The Data Day Texas 2023 Sessions

We still have more sessions to announce. Check back for updates.

Augmenting Fraud Protection Pipelines At Scale Using Graph Analytics and AI Features

Justin Fine (KatanaGraph)

Fighting fraud is a never-ending battle for the financial services industry.
LexisNexis recently estimated that for every dollar lost by U.S. financial services firms due to fraud in 2020, companies incurred $3.78 in total costs, including legal fees, investigation and recovery expenses, fines and other costs, in addition to the lost transaction value.
Rules-based systems and traditional machine learning (ML) for fraud detection can no longer keep up with the increasingly sophisticated and ever-changing tactics of today’s bad actors seeking to commit fraud.
Graph technology has been shown to dramatically improve fraud detection efforts; however, due to the massive scale and complex computing required for real-time fraud detection, implementing graph systems has been historically difficult… until now.
Katana Graph efficiently handles highly complex queries, algorithms and deep learning models, at massive scale and speed that other graph solutions simply cannot match. This talk will walk through the process and touch on some of the key pieces of technology that make this possible.

Database Keynote
Things databases don’t do - but should!

Gwen Shapira (Stealth)

Relational databases have been around for 40+ years, so you’d think that by now, they do everything an engineer could possibly need. Surprisingly, this isn’t quite the case. Software architectures, design patterns, devops practices and security all evolved side by side - but not in perfect lockstep. As a result, there are common patterns that developers have to re-implement again and again, while wondering “why can’t my database just do this for me?”. In this talk, we’ll review how software engineering and architectures evolved in the last decade, and how databases can do more to keep up.

Data Engineering Keynote
Engineering Data Systems for the next 10 years of growth

Adi Polak

For most of my professional life, I dealt with data. As a data practitioner, I developed algorithms to solve real-world problems leveraging machine learning techniques. As an engineer, I led the direction that brought the value of my hands-on machine learning experience into our products and services by building upon cutting-edge and emerging technologies. This experience has taught me that scaling data systems is harder than you might think.
Supporting the operations of scalable data environments poses a challenge greater than the known application-level support due to the complexity of managing data together with the application. Why is data adding so much complexity? Well, Data is big, so all systems are now becoming distributed. Data changes and evolves, and it's hard to create repeatable, automated pipelines. Plus, technology is advancing at an alarming rate and changes are messy.
Interested in learning about all the most recent updates in the data space? How does that impact our data systems? In this talk, you will learn about the challenges of product quality, delivery velocity, production monitoring, and outage recovery, see how those can be met using best practices in the tech stack, and develop empathy for those who manage scalable data environments.

Chat GPT Keynote
Neuro-Symbolic Story Extraction from Natural Language

Using Chat GPT to write language extraction rules.
Jans Aasman (Franz Inc)

The majority of the Knowledge Graphs we are building for and with customers contain massive amounts of unstructured data, usually in the form of informal natural language. Think of doctor's notes in a medical chart , agent/customer conversations in a call center, and maintenance records for an aircraft. Yes, of course we do advanced taxonomy based entity extraction and relationship detection on these texts but that doesn't even come close to what we really need, true Natural Language Understanding (NLU) that turns text into an understandable story represented as usable data in a knowledge graph.
But what is the state of NLU? A recent article from MIT Technology Review concludes that AI still doesn't have the common sense to understand human language [1]. Yes: transformer models like ChatGPT and GPT-3 do an amazing job of writing prose that resembles human writing in ways that dazzle naive users and newspaper journalists. But that is only the first impression, on closer look you will find that these models have many shortcomings. They don't hold logical and consistent context over many paragraphs, they don't have a mental model, memory, or a sense of meaning and they really don't understand what the inputs and outputs mean. The famous author and cognitive scientist Douglas Hofstatter calls these transformer models basically cluelessly clueless.
So why do we even mention these models? Well: because they are incredibly useful in helping us to write rules for normalizing and reducing informal natural language, and even help us write rules that can turn natural language into collections of reified triples that represent stories.

Data Quality Keynote
Data Contracts - Accountable Data Quality

Chad Sanderson

Data Contracts are a mechanism for driving accountability and data ownership between producers and consumers. Contracts are used to ensure production-grade data pipelines are treated as part of the product and have clear SLAs and ownership. Chad Sanderson, former Head of Data at Convoy has implemented Data Contracts at scale on everything from Machine Learning models to Embedded Metrics. In this talk, Chad will dive into the why, when, and how of Data Contracts, covering the spectrum from culture change to implementation details.

Closing Keynote
Fundamentals of Data Engineering

Joe Reis / Matthew Housley (Ternary Data)

Joe Reis and Matt Housley, co-authors of Fundamentals of Data Engineering, unpack data engineering from first principles. Along the way, they'll discuss the data engineering lifecycle, its undercurrents, and current and future trends in data engineering.

Data Lake Keynote
Turning your Data Lake into an Asset

Bill Inmon

People have built data lakes which quickly turn into data swamps. So what do you need to know about how to turn your data lake into a useful asset. This presentation describes how a data lake/data sewer can be turned into an analytical asset building such things as the analytical infrastructure and loading integrated data into the data lake. Turning the data lake into a data lakehouse is the BEST thing and organization can do in order to turn the data scientist into a useful human being.

NLP Keynote
Data Demonology

John Bohannon (Primer AI)

Once upon a time, natural language processing (NLP) was all about the data. To make a good document classifier, the big blocker was getting enough training data from those documents. To make a good question-answering model, you needed a trove of (question, document, answer) triples from your corpus. Even text generation tasks such as summarization were powered by special data. (Practitioners passed around a grubby corpus of CNN and Daily Mail news summaries, earnestly hoping to generalize beyond the news domain.) But today, your data matters less and less. Practitioners of the NLP dark arts are becoming demonologists, slinging arcane spells known as "prompts". Their magic is powered by demons known as "large language models" summoned via API. Taming these demons requires a technique called reinforcement learning with human feedback (RLHF). With the launch of ChatGPT, we truly have entered the Age of NLP Magic. You wouldn't be faulted for believing that there is no problem that these demons can't solve. But you would be wrong. I will share some of the dirty secrets of the new data demonology.

Spark Keynote
Metaprogramming — making easy problems hard enough to get promoted (w/ Spark & Friends)

Holden Karau

No one enjoys upgrading code from legacy systems (if you do, let me know), and not only is it not fun you probably won't get promoted for it. That being said, we all know we shouldn't be running on old version of tools -- so how do we do both? The solution is to take a simple problem we won't get promoted for -- and make it a hard one for which we can get promoted. Come for a tongue-in-cheek look at how to automate our upgrades using metaprogramming :D You'll leave with not only an idea of how to make easy problems hard -- but also when to make the easy problems hard for your (and, of course) your shareholder's benefits.

Healthcare AI Keynote
AI in Healthcare : Opportunities Amid Landmines

Andrew Nguyen

Long before COVID, there has always been tremendous interest and excitement around the use of AI, machine learning, and analytics in healthcare. From research, to drug discovery, to clinical decision support, the potential use cases are endless -- all with the ultimate goal of ensuring the right patients get the right care at the right time. However, among these opportunities lie a hidden field of landmines. One misstep and what was once an exciting and promising project becomes a drastic failure.
All too often, the focus is on the need for better data, algorithms, pipelines, architectures, or infrastructure. But what healthcare issues are we trying to address? Do we have a clear understanding of the problems that patients and physicians encounter? What exactly is needed to provide that expected level of healthcare delivery? If our fancy technologies and algorithms don’t improve clinical outcomes, even statistically valid “results” are not useful.
The challenges that we face when analyzing healthcare data are not only technical problems needing technical solutions. These challenges reflect the complexities of the delivery of medicine. On the one hand, our scientific understanding of physiology, pathophysiology, diagnostics, and therapeutics is constantly changing. On the other, there is a delicate balance of sociopolitical, economic, and operational factors that influence what data are collected, and why, where, and how they are collected. The end result is a fragmented and messy world of data from the real world.

This talk is for anyone interested in healthcare applications of AI and is asking:
How have we evolved? What has been done so far and what is slowing down the whole process?
What are the barriers and challenges in healthcare data, is it really that different and difficult to understand?
How can we accelerate the current momentum?

We will discuss how modern technologies and architectures are essential and necessary but unfortunately not sufficient, that we need a mindful orchestration of people, processes, and technology for the success of AI in healthcare.

Chat GPT KeynoteNeuro-Symbolic Story Extraction from Natural Language

Jans Aasman (Franz Inc)

The majority of the Knowledge Graphs we are building for and with customers contain massive amounts of unstructured data, usually in the form of informal natural language. Think of doctor's notes in a medical chart , agent/customer conversations in a call center, and maintenance records for an aircraft. Yes, of course we do advanced taxonomy based entity extraction and relationship detection on these texts but that doesn't even come close to what we really need, true Natural Language Understanding (NLU) that turns text into an understandable story represented as usable data in a knowledge graph.
But what is the state of NLU? A recent article from MIT Technology Review concludes that AI still doesn't have the common sense to understand human language [1]. Yes: transformer models like ChatGPT and GPT-3 do an amazing job of writing prose that resembles human writing in ways that dazzle naive users and newspaper journalists. But that is only the first impression, on closer look you will find that these models have many shortcomings. They don't hold logical and consistent context over many paragraphs, they don't have a mental model, memory, or a sense of meaning and they really don't understand what the inputs and outputs mean. The famous author and cognitive scientist Douglas Hofstatter calls these transformer models basically cluelessly clueless.
So why do we even mention these models? Well: because they are incredibly useful in helping us to write rules for normalizing and reducing informal natural language, and even help us write rules that can turn natural language into collections of reified triples that represent stories.

This presentation will cover several examples in domains where we extracted understandable, explainable, and query-able stories from unstructured text. The extraction pipeline relies on several technologies but the two important pillars are the rules written by GPT-3, hand-edited by humans and detailed domain ontologies. The extracted stories are represented in knowledge graphs and can be queried for deeper domain insight and predictive analytics.

Math Keynote
How Math Simplifies AI

Hala Nelson (James Madison University)

In this talk, Hala will be making the case that the fastest way to get into the AI field and to apply it successfully is to study AI's underlying mathematical foundation, and to understand the symbiotic interdependence between AI and math. In the midst of a mad rush of many industries to integrate AI and data-driven technologies into their systems and operations, a firm grasp of the mathematics of AI bridges a gap in presentation between the unlimited potential of AI and its successful development and application. An extra advantage is that mathematics differentiates reality and true potential of AI from exaggerations, false claims, or fiction. Thankfully, only a handful of mathematical ideas underlie the seemingly different models for computer vision, natural language, data analysis and machine learning, so mastering those is key to access the world of AI. The involved mathematics provides a fundamental structure that makes the overwhelming pace of progress in AI not dizzying at all for newcomers in the field. We will briefly survey the basic mathematics required to fast track newcomers into many aspects of AI, including language models such as ChatGPT, and nano-robotics, or AI applications such as food systems and the supply chain. This talk is for a broad audience in business, data, engineering, and similar fields who do not possess advanced mathematical knowledge, and for people with strong STEM backgrounds but without AI knowledge.

Metadata Keynote
In Search of the Control Plane for Data

Shirshanka Das (Acryl)

Data Discovery, Data Observability, Data Quality, Data Governance, Data Management are all linked via a common thread: Metadata. Yet we seem content to pursue each of these as separate problems with very minimal linkage across them. As a result, problems such as access management, data retention, cost attribution and optimization, data-aware orchestration, etc. have become impossible to achieve in a uniform manner across the stack. We’ve gotten into this state, because while we were busy building individual tools that solve specific use-cases, we forgot to build a harmonizing layer that makes these tools work together to achieve outsized outcomes for the business.
I’m calling this layer, the control plane of data.
In this talk, I’ll describe what the control plane of data looks like, and how it fits into the reference architecture for the deconstructed data stack; a data stack that includes operational data stores, streaming systems, transformation engines, BI tools, warehouses, ML tools and orchestrators.

We’ll dig into the fundamental characteristics for a control plane:
• Breadth (completeness)
• Latency (freshness)
• Scale
• Consistency
• Source of Truth

We’ll discuss what use-cases you can accomplish with a unified control plane and why this leads to a simpler, more flexible data stack.
Finally, I’ll share how DataHub, the open source metadata platform for the modern data stack is evolving towards this vision of the future, through some real-world examples.

Data ROI Keynote
Show me the money: Practical tips for data teams to show ROI

Juan Sequeda

If you are being asked to provide ROI and you are not able to provide it, you should be concerned. The irony is that if you are a data person, but don’t have the data to measure your own impact, how can you conceivably be doing that for other parts of the organization? It’s a sign that the team is not clearly aligned with the larger goals of the organization and you are not focusing on the right use cases. In 2023, sh*ts going to get real. Abundant times are over. If you want to keep your data job amid this wave of layoffs, you need to show how your role, work and projects are aligned with the top-level strategic and organizational goals of the company that will make or save money. In this talk, I’ll be providing practical tips that data teams can start using immediately to follow the money and ROI models to consider in order to show the money.

The Modern Data Stack Evolution - 2023 will be the year of consolidation

Chris Tabb (LEIT DATA)

Each era of the Data World delivers new ways of providing it more accessible to add business value. But to stay Modern, it needs to evolve, and simplifying the Vendor Market and consolidating of the technology required to deliver business value from data is the key to success. This session will provide an overview of the evolution of the Data Platforms to what we know now as the Modern Data Stack.

The Architected Cloud Environment

Bill Inmon

If you are not careful your organization is throwing money and other resources away in building a cloud environment that is wasteful and inefficient. What is meant for a cloud data architecture and how and why can it enhance your cloud environment. Do you know how much money and time are being wasted? How much data do you have in the cloud that is worthless yet you are paying for it? And it is slowing everything down? This presentation discusses an approach which will save you time, money, and make your cloud more effective, all without losing the advantages of going to the cloud. Sounds too good to be true? Come see this presentation.

Cassandra on ACID: this changes everything

Patrick McFadin - DataStax

Cassandra has established itself as the database of choice for high-scale, resilient, globally available data applications for more than ten years. All this is available with the tradeoff of working with eventually consistent data models. “I can’t use Cassandra because it doesn’t have transactions.” With the arrival of Accord and CEP-15, those days will be over soon. ACID-compliant, strictly serializable transactions are available with everything Cassandra already promises with scale, distribution, and reliability. Sounds like magic? Nope, it’s computer science, and I’ll tell you all about it. Prepare to completely level up any data-driven applications you have with way fewer tradeoffs. What we’ll cover:Transactional guarantees now availableChanges in CQL SyntaxExamples of how it could be used in your application.

Real-Time Recommendations with Graph and Event Streaming

Aaron Ploetz - DataStax

Real-time data is becoming increasingly important to success of the enterprise mission. Successfully collecting incoming data while maintaining the ability to react both quickly and strategically is paramount.However, the data collection process is often far from trivial. Write contention is a common bottleneck with many large-scale architectures. To further complicate matters, today’s developers are abstracted farther away from the critical areas of the write path than ever before.How can we ensure that all of our data is being stored while not overloading the storage layer?Focusing on the use case of a movie recommendation service, we will discuss the choices and trade-offs of different architecture components. We will show the data model implementation as well as ways to leverage our graph database to maximize data discovery. Finally, we will show ways to improve data delivery guarantees and how to mitigate write back-pressure on our database by using an event streaming platform (Apache Pulsar).

How to build someone we can talk to

Jonathan Mugan - DeUmbra

So many intelligent robots have come and gone, failing to become a commercial success. We’ve lost Aibo, Romo, Jibo, Baxter, Scout, and even Alexa is reducing staff. I posit that they failed because you can’t talk to them, not really. AI has recently made substantial progress, speech recognition now actually works, and we have neural networks such as ChatGPT that produce astounding natural language. But you can’t just throw a neural network into a robot because there isn’t anybody home in those networks—they are just purposeless mimicry agents with no understanding of what they are saying. This lack of understanding means that they can’t be relied upon because the mistakes they make are ones that no human would make, not even a child. Like a child first learning about the world, a robot needs a mental model of the current real-world situation and must use that model to understand what you say and generate a meaningful response. This model must be composable to represent the infinite possibilities of our world, and to keep up in a conversation, it must be built on the fly using human-like, immediate learning. This talk will cover what is required to build that mental model so that robots can begin to understand the world as well as human children do. Intelligent robots won’t be a commercial success based on their form—they aren’t as cute and cuddly as cats and dogs—but robots can be much more if we can talk to them.

Why your database needs an API

Jeffrey Carpenter

For many years, application developers have been dependent on database administrators (DBAs) in order to design schema and write efficient queries. This worked in a world in which client applications used drivers to talk to databases within the same datacenter using custom binary protocols. In our modern cloud native world, much has changed. Developers now write in a wider variety of languages than ever, and HTTP is the dominant network protocol, making traditional drivers more difficult to use effectively.
In this talk, we’ll introduce the concept of a Data API gateway and its key features, and examine the Stargate project, a data API gateway built on top of Apache Cassandra. We’ll discuss the various API styles that Stargate supports including REST, GraphQL, Document, and gRPC, and the benefits of each. We’ll also dig into Stargate’s architecture to see how it scales horizontally, abstracts the underlying database, and how easy it is to create new APIs.
Prerequisite knowledge: basic familiarity with HTTP APIs and DB query languages.
#cassandra #database #dataengineering

Bigger Data beats Better Math, No Question

Brent Schneeman

This session explores the trade-offs between ML model architectures, the quantity of training data, and the quality of training data. For a constant model architecture, how does more data affect performance? How about higher quality data? For a constant data set, how does different math (i.e. different model architectures) affect performance? This talk will dive into these questions, both quantitatively and qualitatively, as well as presenting an economic framework with which to help make the trade-off on investing in bigger data or better math. The problem presented in the talk is fashion-industry image classification, but the concepts are broadly generalizable across industries and ML models. "

Towards Data Science Design Patterns

Michael Berthold (KNIME)

In this talk, Michael Berthold will present first ideas on how data flow diagrams can be used to model data science design patterns. Using a number of explanatory patterns, Michael will demonstrate how they can be used to explain and document data science best practices, aid data science education, and enable validation of data science processes.

Reinforcement Learning with Ray RLlib

Dean Wampler - IBM

Reinforcement Learning trains an agent to maximize a cumulative reward in an environment. It has been used to achieve expert level performance in Atari games and even the game of Go. It is now being applied to many other problems. Dean will begin with why RL is important, how it works, and discuss several applications of RL. Then he will discuss how RL requires a variety of computational patterns: data processing, simulations, model training, model serving. etc. Few frameworks efficiently support all these patterns at scale. Finally, Dean will show how RLlib, implemented with Ray, seamlessly and efficiently supports RL, providing an ideal platform for building Python-based, RL applications with an intuitive, flexible API.

Building MLOps Organizations for Scale

Joey Jablonski - (Pythian)

Many organizations have now successfully created their first analytical model and leveraged it to drive new business value and more impactful decisions. Now the hard part begins, building an operational framework for the deployment of future models and management at scale with governed principles of performance, bias, responsiveness, and accuracy.
Today’s Machine Learning Operations (ML Ops) capabilities are more than just technology. They are capabilities that must align with the model development process, a data engineering platform, and an operational model. These processes must ensure models are unbiased when deployed, do not drift over time with the accuracy of their answers and that upstream data changes do not affect accuracy of model outputs.
This process starts with defining metrics of success. These are the anchor point for how we measure and intervene in the lifecycle of our models. Effective metrics align with organizational goals for product adoption and revenue growth targets.
These metrics become key components in the design of our ML Ops technology stack, assisting us in prioritizing what aspects of model performance we monitor, how we intervene and what urgency data science and ML engineering teams feel when receiving alerts about adverse performance. The technology landscape for ML Ops is changing rapidly and making purposeful decisions early about architecture and modularity will ensure seamless addition of new future capabilities.
Behind the ML Ops technology stack is our operational model. These are the constructs for how teams operate across data engineering, data science and ML engineering to build, test and deploy analytical assets. Operational models capture the process for work handoff, verification, and review to ensure adherence with organization and industry objectives. Operational models are backed by our governance standards that define testing criteria for new models, de-identification of data used for training and management of multiple models that converge for driving application experiences.
Bringing together defined metrics, flexible technology stacks and a well-defined operational model will enable your organization to deploy reliable models at scale. The collection of these capabilities will allow the organization to move to higher levels of maturity in their operations, higher levels of reliability and enable scalability as the use of analytical models for decisioning increases.

Geospatial Keynote
"one ant , one bird, one tree"...

Bonny McClain

"When you have seen one ant, one bird, one tree, you have not seen them all.”--E.O. Wilson
The power of GIS and visualizing history is the transitory nature of the present. The world writ large doesn’t exist in political terms, marketing campaigns, or lifespans--think of a continuous link evolving into our current zeitgeist--and well beyond.
Geospatial analysts and scientists evaluate demographic shifts, social and cultural shifts, economic shifts, and environmental dynamics--what we need now is a powerful intersection of our insights. Understanding the role of location intelligence and spatial awareness just might be the missing link.
Using open source tools and data we will examine how powerful data questions elevate our discussion and re-focus potential solutions to address community level discord and marginalization.

Authoring unified batch and streaming workflows

Santona Tuli - (Upsolver)

The ability to orchestrate batch jobs into data pipelines that correctly honor dependencies and gracefully handle failure states has been a boon for data engineering in recent years. Although our ability to handle data from streaming sources has also improved, most end-to-end pipelines can only guarantee data freshness up to a batch interval. Streaming data end up in warehouses or lakes before they are used in analytics or ML deliverables through a secondary pipeline. The exception to this is online inference in ML applications, where streaming data are featurized and sent to a prediction endpoint in one motion. But even then we don’t see unification of the offline training and online serving components of the application lifecycle. In other words, batch and streaming data pipelines are universally handled separately.Today, increase in streaming data sources and data freshness SLAs measured in minutes or seconds, even for analytics use cases, creates the need for pipelines that can handle both streaming and batch data. Unlike batch-only use cases, the price of even an infrequent failure is much higher for a pipeline feeding live actionable insights or applications. Predicting and preparing for all failure states and corner cases in code simply isn't realistic when products rely on reacting to data and schema changes in flight. Most orchestration frameworks tend to be data unaware, making it difficult to react to changes and allowing bad data to flow through otherwise functioning pipelines sans alert. In this talk, we will dive into* Why modern data needs are not compatible with batch-only orchestrators.* How programmatic pipeline authoring becomes an asymptotic time sync for developers and a source of tech debt for operators.* A new approach to data pipelines, where the execution plan is automated from declarative transformation code and implicit, halting data contracts between each step guarantee data quality all the way to end-user deliverables.

Computer Vision Landscape at Chegg: Present and Future

Sanghamitra Deb - (Chegg)

Millions of people all around the world Learn with Chegg. Education at Chegg is powered by the depth and diversity of the content that we have. A huge part of our content is in form of images. These images could be uploaded by students or by content creators. Images contain text that is extracted using a transcription service. Very often uploaded images are noisy. This leads to irrelevant characters or words in the transcribed text. Using object detection techniques we develop a service that extracts the relevant parts of the image and uses a transcription service to get clean text. In the first part of the presentation, I will talk about building an object detection model using YOLO for cropping and masking images to obtain a cleaner text from transcription. YOLO is a deep learning object detection and recognition modeling framework that is able to produce highly accurate results with low latency. In the next part of my presentation, I will talk about the building the Computer Vision landscape at Chegg. Starting from images on academic materials that are composed of elements such as text, equations, diagrams we create a pipeline for extracting these image elements. Using state of the art deep learning techniques we create embeddings for these elements to enhance downstream machine learning models such as content quality and similarity.

Lessons learned adopting Ray

Zac Carrico - (Freenome)

Ray is an open-source Python framework for distributed computing. It offers many attractive features for machine learning model development. It is highly scalable, has a simple API for distributed computing, and provides integrations for the most common machine learning packages. This presentation will describe our journey using Ray at Freenome to improve the scalability and ease-of-use of our machine learning platform. Freenome is developing next-generation blood tests for early cancer detection, powered by multiomics and machine learning. We are continuously exploring ways to make model development more efficient and scalable because the faster we can evaluate new models, the faster we can test their potential utility for use in new cancer diagnosis products. This presentation describes why we chose Ray and our experience adopting it. The presentation is general enough to be applicable to other industries and is appropriate for any group interested in improving machine learning model development. It is especially appropriate for teams that are looking for opportunities to scale and simplify machine learning operations (MLOps).

Your laptop is faster than your data warehouse. Why wait for the cloud? (DuckDB)

Ryan Boyd - (MotherDuck)

The tech industry has been on a pendulum swinging between centralized and distributed computing power. We started with mainframes accessed through thin clients, moved to fully-functional PCs and now have these PCs backed by cloud compute -- the modern "mainframes."With advances in CPUs, memory, SSDs, and the software that enables it all, our personal machines are powerful beasts relegated to handling a few Chrome tabs and sitting 90% idle. As data engineers and data analysts, this seems like a waste that's not only expensive, but also impacting the environment. How can we take advantage of this compute?
DuckDB is an open source in-process OLAP engine for vectorized analysis of columnar data. It supports SQL, including some advanced analytics capabilities like window functions and data sampling.In this session, I'll introduce DuckDB, show its capabilities with a bunch of live demos and talk about how we're exploring using the cloud and a hybrid querying architecture to support in-process analytics instead of replacing it.

Zero-copy integration

Dave McComb - (Semantic Arts)

The reason we need to talk about zero copy integration is that its opposite is so well entrenched that most practitioners can’t imagine a world without some form of extract, transform and load, or system integration copy and manipulate through APIs. The traditional enterprise data landscape in an almost endless set of pipelines, data sets and chutes and ladders that ferry data from its source to myriad destinations. This seems necessary, because each application we use and each tool we employ has its own bespoke way of structuring data. Each application ends up morphing the prior application’s idiosyncrasies into its own idiosyncrasies. In this talk we unpack the prerequisites needed to achieve Data-centricity and zero copy integration. We will present two case studies of firms that are enjoying zero copy integration. We will also present a simple demonstration, to make the idea more concrete.

Clinical trials exploration: surfacing a clinical application from a larger Bio-Pharma KnowledgeGraph

David Hughes - Graphable

Clinical, proteomic, and pharma knowledge graphs are complex aggregations of constituent sub graphs. These linked graphs provide meaningful insights as a whole, but in many cases a singular subgraph can independently prove to be a valuable asset. In this session, David will identify possible applications of the NLM’s Clinical Trials resource as a standalone application. He will review how to query the API, how to populate/run ETL through tools like Hume orchestra and Apache-Hop. He will then explore how to create an application using Streamlit as a POC, and then discuss potential refinements.”

Biomedical Knowledge Graph to Power Better AI in Health

Ying Ding - University of Texas

Understanding human health and providing better care are knowledge-intensive endeavors, which cover the broad spectrum of pre-clinical and clinical practices. Multi-perspective health data have been accumulated and augmented with semantics. But the connectiveness of heterogeneous and multimodality health data remains a challenge. This talk highlights the potential of using knowledge graph to enrich the connectiveness in pre-clinical and clinical data and showcases the best practices of AI in health.

An Introduction to Apache Pinot

Tim Berglund - StarTree

When things get a little bit cheaper, we buy a little bit more of them. When things get cheaper by several orders of magnitude, you don't just see changes in the margins, but fundamental transformations in entire ecosystems. Apache Pinot is a driver of this kind of transformation in the world of real-time analytics.
Pinot is a real-time, distributed, user-facing analytics database. The rich set of indexing strategies makes it a perfect fit for running highly concurrent queries on multi-dimensional data, often with millisecond latency. It has out-of-the box integration with Apache Kafka, S3, Presto, HDFS, and more. And it's so much faster on typical analytics workloads that it is not just a marginally better data warehouse, but the cornerstone of the next revolution in analytics: systems that expose data not just to internal decision makers, but to customers using the system itself. Pinot helps expand the definition of a "decision-maker" not just down the org chart, but out of the organization to everyone who uses the system.
In this talk, you'll learn how Pinot is put together and why it performs the way it does. You'll leave knowing its architecture, how to query it, and why it's a critical infrastructure component in the modern data stack. This is a technology you're likely to need soon, so come to this talk for a jumpstart.

Your data infrastructure will be in Kubernetes

Patrick McFadin / Jeff Carpenter - (DataStax)

Are people actually moving stateful workloads to K8s? Yes, yes they are. In the process of writing the book Managing Cloud Native Data on Kubernetes, we spoke with a bunch of the experts who are moving various types of stateful workloads to K8s. In this talk we’ll share what we learned:
• What’s solid: storage and workload management
• What’s good and getting better: operators, streaming, and database workloads
• What needs work: analytics and machine learning
We’ll also share what this means for your data infrastructure:
• Infrastructure should conform to your application and not the other way around.
• Stop creating new data infrastructure projects and start assembling new architectures
• Look to open source projects for inspiration

Artistic Processing: the Untapped Power of Data Visualization

Weidong Yang (Kineviz)

Data visualization is often regarded as an output, a way to package the results of analysis in a digestible form for decision makers. Less explored is the power of visualization to drive analysis, increasing speed and flexibility while revealing hidden stories and potential black swan events.
Humans regularly ingest and analyze a firehose of high dimensional data—the sensory stream from our surroundings. We’re especially tuned to detect change and tease out patterns from complexity. Approaching visualization as an ongoing process rather than an end result enables us to tap this innate sensitivity, accelerating the exploration of data and development of hypotheses.
In this talk, Weidong Yang will discuss the theory and practice of data visualization across his career as an artist and scientist. Examples will span manufacturing, healthcare, law enforcement, dance, and installation art. Methodologies including dynamic filters and schemas, capturing context through graphs, dimensionality reduction for clarity, and more will be demonstrated. Attendees will come away from the talk with new strategies to leverage data visualization for both analysis and communication.

A DataOps Approach to Global Data Observability

Arvind Prabhakar (StreamSets)

DataOps is a necessary practice for delivering continuous data analytics. While many equate DataOps as “DevOps for data”, smart data pipelines are the foundation needed to enable a people, processes, and technology framework for DataOps. Current data observability approaches promise data health by applying algorithms on black box pipelines as an afterthought. True global data observability is only achieved with instrumented data pipelines that provide a system level view of data health and harness data drift to ensure reliable analytics delivery.
We at StreamSets believe that Data Observability is a feature of Data Integration and not a platform by itself. Our DataOps platform is a data integration platform built on the foundation of DataOps. It decouples all data producers and consumers, thereby giving unprecedented ability to work with changing data structures, formats, infrastructures and even semantics, aka data-drift. While the smart data pipelines enable the automatic handling of data-drift for the most part, it’s built in data observability capabilities provide a system level view of data and data architecture health.
In this session, I will walk you through the internals of StreamSets DataOps platform and how you can use it to solve data integration using built in data observability.

Outrageous ideas for Graph Databases

Max De Marzi (Relational AI)

Almost every graph database vendor raised money in 2021. I am glad they did, because they are going to need the money. Our current Graph Databases are terrible and need a lot of work. There I said it. It's the ugly truth in our little niche industry. That's why despite waiting for over a decade for the "Year of the Graph" to come we still haven't set the world on fire. Graph databases can be painfully slow, they can't handle non-graph workloads, their APIs are clunky, their query languages are either hard to learn or hard to scale. Most graph projects require expert shepherding to succeed. 80% of the work takes 20% of the time, but that last 20% takes forever. The graph database vendors optimize for new users, not grizzly veterans. They optimize for sales not solutions. Come listen to a Rant by an industry OG on where we could go from here if we took the time to listen to the users that haven't given up on us yet.

How to automate data monitoring to support a scaling data strategy

Andy Petrella (Kensu)

As stated in the Fundamentals of Data Engineering, the undercurrents of the data engineer role include data management, data quality, and data observability practices. Therefore, data engineers who develop and maintain pipelines are becoming responsible for monitoring and controlling the quality of their deliverables.
In this talk, Andy will show how the tasks supporting these responsibilities can be executed without impacting the efficiency of the data teams yet increasing the value generated for the organization. This is important to maintain the efficiency of the data teams because they must focus on scaling the data strategy of their organizations.
Andy will also demonstrate how to automate most of these tasks, such as monitoring, detecting, and troubleshooting, by making Spark jobs data observable and a Data Observability platform.
#observability #spark

Apache Iceberg: An Architectural Look Under the Covers

Alex Merced (Dremio)

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format, released by Facebook in 2009 that addresses some of these problems, but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. In this talk, Alex Merced will walk through the architectural details of Iceberg, and show how the Iceberg table format addresses the shortcomings of the Hive format, as well as additional benefits that stem from Iceberg’s approach.
You will learn:
The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design

Apache Iceberg and the Right to Be Forgotten

Alex Merced (Dremio)

Regulatory requirements can make data engineering more complex than it otherwise would be. In this talk, we will discuss how to navigate hard deletions in an Apache Iceberg-based data lakehouse. In this talk you will learn:
- How to ensure data is hard deleted using copy-on-write Iceberg tables
- How to ensure data is hard deleted using merge-on-read Iceberg tables
- Other strategic and technical options to consider when architecting regulatory compliance
Some familiarity with Apache Iceberg and Data Lakehouse will be helpful

Visualizing Connected Data as It Evolves Over Time

Janet Six - (Tom Sawyer)

Connected data visualization and analysis techniques are becoming increasingly popular for the ability to work well with graph databases and for the communication of key results to stakeholders and decision makers. Applying these techniques to connected data that is static benefits from specific techniques, but what are the best practices for visualizing connected data that dynamically changes? And, how do you best model the changes that are occurring in the system?
In this session, we will discuss how connected data can change over time and the implications of those changes for visualization and analysis techniques. We will also explore visualization techniques for dynamically changing connected data, including social networks that evolve over time, digital transformation model simulations, and event analysis. These visualization techniques allow us to:
• Apply post-situation analysis so that we can understand what happened in the system and when
• Better understand simulations of future scenarios and compare them
• Discover important trends

Database Schema Optimization in Apache Cassandra

Artem Chebotko - (DataStax)

Level up your data modeling skills with these five schema optimization techniques that are frequently used to create efficient and scalable data models for Apache Cassandra: - splitting large partitions and dynamic bucketing;- data duplication and batches;- indexes and materialized views;- race conditions and lightweight transactions;- tombstone-related problems and solutions.This presentation would be most beneficial to audiences with prior experience using Cassandra.

What You Can't do With Graph Databases

Tomás Sobat Stöfsel - (Vaticle)

Developing with graph databases has a number of challenges, such as the modelling of complex schemas and maintaining data consistency in your database.In this talk, we discuss how TypeDB addresses these challenges, as well as how it compares to property graph databases. We’ll look at how to read and write data, how to model complex domains, and TypeDB’s ability to infer new data.The main differences between TypeDB and graph databases can be summarised as:1. TypeDB provides a concept-level schema with a type system that fully implements the Entity-Relationship (ER) model. Graph databases, on the other hand, use vertices and edges without integrity constraints imposed in the form of a schema2. TypeDB contains a built-in inference engine - graph databases don’t provide native inferencing capabilities3. TypeDB offers an abstraction layer over a graph, and leverages a graph engine under the hood to create a higher-level model; graph databases offer a lower level of abstraction

Graphing without the database - creating graphs from relational databases

Corey Lanum - (Cambridge Intelligence)

Many projects I’ve worked on assume that to present business users with a node-link view means transferring all the data to a graph database. Worse still, I’ve seen teams duplicate and synchronize their data into a second database creating new layers of completely unnecessary complexity.
The truth is that analyzing and visualizing data as a graph doesn’t necessarily mean storing the data in a graph database, or in a graph format, at all.
While graph databases have value for complex traversal queries, in many cases, they create unnecessary complexity. The simpler model of translating tabular data to nodes and links on the fly is easier to implement and allows the flexibility to choose from different data models of the same source.
In this talk, I’ll walk through the process and architecture of building a graph application from the standard relational database you probably already have.

Enabling the Computational Future of Biology

Tomás Sobat Stöfsel - (Vaticle)

Computational biology has revolutionised biomedicine. The volume of data it is generating is growing exponentially. This requires tools that enable computational and non-computational biologists to collaborate and derive meaningful insights. However, traditional systems are inadequate to accurately model and handle data at this scale and complexity.In this talk, we discuss how TypeDB enables biologists to build a deeper understanding of life, and increase the probability of groundbreaking discoveries, across the life sciences.

Where Is the Graph? Best practices for extracting data from unstructured data sources for effective visualization and analysis

Janet Six - (Tom Sawyer)

As unstructured data becomes larger and more complex while stakeholders expect increasingly useful data results, we can apply graph techniques to discover ever elusive insights. These graph techniques can assist with data discovery and understanding, and be used to communicate findings to decision makers. But what are the best practices to apply graph technology to the connected data that is inherent in unstructured data sources? Where is the graph?Currently, many companies are still trying to visualize or analyze the whole data source. This leads to mixed results and hairball visualizations that may be beautiful artistically, but don’t show the level of detail that is needed for visual analysis and for communication of results to stakeholders. How do we get beyond all the noise in unstructured data to discover the knowledge needed to bring business value to that data?In this session, we will discuss several approaches to finding useful graphs in your unstructured data and how to apply visualization and analysis techniques to them.

Data Professional's Career: Techniques to Practice Rigor and Avoid Ten Mistakes

Jike Chong / Yue Cathy Chang

Trust is precious to build for all data science work. Much of the trust in data and modeling comes from the rigor with which data professionals treat the subject. What are some aspects of rigor required throughout a data professional's career?
One aspect of rigor is to detect anti-patterns in your work. Anti-patterns are undesirable data science practices that increase the risk of failure. Just as a chess master can observe a chessboard and articulate which side may be winning, you can detect anti-patterns in your projects and work before they cause irreparable damage.
This talk highlights 10 top areas to practice rigor and mitigate anti-patterns, focusing on four levels of career development stages, across team leads, team manager, function director, and executive levels, with within-team to industry-wide scopes of influences.
Requirements : Some experience as a data professional.
Takeaways : Recognize and detect anti-patterns in your projects and work before they cause irreparable damage.

For the overwhelmed data professionals: What to do when there is so much to do?

Jike Chong / Yue Cathy Chang

95% of the companies with data science teams have teams of fewer than ten members. In a nimble team with a wide range of possibilities to make business impacts with data science, how do you explore and prioritize the opportunities? How do you ensure that there are sponsors and champions for your project? How do you set realistic expectations with business partners for project success? If you are leading a team, how can you delegate your projects effectively?
In this talk, the presenters will share three techniques for project prioritization, two roles (sponsor and champions) to identify, four levels of confidence (for predictive models) to specify project success, and discuss best practices for delegating work as a team lead/manager.

Build your analytics architecture for performance and growth on a budget.

Paige Roberts - (Vertica)

When building or changing an enterprise analytics architecture, there are a lot of things to consider–Cloud or on-prem, hybrid or multi-cloud, this cloud or that cloud, containerized, build tech, buy tech, use the skills in house, train new skills, etc. While balancing those decisions, there are a lot of considerations, but the main three are performance, costs, and planning for the future including future growth in analytics demand. In this session, get some solid data on how to build a data analytics architecture for performance and rapid growth, without breaking the budget, by focusing on what is important and looking at some examples of architectures at companies that are tackling some of the toughest analytics use cases. Learn from others’ mistakes and successes. Learn how real companies like the Index Exchange,, and the Tradedesk analyze data up to petabyte ranges, track millions of realtime actions, generate 10’s of thousands of reports a day, keep thousands of machine learning models in production and performing, and still keep budgets under control.
Level of Difficulty: Non-Technical
Takeaways: Guiding principles for analytics architecturesGotchas to avoid that might save you money or save your company and your jobNew strategies for dealing with old problems

Ontology in Healthcare: a survey

Sivaram Arabandi (Ontopro)

It is a little know fact that ontologies have been playing a critical role in the healthcare domain for over 2 decades. The Gene Ontology (GO) was started in 1998 as a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species [1]. SNOMED-CT, which provides a broad coverage of the clinical domain, was released in 2002 in a highly computable form: a directed acyclic graph anchored in formal representation logic [2]. Similarly, LOINC (lab data), RxNorm (medications) and many others have been an integral part of healthcare data. This talk will provide a broad overview of the healthcare terminology space including formal ontologies, controlled vocabularies, value sets, etc. and how they relate to other healthcare standards such as FHIR and CPG.

Topics covered include:
1. Why do we need ontologies in healthcare
2. Overview of Healthcare terminology space
3. Terminology, Value sets and Cross maps
4. Ontology vs Information model
5. Terminology servers
6. Ontology and AI

Introducing a Strongly-typed Database: TypeDB & TypeQL

Haikal Pribadi - (Vaticle)

In this session, Haikal Pribadi will present the origins of TypeDB, the impetus for inventing a new query language, TypeQL, and how Vaticle has arrived at an altogether new evolution of the database. As a strongly-typed database, TypeDB allows you to model your domain based on logical and object-oriented principles, allowing you to think higher-level, as opposed to join-tables, columns, documents, vertices, and edges. Types describe the logical structures of your data, allowing TypeDB to validate that your code inserts and queries data correctly. Query validation goes beyond static type-checking, and includes logical validation of meaningless queries. TypeDB also encodes your data for logical interpretation by its reasoning engine. It enables type-inference and rule-inference, which create logical abstractions of data, allowing for the discovery of facts and patterns that would otherwise be too hard to find.With these abstractions, queries in the tens to hundreds of lines in SQL or NoSQL databases can be written in just a few lines in TypeQL.

Beyond process safety: expanding assurance capabilities and guaranteeing system safety with mathematics

Brandon Baylor / James Hansen

Process safety deals with the prevention and control of incidents that have the potential to release hazardous materials or energy. This talk describes new capabilities to assure process safety by formalizing information processes and systems, using the mathematics of structure. Distinct entities, each having their own database, ontology, and other models of the world, must collaborate to handle emerging changes and new contexts. To do so, the entities must continue amending their shared language as well as updating their own models. Current logical frameworks are not flexible enough to manage constant schema changes that arise in our engineering systems.
In this talk we share two case studies that demonstrate new means of assurance (system verification and interoperability) in the context of process safety.
The first capability describes a method for merging multiple spreadsheets into one sheet, and/or exchanging data among the sheets, by expressing each sheet's formulae as an algebraic theory and each sheet's values as a model of its theory, expressing the overlap between the sheets as theory and model morphisms, and then performing constructions from category theory to compute a canonically universal integrated theory and model. The topic also covers the automated theorem proving burden associated with both verifying the semantics preservation of the overlap mappings as well as verifying the consistency of the resulting integrated sheet.
The second capability brings computation to ETL development by migrating queries from one SQL schema to another. Schema evolution in this way ensures provably correct implementations during information system upgrades. This query migration uses symbolic AI to analyze all the queries to determine the exact conditions, if any, under which the translated queries could fail at runtime due to data integrity issues. Using the query migrator capability provides assurance that data is consistent and trustworthy when used to make safety critical decisions.

Workshops at Data Day Texas 2023

Introduction to Taxonomies for Data Scientists

Heather Hedden - (Semantic Web Company)

This 90 minute tutorial - with an optional 40 minute hands-on session - teaches the fundamentals and best practices for using and creating quality taxonomies, whether for the enterprise or for specific knowledge bases in any industry. Emphasis is on serving users rather than on theory. Topics to be covered include: the appropriateness of different kinds of knowledge organization systems (taxonomies, thesauri, ontologies, etc.), standards, taxonomy concept creation and labeling, taxonomy relationship creation. The connection of taxonomies to ontologies and knowledge graphs will also be discussed. This session will cover:
• Introduction to taxonomies and their relevance to data
• Comparisons of taxonomies and knowledge organization system types
• Standards for taxonomies and knowledge organization systems
• Taxonomy concept sources and creation
• Wording of concept labels
• Taxonomy concept relationships
• Semantically enriching a taxonomy to extend it to become an ontology

Following the 90-minute tutorial will be an optional additional 40-minute session for a deeper dive into taxonomies, which also includes hands-on exercises. The deeper dive topics include:
• Creating alternative labels
• Creating hierarchical relationships
• Taxonomy linking and mapping
• Taxonomy governance and quality
• Taxonomy management software demo

Introduction to Graph Data Science for Python Developers

Sean Robinson - (Graphable)

This workshop will cover a variety of graph data science techniques using Python, Neo4j, and other libraries. The goal of the workshop is to serve as a springboard for attendees to identify which graph-based tools/techniques can provide novel value to existing workflows. Some of the techniques to be covered are:
• How to think about data as a graph and the implications that has on downstream analysis
• How to use graph algorithms at scale using both Neo4j and other pythonic libraries
• How to enhance traditional ML models with graph embeddings
• How to visualize these insights in the context of a graph for greater business intelligence
• How to integrate these techniques with your existing data science tool belt

Hands-On Introduction To GraphQL For Data Scientists & Developers

William Lyon - (Neo4j)

This hands-on workshop will introduce GraphQL and explore how to build GraphQL APIs backed by Neo4j, a native graph database, and show why GraphQL is relevant for both developers and data scientists. This workshop will show how to use the Neo4j GraphQL Library, which allows developers to quickly design and implement fully functional GraphQL APIs without writing boilerplate code, to build a Node.js GraphQL API, including adding custom logic, authorization rules, and operationalizing data science techniques.

- Overview of GraphQL and building GraphQL APIs
- Building Node.js GraphQL APIs backed by a native graph database using the Neo4j GraphQL Library
- Adding custom logic to our GraphQL API using the @cypher schema directive and custom resolvers
- Adding authentication and authorization rules to our GraphQL API

We will be using online hosted environments so no local development setup is required. Specifically, we will use Neo4j Aura database-as-a-service and CodeSandbox for running our GraphQL API application. Prior to the workshop please register for Neo4j Aura and create a "Free Tier" database: You will also need a GitHub account to sign-in to CodeSandbox or create a CodeSandbox account at

Ontology for Data Scientists - 90 minute tutorial

Michael Uschold - (Semantic Arts)

We start with an interactive discussion to identify what are the main things that data scientists do and why and what some key challenges are. We give a brief overview of ontology and semantic technology with the goal of identifying how and where it may be useful for data scientists.
The main part of the tutorial is to give a deeper understanding of what an ontologies are and how they are used. This technology grew out of core AI research in the 70s and 80s. It was formalized and standardized in the 00s by the W3C under the rubric of the Semantic Web. We introduce the following foundational concepts for building an ontology in OWL, the W3C standard language for representing ontologies.
- Individual things are OWL individuals - e.g., JaneDoe
- Kinds of things are OWL classes - e.g., Organization
- Kinds of relationships are OWL properties - e.g., worksFor
Through interactive sessions, participants will identify what the key things are in everyday subjects and how they are related to each other. We will start to build an ontology in healthcare, using this as a driver to introduce key OWL constructs that we use to describe the meaning of data. Key topics and learning points will be:
- An ontology is a model of subject matter that you care about, represented as triples.
- Populating the ontology as triples using TARQL, R2RML and SHACL
- The ontology is used as a schema that gives data meaning.
- Building a semantic application using SPARQL.
We close the loop by again considering how ontology and semantic technology can help data scientists, and what next steps they may wish to take to learn more.