The Data Day Texas 2023 Sessions

We still have discount rooms at the AT&T. If you are coming from out of town, this is where all the action is. For the best selection, Book a room now.

We're just beginning to announce the sessions for Data Day Texas 2023. We'll be adding more every few days. We are still accepting proposals.

Building ML Ops Organizations for Scale

Joey Jablonski - (Pythian)

Many organizations have now successfully created their first analytical model and leveraged it to drive new business value and more impactful decisions. Now the hard part begins, building an operational framework for the deployment of future models and management at scale with governed principles of performance, bias, responsiveness, and accuracy.
Today’s Machine Learning Operations (ML Ops) capabilities are more than just technology. They are capabilities that must align with the model development process, a data engineering platform, and an operational model. These processes must ensure models are unbiased when deployed, do not drift over time with the accuracy of their answers and that upstream data changes do not affect accuracy of model outputs.
This process starts with defining metrics of success. These are the anchor point for how we measure and intervene in the lifecycle of our models. Effective metrics align with organizational goals for product adoption and revenue growth targets.
These metrics become key components in the design of our ML Ops technology stack, assisting us in prioritizing what aspects of model performance we monitor, how we intervene and what urgency data science and ML engineering teams feel when receiving alerts about adverse performance. The technology landscape for ML Ops is changing rapidly and making purposeful decisions early about architecture and modularity will ensure seamless addition of new future capabilities.
Behind the ML Ops technology stack is our operational model. These are the constructs for how teams operate across data engineering, data science and ML engineering to build, test and deploy analytical assets. Operational models capture the process for work handoff, verification, and review to ensure adherence with organization and industry objectives. Operational models are backed by our governance standards that define testing criteria for new models, de-identification of data used for training and management of multiple models that converge for driving application experiences.
Bringing together defined metrics, flexible technology stacks and a well-defined operational model will enable your organization to deploy reliable models at scale. The collection of these capabilities will allow the organization to move to higher levels of maturity in their operations, higher levels of reliability and enable scalability as the use of analytical models for decisioning increases.
This session is part of the Data Day Texas ML Track.

GIS Keynote
"one ant , one bird, one tree"...

Bonny McClain

"When you have seen one ant, one bird, one tree, you have not seen them all.”--E.O. Wilson
The power of GIS and visualizing history is the transitory nature of the present. The world writ large doesn’t exist in political terms, marketing campaigns, or lifespans--think of a continuous link evolving into our current zeitgeist--and well beyond.
Geospatial analysts and scientists evaluate demographic shifts, social and cultural shifts, economic shifts, and environmental dynamics--what we need now is a powerful intersection of our insights. Understanding the role of location intelligence and spatial awareness just might be the missing link.
Using open source tools and data we will examine how powerful data questions elevate our discussion and re-focus potential solutions to address community level discord and marginalization.

Zero-copy integration

Dave McComb - (Semantic Arts)

The reason we need to talk about zero copy integration is that its opposite is so well entrenched that most practitioners can’t imagine a world without some form of extract, transform and load, or system integration copy and manipulate through APIs. The traditional enterprise data landscape in an almost endless set of pipelines, data sets and chutes and ladders that ferry data from its source to myriad destinations. This seems necessary, because each application we use and each tool we employ has its own bespoke way of structuring data. Each application ends up morphing the prior application’s idiosyncrasies into its own idiosyncrasies. In this talk we unpack the prerequisites needed to achieve Data-centricity and zero copy integration. We will present two case studies of firms that are enjoying zero copy integration. We will also present a simple demonstration, to make the idea more concrete.

Clinical trials exploration: surfacing a clinical application from a larger Bio-Pharma KnowledgeGraph

David Hughes - Graphable

Clinical, proteomic, and pharma knowledge graphs are complex aggregations of constituent sub graphs. These linked graphs provide meaningful insights as a whole, but in many cases a singular subgraph can independently prove to be a valuable asset. In this session, David will identify possible applications of the NLM’s Clinical Trials resource as a standalone application. He will review how to query the API, how to populate/run ETL through tools like Hume orchestra and Apache-Hop. He will then explore how to create an application using Streamlit as a POC, and then discuss potential refinements.”

Your data infrastructure will be in Kubernetes

Patrick McFadin / Jeff Carpenter - (DataStax)

Are people actually moving stateful workloads to K8s? Yes, yes they are. In the process of writing the book Managing Cloud Native Data on Kubernetes, we spoke with a bunch of the experts who are moving various types of stateful workloads to K8s. In this talk we’ll share what we learned:
• What’s solid: storage and workload management
• What’s good and getting better: operators, streaming, and database workloads
• What needs work: analytics and machine learning
We’ll also share what this means for your data infrastructure:
• Infrastructure should conform to your application and not the other way around.
• Stop creating new data infrastructure projects and start assembling new architectures
• Look to open source projects for inspiration

Protecting Against Ransomware Attacks using Kafka, Flink and Boostgraph

Brian Hall - (Qomplx)

Ransomware attacks are now commonplace in the news and only becoming more so - and they do not include those handled quietly. Come see how Kafka, Flink and graph technologies like boostgraph can be used to identify anomalous behaviors on corporate networks.
Prerequisites: General knowledge around messaging technologies, parallelism and autoscaling.
Takeaways: How to keep track of and manage network traffic in a highly scalable manner and interpret relative risk of potential breaches using readily available technologies.

Outrageous ideas for Graph Databases

Max De Marzi - Amazon Web Services

Almost every graph database vendor raised money in 2021. I am glad they did, because they are going to need the money. Our current Graph Databases are terrible and need a lot of work. There I said it. It's the ugly truth in our little niche industry. That's why despite waiting for over a decade for the "Year of the Graph" to come we still haven't set the world on fire. Graph databases can be painfully slow, they can't handle non-graph workloads, their APIs are clunky, their query languages are either hard to learn or hard to scale. Most graph projects require expert shepherding to succeed. 80% of the work takes 20% of the time, but that last 20% takes forever. The graph database vendors optimize for new users, not grizzly veterans. They optimize for sales not solutions. Come listen to a Rant by an industry OG on where we could go from here if we took the time to listen to the users that haven't given up on us yet.

Apache Iceberg: An Architectural Look Under the Covers

Alex Merced - Dremio

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format, released by Facebook in 2009 that addresses some of these problems, but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. In this talk, Alex Merced will walk through the architectural details of Iceberg, and show how the Iceberg table format addresses the shortcomings of the Hive format, as well as additional benefits that stem from Iceberg’s approach.
You will learn:
The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design

Apache Iceberg and the Right to Be Forgotten

Alex Merced - Dremio

Regulatory requirements can make data engineering more complex than it otherwise would be. In this talk, we will discuss how to navigate hard deletions in an Apache Iceberg-based data lakehouse. In this talk you will learn:
- How to ensure data is hard deleted using copy-on-write Iceberg tables
- How to ensure data is hard deleted using merge-on-read Iceberg tables
- Other strategic and technical options to consider when architecting regulatory compliance
Some familiarity with Apache Iceberg and Data Lakehouse will be helpful

Workshops at Data Day Texas 2023

The mini-workshops are 90 minutes - the same length as two regular Data Day sessions. These workshops run throughout the day, and are held in tiered classrooms with space to open and plug-in your laptop. The goal of each workshop is to set you up with a new tool/skill and enough knowledge to continue on your own.
We're just beginning to announce the workshops for Data Day Texas 2023. We'll be adding more each week.

Hands-on Introduction to Web Scraping with Python 2023

Ryan Mitchell - (GLG)

Ryan Mitchell, author of Web Scraping with Python (3rd edition in progress), has brought her 4 hour web scraping workshop multiple times to the Data Day Texas. It's always been the day before Data Day and required separate registration. For 2023, Ryan will be returning to Data Day, this time offering a 90 minute version of her workshop - included as part of Data Day - with no additional registration fee. Best practices for web scraping have changed considerably in the last few years. Ryan will be covering the latest tools, tips and tricks. Ryan is pretty much the #1 goto trainer for webs scraping - and someone for whom we get frequent requests to bring back.
Workshop requirements and goals will be published in the next week.

Introduction to Graph Data Science for Python Developers

Sean Robinson - (Graphable)

This workshop will cover a variety of graph data science techniques using Python, Neo4j, and other libraries. The goal of the workshop is to serve as a springboard for attendees to identify which graph-based tools/techniques can provide novel value to existing workflows. Some of the techniques to be covered are:
• How to think about data as a graph and the implications that has on downstream analysis
• How to use graph algorithms at scale using both Neo4j and other pythonic libraries
• How to enhance traditional ML models with graph embeddings
• How to visualize these insights in the context of a graph for greater business intelligence
• How to integrate these techniques with your existing data science tool belt

Hands-On Introduction To GraphQL For Data Scientists & Developers

William Lyon - (Neo4j)

This hands-on workshop will introduce GraphQL and explore how to build GraphQL APIs backed by Neo4j, a native graph database, and show why GraphQL is relevant for both developers and data scientists. This workshop will show how to use the Neo4j GraphQL Library, which allows developers to quickly design and implement fully functional GraphQL APIs without writing boilerplate code, to build a Node.js GraphQL API, including adding custom logic, authorization rules, and operationalizing data science techniques.

- Overview of GraphQL and building GraphQL APIs
- Building Node.js GraphQL APIs backed by a native graph database using the Neo4j GraphQL Library
- Adding custom logic to our GraphQL API using the @cypher schema directive and custom resolvers
- Adding authentication and authorization rules to our GraphQL API

We will be using online hosted environments so no local development setup is required. Specifically, we will use Neo4j Aura database-as-a-service and CodeSandbox for running our GraphQL API application. Prior to the workshop please register for Neo4j Aura and create a "Free Tier" database: You will also need a GitHub account to sign-in to CodeSandbox or create a CodeSandbox account at

Ontology for Data Scientists - 90 minute tutorial

Michael Uschold - (Semantic Arts)

We start with an interactive discussion to identify what are the main things that data scientists do and why and what some key challenges are. We give a brief overview of ontology and semantic technology with the goal of identifying how and where it may be useful for data scientists.
The main part of the tutorial is to give a deeper understanding of what an ontologies are and how they are used. This technology grew out of core AI research in the 70s and 80s. It was formalized and standardized in the 00s by the W3C under the rubric of the Semantic Web. We introduce the following foundational concepts for building an ontology in OWL, the W3C standard language for representing ontologies.
- Individual things are OWL individuals - e.g., JaneDoe
- Kinds of things are OWL classes - e.g., Organization
- Kinds of relationships are OWL properties - e.g., worksFor
Through interactive sessions, participants will identify what the key things are in everyday subjects and how they are related to each other. We will start to build an ontology in healthcare, using this as a driver to introduce key OWL constructs that we use to describe the meaning of data. Key topics and learning points will be:
- An ontology is a model of subject matter that you care about, represented as triples.
- Populating the ontology as triples using TARQL, R2RML and SHACL
- The ontology is used as a schema that gives data meaning.
- Building a semantic application using SPARQL.
We close the loop by again considering how ontology and semantic technology can help data scientists, and what next steps they may wish to take to learn more.

Introduction to Taxonomies for Data Scientists - 90 minute tutorial

Heather Hedden - (Semantic Web Company)

This tutorial/workshop teaches the fundamentals and best practices for creating quality taxonomies, whether for the enterprise or for specific knowledge bases in any industry. Emphasis is on serving users rather than on theory. Topics to be covered include: the appropriateness of different kinds of knowledge organization systems (taxonomies, thesauri, ontologies, etc.), standards, taxonomy concept creation and labeling, taxonomy relationship creation. The connection of taxonomies to ontologies and knowledge graphs will also be briefly discussed. There will be some interactive activities and hands-on exercises. This session will cover:
Introduction to taxonomies and their relevance to data
• Comparisons of taxonomies and knowledge organization system types
• Standards for taxonomies and knowledge organization systems
• Taxonomy concept creation
• Preferred and alternative label creation
• Taxonomy relationship creation
• Taxonomy relationships to ontologies and knowledge graphs
• Best practices and taxonomy management software use