The Data Day Texas 2024 Sessions
We are now continuing to publish the sessions for Data Day Texas 2024. Follow us on Linkedin for the latest news - and book a discount room at the conference hotel - while there are still a few left.
MLOps Keynote
MLOps: Where do we go from here?
Mikiko Bazeley - Labelbox
Is MLOps at a dead end? How can we bridge the chasm between MlOps and DataOps/Data Engineering? Do we need another X-Ops? The arrival of Gen-AI and LLMs to the masses has surfaced a number of important and increasingly difficult questions about how to make sense of it all (with "all" = the X-ops chaos and mishmash of tooling, vendors, patterns. and the war between OSS and private).
As practitioners, technical leaders, and teams, what we do we need to know in order build ML products and teams responsibly & effectively, even amongst the noise?
Among the topics covered in this session, Mikiko will touch on:
- What are the main problems MLOps tries to solve?
- What does the archetypal MLOps platform look like? What are the most common components of an MLOps platform?
- How can existing ML platforms be extended to account for new uses cases involving LLMs?
- Does the team composition change? Do we now need to start hiring “prompt engineers”? Should we stop existing initiatives? Do we need to pivot?
#mlops #dataengineering
Patterns of Power: Uncovering control points to influence outcomes
Amy Hodler - graphgeeks.org
We all know everything is connected, but what can you do with this information? Some organizations use the relationships between data to improve predictions, but there's a huge untapped opportunity to apply these patterns to shape more positive outcomes. Simply put, we are wasting insights hiding in existing data that can help grow revenue, increase resiliency and safety, and improve lives.
In this session, Amy Hodler will share the common forms of power within various networks ranging from IT and software systems to social, biological, and financial ecosystems. She'll show how companies like LinkedIn and Google have revealed the structural patterns of influence using network analysis and graph analytics. Then she'll delve into the concept of centrality and how it's used to combat terrorism, cyberattacks, and infectious diseases. And we'll look at how this science of measuring importance is evolving to be more readily applied to different needs.
From identifying key actors to strategies for using control points, you'll learn about different approaches and tools for moving beyond predictive analytics to enacting change. Finally, Amy will examine ethical considerations and explore the responsible use of technology to influence power dynamics. You'll walk away with practical knowledge about uncovering power and influence patterns, along with methods to actively shape positive outcomes.
Ten Simple Rules for Writing a Technical Book
If you have deep expertise in a technical topic, you may have thought about developing educational materials, like documentation, a learning course, or a book. But undertaking the voyage into publishing is daunting. Join Jess Haberman, a seasoned navigator of the ever-changing seas of technical book publishing, for an enlightening journey through the intricacies of the publishing world. As Director of Learning Solutions at Anaconda and formerly acquisitions editor at O'Reilly Media, Jess has spent much of her career bringing world-shaping ideas of technology innovators to knowledge-seeking professionals and students. Together, let's unravel some behind-the-curtain insights and refined best practices tailored specifically for the publishing-curious among you.
Through engaging with industry titans, Jess has gathered a treasure trove of experiences to share. We'll explore the hidden gems of the publishing realm, tips to follow, and pitfalls to avoid so you can elevate your writing endeavors. Chances are you already know many of the authors Jess signed while at O’Reilly—get to know more about the work they did create their masterpieces. Discover the knowledge that will transform your writing journey into an epic odyssey you can be proud of undertaking.
Causality: The Next Frontier of GenAI Explainability (90 minute hands-on)
In a world obsessed with making predictions and generative AI, we often overlook the crucial task of making sense of these predictions and understanding results. If we have no understanding of how and why recommendations are made, if we can’t explain predictions – we can’t trust our resulting decisions and policies.
In the realm of predictions and explainability, graphs have emerged as a powerful model that has recently yielded remarkable breakthroughs. This talk will examine the implications of incorporating graphs into the realm of casual inference and the potential for even greater advancements. Learn about foundational concepts such as modeling causality using directed acrylic graphs (DAGs), Jedeau Pearl’s “do” operator, and keeping domain expertise in the loop. You’ll hear how the explainability landscape is evolving, comparisons of graph-based models to other methods, and how we can evaluate the different fairness models available.
In this work session, you’ll get an overview of using the open source PyWhy project with a hands-on example that includes the DoWhy and EconML libraries. We’ll walk through identifying assumptions and constraints up front as a graph and applying that through each phase of modeling mechanisms, identifying targets, estimating causal effects, and robustness testing to check prediction validity. We’ll also cover our lessons learned and provide tips for getting started.
Join us as we unravel the transformative potential of graphs and their impact on predictive modeling, explainability, and causality in the era of generative AI.
Knowledge Graph Keynote
Beyond Human Oversight: The Rise of Self-Building Knowledge Graphs in AI
The rapid success in extracting 'de-hallucinated' graphs from Large Language Models (LLMs) marks a big step forward in AI. Knowledge Graphs, now the industry standard for knowledge-intensive applications in enterprises, are at the forefront of this progress. The future of these Knowledge Graphs lies in their evolution into self-replicating systems, significantly reducing the need for programming and human oversight. This shift towards automated and self-sufficient Knowledge Graphs will ensure a reliable and constantly updated "Source of Truth" in various applications.
In this presentation, Jans Aasman will discuss the four essential capabilities a Knowledge Graph must possess to achieve autonomous knowledge generation and curation:
A set of primitives in the query processor allowing for the direct extraction of knowledge and graphs from an LLM without requiring programming skills.
An embedded vector store in the Knowledge Graph. This feature enables natural language queries to interact with your private structured and unstructured data, leading to efficient Retrieval Augmented Information.
A methodology adapted from symbolic AI that allows users to use natural language to generate queries in structured query languages like SQL, SPARQL, or Prolog, even when the underlying schemas are highly complex.
Rule-based logic for a true NeuroSymbolic computing platform. The rule-based system can directly invoke LLM functions, rather than being purely symbolic. The goal is for LLM functions to have the capability to write and execute their own rules, significantly enhancing the system's intelligence and functionality.
Jans will provide a demo of enterprise case studies that illustrate the essential role these capabilities play in the development of self-sustaining knowledge systems.
#knowledgegraphs #ai
Using LLMs to Fight Health Insurance Denials: From Data Synthesis to Production
This talk will cover every step of building fine tuning LLM and running in production. I'll start with talking about how we collected the data, and worked around the lack of good public datasets for what we were trying to accomplish. Then we'll dive into the fine tuning, looking at the different model options and why we chose what we chose. Finally, you'll get to see it running in production – along with the sketchy hardware I bought so it didn't cost an arm and a leg.
Almost all of us have had an unfair health insurance denial at some point, and for health care professionals dealing with health insurance denials can quickly become a full time job. As a trans person I've had more than my share, and my friends have seen more than their fair share of denials for everything from routine care to gender affirming care. Even if you have not had to deal with health insurance denials yet, ask around in your family and social circle and you won't need to go far.
With some insurance companies allegedly using AI to unfairly deny claims (with over 90% false positive rates), it's time that the consumers are able to fight back. Many less than scrupulous insurers depend on the appeals process being too complicated. Thankfully, with LLMs we can lower the barrier to appeal, and if things go really really well maybe we can increase the cost to insurance companies for denying our claims.
#ai #llms #dataengineering
Building Generative AI Applications: An LLM Case Study
Michelle Yi - Women In Data
The adoption and application of Large Language Models (LLMs) such as Llama 2, Falcon 40B, GPT-4, etc. in building generative AI applications is an exciting and emerging domain. In this talk, Michelle Yi will dive into the end-to-end process of and framework for building a generative AI application, leveraging a case study with open-source tooling (e.g., HuggingFace models, Python, PyTorch, Jupyter Notebooks). She will then guide attendees through key stages from model selection and training to deployment, while also addressing fine-tuning versus prompt-engineering for specific tasks, ensuring the quality of output, and mitigating risks. The discussion will explore the challenges encountered and emerging solutions and architectures developed. Finally, she will provide attendees with a pragmatic framework for assessing the opportunities and hurdles associated with LLM-based applications. This talk is suitable for AI researchers, developers, and anyone interested in understanding the practicalities of building generative AI applications.
#ai
From Open Source to SaaS: The ClickHouse Odyssey
Roopa Tangirala - ClickHouse
Have you ever wondered what it takes to go from an open-source project to a fully-fledged SaaS product? How about doing that in only 1 year’s time? If the answer is yes, then this talk is for you. Join Roopa Tangirala as she unfolds the captivating journey of accomplishing this feat, delving into the architecture intricacies of ClickHouse Cloud. This includes a detailed exploration of the fully separated storage and compute elements orchestrated on Kubernetes, shedding light on the design decisions made and potential pitfalls encountered along the way. Whether you're keen on the technical nuances or seeking practical insights for your own SaaS venture, this talk guarantees to serve as a compelling guide for your journey ahead.
An abridged history of DuckDB: database tech from Amsterdam
Peter Boncz - MotherDuck
In this session, Peter Boncz will discuss the evolution of analytical database systems, starting from the classical relational database systems, all the way to DuckDB - the fastest growing data system today. His home institution CWI (birthplace of python!) located in Amsterdam, was party to multiple relevant events in this evolution, which includes the introduction of: column stores (MonetDB), vectorized query execution and lightweight compression (VectorWise), cloud-optimized database architectures (Databricks and Snowflake) and finally, embedded analytics (DuckDB). Peter will combine technical info with personal anecdotes along this 30-year tale, which finishes with the launch of MotherDuck, a new startup that aims to put DuckDB in the cloud.
#database
ELT and ETL are not products, technologies, or features. If someone says otherwise, they are selling you something.
Roy Hasson - Upsolver
Building a data platform is hard. You wade through hundreds of blogs, videos and whitepapers by vendors, engineers and social media influencers, and at the end you’re paralyzed, not sure where to start or what to build. This leads to teams just buying whatever vendors are selling, because it sounds easy and promises to solve all of their problems…with a sprinkling of AI to get executives excited. In this session, Roy will explore the 20+ year old battle between ELT and ETL. He'll discuss what these ideas are and how they are sold to us by vendors and influencers. Should you choose Snowflake+Fivetran or Databricks+Arcion? What about Trino+Spark+Airbyte? Roy will provide a framework for you to evaluate which one is the best approach for you, how to choose the right tools, when to consider open source vs. commercial tools and how to get started building a data platform in a quick, educated and effective way.
#dataengineering
OneTable: Interoperate between Apache Hudi, Delta Lake & Apache Iceberg
Dipankar Mazumdar - ONEHOUSE
Apache Hudi, Iceberg, and Delta Lake have emerged as leading open-source projects, providing decoupled storage with a powerful set of primitives that offer transaction and metadata layers (commonly referred to as table formats) in cloud storage. They all provide a table abstraction over a set of files and, with it, a schema, commit history, partitions, and column stats. Each of these projects has its own rich set of features that may cater to different use-cases. Hudi works great for update-heavy and incremental workloads. Iceberg may be a good choice if you already have Hive-based tables and want to deal with consistency issues. You shouldn’t have to choose one over the others.
In this session, Dipankar will introduce OneTable, an open source project created to provide omni-directional interoperability between the various lakehouse table formats. Not a new or additional format, Onetable provides abstractions and tools for the translation of existing table format metadata. Dipankar will show how, with OneTable, you can write data in any format of your choice and convert the source table format to one or more targets that can then be consumed by the compute engine of your choice.
#dataengineering #analytics #streamingdata
Battle of the warehouses, lakehouses and streaming DBs, choose your platform wisely.
Roy Hasson - Upsolver
Snowflake or Databricks, Iceberg or Hudi, ClickHouse or Pinot. Should you choose based on performance, cost, features, ecosystem or what some LinkedIn influencers are telling you? So many options, so little time and money. How do you choose? In this session, Roy will review the key business use cases requiring modern data tools and compare between top players in the warehouse, lakehouse and streaming DB market intended to offer solutions. He'll share a framework to help you quickly sort out the options and decide on the one that best fits your business needs for storing, processing and analyzing data. At the end of this session your boss will be amazed at how well you know the market players, or at a minimum could lead to a fun debate.
#dataengineering
Cost containment: Scaling your data function on a budget
Lindsay Murphy - Secoda
Data teams spend a lot of time measuring and optimizing the effectiveness of other teams. Unfortunately, we're not so great at doing this for ourselves. In this talk, we will dive into a big blind spot that a lot of data teams operate with–not knowing how much they're costing their business (now and in the future). Given how easy it is to rack up expensive bills in pay-as-you-go tools across the modern data stack, this can become a big problem for data teams, very fast. We'll discuss how to build your own cost monitoring metrics and reporting (and why you should make this a priority), some of the challenges you will face along the way, and best practices for implemeting cost containment into your workflows to drive cost accountability across your company.
#dataengineering
Introduction to the Streaming Plane: Bridging the Operational and Analytical Data Planes
Hubert Dulay - StarTree
In the dynamic landscape of data management, the concept of the "data divide" has emerged as a pivotal idea that highlights the crucial distinction between two essential components: the operational data plane and the analytical data plane. The bridge between these two planes has traditionally been a one-way highway from the operational to the analytical plane. The path in the opposite direction is an arduous, awkward, and costly one that includes solutions named: Reverse ETL (rETL) and Data Activation. These solutions try to extract already cleansed and mastered data residing in the analytical plane from the data systems that aren’t optimized for large extraction. By merging real-time and historical data, one can gain a more comprehensive and nuanced view of operations, customers, and markets. The Streaming Plane is the bridge that connects the operational and analytical aspects of data processing. It captures and processes real-time data, allowing it to flow seamlessly into the analytical phase where it's stored, analyzed, and used for insights and decision-making. The Streaming Plane empowers organizations to access and analyze a broader spectrum of data types, enabling better-informed decisions in real-time and over time.
#dataengineering
Type Theory as a Unifying Paradigm for Modern Databases
Haikal Pribadi - Vaticle
Over the past decades, data modeling has become a highly diversified discipline with many competing paradigms emerging across various application domains. We argue that higher levels of abstraction including, in particular, the integration of high-level programming and reasoning techniques, will pave the way forward for future knowledge management systems. As a programmatic foundation for this endeavor, we will discuss a novel type theoretical modeling and reasoning paradigm, which aims to strike a powerful balance between what can be naturally semantically modeled and what can be practically implemented. TypeQL is a multi-purpose database language rooted in these foundations: it is designed around the expressivity of natural language and backed by type theoretical principles. This rigorous high-level approach to database language reduces development and maintenance loads, preventing hard to spot data integrity and logic errors through its underlying type system, while also providing a unifying toolset for a large class of domain-specific applications ranging from querying connected data in knowledge graphs for drug discovery to reasoning and adaptive decision making in cognitive robotics.
Data Product Chaos
Malcom Hawker - Profisee
Data products have become all the rage, even while most data people struggle to define what they are. For some, the concept of a data mesh provides slightly more clarity, while being completely unattainable. For others, the benefits of data products loom large, but remain out of grasp thanks to misguided perceptions on what they are, and how they should be managed. This combination of hype and confusion is creating chaos – ensuring most will fail to realize any meaningful value from data products. In this session, Malcolm Hawker will dispel the many myths around data products, and shares why data product management should be your primary focus. Malcolm will show a way to:
• Define data products, while acknowledging the definition is largely meaningless.
• Delineate data products in a data mesh, vs. data products, vs. data product management.
• Outline the concepts and benefits of product management to data and analytics teams.
• Provide recommendations on how to break free from the chaos.
#dataengineering #dataproducts
Invisible Threats and Data Hygiene: The Hidden Value Data Products
Ron Itelman - Intelligence.AI
Drawing parallels between the groundbreaking work of Dr. Ignaz Semmelweis in 19th-century medicine and modern organizational dynamics, this talk explores the concept of "hidden threats" that often go unnoticed but can have significant impacts. Reviewing the story of Dr. Semmelweis, a Vienniese surgeon who reduced maternal mortality rates through simple handwashing procedures, provides an illuminating case study on how seemingly minute factors can have profound systemic consequences. Similarly, contemporary organizations face their own “invisible germs” in the form of ambiguity, knowledge gaps, and blind spots. These elements, often overlooked in the day-to-day decision-making processes, can have cascading effects that disrupt an organization's coherence and performance. This talk introduces a unifying methodology, operationalized through high-quality data products, designed to address and mitigate these hidden threats. By leveraging techniques such as JSON Schema, businesses can achieve a harmonized approach to minimizing ambiguity and enhancing data hygiene, leading to transformative operational improvements.
#dataproducts
Geospatial Analytics With Apache Sedona In Python, R & SQL
William Lyon - Wherobots
Learn to work with geospatial data at scale. William will start with the basics including how to model and query point, line, and polygon geometries using SQL. Then he'll demonstrate how to build geo-aware applications using Python and R. Finally, William will introduce techniques for applying geospatial artificial intelligence (GeoAI) and machine learning techniques, all using Apache Sedona, the open-source geospatial analytics platform.
Querying a Graph Through an LLM
Marko Budiselic - Memgraph
Large Language Models are powerful tools for extracting knowledge from existing text datasets. However, there are also many issues because of how LLMs work under the hood - bias, misinformation, scalability, and cost, just to name a few. Marko will show how graph data can enhance LLM output, lessen hallucination and increase cost-effectiveness.
Computational Trinitarianism
Ryan Wisnesky - Conexus
In this session, Ryan Wisnesky will show how computation has three aspects: logic, type theory, and algebra. He describe how type theory can be used to formalize almost anything, how logic informs 'proof engineering', and how algebra, and specifically category theory, allows us to relate our type-theoretic constructions to each other. Ryan will also describe applications of this way of thinking, including how to build provably correct data transformation systems.