The Data Day Texas 2024 Sessions

Plenary Keynote
Practitioner turned Executive; what caught me by surprise & lessons I learned about how decisions are really made with data ecosystems.

Sol Rashidi

Sol will present an unfiltered and transparent conversation about the unexpected challenges and revelations she experienced pivoting from a hands-on data practitioner to a strategic data executive. She'll discuss the successes, failures, and the exchange of popularity one has to sometimes make, in order to make progress when deploying large-scale data ecosystems. There are under-appreciated intricacies involved in transitioning from a technical role to an executive position, and there is EQ, SQ, and BQ that must to be developed to compliment the IQ. While it may be obvious what needs to be done, Sol will share the real-world processes and politics that drive the decision-making in the data world.

Machine Learning Keynote
Distilling the meaning of language: How vector embeddings work

Susan Shu Chang - Elastic

Vector embeddings have been used a foundation in NLP for years, and has evolved as fast as the field itself. How do they work behind the scenes? This talk will deep dive into how vector embeddings work, how they are built, and various challenges that can make improving them tough. Finally, we'll discuss how they fit into a full ML pipeline.

MLOps Keynote
MLOps: Where do we go from here?

Mikiko Bazeley - Labelbox

Is MLOps at a dead end? How can we bridge the chasm between MlOps and DataOps/Data Engineering? Do we need another X-Ops? The arrival of Gen-AI and LLMs to the masses has surfaced a number of important and increasingly difficult questions about how to make sense of it all (with "all" = the X-ops chaos and mishmash of tooling, vendors, patterns. and the war between OSS and private).
As practitioners, technical leaders, and teams, what we do we need to know in order build ML products and teams responsibly & effectively, even amongst the noise?
Among the topics covered in this session, Mikiko will touch on:
- What are the main problems MLOps tries to solve?
- What does the archetypal MLOps platform look like? What are the most common components of an MLOps platform?
- How can existing ML platforms be extended to account for new uses cases involving LLMs?
- Does the team composition change? Do we now need to start hiring “prompt engineers”? Should we stop existing initiatives? Do we need to pivot?
#mlops #dataengineering

Data Architecture Keynote
What Data Architects and Engineers can learn from Library Science

Jessica Talisman

Data architecture and information science are fundamental to all things digital. The HTTP protocol, underlying the internet itself, was designed to facility the organization and sharing of academic research. From computer hardware systems to enterprise software, data architecture is a base requirement. But what are the essential components that define an information system? Controlled vocabularies, data catalogs, thesauri, taxonomies, ontologies and knowledge graphs are the building blocks that make up the academic discipline of Information & Library Science.
Library and Information Science (LIS) may seem like a niche domain, and it is. Coming from a discipline founded upon ambiguity, information scientists specialize in the art of disambiguation. Resource description is a core element of LIS, responsible for cleaning, defining, classifying, cataloging and structuring data. Operationalized, resource description is evident through metadata, data catalogs, schemas, records and repositories. Search systems rely on resource description for findability, access, provenance, system reconciliation and usage metrics. Knowledge organization is born out of resource description, when context and meaning can be derived from the networks of resources being described.
For Data Engineers, an understanding of information science is a superpower, especially in the age of AI. Need to create a valid classification structure? A framework for a data catalog? Create transformers to reconcile schemas and entities? What about property graphs and knowledge graphs? In this talk, we will demystify the fundamentals of information classification systems, and delve into how information science can work synergistically with data engineering.

Bridging the Gap: Enhancing Collaboration Between Executives and Practitioners in Data-Driven Organizations

Joe Reis - Ternary Data
Sol Rashidi

A good relationship between executive leadership and data practitioners is crucial in the swiftly evolving data landscape. This discussion, presented by decorated data executive Sol Rashidi and veteran data practitioner Joe Reis, delves into the dynamic interplay between strategy and practical implementation in the world of data.
Our session will explore effective methods for fostering a productive dialogue between executives focused on strategic vision and business outcomes and practitioners deeply involved in the technical execution and day-to-day challenges. We will discuss how these two roles can not only coexist but thrive together, creating a robust and forward-thinking data environment.
Key topics include understanding executives' and practitioners' different perspectives and languages, establishing common goals, and leveraging each other's strengths. We will also touch upon the importance of cultivating a culture of mutual respect, continuous learning, and adaptability in the face of technological advancements.
Through real-world examples and interactive discussions, this talk aims to equip executives and practitioners with insights and tools to enhance their collaboration, leading to more innovative solutions and successful outcomes in data. Join us to learn how to bridge the gap and harness the full potential of your organization's data capabilities.

Data Engineering Keynote
The State of Data Engineering ... and not repeating history.

Jesse Anderson - Big Data Institute

Let's have a frank and honest chat about the current state of data engineering. We (mostly) suck as an industry. Data science influencers created an alternate reality of the industry. We'll go through some history to see where we're headed. Finally, we'll talk about how we can turn things around. WARNING: This talk will be offensive to people who’ve invested too much and done too little.

Survey of recent progress in intelligent robotics

Jonathan Mugan - De Umbra

Robots are finally getting smart. This talk will survey this progress from the vantage points of hardware, intelligence, and simulation infrastructure. On the hardware front, we will look at robots that can elegantly walk around and manipulate objects. We will break down the advancements in robot intelligence into two components: improvements in thinking and progress on skill learning. Thinking has been helped significantly by large language models (LLMs), but these LLMs must be grounded in a representation that allows the robot to search through spaces of options. LLMs also aid in skill learning, but we will also discuss how diffusion methods have enabled breakthroughs. And finally, we will look at the current simulation landscape that allows robots to be trained cheaply in diverse environments.

Data Quality Skepticism

Santona Tuli - Upsolver

Failure modes are hard to predict—we shouldn’t fool ourselves into thinking that we’ve guaranteed high-quality data by implementing constraints-as-expectations or contracts on data. You can implement all the quality checks and alerts you want, at every stage of the pipeline, but if you don’t have general health monitoring, relational/semantic metadata and the ability to react to changes, you can’t promise high quality data.
In this session, Santona will reflect on her decade of working with data. She will share how she has learned to be ever skeptical of claiming to have high quality data, and how the better approach is simply to define and quantify data health within a system at any given time. Santona will then outline strategies for being process-driven and thinking holistically about data quality, and how things can still be tricky.

Evolving as a Data Scientist in the age of AI

Megan Lieu - Deepnote

The ground is shifting beneath us everyday as data scientists. So how do we future-proof our careers when we have no way of predicting that future?
Data science is a field where many of its professionals have invested a lot of time, money, and degrees into getting in in the first place. Which means we have even more to lose when meteoric shifts like the arrival of Generative AI affect the industry.
This talk goes through how we as data scientists can prime ourselves for longevity in a space where nothing is certain and everything seems to be evolving at lightning speed everyday. We will do so by looking at past industry shifts, take inventory of our current skills, and attempt to predict where we and our field are headed in the future. Whether you’re just starting out or a seasoned veteran, this talk will offer actionable lessons for all data scientists.

Data's Product Pivot: The Reliability Imperative

Shane Murray - Monte Carlo

In today's data landscape, as organizations transition from managing data monolithically to treating it as a product, the spotlight intensifies on the reliability and trustworthiness of data. This imperative reshapes organizational structures and challenges traditional notions that mere data quality suffices. In this talk, Shane will delve into the challenges of upholding data reliability at scale, navigate the complexities of ownership models, and share strategies for success.

Demystifying the next advancements of LLMs: RAG: Retrieval-augmented generation

Adi Polak

Have you ever asked an AI language model like ChatGPT about the latest developments on a certain topic, only to receive this response:" I'm sorry, but as of my last knowledge update in January 2022, I don't have information on the topic at hand.."If you have, you've encountered a fundamental limitation of large language models. You can think about these models as time capsules of knowledge frozen at the point of their last training. They can only learn new information by going through a retraining process, which is both time-consuming and computationally intensive. In the fast-paced world of artificial intelligence, a new technology is emerging to tackle this challenge — Retrieval-Augmented Generation, or RAG. This innovative approach is revolutionizing how language models operate, breaking down barriers and opening up new possibilities.But what exactly is RAG? Why is it important? And how does it work? All and more in this talk.

We decomposed the database - now what?

Matthias Broecheler - DataSQRL

Once upon a time, a simple relational database in a 3-tiered architecture was all you needed. Over the last 20 years, data management has gotten increasingly more complex as hundreds of data technologies were invented to address specific needs. As developers and data engineers, we're left grappling with the challenge of integrating these diverse technologies into working applications and data pipelines.
This session will show how the disassembly of the database into its components to scale data-intensive applications has led to this complexity. We introduce an open-source database compiler called DataSQRL that aims to put the database back together by orchestrating data technologies like Kafka, Flink, and Postgres into scalable data microservices or data pipelines.
DataSQRL eliminates integration complexity like connector implementations, schema mapping, and data synchronization.
The vision for DataSQRL is a higher-level database framework that combines the simplicity of a single database with the features, performance, and scalability of modern data technologies.

Under the hood of vector search with JVector

Jonathan Ellis - DataStax

This is a speedrun through the past ten years of R&D on approximate nearest-neighbor search algorithms and vector databases, covering the major advances, the current state of the art, and possible future directions. This will be informed by Jonathan's experience leading the development of JVector, an embedded search engine for Java that powers vector search for Apache Cassandra and DataStax Astra.

Expert Systems are Generative AIs

Ryan Wisnesky - Conexus

We claim that symbolic, rule-based AIs - millions of which have already been deployed since the phrase ‘expert system’ became popular in the 1980s - can and should be repurposed as generative AIs, enabled by recent advances in mathematics and computer science. Although how to do so has been well-known to computer scientists since the 1970s, it is only now post 2020 that computers and algorithms are fast enough to take advantage of it.
How does this work? ‘Generativity’ is about generating new from old. In the case of ChatGPT, new English text (the response to a question) is generated from old English text (the training corpus). In the case of Stable Diffusion, new images are generated from old images. Although ChatGPT and StableDiffusion operate by statistical, heuristic methods, symbolic AIs can perform many of the same generative tasks, as evidenced by the eerie similarity between the popular discourse around the ‘Eliza’ chatbot of the 1980s and today’s ChatGPT. Generativity is not about machine learning: it is about generating new things, as opposed to simply answering questions. And for the most part, the same sets of rules that define expert systems can be re-used for generative purposes, we need only change the underlying algorithms that we run on them: rather than run deduction/inference/entailment algorithms, we need only run model completion/‘chase’ algorithms. As we explain in this talk, only some logics admit model completion, providing concrete guidance about which logics (and therefore technologies) to use for generative symbolic AI purposes.

Graph Analytics Keynote
Patterns of Power: Uncovering control points to influence outcomes

Amy Hodler - graphgeeks.org

We all know everything is connected, but what can you do with this information? Some organizations use the relationships between data to improve predictions, but there's a huge untapped opportunity to apply these patterns to shape more positive outcomes. Simply put, we are wasting insights hiding in existing data that can help grow revenue, increase resiliency and safety, and improve lives.
In this session, Amy Hodler will share the common forms of power within various networks ranging from IT and software systems to social, biological, and financial ecosystems. She'll show how companies like LinkedIn and Google have revealed the structural patterns of influence using network analysis and graph analytics. Then she'll delve into the concept of centrality and how it's used to combat terrorism, cyberattacks, and infectious diseases. And we'll look at how this science of measuring importance is evolving to be more readily applied to different needs.
From identifying key actors to strategies for using control points, you'll learn about different approaches and tools for moving beyond predictive analytics to enacting change. Finally, Amy will examine ethical considerations and explore the responsible use of technology to influence power dynamics. You'll walk away with practical knowledge about uncovering power and influence patterns, along with methods to actively shape positive outcomes.

I am an AI Product Manager, Am I Not?

Greg Coquillo

Should you embrace the title of an AI Product Manager? Will it align with your professional brand, product portfolio, and strategy? In this session, Greg Coquillo will dive into the future of product management, and explore the diverse PM archetypes in the age of ubiquitous GenAI. Greg will go on to discuss the innovation, challenges, and evolving landscape of AI-infused products, and prepare you to navigate the frontier of AI product management in 2024.

My Attempt To Build a Foundation for AI and Data

Hala Nelson

I will describe my own journey from academia and mathematics to the worlds of AI and data, the treacherous waters that I encountered, and my attempt to remedy that by bridging the gaps between science, technology, engineering, mathematical modeling, computation, AI, and everything data. I will survey the modeling, analytical, and engineering skills necessary to have a solid footing in the AI and data fields, discuss various projects I worked on, and brainstorm strategies for implementing AI, which almost always translates to data strategy, at institutional levels.

Managing Competing AI Decisions in Large Ontologies

Ryan Mitchell

Knowledge graphs, semantic triples, and their ontological kin are fast and cheap to generate with modern artificial intelligence algorithms. However, these algorithms are not infallible, and, as the pace of research increases, we may find ourselves managing multiple AI experiments, competing outputs, and even human-labeled judgments. Ryan will discuss strategies for thinking about and managing ontologies in order to assess their “truthiness” in large complex systems with many sources of data, decision-making algorithms, and knowledge validation methods. Along with high-level concepts, you’ll get concrete, and immediately implementable data architecture patterns for organizing, using, and future-proofing this information.

Conceptual Modeling - a practical way to capture business needs for data products

Juha Korpela

Data teams often struggle to show impact on the business. To ensure value creation, Data Products should be designed to fit real business needs. However, communicating business needs in an actionable form is sometimes difficult. Conceptual Modeling is a true and tested way to document business realities in a structured format that is easy to utilize in data projects. Instead of massive top-down modeling efforts, in the modern data world we should keep it simple: use Conceptual Modeling to facilitate fast and agile collaboration between business users and data experts when designing Data Products. This drives engagement and improves value creation for individual projects. Connected to Data Engineering workflows as design inputs and business documentation, Conceptual Models provide the engineers with basic guidelines on what business objects should exist and how the data elements are connected in real life. The linkage between engineering and modeling also enables organizations to look at the bigger picture - how all the different Data Products are related, and how they cover the needs of the business.
Key takeaways from this talk include: 1) How Conceptual Modeling helps to document real business needs, 2) How to do it fast & easy in agile projects, 3) Connecting data engineering with Conceptual Models to enable the bigger picture, 4) Benefits for individual projects & the whole organization.

Ensuring Success for your Data Team

Clair Sullivan

A significant majority of all data projects fail, most of them never having made it to production or even to deliver business outcomes. While many companies have embraced the need to be data driven, most have taken an “upside-down approach” to do so, resulting in wasted money and failing to create any ROI. To be successful, there are four key things that need to be in place: data culture, the right problem, the right data, and the right people. After defining what it means to have an “upside-down approach” to data, we will explore what makes for a good data culture and signs that you have one. We will then turn to identifying the right business problems and key decisions that should be made along the way in terms of getting the data in place before any analytics or modeling work begins. The session will conclude with suggestions for the continued “care and feeding” of the data culture to ensure scalability and long-term success.

Practical Large Language Models: Using LLMs in Your Business With Python

Jonathan Mugan - De Umbra

LLMs can teach you the theory of general relativity, but let’s talk about something important. We will cover how you can use large language models (LLMs) such as ChatGPT in your business using the Python ecosystem. We will survey the most popular Python libraries for interacting with LLMs and embeddings. We will cover document retrieval (semantic search) without using keywords and enabling an LLM to use information in your documents to answer questions. We will also cover how to enable LLMs to use tools and how to use LLMs to generate SQL queries so that you can build a chatbot that provides quantitative answers from your database. To store embeddings of your documents, we will survey vector databases and show how you can use Milvus, a popular open-source vector database. Finally, we will cover how LLMs work at the level of matrix multiplication so you can get a grounded sense of their strengths and weaknesses.

On the Data Highway: Is Data Vault Speedy or Reckless?

Veronika Durgin - Saks

Data Vault is a comprehensive system of Information Delivery that includes methodology, architecture, and data model for effectively managing and integrating enterprise data. It is platform and tool agnostic. Data Vault emphasizes the importance of agility, scalability, and adaptability in the context of data management.Learning Data Vault is a lot like learning to drive - anyone can do it, but there are a few key steps to navigate it safely and smoothly. In this session, I will cover Data Vault’s fundamentals, its core architecture, data modeling patterns, as well as standards and best practices. We will discuss when and why to use it and how it helps put “pedal to the metal” on the path from data to value. Ready to take Data Vault for a spin? Buckle up, and join this session.

Past, Present and Future of Data Catalogs

Juan Sequeda

A data catalog is a tool used to organize, manage, and discover data assets within an organization. The modern data stack boosted the popularity of data catalogs to the point that they are now an essential part of the stack in addition to ETL tools, Data Lakes/House/Cloud and Analytics/ML tools. But how did this happen?
In this talk, we will go down memory lane by revisiting different independent threads (Library Sciences, Web, Big Tech, Open source) that have brought us to the data catalogs of today, and postulate where data catalogs can and should be going in the future.

Ten Simple Rules for Writing a Technical Book

Jess Haberman

If you have deep expertise in a technical topic, you may have thought about developing educational materials, like documentation, a learning course, or a book. But undertaking the voyage into publishing is daunting. Join Jess Haberman, a seasoned navigator of the ever-changing seas of technical book publishing, for an enlightening journey through the intricacies of the publishing world. As Director of Learning Solutions at Anaconda and formerly acquisitions editor at O'Reilly Media, Jess has spent much of her career bringing world-shaping ideas of technology innovators to knowledge-seeking professionals and students. Together, let's unravel some behind-the-curtain insights and refined best practices tailored specifically for the publishing-curious among you.
Through engaging with industry titans, Jess has gathered a treasure trove of experiences to share. We'll explore the hidden gems of the publishing realm, tips to follow, and pitfalls to avoid so you can elevate your writing endeavors. Chances are you already know many of the authors Jess signed while at O’Reilly—get to know more about the work they did create their masterpieces. Discover the knowledge that will transform your writing journey into an epic odyssey you can be proud of undertaking.

Causality: The Next Frontier of GenAI Explainability (90 minute hands-on)

Amy Hodler and Michelle Yi

In a world obsessed with making predictions and generative AI, we often overlook the crucial task of making sense of these predictions and understanding results. If we have no understanding of how and why recommendations are made, if we can’t explain predictions – we can’t trust our resulting decisions and policies.
In the realm of predictions and explainability, graphs have emerged as a powerful model that has recently yielded remarkable breakthroughs. This talk will examine the implications of incorporating graphs into the realm of casual inference and the potential for even greater advancements. Learn about foundational concepts such as modeling causality using directed acrylic graphs (DAGs), Jedeau Pearl’s “do” operator, and keeping domain expertise in the loop. You’ll hear how the explainability landscape is evolving, comparisons of graph-based models to other methods, and how we can evaluate the different fairness models available.
In this work session, you’ll get an overview of using the open source PyWhy project with a hands-on example that includes the DoWhy and EconML libraries. We’ll walk through identifying assumptions and constraints up front as a graph and applying that through each phase of modeling mechanisms, identifying targets, estimating causal effects, and robustness testing to check prediction validity. We’ll also cover our lessons learned and provide tips for getting started.
Join us as we unravel the transformative potential of graphs and their impact on predictive modeling, explainability, and causality in the era of generative AI.

Knowledge Graph Keynote
Beyond Human Oversight: The Rise of Self-Building Knowledge Graphs in AI

Jans Aasman

The rapid success in extracting 'de-hallucinated' graphs from Large Language Models (LLMs) marks a big step forward in AI. Knowledge Graphs, now the industry standard for knowledge-intensive applications in enterprises, are at the forefront of this progress. The future of these Knowledge Graphs lies in their evolution into self-replicating systems, significantly reducing the need for programming and human oversight. This shift towards automated and self-sufficient Knowledge Graphs will ensure a reliable and constantly updated "Source of Truth" in various applications.
In this presentation, Jans Aasman will discuss the four essential capabilities a Knowledge Graph must possess to achieve autonomous knowledge generation and curation:
A set of primitives in the query processor allowing for the direct extraction of knowledge and graphs from an LLM without requiring programming skills.
An embedded vector store in the Knowledge Graph. This feature enables natural language queries to interact with your private structured and unstructured data, leading to efficient Retrieval Augmented Information.
A methodology adapted from symbolic AI that allows users to use natural language to generate queries in structured query languages like SQL, SPARQL, or Prolog, even when the underlying schemas are highly complex.
Rule-based logic for a true NeuroSymbolic computing platform. The rule-based system can directly invoke LLM functions, rather than being purely symbolic. The goal is for LLM functions to have the capability to write and execute their own rules, significantly enhancing the system's intelligence and functionality.
Jans will provide a demo of enterprise case studies that illustrate the essential role these capabilities play in the development of self-sustaining knowledge systems.
#knowledgegraphs #ai

Using LLMs to Fight Health Insurance Denials: From Data Synthesis to Production

Holden Karau

This talk will cover every step of building fine tuning LLM and running in production. I'll start with talking about how we collected the data, and worked around the lack of good public datasets for what we were trying to accomplish. Then we'll dive into the fine tuning, looking at the different model options and why we chose what we chose. Finally, you'll get to see it running in production – along with the sketchy hardware I bought so it didn't cost an arm and a leg.
Almost all of us have had an unfair health insurance denial at some point, and for health care professionals dealing with health insurance denials can quickly become a full time job. As a trans person I've had more than my share, and my friends have seen more than their fair share of denials for everything from routine care to gender affirming care. Even if you have not had to deal with health insurance denials yet, ask around in your family and social circle and you won't need to go far.
With some insurance companies allegedly using AI to unfairly deny claims (with over 90% false positive rates), it's time that the consumers are able to fight back. Many less than scrupulous insurers depend on the appeals process being too complicated. Thankfully, with LLMs we can lower the barrier to appeal, and if things go really really well maybe we can increase the cost to insurance companies for denying our claims.
#ai #llms #dataengineering

Building Generative AI Applications: An LLM Case Study

Michelle Yi - Women In Data

The adoption and application of Large Language Models (LLMs) such as Llama 2, Falcon 40B, GPT-4, etc. in building generative AI applications is an exciting and emerging domain. In this talk, Michelle Yi will dive into the end-to-end process of and framework for building a generative AI application, leveraging a case study with open-source tooling (e.g., HuggingFace models, Python, PyTorch, Jupyter Notebooks). She will then guide attendees through key stages from model selection and training to deployment, while also addressing fine-tuning versus prompt-engineering for specific tasks, ensuring the quality of output, and mitigating risks. The discussion will explore the challenges encountered and emerging solutions and architectures developed. Finally, she will provide attendees with a pragmatic framework for assessing the opportunities and hurdles associated with LLM-based applications. This talk is suitable for AI researchers, developers, and anyone interested in understanding the practicalities of building generative AI applications.
#ai

From Open Source to SaaS: The ClickHouse Odyssey

Roopa Tangirala - ClickHouse

Have you ever wondered what it takes to go from an open-source project to a fully-fledged SaaS product? How about doing that in only 1 year’s time? If the answer is yes, then this talk is for you. Join Roopa Tangirala as she unfolds the captivating journey of accomplishing this feat, delving into the architecture intricacies of ClickHouse Cloud. This includes a detailed exploration of the fully separated storage and compute elements orchestrated on Kubernetes, shedding light on the design decisions made and potential pitfalls encountered along the way. Whether you're keen on the technical nuances or seeking practical insights for your own SaaS venture, this talk guarantees to serve as a compelling guide for your journey ahead.

An abridged history of DuckDB: database tech from Amsterdam

Peter Boncz - MotherDuck

In this session, Peter Boncz will discuss the evolution of analytical database systems, starting from the classical relational database systems, all the way to DuckDB - the fastest growing data system today. His home institution CWI (birthplace of python!) located in Amsterdam, was party to multiple relevant events in this evolution, which includes the introduction of: column stores (MonetDB), vectorized query execution and lightweight compression (VectorWise), cloud-optimized database architectures (Databricks and Snowflake) and finally, embedded analytics (DuckDB). Peter will combine technical info with personal anecdotes along this 30-year tale, which finishes with the launch of MotherDuck, a new startup that aims to put DuckDB in the cloud.
#database

ELT and ETL are not products, technologies, or features. If someone says otherwise, they are selling you something.

Roy Hasson - Upsolver

Building a data platform is hard. You wade through hundreds of blogs, videos and whitepapers by vendors, engineers and social media influencers, and at the end you’re paralyzed, not sure where to start or what to build. This leads to teams just buying whatever vendors are selling, because it sounds easy and promises to solve all of their problems…with a sprinkling of AI to get executives excited. In this session, Roy will explore the 20+ year old battle between ELT and ETL. He'll discuss what these ideas are and how they are sold to us by vendors and influencers. Should you choose Snowflake+Fivetran or Databricks+Arcion? What about Trino+Spark+Airbyte? Roy will provide a framework for you to evaluate which one is the best approach for you, how to choose the right tools, when to consider open source vs. commercial tools and how to get started building a data platform in a quick, educated and effective way.
#dataengineering

You need to be more strategic - The mantra for data leader career growth.

Aaron Wilkerson - Carhartt

Data professionals have built careers out of being experts in meeting the data and reporting/analytics needs of their companies. However, that's not enough. Pick any survey taken by business leaders and you'll find that companies are not seeing the value from their data investments. Something needs to change and something has to change. In order for the data industry to survive and thrive, a new type of data leader needs to emerge. This session will discuss practical methods for growing into a strategic data leadership role or taking your data leadership to the next level.

OneTable: Interoperate between Apache Hudi, Delta Lake & Apache Iceberg

Dipankar Mazumdar - ONEHOUSE

Apache Hudi, Iceberg, and Delta Lake have emerged as leading open-source projects, providing decoupled storage with a powerful set of primitives that offer transaction and metadata layers (commonly referred to as table formats) in cloud storage. They all provide a table abstraction over a set of files and, with it, a schema, commit history, partitions, and column stats. Each of these projects has its own rich set of features that may cater to different use-cases. Hudi works great for update-heavy and incremental workloads. Iceberg may be a good choice if you already have Hive-based tables and want to deal with consistency issues. You shouldn’t have to choose one over the others.
In this session, Dipankar will introduce OneTable, an open source project created to provide omni-directional interoperability between the various lakehouse table formats. Not a new or additional format, Onetable provides abstractions and tools for the translation of existing table format metadata. Dipankar will show how, with OneTable, you can write data in any format of your choice and convert the source table format to one or more targets that can then be consumed by the compute engine of your choice.
#dataengineering #analytics #streamingdata

Battle of the warehouses, lakehouses and streaming DBs, choose your platform wisely.

Roy Hasson - Upsolver
Snowflake or Databricks, Iceberg or Hudi, ClickHouse or Pinot. Should you choose based on performance, cost, features, ecosystem or what some LinkedIn influencers are telling you? So many options, so little time and money. How do you choose? In this session, Roy will review the key business use cases requiring modern data tools and compare between top players in the warehouse, lakehouse and streaming DB market intended to offer solutions. He'll share a framework to help you quickly sort out the options and decide on the one that best fits your business needs for storing, processing and analyzing data. At the end of this session your boss will be amazed at how well you know the market players, or at a minimum could lead to a fun debate.
#dataengineering

Cost containment: Scaling your data function on a budget

Lindsay Murphy - Secoda

Data teams spend a lot of time measuring and optimizing the effectiveness of other teams. Unfortunately, we're not so great at doing this for ourselves. In this talk, we will dive into a big blind spot that a lot of data teams operate with–not knowing how much they're costing their business (now and in the future). Given how easy it is to rack up expensive bills in pay-as-you-go tools across the modern data stack, this can become a big problem for data teams, very fast. We'll discuss how to build your own cost monitoring metrics and reporting (and why you should make this a priority), some of the challenges you will face along the way, and best practices for implemeting cost containment into your workflows to drive cost accountability across your company.
#dataengineering

Introduction to the Streaming Plane: Bridging the Operational and Analytical Data Planes

Hubert Dulay - StarTree

In the dynamic landscape of data management, the concept of the "data divide" has emerged as a pivotal idea that highlights the crucial distinction between two essential components: the operational data plane and the analytical data plane.  The bridge between these two planes has traditionally been a one-way highway from the operational to the analytical plane. The path in the opposite direction is an arduous, awkward, and costly one that includes solutions named: Reverse ETL (rETL) and Data Activation. These solutions try to extract already cleansed and mastered data residing in the analytical plane from the data systems that aren’t optimized for large extraction. By merging real-time and historical data, one can gain a more comprehensive and nuanced view of operations, customers, and markets. The Streaming Plane is the bridge that connects the operational and analytical aspects of data processing. It captures and processes real-time data, allowing it to flow seamlessly into the analytical phase where it's stored, analyzed, and used for insights and decision-making. The Streaming Plane empowers organizations to access and analyze a broader spectrum of data types, enabling better-informed decisions in real-time and over time.
#dataengineering

90 minute workshop
Introduction to Graph Data Science in Python: Leveling Up Your Data Science Toolbelt

Sean Robinson - Graphable

This workshop will cover a variety of graph data science techniques using Python, Neo4j, and other libraries. The goal of the workshop is to serve as a springboard for attendees to identify which graph-based tools/techniques can provide novel value to existing workflows. Some of the techniques to be covered are:
• How to think about data as a graph and the implications that has on downstream analysis
• How to use graph algorithms at scale using both Neo4j and other pythonic libraries
• How to enhance traditional ML models with graph embeddings
• How to visualize these insights in the context of a graph for greater business intelligence
• How to integrate these techniques with your existing data science tool belt

LLMs for Enhanced ETL into Graph

Sean Robinson - Graphable

In this session, we will look at how LLMs provide new options for capabilities within ETL pipelines. Specifically, we will look at how to use LLMs as a named entity recognition stage to extract entities from unstructured data and use them to enhance our graph representation of the data.
Attendees will get a first hand look at how to enhance their ETL pipelines through the power of LLMs as well as several take-home examples to consider for their own use cases.

Data-Driven Transformation: Building a Business Value Machine

Chris Tabb - LEIT DATA

In today's data-driven world, businesses are increasingly recognizing the power of data to drive innovation, improve efficiency, and achieve competitive advantage. Data-driven transformation (DDT), the process of fundamentally shifting an organization's culture and processes to become data-centric, is becoming an imperative for organizations of all sizes.
Definition of Business Value:
Business value encompasses a range of metrics that reflect the positive impact of data-driven insights and decisions on an organization's success. This includes metrics such as:
Increased revenue: Data-driven insights can help businesses identify new market opportunities, optimize pricing strategies, and improve customer satisfaction, leading to increased revenue.
Enhanced customer experience: Data analysis can help businesses better understand customer behavior, preferences, and pain points, enabling them to provide personalized and relevant experiences that drive customer loyalty and retention.
Improved operational efficiency: Data-powered automation, decision-making, and resource allocation can streamline operations, reduce costs, and boost productivity.
Enhanced risk management: Data analytics can identify and mitigate risks, allowing businesses to make informed decisions and protect their assets.
Key Components of Data-Driven Transformation:
A successful DDT journey involves several key components:
Cultivating a Data-Driven Culture: Creating a company culture that embraces data literacy, analysis, and decision-making is crucial. This includes training employees in data skills, fostering open communication, and celebrating data-driven successes.
Data Governance and Quality: Establishing robust data governance practices ensures data quality, consistency, and accessibility, laying the foundation for effective analysis.
Data Infrastructure and Analytics: Building a scalable and reliable data infrastructure enables efficient data collection, storage, and analysis. Leveraging advanced analytics tools and techniques unlocks the insights hidden within data.
Data-Driven Decision-Making: Integrating data-driven insights into business processes and decision-making empowers organizations to make informed choices that drive value.
Continuous Improvement: Embracing a culture of continuous learning and improvement ensures that data-driven practices remain relevant and effective as business needs evolve.

How polyglot storage cost me a job, almost killed data modeling, (and started my quest for one data model to rule them all)

Brian Greene - NeuronSphere

A quick review of some major evolutionary steps in database and software architectures over the last few decades reveals that we have traded scale and delivery speed for data governance, clarity, quality, and long-term maintainability.
Yet the same issue emerges - once the default transactional storage engine is no longer capable of the analytical use cases due to engine performance, team focus, or data presence/locality - it always ends up back in a SQL accessible data warehouse. For all the beauty of noSQL, we end up using SQL for insight-generation far beyong anything else. Polyglot storage helped in one place, and brought an exponential new degree of model mismatches or outright unmodeled systems.
We can talk about "best" for analytical modeling, but we're just discussing degrees of normalization and where we put the views - none of the Dimensional vs Data Vault vs Temporal Relational vs Anchor model discussions talk about bridging the gap up the stack and data lifecycle to provide a model that can be used across both the polyglot application and analytical lifecycles. The deeper challenge is always between the state or message-based data models and those that seek to capture history and offer analytical insight, and the best we often can do is post-hoc governance by capture.
What if we could use a labeled property graph data model as the root of both our application and analytical use cases? Can we ues the same data models and techniques based on metadata driven development from software to create an end-to-end modeling and development experience from App and API to data warehouse and pipeline? How can this work? What's an Entity to Polyglot Storage Manager and why is that the key? We'll introduce how NeuronSphere uses these ideas to deliver a graph-based meta-programming toolkit with a set of reference implementations to cover an extensible stack of modern software and data platform use cases.

Data Product Chaos

Malcom Hawker - Profisee

Data products have become all the rage, even while most data people struggle to define what they are. For some, the concept of a data mesh provides slightly more clarity, while being completely unattainable. For others, the benefits of data products loom large, but remain out of grasp thanks to misguided perceptions on what they are, and how they should be managed. This combination of hype and confusion is creating chaos – ensuring most will fail to realize any meaningful value from data products. In this session, Malcolm Hawker will dispel the many myths around data products, and shares why data product management should be your primary focus. Malcolm will show a way to:
• Define data products, while acknowledging the definition is largely meaningless.
• Delineate data products in a data mesh, vs. data products, vs. data product management.
• Outline the concepts and benefits of product management to data and analytics teams.
• Provide recommendations on how to break free from the chaos.
#dataengineering #dataproducts

When linear scaling is too slow – strategies for high scale data processing

Paige Roberts
How does the TradeDesk handle 10 million ad auctions per second and generate 40 thousand reports in less than 6 hours on 15 petabytes of data? If you want to crunch all the data to train an LLM AI model, or handle real-time machine scale IoT problems for AIOps, or juggle millions of transactions per second, linear scaling is far too slow.
Is the answer a 1000-node database with a ton of memory on every node? If it was, companies like the TradeDesk would have to declare bankruptcy. Throwing more nodes or serverless executors at the problem either on cloud or on-premises is neither the only, nor even a good solution. You will rapidly hit both performance and cost limitations, providing diminishing returns.
So, how do extreme high scale databases keep up? What strategies in both open source and proprietary data processing systems leave linear scaling in the dust, without eating up corporate ROI? In this talk, you’ll learn some of the strategies that provide affordable reverse linear scaling for multiple modern databases, and which direction the future of data processing is going.

Data Products: The Value of Simplicity

Ron Itelman - Intelligence.AI

This presentation will explore the pivotal role of simplicity in the design and implementation of data products. In today’s data-driven landscape, the creation and use of data products have become complex, often inhibiting efficiency and effectiveness. We will examine how simple principles can enhance data product usability, user engagement, and streamline decision-making. Case studies where simplification led to significant improvements in performance and satisfaction will be reviewed. Attendees will be provided practical strategies for simplifying data products including focusing on core functionalities, employing intuitive interfaces, and prioritizing user-centric design. The overarching aim is to demonstrate how in data products, less can be more, and simplicity is not just an aesthetic choice but a strategic tool for achieving greater impact and value. This talk will also overview a practical approach to unifying your business, data, and code perspectives with simple, yet powerful tools to harmonize and synchronize collaboration.
#dataproducts

Geospatial Analytics With Apache Sedona In Python, R & SQL

William Lyon - Wherobots

Learn to work with geospatial data at scale. William will start with the basics including how to model and query point, line, and polygon geometries using SQL. Then he'll demonstrate how to build geo-aware applications using Python and R. Finally, William will introduce techniques for applying geospatial artificial intelligence (GeoAI) and machine learning techniques, all using Apache Sedona, the open-source geospatial analytics platform.

Accelerating Insights with Memgraph and GraphXR: A Unified Approach to Graph Database Analytics and Visualization

David Hughes - Graphable

In this presentation, we will explore the potential of Memgraph and GraphXR in transforming data analytics and visualization. Memgraph, a high-performing, in-memory graph database, is renowned for its speed and scalability, offering a powerful solution for managing complex and large-scale graph data​​​​​​​​​​​​​​​​. On the other hand, GraphXR by Kineviz stands out as a browser-based visual analytics platform, providing intuitive tools for transforming complex data into clear, compelling visualizations. It excels in accelerating time to insight, enabling deep analysis of high-dimensional, big data through innovative features like geospatial and time series data representation, and data fusion from varied sources​​​​​​​​​​​​. This presentation will demonstrate how the integration of Memgraph's robust database capabilities with GraphXR's advanced visualization tools can facilitate powerful analytics and insights, catering to both technical and business users. Attendees will gain a comprehensive understanding of how these technologies can be leveraged for complex data analysis, from ingestion to visualization, and learn about their practical applications in various industries. Join us to discover how Memgraph and GraphXR together are redefining the landscape of graph database analytics and visualization.

Advancing Graph Data Insights through a Graph Query Engine: PuppyGraph

David Hughes - Graphable

In this talk, we'll delve into the world of graph query engines, exploring how they are revolutionizing the field of data analytics and visualization. Our focus will be on PuppyGraph, an emerging graph query engine that layers atop traditional relational data stores. We'll examine how PuppyGraph exemplifies the capabilities and potential of graph query engines in enhancing the efficiency and flexibility of data analysis.
Our exploration will include a brief demonstration of PuppyGraph's integration with GraphXR, a web-based visual analytics platform from Kineviz. While our primary emphasis will be on the broader context of graph query engines, this demonstration will provide a glimpse into how these technologies can be combined to transform complex data into actionable insights.
Throughout the presentation, we'll maintain a balanced view, acknowledging the developmental stage of PuppyGraph and graph query engines at large, while also highlighting their current strengths and future potential. We'll explore realistic applications across various sectors, considering both present capabilities and anticipated advancements. Attendees will gain a broader understanding of the role of graph query engines like PuppyGraph in existing data analytics frameworks, discussing their scalability and transformative impact across industries.
Join us to discover how technologies like PuppyGraph are charting a new course in data analytics and visualization, offering insights into the dynamic and rapidly evolving landscape of graph query engines.

Querying a Graph Through an LLM

Marko Budiselic - Memgraph / William Firth - Microchip Technology Inc.

On one hand, Large Language Models are powerful tools for extracting knowledge from existing text datasets. However, many issues are also involved because of how LLMs work under the hood. Bias, misinformation, scalability, and cost are just a few. On the other side, graphs can accurately encode a lot of unstructured data, providing a playground for running complex analytics on top, but that's not enough on its own because a knowledgeable human has to figure out the exact query, being able to interpret the results, etc.
Luckily, humans are good at merging existing tools, solving problems and providing new values. Combining LLMs and graphs creates a synergy, allowing a more comprehensive understanding of any particular domain and dataset. Leveraging the strengths of AI powered by LLMs and graph-based techniques such as graph modeling and graph algorithms accelerates and enhances human ability to understand complex datasets and extract valuable knowledge. Let's examine how querying LLMs enhanced by graphs has many new opportunities. As always, in the process, we'll discover new issues and learn where to apply the new capabilities.
To make everything tangible, Microchip will share a real-world implementation of using LLMs to interact with a graph database. Inspired by the desire of a Microchip executive to quickly answer the question: "Why is this customer sales order late?". We'll walk through how the idea came about, how we implemented it, and how a solution like this truly benefits an enterprise organization.

Computational Trinitarianism

Ryan Wisnesky - Conexus

In this session, Ryan Wisnesky will show how computation has three aspects: logic, type theory, and algebra. He describe how type theory can be used to formalize almost anything, how logic informs 'proof engineering', and how algebra, and specifically category theory, allows us to relate our type-theoretic constructions to each other. Ryan will also describe applications of this way of thinking, including how to build provably correct data transformation systems.

Exploring the Apache Iceberg Ecosystem

Alex Merced - Dremio

This talk provides a concise overview of the Apache Iceberg ecosystem, a pivotal component in the evolution of open lakehouse architectures. Alex Merced will delve into the key aspects:
• Querying Tools: Learn about tools for querying Apache Iceberg tables, enhancing data querying within the open lakehouse framework.
• Cataloging Vendors: Discover vendors offering solutions for cataloging and managing Apache Iceberg tables, crucial for maintaining metadata and data organization.
• Unique Use Cases: Explore innovative uses of Apache Iceberg tables in various products and technologies, showcasing their adaptability.
• Open-Source Projects: Uncover valuable open-source projects that complement Apache Iceberg, expanding its functionality and adoption.
Whether you're a data engineer, analyst, or architect, this session will provide the background to effectively utilize Apache Iceberg in your lakehouse journey.

The Role of Knowledge Graphs for LLM accuracy in the Enterprise

Juan Sequeda

A popular application of LLMs for the enterprise is question answering systems, aka chat with the data. Being able to chat with data residing in enterprise SQL databases provides tremendous potential for transforming the way data-driven decision making is executed. But challenges persist, notably in the form of hallucinations. Knowledge Graphs (KGs) have been identified as a promising solution to fill the business context gaps in order to reduce hallucinations.
In this talk, we will present our work on understanding the accuracy of LLM-powered question answering systems with respect to enterprise questions, enterprise SQL databases and the role knowledge graphs play to improve the accuracy. We will go over our current results where using GPT-4 and zero-shot prompting, the overall Knowledge Graphs accuracy was 3x the SQL accuracy. Furthermore, we will provide practical approaches that can be readily integrated into existing Retrieval Augmented Generation (RAG) architectures.
Our work concludes that investing in Knowledge Graph provides higher accuracy for LLM-powered question-answering systems. And ultimately, to succeed in this AI world, enterprises must treat the business context and semantics as a first-class citizen.

The Ins & Outs of Data Lakehouse Versioning at the File, Table, and Catalog Level

Alex Merced - Dremio

Data lakehouse versioning is a critical technique for ensuring the accuracy and reliability of data in a data lakehouse. It allows you to track changes to data over time, which can be helpful for troubleshooting problems, auditing data, and reproducing experiments. This presentation will explore the ins and outs of data lakehouse versioning. We will discuss the different levels of versioning, including catalog, file, and table-level versioning. We will also discuss the benefits of data lakehouse versioning and the pros and cons of each type of versioning. By the end of this presentation, you will have a better understanding of data lakehouse versioning and how it can be used to improve the accuracy and reliability of your data.

Low-Code Data Analysis with the KNIME Analytics Platform

Scott Fincher - KNIME

Curious about how a low-code tool could help you make better decisions, faster? In this talk, we’ll introduce KNIME Analytics Platform: a free, open-source tool with which you can build intuitive visual workflows to solve any data problem. Using an example dataset, we’ll cover the entire data science life cycle from data access, data wrangling & transformation, standard visualization tasks, and training and validation of predictive models. At the end, we will briefly provide an overview how more sophisticated AI/ML methods (including LLMs) can be brought to bear within the same environment.

Enhancing low-code with graph, vector and AI

Tom Zeppenfeldt - Graphileon

In an era where the speed of information evolution accelerates and the ability to derive insights from data becomes a linchpin for organizational viability, the prominence of low-code platforms has surged. The foundational architecture of software, however, remains constant, consisting of interconnected functions that activate each other, some behind the scenes and others fronting user interfaces for data visualization and input collection.
In his presentation, Tom Zeppenfeldt will share insights into the innovative integration of graph databases, vector indexes, and generative AI. This integration has enabled the Graphileon team to develop a versatile platform that simplifies complex programming demands, catering to sectors as varied as pharmaceuticals, food services, fintech, architectural design, engineering, and fraud detection. Graphileon's solutions, with their seamless API connectors, meld into existing IT frameworks with remarkable ease, offering flexibility and the capacity for swift enhancements. Their platform stands as a beacon of innovation, redefining the creation of data-centric solutions with minimal coding requirements. Join us to explore how Graphileon is charting a new course in the realm of data-driven organizational tools.

Building a Resilient Food Supply Chain with Neo4J and Graphileon

Vish Puttagunta - Power Central

Vish Puttagunta, who co-owns a Food Manufacturing company, encountered first hand, the intricate challenges of timely product delivery during the COVID-19 pandemic. The notification of a delay in a seemingly simple ingredient, such as Sugar, posed a considerable hurdle for his Sales, Procurement, and Production teams. Visualizing the impact on Production Orders and Sales Orders became essential, requiring swift and informed actions.
Existing planning modules in the ERP, including Material Requisition Planning (MRP) and Production Planning, proved cumbersome to set up and lacked the flexibility to handle exceptions effectively. The complexity of multi-level Bill-Of-Materials, a standard in Food Manufacturing, added another layer of difficulty. For instance, understanding how an ingredient like Sugar could cascade through various stages, eventually contributing to different SKU’s, rendered traditional SQL queries challenging to decipher.
This session will delve into the development of an easy-to-use solution for Planning staff by Power Central, leveraging Graphileon and Neo4J. The aim is to streamline the planning process, offering a more intuitive and efficient approach for addressing the dynamic challenges faced in the food manufacturing industry.

90 minute Deep Dive
LLMs In a Low-Code Environment – Is it Possible?

Satoru Hayasaka - KNIME

Almost everybody is familiar by now with Large Language Models (LLMs) and how powerful they can be. What if there was a way to train and customize them without writing a single line of code? In this workshop, we’ll cover some technical background on LLMs, and dive into methods for training our own local models using a custom knowledge base. The tool we’ll use is KNIME Analytics Platform: a free, open-source package for building intuitive visual workflows.Without any coding required, we’ll show you how to:• create a vector store from a corpus of documents and attached metadata• prompt engineer your custom chat model via the OpenAI integration to let AI answer specific questions on the provided documents• build a chatbot data app by adding a user interface to control a KNIME workflow.• deploy the data app to make it available to authenticated users on a web browserWe’ll also cover other KNIME AI integrations including GPT4All, Hugging Face, Chroma, FAISS and more.

Discover Insights in a Large Multi-Decade Life Sciences Database Through Data Visualization and Analysis

Janet Six - Tom Sawyer Software

Effective analysis of legacy databases requires the overcoming of several obstacles including the understanding of the heterogenous data stored within, evaluation of the quality and health of the data, and the discovery of key insights. All of these tasks must be accomplished without being limited by the technology that existed at the time the database was designed and initiated. Furthermore, results must be presented to stakeholders and decision makers in a compelling and understandable manner.In this session, we will discuss graph technology methods for understanding the data within a large multi-decade life sciences relational database, evaluating data quality and health, and applying data visualization and analysis for both discovery and communication of key results. These methods also support effective navigation of the data.