Take advantage of our conference discount and book your room at the AT&T Conference Hotel.
The Data Day Texas 2025 Sessions
Improve your RAG pipelines with semantic re-ranking
Susan Shu Chang - Elastic
Hallucinations in Generative AI can undermine trust and accuracy. Retrieval Augmented Generation (RAG) has emerged in the last few years as a proven solution - in short, RAG works by retrieving relevant, ground truths from a knowledge base to pass to the LLM along with a prompt. However, its effectiveness hinges on retrieving the right information. If the retrieved content is irrelevant, we're back to square one. In this talk, we’ll explore re-ranking, a technique long proven effective from recommender systems, to improve the relevance of retrieved data. Join this talk to learn how to make RAG outputs more accurate, and improve your AI applications.
Data Modeling in the Age of AI
Keith Belanger - SqlDBM
While AI can indeed be a tremendous asset, accelerating many of the routine tasks involved in data modeling, it cannot replace the strategic thinking and deep business understanding that a skilled data professional brings to the table. The art of interacting with stakeholders, understanding their unique needs, and interpreting those needs into a data model is something that, at least for now, AI cannot fully replicate. While the core concepts and patterns of data modeling have remained relatively stable over the years, the physical data platforms, formats, and volumes have changed dramatically. We now operate in a world where data is generated at an unprecedented scale and in a variety of formats, from structured relational data to unstructured text and multimedia. This shift has necessitated a corresponding evolution in how we approach data modeling. AI can—and should—be leveraged in your data modeling practice, it should be seen as a tool that enhances human capability, not a replacement for it.
In this session, Keith Belanger will discuss how to combine the speed and efficiency of AI with the insight and experience of seasoned data professionals - in a hybrid approach that will allow your organization to not only keep pace with the demands of modern data environments, but also to innovate and lead.
From Office Cubicles to Independent Success: How to Create a Career and Thrive as a Freelance Data Scientist
Clair Sullivan - Clair Sullivan & Associates
In a world where corporate stability is increasingly uncertain, freelancing offers data scientists a powerful way to take control of their careers and insulate themselves from layoffs. This talk will empower you to take that leap with confidence, showing you how to build a freelance practice that not only sustains but thrives, even in turbulent times. Drawing on my own journey from corporate employee to independent freelancer, I’ll share the critical steps to ensure financial stability and client consistency, from setting up the right pricing models and navigating the business logistics of company formation to developing a network that leads directly to opportunities. You’ll learn how to position yourself to avoid being just another name in a crowded job application list, instead connecting directly with clients who value your expertise and have a real need for your work.
We’ll also tackle mindset shifts—breaking away from the corporate myth of job security and embracing the control that freelancing brings over your work and income. This talk goes beyond practical tips, inspiring you to see freelancing as a viable, empowering alternative to traditional employment, enabling you to focus on work that truly excites you while offering flexibility and peace of mind. If you’re ready to take charge of your future, this talk will guide you through every step, turning freelancing from a risky idea into a secure, fulfilling reality.
Moving Beyond Text-to-SQL: Reliable Database Access through LLM Tooling
Patrick McFadin - DataStax
In the quickly evolving landscape of data engineering, traditional text-to-SQL methods often need help with non-deterministic outputs and potential errors. In this session, Patrick will explore an alternative approach: leveraging Large Language Model (LLM) tooling for direct and reliable database interactions. By integrating LLMs with databases, data engineers can achieve efficient data retrieval and manipulation without relying on intermediate SQL generation. This method enhances reliability performance and simplifies complex data workflows. Patrick will then show some of the work he has done on LangChain and LlamaIndex and the insights he gained along the way. Patrick will also review the current state of LLMs and how perfecting this methodology might be a needed survival technique.
Adopting AI in a Large Complex Organization- Aspiration vs Reality
Hala Nelson - James Madison University
Because historically Data and AI have evolved within relatively separate communities, their capabilities, benefits, and adoption strategies are valued differently by different work teams and investment decision makers. Data emerged over the last decade as the new oil, and now AI succeeded as its combustion engine, igniting data to power innovation. Many of us are currently attempting to harness the power of AI technologies at large complex organizations, or small ones for that matter. Initiatives span across a wide range of interests within an organization: AI specialists, data engineers, IT departments, strategists, ethicists, executives, and the people on the ground. How does an implementation team guide a 32,000 person institution to optimally adopt AI within the very short time attention span of an executive who wants an immediate return on investment? Is striking a deal with Microsoft Co-Pilot or OpenAI enough? Can resources be justified for the ambitious goal to create a digital twin of the processes and systems of an entire organization where AI can be applied to drive efficiencies and improvements? Are the required technologies, expertise, and resources available? In this talk I will present my experience on what it takes to move from aspiration to an implementable reality- from math to data to strategy to people to everything in between. We’ll also try to answer the question on everyone’s mind: will we eventually succeed, or will all our efforts end up in the wasteland of failed projects, efforts, funding, and time?
#ai
History and Future of Iceberg REST Catalogs
Lisa Cao - Datastrato
While Iceberg primarily concentrates on its role as an open data format for lakehouse implementation, it needs to heavily leverage its catalog for tracking tables and allowing external tools to interface with the metadata. In Iceberg 0.14.0, the community introduced the REST Open API Specification, but there is a good history into why it was developed and why the Iceberg community has decided not to provide it’s own service instead. In 2024 especially, we’ve seen many third party catalog service providers pop up instead, each with its own unique flavour- but realistically, what is the outcome we can expect from this widespread adoption? Together, we’ll review not only the history of the REST Catalog Spec, but the future of the many offshoot services it has sparked. Please note this talk is not a comparison of the catalog service providers (we’ll save that for the data discussions!) but instead the rationale on the Iceberg community to provide a spec and why everyone’s hedging their bets on Iceberg as the next standard.
All Your Base Are Belong To Us: Adversarial Attack and Defense
Michelle Yi - Women in Data
The increasing deployment of generative AI models in production environments introduces new security challenges, particularly in the realm of adversarial attacks. While visually or textually subtle, these attacks can manipulate generative models, leading to harmful consequences such as medical misdiagnoses from tampered images or the spread of misinformation through compromised chatbots. This talk examines the vulnerabilities of generative models in production settings and explores potential defenses against adversarial attacks. Drawing on insights from attacks against Vision-Language Pre-training (VLP) models, which are a key component in text-to-image and text-to-video models, this talk highlights the importance of understanding cross-modal interactions and leveraging diverse data for crafting robust defenses.
We Are All Librarians: Systems for Organizing in the Age of AI
Jessica Talisman - Adobe
Libraries have been used to collect, organize, and make human knowledge available for over 3,000 years. Over the years, practices such as collection development, curation, cataloging, serializing, recording and archiving have evolved as technology has advanced, keeping pace with the complexities of organizing for human and machine consumption. These information systems have proven to be very useful for AI, which benefits from clean, semantically structured data. What does this mean for the AI technologist? In the age of AI, we are all librarians, tasked with curating and making sense of vast amounts of data and information. As we navigate this new landscape, we would do well to learn from the expertise of librarians, who have spent centuries perfecting the art of organizing and making sense of the world's knowledge.
Empowering Change: Building and Sustaining a Data Culture from the Ground Up
Clair Sullivan - Clair Sullivan and Associates
This is a follow-up to Clair's session on Data Culture at Data Day Texas 2024.
Many organizations struggle to create a data culture that drives real business value, often facing issues such as misalignment between teams, unclear objectives, and poorly managed data practices. These challenges typically stem from simple yet correctable mistakes that, once addressed, can unlock significant potential. In this session, we’ll focus on practical steps that anyone in an organization can take, whether you are an individual contributor or a senior director, to align your teams around data-driven outcomes, improve data governance practices, and enhance collaboration across departments. You’ll learn how to influence data-driven decision-making processes, advocate for better data practices, and create an environment where data insights lead to measurable improvements. We will discuss practical approaches to champion data initiatives from within, regardless of your position, and drive meaningful change by influencing processes, communication, and shared goals. Additionally, we will explore how to build momentum for data projects by showcasing early wins and creating a feedback loop that promotes continuous improvement.
In this session, you will learn how to identify and address gaps in your team’s data culture, with a focus on driving measurable business outcomes. We’ll explore strategies to align business objectives with data insights at the team and departmental levels, making sure that data projects are closely tied to real business needs. You’ll discover ways to foster collaboration between technical and non-technical teams, ensuring that communication is clear and expectations are aligned. We will also cover how to calculate and demonstrate ROI from data initiatives, helping you build a strong case for continued investment in data-driven solutions. Ultimately, you’ll leave with practical approaches to championing data initiatives and creating a culture of continuous improvement, even without direct involvement from executive leadership.
Escape the Data & AI Death Cycle, Enter the Data & AI Product Mindset
Anne-Claire Baschet - Mirakl
Yoann Benoit - Hymaïa
On the horizon, there's a transformation underway where every digital product will encompass Data & AI capabilities.
However, we must recognize that Data and Product teams have distinct cultures and origins. Data teams possess an array of tools and technical expertise, yet they often struggle with quantifying the value they deliver. They frequently miss the mark in addressing the right problems that align with customer needs or in collaborating with Business-Product-Engineering teams.
This is where adopting a Product Mindset becomes paramount. Closing the divide between the Data and Product communities is imperative, as both groups must collaborate on a daily basis to create value for users and support businesses in reaching their goals.
In this talk, you will get insights into : identifying and overcoming the most common traps that Data Teams fall into when delivering Data & AI initiatives
• Crafting impactful Data & AI Products that solve the right problems
• Scaling a Data & AI Product Culture throughout the whole organization and define a Data & AI Product Strategy.
#ai
The Outcomes Economy: A Technical Introduction To AI Agentic Systems, Multi-Simulations, & Ontologies
Vin Vashishta - V Squared
Linear has given way to exponential. Digital apps are tools for agents. Data models complex and dynamical systems. The goal is models training models and building their own tools, but nothing is designed to support that today. AI platforms must follow new architectural tenets.
The AI platform roadmap must be designed to accept the realities of where businesses are today; low data maturity and resistance to change. Businesses are in a state of continuous transformation or managed decline. As Sam Altman said, “Stasis is a myth,” which means startups and SMBs have a new competitive advantage.
The speaker will deep dive into AI platforms' three primary architectural components. He’ll explain how they are constructed using real-world case studies from emerging AI platforms. This talk will touch on complex and dynamical systems modeling and where ontologies fit. The talk wraps up with a pragmatic approach to aligning technology with the business and its customers.
The human side of data: Using technical storytelling to drive action
Annie Nelson - GitLab / Annie's Analytics
Join Annie for a session on learning the art of technical storytelling. Drawing from her background in psychology and experience as a data analyst, Annie will share strategies that go beyond just communicating data - how to influence stakeholders from the start and throughout a project’s lifecycle. Whether you're at the kickoff of a project, guiding decisions along the way, or presenting final results, the way you tell the story can have a big impact on its success.
In this session, Annie will explore a practical framework for crafting technical stories that not only explain data but also build trust, influence decision-making, and inspire action at every stage of the process. She will also provide real-world examples of how to tailor your message for both technical teams and business leaders, so you can engage all of your stakeholders effectively. You’ll leave with actionable techniques that help you drive results by tapping into an overlooked tool in data: emotion.
How to Start Investing in Semantics and Knowledge: A Practical Guide
Juan Sequeda - Data.World
What do enterprises lose by not investing in semantics and knowledge? The ability to reuse data effectively due to the lack of context and understanding of what the data means. How is AI going to use data if we don't even understand it? This is why we waste so much time, money, and lack strategic focus.
Many practitioners are already doing critical data and knowledge work, but it’s often overlooked and treated as second-class. In this talk, I will focus on practical knowledge engineering steps to start investing in semantics and knowledge and demonstrate how to elevate this data and knowledge work as a first-class citizen.
We’ll explore four key areas: communication, culture, methodology and technology. The goal is for attendees to leave with concrete steps on how to start investing in semantics and knowledge today, empowering them to be efficient and resilient.
Fundamentals of DataOps
Lisa Cao - Datastrato
While building pipeline after pipeline- we might wonder, what comes next? Automation and Data Quality, of course! Organizations today are facing complex challenges in the end-to-end deployment of data applications, from initial development to operational maintenance. This process requires seamless integration of CI/CD practices, containerization, data infrastructure, MLOps, and security measures. This session discusses strategies and a complete beginner's roadmap for groups trying to implement their own DataOps infrastructures from scratch by empowering developers, architects, and decision-makers to effectively leverage open-source tools and frameworks for streamlined, secure, and scalable ML application deployments.
Deployment at scale of an AI system based on custom LLMs : technical challenges and architecture
Arthur Delaitre - Mirakl
Mirakl is transforming seller catalog onboarding through the deployment of a scalable AI system based on custom fine-tuned Large Language Models (LLMs) and state-of-the-art multimodal models. Traditional onboarding processes can take up to two months; the new system reduces this to mere hours, efficiently handling millions of products.
This presentation will delve into the technical challenges and architectural solutions involved in deploying custom LLMs at scale. Key topics include:
• Infrastructure Deployment: Building scalable environments for LLM inference.
• Model Fine-Tuning: Customizing LLMs and quality improvements through hallucination reduction and consistency increase.
• Micro-Service Architecture: Orchestrating between model services and hosting for efficient operation. Synergies of systems containing LLMs and other ML models.
• Layered Approach: Selecting optimal results while minimizing computational costs.
Arthur will explore how these technologies are integrated into a production-ready system, discussing the strategies used to overcome scaling challenges and ensure high performance. Attendees will gain insights into deploying advanced AI systems in real-world environments, optimizing large-scale inference, and setting new industry standards in marketplace technology.
Modeling in Graph Databases
Max De Marzi - maxdemarzi.com
Modeling is a word with many meanings, we will cover two. First, how to structure your graph data to take advantage of the mechanical sympathy of most graph databases. You'll learn how to use relationship types to partition your data to speed up your queries. Second, how to model your business domain as a graph. You'll learn how to relate data so it is easier to find clusters, create often traveled paths. We'll finish with WCOJs and View capabilities of newer graph databases and how that affects both types of modeling.
#graphday #datamodeling
Automating Financial Reconciliation with Linear Programming and Optimization
(90 minute deep dive session)
Bethany Lyons - Assured Insights
Some of the knarliest data quality problems arise from the absence of relationships in data that exist in the world. Suppose you've raised multiple payment requests for $1000, $2000 and $3000. Then $6000 hits your bank account. Those three invoices should be linked to the 6k payment amount, but many systems fail to capture those links. As a result, you have to infer the relationships after the fact through a series of computational math techniques. This session will take you through real world examples and challenges of such a solution, with broad applications across finance and financial services.
GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, and Visual Analytics
Weidong Yang - Kineviz
Existing BI and big data solutions depend largely on structured data, which makes up only about 20% of all available information, leaving the vast majority untapped. In this talk, we introduce GraphBI, which aims to address this challenge by combining GenAI, graph technology, and visual analytics to unlock the full potential of enterprise data.
Recent technologies like RAG (Retrieval-Augmented Generation) and GraphRAG leverage GenAI for tasks such as summarization and Q&A, but they often function as black boxes, making verification challenging. In contrast, GraphBI uses GenAI for data pre-processing—converting unstructured data into a graph-based format—enabling a transparent, step-by-step analytics process that ensures reliability.
We will walk through the GraphBI workflow, exploring best practices and challenges in each step of the process: managing both structured and unstructured data, data pre-processing with GenAI, iterative analytics using a BI-focused graph grammar, and final insight presentation. This approach uniquely surfaces business insights by effectively incorporating all types of data.
What Superintelligence Will Look Like
Superintelligence will be able to invent new knowledge and understand new situations. Current generative methods, such as the LLMs in chat models like ChatGPT, are insufficient for superintelligence because they only interpolate the human knowledge in their training data. We’ve seen flashes of superintelligence in limited and controlled domains such as chess, go, StarCraft II, protein folding, and even math problems. In more general domains, superintelligence will overcome the hallucination issues that plague current LLMs—an obstacle that even DARPA has identified as a barrier to adoption for critical applications.
Superintelligence will invent new cures for rare diseases and new scientific theories, and it will create art and entertainment personalized to your specific taste. Superintelligence will also understand novel situations, enabling it to serve as a customized teacher, explaining new concepts in terms of those that you already know. It will understand the requirements for tax and regulatory compliance and will be able to guide you through. It may even be able to install CUDA on the first try.
To get there, we must overcome the limitations of LLMs. Current neural network algorithms are too compute, memory, and power hungry. This inefficiency exists because they lack the ability to draw clean distinctions using abstractions. LLMs and associated methods also struggle with lifelong learning because they lack compiled knowledge and the pointers to reach it. These clean distinctions and pointers to compiled knowledge are benefits that come from symbolic systems. In the past, symbolic systems failed because they were unable to adapt to unexpected situations, but current LLMs can be used to dynamically write the symbolic systems on the fly to fit any new situation. This talk will cover what AI systems smarter than humans will look like, what it will take to combine neural networks and symbolic methods to achieve it, and the potential societal effects of this superintelligence.
Unleashing the Power of Multimodal GraphRAG: Integrating Image Features for Deeper Insights
GraphRAG has proven to be a powerful tool across various use cases, enhancing retrieval accuracy, language model integration, and delivering deeper insights to users. However, a critical dimension remains underexplored: the integration of visual data. How can images—so rich in contextual and relational information—be seamlessly incorporated to further augment the power of GraphRAG?
In this presentation, we introduce Multimodal GraphRAG, an innovative framework that brings image data to the forefront of graph-based reasoning and retrieval. By extracting meaningful objects and features from images, and linking them with text-based semantics, Multimodal GraphRAG unlocks new pathways for surfacing insights. From images embedded in documents to collections of related visuals, we’ll demonstrate how this approach enables more comprehensive understanding, amplifying both the depth and accuracy of insights.
WTF Is A Triple? My Journey From Neo4j To Dgraph
Will Lyon - Hypermode
As a long time Neo4j user I recently started working at Hypermode, the maintainers of the Dgraph graph database. As part of my work at Hypermode I’ve become a user of Dgraph and went through the process of trying to become an expert in a second graph database. In this talk I’d like to highlight some of my learnings comparing and contrasting these two graph databases while diving into an overview of how Dgraph fits into what we’re building at Hypermode to enable a fullstack framework for building intelligent applications alongside technologies like AI models, WebAssembly, and GraphQL.
Data Governance – It’s Time to Start Over
Malcolm - Profisee
After 20 years of trying, most Data Governance programs have failed to become anything more than a compliance check box. To realize more meaningful and impactful benefits from data governance, it’s time to start over.
If data governance is needed to realize the full value of data, then drastic changes are needed to how we approach the governance function. Companies seeking to use data governance as a lever of business transformation must:
- Transition from a rules-based system to an exception-based system
- Govern operational uses of data differently than analytical uses of data
- Quantify the value of data governance across both operational and analytical uses
- Integrate incentives to motivate governance behaviors and offset costs of governance
- Jettison outdated frameworks and approaches, including the misguided idea of data ‘ownership’.
Join Malcolm Hawker, the CDO of Profisee Software, as he shares his vision for a new approach to data governance that focus on incentives and business value, and not controls.
Optimisation Platforms for Energy Trading
Adam Sroka - Hypercube
As the energy sector transitions to new technologies and hardware, the data requirements are undergoing significant changes. At the same time, the markets in which energy systems operate are also evolving - giving traders and energy teams a vastly more complex set of options against which they need to make decisions.
The move to real-time data for BESS system operation and the addition of multiple markets makes optimisation of revenue for storage assets intractable for human operation alone.
In this talk, Adam Sroka will walk through one solution deployed at a leading BESS trading company in the UK that aligned probabilistic forecasting and stochastic methodologies with a linear optimisation engine to determine the best markets, prices, and trades for any given portfolio of mixed energy and storage assets.
Adam will walk through an architecture diagram for a system that integrates real-time, near real-time, and slow-moving data with AI-driven forecasts and the complexities of optimisation management.
Ontologies vs Ologs vs Graphs
Ryan Wisnesky - Conexus
Ontologies are seeing renewed interest as 'semantic applications' such as LLMs proliferate. In this talk Ryan will go over the hundred+ year history of ontologies, including their origin in taxonomies, the rise and fall of the semantic web, how knowledge graphs have been serving as lightweight ontologies and how computational extensions to knowledge graphs turn them into ontologies proper. He also describe the present of ontologies, including connections to formal logic and relational algebra, and including 'ologs' - a category-theoretic kind of ontology that uniquely admits a way to bidirectionally exchange data between ontologies. Next he will discuss lambda-graph, a type theoretic kind of ontology that allows results such as type inference to be put to work on graph data, with applications to Tinkerpop. Ryan will conclude by looking towards the future, describing how ontologies can guide data integration and warehousing and how they can add context to prompts to increase LLM accuracy.
Validating LLM-Generated SQL Code: A mathematical approach
Ryan Wisnesky - Conexus
Organizations using LLMs (Large Language Models) to generate SQL code face a significant hurdle: ensuring the generated code is reliable and safe to execute. Unforeseen errors in the code can lead to unpredictable behavior, ranging from minor inconveniences to catastrophic data loss. This lack of trust becomes a major roadblock in deploying LLM-based applications. In this talk we describe a technology that leverages advanced mathematics to rigorously analyze LLM-generated SQL code. The analysis goes beyond basic syntax checks, delving into complex logic and potential unintended consequences. For example, the analysis can detect missing join conditions. Once checked, LLM-generated SQL code can be deployed with assurance. In this talk we go through an in depth example of a validation scenario, and describe the formal methods required to build such a verifier at scale.
The Future of Data Education and Publishing in the Era of AI
Jess Haberman - Anaconda
Michelle Yi - Women in Data
Hala Nelson - James Madison University
With easier access to expert knowledge, we are in the midst of a significant shift in the technical education and publishing landscapes. Do these advancements propel us toward educational bliss or do they pose unprecedented threats to industry and academia? Join us as we see to unravel the future of data and tech education.
The surge of generative AI content sparks a range of debates: Does it herald a new era of learning or threaten academic integrity? Will AI augment or overshadow human-generated educational materials? What implications does the proliferation of AI-generated content hold for authors and the discoverability of their work? Does democratized access to generative AI writing tools make our writing better and more efficient, or simply more generic? We will delve into the ramifications of AI tools on writing, teaching, and student learning, exploring the opportunities they present for knowledge dissemination and the concerns they raise with regard to content quality and correctness. Join us for a discussion on the future of data education and its transformative impact on the realms of technology, academia, and publishing.
Our esteemed panelists bring education and publishing perspectives:
Hala Nelson: Associate Professor of Mathematics at James Madison University and author of Essential Math for AI.
Michelle Yi: Board Member at Women in Data and advocate for STEM education among underrepresented minorities.
Jess Haberman (panelist and moderator): Content and education leader at Anaconda, leveraging 14 years of publishing experience, including as an acquisitions editor at O’Reilly Media.