The Data Day Texas 2023 Sessions

There are just a few rooms at the conference hotel left. We have a heavily discounted rate, and this is where the action is. Book a room now.

How to build someone we can talk to

Jonathan Mugan - DeUmbra

So many intelligent robots have come and gone, failing to become a commercial success. We’ve lost Aibo, Romo, Jibo, Baxter, Scout, and even Alexa is reducing staff. I posit that they failed because you can’t talk to them, not really. AI has recently made substantial progress, speech recognition now actually works, and we have neural networks such as ChatGPT that produce astounding natural language. But you can’t just throw a neural network into a robot because there isn’t anybody home in those networks—they are just purposeless mimicry agents with no understanding of what they are saying. This lack of understanding means that they can’t be relied upon because the mistakes they make are ones that no human would make, not even a child. Like a child first learning about the world, a robot needs a mental model of the current real-world situation and must use that model to understand what you say and generate a meaningful response. This model must be composable to represent the infinite possibilities of our world, and to keep up in a conversation, it must be built on the fly using human-like, immediate learning. This talk will cover what is required to build that mental model so that robots can begin to understand the world as well as human children do. Intelligent robots won’t be a commercial success based on their form—they aren’t as cute and cuddly as cats and dogs—but robots can be much more if we can talk to them.

In Search of the Control Plane for Data

Shirshanka Das - Acryl

Data Discovery, Data Observability / Quality, Data Governance, Data Management are all linked via a common thread: Metadata. Instead of pursuing vertical silo-ed solutions in each of these areas, there is a way to invert the model while still preserving the ability to innovate at the edges on each of these.
Over the last few years, we’ve witnessed frenetic innovation that has transformed the data landscape from a mostly monolithic operational database plus analytical warehouse stack to a hyper-specialized de-constructed data stack composed of NoSQL data stores, search indexes, specialized OLAP stores, streaming systems and new and unique segments like “data prep” and “reverse etl”. Some of these segments (e.g. streaming) have become mainstays of the data stack, while others (e.g. reverse etl) are constantly under threat of being subsumed by an adjacent segment.
The consequence of this cambrian explosion and the constant redrawing of the data jigsaw puzzle, is that while team-specific use-cases and business outcomes have gotten the attention they need, (hello, upload csv to salesforce); use-cases and outcomes that span the entire data stack have gotten harder and harder to achieve. What are we missing?
The first hard problem in this category that people usually encounter is “data discovery”, or the ability to find data assets that are hidden behind team and tool boundaries, and “data lineage” or the understanding of how data flows from one system to another. With modern data catalogs like DataHub and others, we’re starting to see promising case studies emerge from the data community in solving this problem.
However, there are other and much more impactful problems that we don’t talk about as much, mainly because as an industry we have resigned ourselves to solving them manually, or in an inefficient way. Problems such as access management, data retention, cost attribution and optimization, data-aware orchestration, etc. When was the last time that you felt like you have a single place where you can provide a high level policy like “sales people should not have access to personally identifiable information collected on your platform”, and rest assured that this was being adhered to across all of your data assets stored across all the different tools that you have purchased at your company? How about specifying in a single place how long your user behavior data should be retained and knowing that it is being respected across your streaming systems, your cloud warehouse and your CRM system?
In an interview conducted within the DataHub community of 5000 data practitioners, more than 50% of respondents were unhappy with the state of the art when it came to solving all of these problem categories above and felt that they were having to solve for these outcomes in a disjointed way or just not solve them at all.
We’ve gotten into this state, because while we were busy building individual tools that solve specific use-cases, we forgot to build a harmonizing layer that makes these tools work together to achieve outsized outcomes for the business.
I’m calling this layer, the control plane of data.
In this talk, I’ll describe what the control plane of data looks like, and how it fits into the reference architecture for the deconstructed data stack; a data stack that includes operational data stores, streaming systems, transformation engines, BI tools, warehouses, ML tools and orchestrators.
We’ll dig into the fundamental characteristics for a control plane:
• Breadth (completeness) • Latency (freshness) • Scale • Consistency • Source of Truth
We’ll discuss what use-cases you can accomplish with a unified control plane at a fraction of the implementation cost of building these as bespoke, custom solutions.
Finally, I’ll share how DataHub, the open source metadata platform for the modern data stack is evolving towards this vision of the future, through some real-world examples and use-cases from the community.

Reinforcement Learning with Ray RLlib

Dean Wampler - IBM

Reinforcement Learning trains an agent to maximize a cumulative reward in an environment. It has been used to achieve expert level performance in Atari games and even the game of Go. It is now being applied to many other problems. Dean will begin with why RL is important, how it works, and discuss several applications of RL. Then he will discuss how RL requires a variety of computational patterns: data processing, simulations, model training, model serving. etc. Few frameworks efficiently support all these patterns at scale. Finally, Dean will show how RLlib, implemented with Ray, seamlessly and efficiently supports RL, providing an ideal platform for building Python-based, RL applications with an intuitive, flexible API.

Building ML Ops Organizations for Scale

Joey Jablonski - (Pythian)

Many organizations have now successfully created their first analytical model and leveraged it to drive new business value and more impactful decisions. Now the hard part begins, building an operational framework for the deployment of future models and management at scale with governed principles of performance, bias, responsiveness, and accuracy.
Today’s Machine Learning Operations (ML Ops) capabilities are more than just technology. They are capabilities that must align with the model development process, a data engineering platform, and an operational model. These processes must ensure models are unbiased when deployed, do not drift over time with the accuracy of their answers and that upstream data changes do not affect accuracy of model outputs.
This process starts with defining metrics of success. These are the anchor point for how we measure and intervene in the lifecycle of our models. Effective metrics align with organizational goals for product adoption and revenue growth targets.
These metrics become key components in the design of our ML Ops technology stack, assisting us in prioritizing what aspects of model performance we monitor, how we intervene and what urgency data science and ML engineering teams feel when receiving alerts about adverse performance. The technology landscape for ML Ops is changing rapidly and making purposeful decisions early about architecture and modularity will ensure seamless addition of new future capabilities.
Behind the ML Ops technology stack is our operational model. These are the constructs for how teams operate across data engineering, data science and ML engineering to build, test and deploy analytical assets. Operational models capture the process for work handoff, verification, and review to ensure adherence with organization and industry objectives. Operational models are backed by our governance standards that define testing criteria for new models, de-identification of data used for training and management of multiple models that converge for driving application experiences.
Bringing together defined metrics, flexible technology stacks and a well-defined operational model will enable your organization to deploy reliable models at scale. The collection of these capabilities will allow the organization to move to higher levels of maturity in their operations, higher levels of reliability and enable scalability as the use of analytical models for decisioning increases.
This session is part of the Data Day Texas ML Track.

GIS Keynote
"one ant , one bird, one tree"...

Bonny McClain

"When you have seen one ant, one bird, one tree, you have not seen them all.”--E.O. Wilson
The power of GIS and visualizing history is the transitory nature of the present. The world writ large doesn’t exist in political terms, marketing campaigns, or lifespans--think of a continuous link evolving into our current zeitgeist--and well beyond.
Geospatial analysts and scientists evaluate demographic shifts, social and cultural shifts, economic shifts, and environmental dynamics--what we need now is a powerful intersection of our insights. Understanding the role of location intelligence and spatial awareness just might be the missing link.
Using open source tools and data we will examine how powerful data questions elevate our discussion and re-focus potential solutions to address community level discord and marginalization.

Computer Vision Landscape at Chegg: Present and Future

Sanghamitra Deb - (Chegg)

Millions of people all around the world Learn with Chegg. Education at Chegg is powered by the depth and diversity of the content that we have. A huge part of our content is in form of images. These images could be uploaded by students or by content creators. Images contain text that is extracted using a transcription service. Very often uploaded images are noisy. This leads to irrelevant characters or words in the transcribed text. Using object detection techniques we develop a service that extracts the relevant parts of the image and uses a transcription service to get clean text. In the first part of the presentation, I will talk about building an object detection model using YOLO for cropping and masking images to obtain a cleaner text from transcription. YOLO is a deep learning object detection and recognition modeling framework that is able to produce highly accurate results with low latency. In the next part of my presentation, I will talk about the building the Computer Vision landscape at Chegg. Starting from images on academic materials that are composed of elements such as text, equations, diagrams we create a pipeline for extracting these image elements. Using state of the art deep learning techniques we create embeddings for these elements to enhance downstream machine learning models such as content quality and similarity.

Your laptop is faster than your data warehouse. Why wait for the cloud? (DuckDB)

Ryan Boyd - (MotherDuck)

The tech industry has been on a pendulum swinging between centralized and distributed computing power. We started with mainframes accessed through thin clients, moved to fully-functional PCs and now have these PCs backed by cloud compute -- the modern "mainframes."With advances in CPUs, memory, SSDs, and the software that enables it all, our personal machines are powerful beasts relegated to handling a few Chrome tabs and sitting 90% idle. As data engineers and data analysts, this seems like a waste that's not only expensive, but also impacting the environment. How can we take advantage of this compute?
DuckDB is an open source in-process OLAP engine for vectorized analysis of columnar data. It supports SQL, including some advanced analytics capabilities like window functions and data sampling.In this session, I'll introduce DuckDB, show its capabilities with a bunch of live demos and talk about how we're exploring using the cloud and a hybrid querying architecture to support in-process analytics instead of replacing it.

Zero-copy integration

Dave McComb - (Semantic Arts)

The reason we need to talk about zero copy integration is that its opposite is so well entrenched that most practitioners can’t imagine a world without some form of extract, transform and load, or system integration copy and manipulate through APIs. The traditional enterprise data landscape in an almost endless set of pipelines, data sets and chutes and ladders that ferry data from its source to myriad destinations. This seems necessary, because each application we use and each tool we employ has its own bespoke way of structuring data. Each application ends up morphing the prior application’s idiosyncrasies into its own idiosyncrasies. In this talk we unpack the prerequisites needed to achieve Data-centricity and zero copy integration. We will present two case studies of firms that are enjoying zero copy integration. We will also present a simple demonstration, to make the idea more concrete.

Clinical trials exploration: surfacing a clinical application from a larger Bio-Pharma KnowledgeGraph

David Hughes - Graphable

Clinical, proteomic, and pharma knowledge graphs are complex aggregations of constituent sub graphs. These linked graphs provide meaningful insights as a whole, but in many cases a singular subgraph can independently prove to be a valuable asset. In this session, David will identify possible applications of the NLM’s Clinical Trials resource as a standalone application. He will review how to query the API, how to populate/run ETL through tools like Hume orchestra and Apache-Hop. He will then explore how to create an application using Streamlit as a POC, and then discuss potential refinements.”

An Introduction to Apache Pinot

Tim Berglund - StarTree

When things get a little bit cheaper, we buy a little bit more of them. When things get cheaper by several orders of magnitude, you don't just see changes in the margins, but fundamental transformations in entire ecosystems. Apache Pinot is a driver of this kind of transformation in the world of real-time analytics.
Pinot is a real-time, distributed, user-facing analytics database. The rich set of indexing strategies makes it a perfect fit for running highly concurrent queries on multi-dimensional data, often with millisecond latency. It has out-of-the box integration with Apache Kafka, S3, Presto, HDFS, and more. And it's so much faster on typical analytics workloads that it is not just a marginally better data warehouse, but the cornerstone of the next revolution in analytics: systems that expose data not just to internal decision makers, but to customers using the system itself. Pinot helps expand the definition of a "decision-maker" not just down the org chart, but out of the organization to everyone who uses the system.
In this talk, you'll learn how Pinot is put together and why it performs the way it does. You'll leave knowing its architecture, how to query it, and why it's a critical infrastructure component in the modern data stack. This is a technology you're likely to need soon, so come to this talk for a jumpstart.

Your data infrastructure will be in Kubernetes

Patrick McFadin / Jeff Carpenter - (DataStax)

Are people actually moving stateful workloads to K8s? Yes, yes they are. In the process of writing the book Managing Cloud Native Data on Kubernetes, we spoke with a bunch of the experts who are moving various types of stateful workloads to K8s. In this talk we’ll share what we learned:
• What’s solid: storage and workload management
• What’s good and getting better: operators, streaming, and database workloads
• What needs work: analytics and machine learning
We’ll also share what this means for your data infrastructure:
• Infrastructure should conform to your application and not the other way around.
• Stop creating new data infrastructure projects and start assembling new architectures
• Look to open source projects for inspiration

Protecting Against Ransomware Attacks using Kafka, Flink and Boostgraph

Brian Hall - (Qomplx)

Ransomware attacks are now commonplace in the news and only becoming more so - and they do not include those handled quietly. Come see how Kafka, Flink and graph technologies like boostgraph can be used to identify anomalous behaviors on corporate networks.
Prerequisites: General knowledge around messaging technologies, parallelism and autoscaling.
Takeaways: How to keep track of and manage network traffic in a highly scalable manner and interpret relative risk of potential breaches using readily available technologies.

Outrageous ideas for Graph Databases

Max De Marzi - Amazon Web Services

Almost every graph database vendor raised money in 2021. I am glad they did, because they are going to need the money. Our current Graph Databases are terrible and need a lot of work. There I said it. It's the ugly truth in our little niche industry. That's why despite waiting for over a decade for the "Year of the Graph" to come we still haven't set the world on fire. Graph databases can be painfully slow, they can't handle non-graph workloads, their APIs are clunky, their query languages are either hard to learn or hard to scale. Most graph projects require expert shepherding to succeed. 80% of the work takes 20% of the time, but that last 20% takes forever. The graph database vendors optimize for new users, not grizzly veterans. They optimize for sales not solutions. Come listen to a Rant by an industry OG on where we could go from here if we took the time to listen to the users that haven't given up on us yet.

Apache Iceberg: An Architectural Look Under the Covers

Alex Merced - Dremio

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format, released by Facebook in 2009 that addresses some of these problems, but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. In this talk, Alex Merced will walk through the architectural details of Iceberg, and show how the Iceberg table format addresses the shortcomings of the Hive format, as well as additional benefits that stem from Iceberg’s approach.
You will learn:
The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design

Apache Iceberg and the Right to Be Forgotten

Alex Merced - Dremio

Regulatory requirements can make data engineering more complex than it otherwise would be. In this talk, we will discuss how to navigate hard deletions in an Apache Iceberg-based data lakehouse. In this talk you will learn:
- How to ensure data is hard deleted using copy-on-write Iceberg tables
- How to ensure data is hard deleted using merge-on-read Iceberg tables
- Other strategic and technical options to consider when architecting regulatory compliance
Some familiarity with Apache Iceberg and Data Lakehouse will be helpful

Visualizing Connected Data as It Evolves Over Time

Janet Six - (Tom Sawyer)

Connected data visualization and analysis techniques are becoming increasingly popular for the ability to work well with graph databases and for the communication of key results to stakeholders and decision makers. Applying these techniques to connected data that is static benefits from specific techniques, but what are the best practices for visualizing connected data that dynamically changes? And, how do you best model the changes that are occurring in the system?
In this session, we will discuss how connected data can change over time and the implications of those changes for visualization and analysis techniques. We will also explore visualization techniques for dynamically changing connected data, including social networks that evolve over time, digital transformation model simulations, and event analysis. These visualization techniques allow us to:
• Apply post-situation analysis so that we can understand what happened in the system and when
• Better understand simulations of future scenarios and compare them
• Discover important trends

What You Can't do With Graph Databases

Tomás Sobat Stöfsel - (Vaticle)

Developing with graph databases has a number of challenges, such as the modelling of complex schemas and maintaining data consistency in your database.In this talk, we discuss how TypeDB addresses these challenges, as well as how it compares to property graph databases. We’ll look at how to read and write data, how to model complex domains, and TypeDB’s ability to infer new data.The main differences between TypeDB and graph databases can be summarised as:1. TypeDB provides a concept-level schema with a type system that fully implements the Entity-Relationship (ER) model. Graph databases, on the other hand, use vertices and edges without integrity constraints imposed in the form of a schema2. TypeDB contains a built-in inference engine - graph databases don’t provide native inferencing capabilities3. TypeDB offers an abstraction layer over a graph, and leverages a graph engine under the hood to create a higher-level model; graph databases offer a lower level of abstraction

Graphing without the database - creating graphs from relational databases

Corey Lanum - (Cambridge Intelligence)

Many projects I’ve worked on assume that to present business users with a node-link view means transferring all the data to a graph database. Worse still, I’ve seen teams duplicate and synchronize their data into a second database creating new layers of completely unnecessary complexity.
The truth is that analyzing and visualizing data as a graph doesn’t necessarily mean storing the data in a graph database, or in a graph format, at all.
While graph databases have value for complex traversal queries, in many cases, they create unnecessary complexity. The simpler model of translating tabular data to nodes and links on the fly is easier to implement and allows the flexibility to choose from different data models of the same source.
In this talk, I’ll walk through the process and architecture of building a graph application from the standard relational database you probably already have.

Enabling the Computational Future of Biology

Tomás Sobat Stöfsel - (Vaticle)

Computational biology has revolutionised biomedicine. The volume of data it is generating is growing exponentially. This requires tools that enable computational and non-computational biologists to collaborate and derive meaningful insights. However, traditional systems are inadequate to accurately model and handle data at this scale and complexity.In this talk, we discuss how TypeDB enables biologists to build a deeper understanding of life, and increase the probability of groundbreaking discoveries, across the life sciences.

Where Is the Graph? Best practices for extracting data from unstructured data sources for effective visualization and analysis

Janet Six - (Tom Sawyer)

As unstructured data becomes larger and more complex while stakeholders expect increasingly useful data results, we can apply graph techniques to discover ever elusive insights. These graph techniques can assist with data discovery and understanding, and be used to communicate findings to decision makers. But what are the best practices to apply graph technology to the connected data that is inherent in unstructured data sources? Where is the graph?Currently, many companies are still trying to visualize or analyze the whole data source. This leads to mixed results and hairball visualizations that may be beautiful artistically, but don’t show the level of detail that is needed for visual analysis and for communication of results to stakeholders. How do we get beyond all the noise in unstructured data to discover the knowledge needed to bring business value to that data?In this session, we will discuss several approaches to finding useful graphs in your unstructured data and how to apply visualization and analysis techniques to them.

Graphing without the database - creating graphs from relational databases

Corey Lanum - (Cambridge Intelligence)

Many projects I’ve worked on assume that to present business users with a node-link view means transferring all the data to a graph database. Worse still, I’ve seen teams duplicate and synchronize their data into a second database creating new layers of completely unnecessary complexity.
The truth is that analyzing and visualizing data as a graph doesn’t necessarily mean storing the data in a graph database, or in a graph format, at all.
While graph databases have value for complex traversal queries, in many cases, they create unnecessary complexity. The simpler model of translating tabular data to nodes and links on the fly is easier to implement and allows the flexibility to choose from different data models of the same source.
In this talk, I’ll walk through the process and architecture of building a graph application from the standard relational database you probably already have.

Data Professional's Career: Techniques to Practice Rigor and Avoid Ten Mistakes

Jike Chong / Yue Cathy Chang

Trust is precious to build for all data science work. Much of the trust in data and modeling comes from the rigor with which data professionals treat the subject. What are some aspects of rigor required throughout a data professional's career?
One aspect of rigor is to detect anti-patterns in your work. Anti-patterns are undesirable data science practices that increase the risk of failure. Just as a chess master can observe a chessboard and articulate which side may be winning, you can detect anti-patterns in your projects and work before they cause irreparable damage.
This talk highlights 10 top areas to practice rigor and mitigate anti-patterns, focusing on four levels of career development stages, across team leads, team manager, function director, and executive levels, with within-team to industry-wide scopes of influences.
Requirements : Some experience as a data professional.
Takeaways : Recognize and detect anti-patterns in your projects and work before they cause irreparable damage.

For the overwhelmed data professionals: What to do when there is so much to do?

Jike Chong / Yue Cathy Chang

95% of the companies with data science teams have teams of fewer than ten members. In a nimble team with a wide range of possibilities to make business impacts with data science, how do you explore and prioritize the opportunities? How do you ensure that there are sponsors and champions for your project? How do you set realistic expectations with business partners for project success? If you are leading a team, how can you delegate your projects effectively?
In this talk, the presenters will share three techniques for project prioritization, two roles (sponsor and champions) to identify, four levels of confidence (for predictive models) to specify project success, and discuss best practices for delegating work as a team lead/manager.

Ontology in Healthcare: a survey

Sivaram Arabandi - Ontopro

It is a little know fact that ontologies have been playing a critical role in the healthcare domain for over 2 decades. The Gene Ontology (GO) was started in 1998 as a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species [1]. SNOMED-CT, which provides a broad coverage of the clinical domain, was released in 2002 in a highly computable form: a directed acyclic graph anchored in formal representation logic [2]. Similarly, LOINC (lab data), RxNorm (medications) and many others have been an integral part of healthcare data. This talk will provide a broad overview of the healthcare terminology space including formal ontologies, controlled vocabularies, value sets, etc. and how they relate to other healthcare standards such as FHIR and CPG.

Topics covered include:
1. Why do we need ontologies in healthcare
2. Overview of Healthcare terminology space
3. Terminology, Value sets and Cross maps
4. Ontology vs Information model
5. Terminology servers
6. Ontology and AI

Introducing a Strongly-typed Database: TypeDB & TypeQL

Haikal Pribadi - (Vaticle)

In this session, Haikal Pribadi will present the origins of TypeDB, the impetus for inventing a new query language, TypeQL, and how Vaticle has arrived at an altogether new evolution of the database. As a strongly-typed database, TypeDB allows you to model your domain based on logical and object-oriented principles, allowing you to think higher-level, as opposed to join-tables, columns, documents, vertices, and edges. Types describe the logical structures of your data, allowing TypeDB to validate that your code inserts and queries data correctly. Query validation goes beyond static type-checking, and includes logical validation of meaningless queries. TypeDB also encodes your data for logical interpretation by its reasoning engine. It enables type-inference and rule-inference, which create logical abstractions of data, allowing for the discovery of facts and patterns that would otherwise be too hard to find.With these abstractions, queries in the tens to hundreds of lines in SQL or NoSQL databases can be written in just a few lines in TypeQL.

Workshops at Data Day Texas 2023

The mini-workshops are 90 minutes - the same length as two regular Data Day sessions. These workshops run throughout the day, and are held in tiered classrooms with space to open and plug-in your laptop. The goal of each workshop is to set you up with a new tool/skill and enough knowledge to continue on your own.
We're just beginning to announce the workshops for Data Day Texas 2023. We'll be adding more each week.

Introduction to Graph Data Science for Python Developers

Sean Robinson - (Graphable)

This workshop will cover a variety of graph data science techniques using Python, Neo4j, and other libraries. The goal of the workshop is to serve as a springboard for attendees to identify which graph-based tools/techniques can provide novel value to existing workflows. Some of the techniques to be covered are:
• How to think about data as a graph and the implications that has on downstream analysis
• How to use graph algorithms at scale using both Neo4j and other pythonic libraries
• How to enhance traditional ML models with graph embeddings
• How to visualize these insights in the context of a graph for greater business intelligence
• How to integrate these techniques with your existing data science tool belt

Hands-On Introduction To GraphQL For Data Scientists & Developers

William Lyon - (Neo4j)

This hands-on workshop will introduce GraphQL and explore how to build GraphQL APIs backed by Neo4j, a native graph database, and show why GraphQL is relevant for both developers and data scientists. This workshop will show how to use the Neo4j GraphQL Library, which allows developers to quickly design and implement fully functional GraphQL APIs without writing boilerplate code, to build a Node.js GraphQL API, including adding custom logic, authorization rules, and operationalizing data science techniques.

- Overview of GraphQL and building GraphQL APIs
- Building Node.js GraphQL APIs backed by a native graph database using the Neo4j GraphQL Library
- Adding custom logic to our GraphQL API using the @cypher schema directive and custom resolvers
- Adding authentication and authorization rules to our GraphQL API

We will be using online hosted environments so no local development setup is required. Specifically, we will use Neo4j Aura database-as-a-service and CodeSandbox for running our GraphQL API application. Prior to the workshop please register for Neo4j Aura and create a "Free Tier" database: You will also need a GitHub account to sign-in to CodeSandbox or create a CodeSandbox account at

Ontology for Data Scientists - 90 minute tutorial

Michael Uschold - (Semantic Arts)

We start with an interactive discussion to identify what are the main things that data scientists do and why and what some key challenges are. We give a brief overview of ontology and semantic technology with the goal of identifying how and where it may be useful for data scientists.
The main part of the tutorial is to give a deeper understanding of what an ontologies are and how they are used. This technology grew out of core AI research in the 70s and 80s. It was formalized and standardized in the 00s by the W3C under the rubric of the Semantic Web. We introduce the following foundational concepts for building an ontology in OWL, the W3C standard language for representing ontologies.
- Individual things are OWL individuals - e.g., JaneDoe
- Kinds of things are OWL classes - e.g., Organization
- Kinds of relationships are OWL properties - e.g., worksFor
Through interactive sessions, participants will identify what the key things are in everyday subjects and how they are related to each other. We will start to build an ontology in healthcare, using this as a driver to introduce key OWL constructs that we use to describe the meaning of data. Key topics and learning points will be:
- An ontology is a model of subject matter that you care about, represented as triples.
- Populating the ontology as triples using TARQL, R2RML and SHACL
- The ontology is used as a schema that gives data meaning.
- Building a semantic application using SPARQL.
We close the loop by again considering how ontology and semantic technology can help data scientists, and what next steps they may wish to take to learn more.

Introduction to Taxonomies for Data Scientists - 90 minute tutorial

Heather Hedden - (Semantic Web Company)

This tutorial/workshop teaches the fundamentals and best practices for creating quality taxonomies, whether for the enterprise or for specific knowledge bases in any industry. Emphasis is on serving users rather than on theory. Topics to be covered include: the appropriateness of different kinds of knowledge organization systems (taxonomies, thesauri, ontologies, etc.), standards, taxonomy concept creation and labeling, taxonomy relationship creation. The connection of taxonomies to ontologies and knowledge graphs will also be briefly discussed. There will be some interactive activities and hands-on exercises. This session will cover:
Introduction to taxonomies and their relevance to data
• Comparisons of taxonomies and knowledge organization system types
• Standards for taxonomies and knowledge organization systems
• Taxonomy concept creation
• Preferred and alternative label creation
• Taxonomy relationship creation
• Taxonomy relationships to ontologies and knowledge graphs
• Best practices and taxonomy management software use