The Data Day Texas 2022 Sessions

We still have discount rooms at the AT&T. If you are coming from out of town, this is where all the action is. For the best selection, Book a room now.

We are continuing to confirm speakers and add sessions.

The Data Practitioner's Guide to Metadata

Shirshanka Das

Ever wonder what is the secret behind the productivity of data scientists and engineers at data-driven companies like LinkedIn, Airbnb, and others? The answer lies in metadata! In this session, Shirshanka Das walks you through the industry's #1 open-source data catalog, DataHub. We will share how DataHub stitches together metadata from tools like dbt, Airflow, Spark, Looker, and many others to power metadata-driven data discovery, governance and observability at many companies like LinkedIn, Expedia, Peloton, Optum, Klarna and many others. Dive under the hood of DataHub’s architecture and learn about approaches to automation that are being implemented in the 3K+ strong DataHub Community to create new capabilities for metadata-driven data management.
Requirements: A general understanding of the tools that comprise the modern data stack and familiarity with the open source data catalog ecosystem.
Takeaways: Metadata Management needs to be addressed holistically to create a well governed and productive data practice. There are open source implementations out there that data practitioners can apply in their daily workflows to start tackling this problem today.

Unlocking time-based Machine Learning with a new paradigm for engineering features from event-based data

Ryan Michael

Understanding behavior allows us to anticipate outcomes we care about, for example, customer churn or mismatches between inventory and demand. Accurately anticipating an outcome enables us to intervene and improve the result. Unfortunately, traditional data-processing tools have a hard time describing behavior. These tools focus on efficiently computing the current answer to a query, but understanding behavior depends on the story of how these answers change over time.
A new computational paradigm can uncover the temporal aspect of event-based data processing, exposing stories of cause and effect. This talk introduces a set of abstractions focused on manipulating values as timelines. By preserving the story of how a computation changes over time, a unique new set of operations are possible, providing a more intuitive understanding of behavior. These operations unlock the ability to build & deploy behavior ML models in a fraction of the time required by traditional data processing tools by aligning the nature of a problem with the abstractions used to solve it.

Data Professional's Career: Techniques to Practice Rigor and Avoid Ten Mistakes

Jike Chong / Yue Cathy Chang

Trust is precious to build for all data science work. Much of the trust in data and modeling comes from the rigor with which data professionals treat the subject. What are some aspects of rigor required throughout a data professional's career?
One aspect of rigor is to detect anti-patterns in your work. Anti-patterns are undesirable data science practices that increase the risk of failure. Just as a chess master can observe a chessboard and articulate which side may be winning, you can detect anti-patterns in your projects and work before they cause irreparable damage.
This talk highlights 10 top areas to practice rigor and mitigate anti-patterns, focusing on four levels of career development stages, across team leads, team manager, function director, and executive levels, with within-team to industry-wide scopes of influences.
Requirements : Some experience as a data professional.
Takeaways : Recognize and detect anti-patterns in your projects and work before they cause irreparable damage.

Introduction to Vector Search Engines

Bob Van Luijt

For years, relational databases and full-text search engines have been the foundation of information retrieval in modern IT systems. With these, you add tags or category keywords such as "movie", "music", or "actor" to each piece of content (image or text) or each entity (a product, user, IoT device, etc). You’d then add those records to a database, so you can perform searches against those tags or keywords.
In contrast, vector search uses vectors (where each vector is a list of numbers) for representing and searching content. The combination of the numbers defines similarity to specific topics. 
With keyword search, you can only specify a binary choice as an attribute of each piece of content; it's either about a movie or not, either music or not, and so on. Also, If you specify a keyword "films", for example, you would not see any content related to "movies" unless there was a synonyms dictionary that explicitly linked these two terms in the database or search engine. 
Vector search provides a much more refined way to find content, with subtle nuances and meanings. Vectors can represent a subset of content that contains "much about actors, some about movies, and a little about music". Vectors can represent the meaning of content where “films”, “movies”, and “cinema” are all collected together. Vectors also have the flexibility to represent categories  previously unknown to or undefined by service providers. 
Bob Van Luijt, creator of the Weaviate vector search engine, will take you through a tour of what you can do when searching with vectors. At the end of this session, you will be ready to build your first vector-based applications.

Approaches to Modern Data Governance

Joey Jablonski - (Pythian)

The common model of people-process-technology is an out of date, oversimplification of the complexities that modern data governance organizations face when ensuring the protection and appropriate use of data in complex organizations. Todays data governance teams must ensure the proper access of data supporting both internal analysis and powering external consumer experiences. They must do this while balancing the need to protect consumer privacy, meet legal obligations and ensuring employees can effectively trial new products and capabilities. Most activities of a data governance team focus on how best to execute a balanced approach between data access and privacy.This balance of privacy and access is then further challenged by the regular use & joining of both first and third party data for decision making and product creation. Organizations must work with a growing set of complexities about what is allowed for data is that generated and transformed at multiple stages of the value chain.This talk will focus in on how the structure of a modern data governance organization becomes an enabling factor for engineering teams, compliance teams and information security organizations. We will focus on how data governance teams can operate as a facilitator balancing the requirements of consumer privacy with the needs of data access for new product creation, while discussing the technology elements that become the foundation of successful data governance programs.

RLlib for Deep Hierarchical Multiagent Reinforcement Learning

90 minute hands-on tutorial
Jonathan Mugan - (DeUmbra)

Reinforcement learning (RL) is an effective method for solving problems that require agents to learn the best way to act in complex environments. RLlib is a powerful tool for applying reinforcement learning to problems where there are multiple agents or when agents must take on multiple roles. There exist many resources for learning about RLlib from a theoretical or academic perspective, but there is a lack of materials for learning how to use RLlib to solve your own practical problems. This tutorial helps to fill that gap. We show you how to apply reinforcement learning to your problem by taking you through a custom environment and demonstrating how to apply RLlib to that environment. We make the code available on GitHub.
Requirements: ability to code in Python.

A 90 minute hands-on GNN Primer

Everything you wanted to know about Graph AI (but were too afraid to ask)
Alexander Morrise - (Graphistry)

The utility of Graph Neural Nets (GNNs) is growing day by day. By bringing the power of graph thinking to the already-amazing world of neural networks, GNNs are achieving better results than graph analytics or neural nets have on their own. In some cases, GNNs are even enabling answering new kinds of questions about nodes and edges. Our workshop walks through how you can approach automatically turning your typical datasets into GNNs ready for a variety of useful visual and analytic insights. Graph AI thinking topics will include matching different tasks to different GNNs, how to choose node and edge feature representations, and top examples of applying graph thinking to domains like social & behavioral data, event data, natural language, supply chain, and more. For technologies, we will focus on the popular path of using the open source PyData GPU ecosystem and modern data stack, including PyTorch, DGL, RAPIDS (cuGraph/cuDF), Arrow, and PyGraphistry[AI] from Jupyter/Streamlit. We will emphasize workflows that automate handling large and heterogeneous data. At the end, attendees should feel ready to, in just 1-2 lines of code, go from any data source like CSVs, Databricks, SQL, logs, and graph databases to powerful graph AI visualizations and models.

Introduction to Codeless Deep Learning

90 minute hands-on tutorial
Satoru Hayasaka - (KNIME)

Deep learning is used successfully in many data science applications, such as image processing, text processing, and fraud detection. KNIME offers an integration to the Keras libraries for deep learning, combining the codeless ease of use of KNIME Analytics Platform with the extensive coverage of deep learning paradigms by the Keras libraries. Though codeless, implementing deep learning networks still requires orientation to distinguish between the learning paradigms, the feedforward multilayer architectures, the networks for sequential data, the encoding required for text data, the convolutional layers for image data, and so on. This course will provide you with the basic orientation in the world of deep learning and the skills to assemble, train, and apply different deep learning modules.

Panel: How do you scale a knowledge Graph when there is no consensus

Brandon Baylor - (Chevron) / James Hansen - (Chevron) / Josh Shinavier - (Linkedin) / Ryan Wisnesky - (Conexus) / Michael Uschold - (Semantic Arts)

Our friends at Chevron posed the following problem: How do you scale a knowledge Graph when there is no consensus? The problem takes many forms. How do you aggregate a bunch of silos where there is no agreement things and their attributes? The panelists will discuss various approaches to solving this problem.

Building a Modern Data Platform Using Advanced Redshift Topologies

Elliott Cordo - (Capsule)

Over the past few years there have been amazing advancements in technology and data architecture, complemented by recent feature releases for AWS Redshift data warehouse platform. Such features as Data Sharing, Redshift Serverless, and Redshift Spectrum have unlocked us to create flexible, performant and dynamic data platforms. In this talk we will first explore overall data platform design considerations, and common capabilities. We will then step through the different layers of the data platform from ingest, to data processing, to serving, walking through architectural patterns leveraging AWS services.We will then share an architectural walkthrough of Capsule, the Pharmacy that works foreveryone, and how we’ve set out to build a data platform that works for everyone. We will share the "Why" behind our data platform, as well as the technical details on the platform capabilities, technologies and architecture that enabled this.

A Comparison of Deep Learning Versus Parametric Time Series Models

Bivin Sadler - (Southern Methodist University)

The once new shiny object in the room, “Deep Learning” has realized and even extended much of its promise! It has become a major and increasingly trusted work horse in deployed industrial and scientific machine learning models. It is no secret that deep learning models can leverage non-linear relationships; but what does that mean with respect to data collected over time? A brief introduction to some of the most popular parametric (AR/ARIMA) and deep learning (RNN / LSTM / Bidirectional LSTM) models will be provided as well as a comparison between them with respect to both linear and non-linear time series. This talk has something for everyone that has an interest in data collected over time!
Topics Include:
Introduction to AR and ARIMA models
Introduction to LSTM and Bidirectional LSTM (BiLSTM) models
Explanation of Linear and Non-Linear Time Series
Comparison between AR/ARIMA and LSTM / BiLSTM models for both linear and nonlinear time series with simulated and real-world data.

What is Data Observability, and why should data teams consider it?

Andy Petrella - Kensu

In this talk, I’ll review the challenges faced by data teams related to the management of data in production, especially those associated with data quality issues. I’ll highlight why classical data quality approaches are not enough or suited anymore to this era of data - where data teams are growing rapidly to sizes we never witnessed.Similar management and operations challenges have been already encountered in IT, which has led to the development of the DevOps culture, in which observability is taking a big part, next to the automation and decentralization ones. So, I’ll introduce observability with a laser focus on data, its dependencies with other areas, and how it can be introduced in data teams' (DataOps) culture to help them detect, resolve, and prevent data issues (at least).

What is Graph Intelligence

Leo Meyerovich - Graphistry

Graph intelligence is going through a once-in-a-generation watershed moment. From fraud detection to supply chain analysis to user analytics, graph neural networks (GNNs) are replacing previously popular techniques due to their significant lift in results. This talk shares our studies of how modern graph intelligence has been transitioning from AI research to industry, and the rapidly mounting implications we found for data scientists, data engineers, and leaders. On the theoretical side, we overview the key concepts that have grown GNNs from an academic niche to state-of-the-art for long-standing scientific challenges, and have grown GNNs far past use cases having to do with the traditional graph database market. For practitioners, we first map the emerging use cases we’re finding in top companies. One key disrupter in their graph data projects has been the modern data stack: it drastically improves core areas like implementation time, and by doing so, explains a rethinking of the graph stack to support graph intelligence. Finally, as graph intelligence is beginning its democratization phase, we discuss what is happening in key areas like making graph AI automatic, explainable, and visual.

It was the best of Graph, it was the worst of Graph - Choosing between Graph ML and Graph Analytics

Jörg Schad - ArangoDB

Graph Analytics has long demonstrated that it solves real-world problems, including Fraud, Ranking, Recommendation, text summarization, and other NLP tasks.
More recently, Graph Machine Learning applied directly to graphs using graph algorithms, and machine learning, has been demonstrating significant advantages in solving the same problems as graph analytics as well as problems that are impractical to solve using graph analytics. Graph Machine Learning does this by training statistical models on the graph resulting in Graph Embedding and Graph Neural Networks that are used to complex problems in a different way.
In this talk, we will compare and contrast these two approaches (spoiler: often complexity vs precision) in real-world scenarios. What factors should you consider when choosing one over the other and when do you even have a choice? Join this talk to learn about exciting new developments in Graph ML and especially when not to use tem.

What's up with upper ontologies?

Boroslav Iordanov - Semantic Arts

Upper ontologies are domain agnostic, highly abstract models of the world that offer a starting point for building knowledge graphs, and in particular enterprise knowledge graphs (EKGs). We will look at several well-established upper ontologies, their history, underlying philosophy and applicability in an enterprise context. Not unlike classical software application frameworks, upper ontologies try to capture commonality and offer some modeling patterns that capture best practices. As such, they facilitate modularity and reuse when building large knowledge graphs, spanning multiple domains. In addition, upper ontologies make it possible for users of the EKGs to operate at different levels of abstraction when “talking” to the graph, a highly desirable feature in a complex system that only a well crafted semantic model can provide. While most upper ontologies aim to achieve semantic interoperability at internet scale, motivated by the original semantic web vision, some are purely academic in nature and others deeply rooted in the enterprise world. The Gist upper ontology, initiated Semantic Arts is one such ontology which grew iteratively by answering practical enterprise needs and it will get a special treatment. Hopefully, at the end of this talk you will be in a better position to answer questions like “should I use an upper ontology?”, “which one?” or “should I create my own?”

Transpilers Gone Wild - announcing Hydra

Josh Shinavier - (LinkedIn)

If you have ever built an enterprise knowledge graph, you know that heterogeneity comes at a cost. The more complex the interfaces to the graph become – more domain data models, more data representation languages and data exchange formats, more programming languages in which applications and ETL code are written – the more time is spent on mappings, and the harder it becomes to keep these mappings in a consistent state. At the same time, support for heterogeneity is often what motivates us to build a graph in the first place. In a previous Data Day talk, A Graph is a Graph is a Graph, I talked about a generic approach for reconciling graph and non-graph data models. The approach was later formalized as Algebraic Property Graphs and implemented in a proprietary tool which I was ultimately not permitted to release as open source software. This time around, I would like to introduce you to a new, open-source project called Hydra which expands the scope of the problem from defining composable transformations for data and schemas, to also porting those transformations between concrete programming languages, encapsulating them in developer-friendly DSLs. Learn to love typed lambda calculi, and see how weird and wonderful things get when a transformation library starts transforming itself.

Shortcut MLOps with In-Database Machine Learning

Paige Roberts - Vertica

MLOps has rocketed to prominence based on one, clear problem: many machine learning projects never make it into production. According to a recent survey by Algorithmia, 35% of even the small percentage of projects that do make it take between one month and a year to get put to work. Since data science is a cost center for organizations until those models are deployed, the need to shorten, organize, and streamline the process from ideation to production is essential.
Data science is not simply for the broadening of human knowledge, data science teams get paid to find ways to shave costs and boost revenues. That can mean preventative maintenance that keeps machines on line, churn reduction, customer experience improvements, targeted marketing that earns and keeps good customers, fraud prevention or cybersecurity that keeps assets safe and prevents loss, or AIOps that optimizes IT to get maximum capabilities for minimum costs.
To get those benefits, do you need to add yet another piece of technology to already bloated stacks? There may be a way for organizations to get machine learning into production faster with something nearly every company already has: a good analytics database.
Learn how to:
• Enable data science teams to use their preferred tools – Python, R, Jupyter – on multi-terabyte data sets
• Provide dozens of data types and formats at high scale to data science teams, without duplicating data pipeline efforts
• Make new machine learning projects just as straightforward as enabling BI teams to create a new dashboard
• Get machine learning projects from finished model to production money-maker in minutes, not months

What is a metadata platform and why do you need it?

Shirshanka Das - Acryl Data

Ever wonder what is the secret behind the legendary data-driven cultures of companies like LinkedIn, Airbnb and others? The answer is metadata!
In this session, Shirshanka Das presents the multiple ways in which metadata platforms shaped the data ecosystems at these companies, through his own journey in creating the open-source DataHub project. This project is now in wide use at companies like LinkedIn, Expedia, Saxo Bank, Peloton, Klarna, Wolt and many others and is part of the critical day to day workflows for thousands of data professionals around the world.
We will deep dive into the product and architecture of DataHub, and discuss the foundations of a modern metadata platform, which includes capabilities like streaming event processing, schema-first extensible modeling, time-series metadata and graph storage. These features allow multiple use-cases like discovery, observability, automated governance to be implemented without having to rebuild the metadata storage, indexing and retrieval layer multiple times.
The talk will conclude with a short demonstration of how you can get started with DataHub and accomplish interesting things within 10 minutes or less.

90 minute workshop
Ontology for Data Scientists

Michael Uschold - Semantic Arts

We start with an interactive discussion to identify what are the main things that data scientists do and why and what some key challenges are. We give a brief overview of ontology and semantic technology with the goal of identifying how and where it may be useful for data scientists.
The main part of the tutorial is to give a deeper understanding of what an ontologies are and how they are used. This technology grew out of core AI research in the 70s and 80s. It was formalized and standardized in the 00s by the W3C under the rubric of the Semantic Web. We introduce the following foundational concepts for building an ontology in OWL, the W3C standard language for representing ontologies.
- Individual things are OWL individuals - e.g., JaneDoe
- Kinds of things are OWL classes - e.g., Organization
- Kinds of relationships are OWL properties - e.g., worksFor
Through interactive sessions, participants will identify what the key things are in everyday subjects and how they are related to each other. We will start to build an ontology in healthcare, using this as a driver to introduce key OWL constructs that we use to describe the meaning of data. Key topics and learning points will be:
- An ontology is a model of subject matter that you care about, represented as triples.
- Populating the ontology as triples using TARQL, R2RML and SHACL
- The ontology is used as a schema that gives data meaning.
- Building a semantic application using SPARQL.
We close the loop by again considering how ontology and semantic technology can help data scientists, and what next steps they may wish to take to learn more.

Fighting COVID-19 using Knowledge Graphs

Ying Ding - University of Texas

COVID-19 pandemic has caused the tragic loss of human lives and the worst economic downturn since the Great Depression. Scientific discovery heavily depends on the accumulated knowledge by peer scientists. There are over 500,000 COVID-19 related articles published, but nobody can read them all. This talk highlights the importance of knowledge graph and how it can help scientists and the public to figure against COVID-19. It covers the topics ranging from building COVID-19 knowledge graph based on scientific publications to faciliate drug discovery, using contrastive deep learning to enhance the risk prediction for COVID patients, to investigating the parachuting collaboration and research novelty in COVID-19 related research activities.

Business Transformers - Leveraging Transfer Learning for B2B Insights

Paul Azunre - Dun and Bradstreet

Pretrained transformer-based neural network models, such as BERT, GPT-3 & XLM have rapidly advanced the state of natural language processing (NLP). Instead of training models from scratch, practitioners can now download general pretrained models and quickly adapt them to new scenarios and tasks with relatively little training data. This has led to rapid advances in various applications, from medicine to natural language generation and chatbots. Paul will overview related recent advances and concepts, and discuss some ways in which these tools can be applied to extract B2B Insights – from detecting and measuring the various stages of the customer buying journey to extending any such analysis capability from English to multiple other languages.

For the overwhelmed data professionals: What to do when there is so much to do?

Jike Chong / Yue Cathy Chang

95% of the companies with data science teams have teams of fewer than ten members. In a nimble team with a wide range of possibilities to make business impacts with data science, how do you explore and prioritize the opportunities? How do you ensure that there are sponsors and champions for your project? How do you set realistic expectations with business partners for project success? If you are leading a team, how can you delegate your projects effectively?
In this talk, the presenters will share three techniques for project prioritization, two roles (sponsor and champions) to identify, four levels of confidence (for predictive models) to specify project success, and discuss best practices for delegating work as a team lead/manager.

A gentle introduction to using graph neural networks on knowledge graphs

Dave Bechberger - Amazon Web Services

Knowledge graphs are a hot topic in the enterprise data space but they often suffer from a lack of rich connections within the data. Graph neural networks (GNNs) can help fill this gap by using the structure of connections within the data to predict new connections. These two seem like a natural partnership, however understanding what each of these is and how to leverage them together is a challenge.
In this talk we’ll provide a gentle introduction to the concepts of knowledge graphs and graph neural networks. We will discuss what a knowledge graph is, why you might want to use one, and some of the challenges you will face. We will then discuss key concepts of graph-based machine learning including an introduction to how they work, what use cases they solve, and how they can be leveraged within knowledge graphs. Finally, we’ll walk through a few demonstrations to show the power of leveraging knowledge graphs and GNN’s together

What is a metadata platform and why do you need it?

Shirshanka Das - Acryl Data

What Happens Next? Event Predictions with Machine Learning and Graph Neural Networks

Jans Aasman - Franz Inc.

Enterprises are subscribed to the power of modeling data as a graph and the importance of using Knowledge Graphs for customer 360 and beyond. The ability to explain the results of AI models, and produce consistent results from them, involves modeling real-world events with the adaptive schema consistently provided via Knowledge Graphs.
Probably the most important reason for building Knowledge Graphs has been to answer the age old question: “What is going to happen next?” Given the data, relationships, and timelines we know about a customer, patient, product, etc. (“The Entity of Interest”), how can we confidently predict the most likely next event.
For example, in healthcare, what is the outcome for this patient given the sequence of previous diseases, medications, and procedures. For manufacturers, what is going to require repair next in this aircraft or some other point in the supply chain.
Machine Learning and more recently, Graph Neural Networks (GNNs) have emerged as a mature AI approach used by companies for Knowledge Graph enrichment. GNNs enhance neural network methods by processing graph data through rounds of message passing, as such, the nodes know more about their own features as well as neighbor nodes. This creates an even more accurate representation of the entire graph network.
In this presentation we describe how to use graph embeddings and regular recurrent neural networks to predict events via Graph Neural Networks. We will also demonstrate creating a GNN in the context of a Knowledge Graph for building event predictions

Introduction to Taxonomies for Data Scientists (tutorial)

Heather Hedden - Semantic Web Company

This tutorial/workshop teaches the fundamentals and best practices for creating quality taxonomies, whether for the enterprise or for specific knowledge bases in any industry. Emphasis is on serving users rather than on theory. Topics to be covered include: the appropriateness of different kinds of knowledge organization systems (taxonomies, thesauri, ontologies, etc.), standards, taxonomy concept creation and labeling, taxonomy relationship creation. The connection of taxonomies to ontologies and knowledge graphs will also be briefly discussed. There will be some interactive activities and hands-on exercises. This session will cover:
Introduction to taxonomies and their relevance to data
• Comparisons of taxonomies and knowledge organization system types
• Standards for taxonomies and knowledge organization systems
• Taxonomy concept creation
• Preferred and alternative label creation
• Taxonomy relationship creation
• Taxonomy relationships to ontologies and knowledge graphs
• Best practices and taxonomy management software use

The Future of Taxonomies - Linking data to knowledge

Heather Hedden - Semantic Web Company

Taxonomies are no longer just for navigation or tagging documents. AI technologies, such as machine learning and NLP are now being used more in combination with taxonomies rather than merely in place of taxonomies. The results include applications of semantic search, personalization, recommendation, question-answering. Combining taxonomies with ontologies and linked instance data, supporting more powerful search and analytics across all kinds of data, structured and unstructured, and not just documents. Topics to be discussed include:
- Trends in uses of taxonomies (industries, applications, implementation and management), benefits of taxonomies,
- How knowledge-based recommendation and personalization systems are built
- How NLP and taxonomies support each other
- How taxonomies contribute to knowledge graphs
- Semantic Web standards and their benefits for taxonomies and other knowledge organization systems

Visual timeline analytics: applying concepts from graph theory to timeline and time series data

Corey Lanum - Cambridge Intelligence

Timelines are one of the most basic forms of data visualization and plotting events on a timeline is usually a straightforward process. But what happens when we have connections in that data? Time-based data often takes the properties of both time series and graph data sets, but the traditional visualizations for each are lacking. Time series visualizations like a stock ticker chart can show how a variable changes over time, but not how data elements are connected to one another. And node-link charts can show relationships but not how they may be changing over time. This talk introduces the new concept of visual timeline analysis, a kind of visualization designed to detangle complex time-based connected data. Corey will share his experience helping organizations harness visual timelines in their applications taking inspiration from the graph world. He’ll show some examples of how a visualization timeline can show connected data over time and how looking at the data this way can produce some really interesting insights.

Unstructured Data Management: It's not just for your documents

Kirk Marple - Unstruk

For structured data, the ‘Modern Data Stack’ offers solutions for ETL, Aggregation, Visualization and Analytics. For unstructured data (i.e. images, audio, video, 3D, documents), there is not a complementary ‘Modern Unstructured Data Stack’. Disparate solutions exist for geospatial data, document management, and visual intelligence - but none provide a holistic view across unstructured data types, correlate unstructured data to structured data, nor describe relationships between real-world assets and unstructured data.In this talk, I will discuss the concept of a data platform for all your unstructured data - not just documents - and how ETL, data modeling, visualization and analytics differ for unstructured data.

Using Reproducible Experiments To Create Better Models

Milecia McGregor - Iterative

It's easy to lose track of which changes gave you the best result when you start exploring multiple model architectures. Tracking the changes in your hyperparameter values, along with code and data changes, will help you build a more efficient model by giving you an exact reproduction of the conditions that made the model better.
In this talk, you will learn how you can use the open-source tool, DVC, to increase reproducibility for two methods of tuning hyperparameters: grid search and random search. We'll go through a live demo of setting up and running grid search and random search experiments. By the end of the talk, you'll know how to add reproducibility to your existing projects.

Connecting the Dots in a Million Node Graph

Thorsten Liebig - derivo

How can we gain meaningful insight into diverse and big data graphs? Which groups of nodes are connected to each other?This talk is addressing these challenges and will demonstrate SemSpect, a tool which uses visual aggregation to solve the hairball problem of graph visualization. The aggregation approach enables users to explore large knowledge graphs, from the meta level down to all node details. We will show how to define query-based node labels while browsing to refine the graph data schema. A reasonable schema is of great benefit as it adds meaning to the data and enables higher query performance.The audience will learn about tooling to interactively discover the structure of large graph databases and identify data flaws even without any prior idea of the data. We will demonstrate how to build sophisticated requests in a stepwise data-driven manner without the need of query language skills.

What is Truth? - Strategies for managing semantic triples in large complex systems

Ryan Mitchell - Gerson Lehrman Group

Knowledge graphs and their ontology-based database kin have exploded in popularity in recent years. Through cheap data sources and powerful inference engines, they can grow extremely quickly. But when bad data gets in, their judgments can be poisoned just as fast. We will discuss strategies for managing semantic triples and assessing their “truthiness” in large complex systems with many sources of data, machine learning algorithms, and fact validation methods.

A Path to Strong AI

Jonathan Mugan - DeUmbra

We need strong artificial intelligence (AI) so it can help us understand the nature of the universe to satiate our curiosity, devise cures for diseases to ease our suffering, and expand to other star systems to ensure our survival. To do all this, AI must be able to learn condensed representations of the environment in the form of models that enable it to recognize entities, infer missing information, and predict events. And because the universe is of almost infinite complexity, AI must be able to compose these models dynamically to generate combinatorial representations to match this complexity. This talk will cover why models are needed for robust intelligence and why an intelligent agent must be able to dynamically compose those models. It will also cover the current state-of-the-art in AI model building, discussing both the strengths and weaknesses of neural networks and probabilistic programming. The talk will cover how we can train an AI to build models that will enable it to be robust, and it will conclude with how we can effectively evaluate AI using technology from computer-animated movies and videogames.
This talk is based on an article Jonathan will be publishing in The Gradient

History of Network Science - A Look at How Networks Have Connected Us

Sean Robinson - Graphable

The rise of graph analytics and network science within data has been steadily growing for several years. Today we see top companies offering new ways to leverage the interconnected power of networks to answer the increasingly complex and subtle questions businesses are asking of their data. However, the use of networks to model data is a technique which has grown over hundreds of years.This talk takes viewers through a timeline of network science as we explore how people have taken on novel challenges in the real world and how they used networks as a solution to these problems. Beginning in 1736 with the Seven Bridges of Königsberg and through to today’s innovations in network science, by exploring the challenges which drove early innovators to rely on the power of networks, we can both understand the groundwork that led to the current innovations we see arising today and gain insight into how these innovations can shape our own solutions to everyday data problems.

Outrageous ideas for Graph Databases

Max De Marzi - Amazon Web Services

Almost every graph database vendor raised money in 2021. I am glad they did, because they are going to need the money. Our current Graph Databases are terrible and need a lot of work. There I said it. It's the ugly truth in our little niche industry. That's why despite waiting for over a decade for the "Year of the Graph" to come we still haven't set the world on fire. Graph databases can be painfully slow, they can't handle non-graph workloads, their APIs are clunky, their query languages are either hard to learn or hard to scale. Most graph projects require expert shepherding to succeed. 80% of the work takes 20% of the time, but that last 20% takes forever. The graph database vendors optimize for new users, not grizzly veterans. They optimize for sales not solutions. Come listen to a Rant by an industry OG on where we could go from here if we took the time to listen to the users that haven't given up on us yet.

This Dashboard Should Have Been a Meeting

Michael Zelenetz - PEAK6

How many of times has your team spent weeks or months developing a dashboard that ultimately serves almost no users? This talk will present a guideline for making your data products successful--what should go in a dashboard and what should not, how to design your dashboard for maximum impact, and, most importantly, when you should not start building a dashboard in the first place.

Graph Neural Networks with PyTorch Geometric and ArangoDB.

Sachin Sharma - ArangoDB

So far we have read a lot of articles regarding the Convolutional Neural Networks which is a well-known method to handle euclidean data structures (like images, text, etc.). However, in the real world, we are also surrounded by non-euclidean data structures like graphs, and a machine learning method to handle this type of data domain is known as the Graph Neural Networks. Therefore in this workshop, we will first deep dive into the concepts of Graph Neural Networks and their Applications (Part-1), and then during the Hands-on-Session (Part-2), we will build a Graph Neural Network application (with Pytorch Geometric) and ArangoDB.

Machine Learning, Semantics, and Knowledge Graphs.

Arthur Keen - ArangoDB

This talk describes a story of lessons learned in a journey that started with the objective of developing an application that required integration of a knowledge graph with multiple machine learning models, which rapidly encountered the hard reality of impedance mismatches between the technologies, and how these differences were addressed using semantic models. Knowledge graphs and graph machine learning seem like a perfect match, though in practice there are subtle differences between the two domains that can cause friction. By analogy, the ideal of the knowledge graph is the perfect crystalline structure of a diamond representing everything that is known about a domain in a logical way, whereas machine learning values flexibility of models trained on a subset of the data that can stretch like rubber to accommodate data never encountered before. The knowledge graph and machine learning communities use different approaches to data transformation and problem solving. They have different key performance metrics. They use dissimilar graph structures and vocabularies. They use different forms of inference, and they deal with imprecision in different ways. The talk describes these differences and ways to address them in a systematic way.

Katana Graph Technical Foundations

Chris Rossbach - KatanaGraph

This talk will describe the technical foundations of Katana Graph's Graph Intelligence platform. We will discuss high level goals and design principles behind the system, and discuss how each of those principles translates to design choices that enable a graph computing platform to operate at scale across multiple domains of graph computing. The technical focus will be on how storage layer design, flexible graph partitioning, and HPC-style computing techniques can be combined with a shared-log foundation to support a highly scalable and efficient graph computing platform.

Clinical trials exploration: surfacing a clinical application from a larger Bio-Pharma KnowledgeGraph

David Hughes - Graphable

Clinical, proteomic, and pharma knowledge graphs are complex aggregations of constituent sub graphs. These linked graphs provide meaningful insights as a whole, but in many cases a singular subgraph can independently prove to be a valuable asset. In this session, David will identify possible applications of the NLM’s Clinical Trials resource as a standalone application. He will review how to query the API, how to populate/run ETL through tools like Hume orchestra and Apache-Hop. He will then explore how to create an application using Streamlit as a POC, and then discuss potential refinements.”

Relational Graphs Part I: Business Models Become the Program

Amy Hodler / Michelle Yi - RelationalAI

The world is full of poorly tuned queries and badly designed data models that don’t accurately reflect the business. This hairball leads to excessive hardware and operational costs – not to mention a frustrating degree of customization required for each and every application. In this session, we’ll look at building data apps using composable blocks of data, context, and logic. You’ll learn about a new relational approach to graphs where your business model actually becomes the program to make knowledge reusable and reduce misalignment. We’ll look at modeling business logic in a graph to support multiple workloads from graph analytics and reasoning to ML and optimization over the same data instance. DBAs, data engineers, data architects, and developers will hear about extending their data modeling skills to model knowledge itself using the RelationalAI knowledge graph system and the Rel declarative modeling language.

Relational Graphs Part II: UN Demo of Composable Knowledge

Amy Hodler / Michelle Yi - RelationalAI

Creating apps to model complex and dynamic environments is challenging and often implies a longtail maintenance burden. In this session, we’ll look at how a relational approach to graphs can tackle one of the most complex situations to model: disasters in cities around the world.The UN Office for Information Technology has partnered with Relational.AI to support the analysis of ongoing stresses and sudden shocks on cities. In this session, we'll look at an automated system based on a knowledge graph to help boost city resilience in a way that scales deep domain knowledge to many cities. You'll learn how we codified domain knowledge from city resilience experts and ingested public data from satellite imagery, numerical data, policy documents, and more. We'll demo combining graph analytics, simulation, and reasoning (using neurosymbolic and probabilistic ML) to understand how disasters impact cities.