The Data Day Texas 2020 Sessions

Take advantage of our discount rooms at the conference hotel.

We have many more sessions to announce the sessions for Data Day Texas 2020. Check back frequently for updates.

Managing your Kafka in an explosive growth environment

Alon Gavra - AppsFlyer

Frequently, Kafka is just a piece of the stack that lives in production that often times no one wants to touch—because it just works. Kafka sits at the core of AppsFlyer’s infrastructure that processes billions of events daily.
Alon Gavra dives into how AppsFlyer built its microservices architecture with Kafka as its core piece to support 70B+ requests daily. With continuous growth, the company needed to “learn on the job” how to improve its Kafka architecture by moving to the producer-owner cluster model, breaking up its massive monolith clusters to smaller, more robust clusters and migrating from an older version of Kafka with real-time production clients and data streams. Alon outlines best practices for leveraging Kafka’s in-memory capabilities and built-in partitioning, as well as some of the tweaks and stabilization mechanisms that enable real-time performance at web scale, alongside processes for continuous upgrades and deployments with end-to-end automation, in an environment of constant traffic growth.

Cost-Optimized Data Labeling Strategy

Jennifer Prendki - Alectio

Active Learning is one of the oldest topics when it comes to Human-in-the-Loop Machine Learning. And with more research than ever on the topic, we seem to be coming closer to a time when all ML projects will adopt some version of it as an alternative to the old brute-force supervised learning approach. That unfortunately doesn’t account for one major caveat: Active Learning essentially runs on the assumption that all provided labels are correct.
In this talk, Jennifer will discuss the tradeoffs between the size of a training set and the accuracy of the labeling process, and present a framework for smart labeling cost optimization that can simultaneously reduce labeling costs and diagnose the model.
This session is part of the Human in the Loop session track.

Practicing data science: A collection of case studies

Rosaria Silipo - KNIME

There are many delineations of data science projects: with or without labeled data; stopping at data wrangling or involving machine learning algorithms; predicting classes or predicting numbers; with unevenly distributed classes, with binary classes, or even with no examples of one of the classes; with structured data and with unstructured data; using past samples or just remaining in the present; with real-time or close-to-real-time execution requirements and with acceptably slower performances; showing the results in shiny reports or hiding the nitty-gritty behind a REST service; and—last but not least—with large budgets or no budget at all.
Rosaria Silipo discusses some of her past data science projects, showing what was possible and sharing the tricks used to solve their specific challenges. You’ll learn about demand prediction in energy, anomaly detection in the IoT, risk assessment in finance, the most common applications in customer intelligence, social media analysis, topic detection, sentiment analysis, fraud detection, bots, recommendation engines, and more.

Learning with limited labeled data

Shioulin Sam - Cloudera Fast Forward Labs

Being able to teach machines with examples is a powerful capability, but it hinges on the availability of vast amounts of data. The data not only needs to exist but has to be in a form that allows relationships between input features and output to be uncovered. Creating labels for each input feature fulfills this requirement, but is an expensive undertaking.
Classical approaches to this problem rely on human and machine collaboration. In these approaches, engineered heuristics are used to smartly select “best” instances of data to label in order to reduce cost. A human steps in to provide the label; the model then learns from this smaller labeled dataset. Recent advancements have made these approaches amenable to deep learning, enabling models to be built with limited labeled data.
Shioulin Sam explores algorithmic approaches that drive this capability and provides practical guidance for translating this capability into production. You’ll view a live demonstration to understand how and why these algorithms work.

How to start your first computer vision project

Sanghamitra Deb - Chegg

Computer vision is becoming an integral part of machine learning pipelines underlying products that serve users with recommendations, ranking, search results, etc. This is particularly true in content driven fields such as education, healthcare, news to mention a few. With 2.2 million subscribers and two hundred million content views, Chegg is a centralized hub where students come to get help with writing, science, math, through online tutoring, chegg study, flash cards and other products. Students spend a large amount of time on their smartphones and very often they will upload a photo of related to the concept they are trying to learn. Disciplines such as Chemistry or Biology are dominated by diagrams, thus students submit more images in these disciplines. These images could be of lower quality and contain too much irrelevant details, sometimes this makes interpreting the images difficult which in turn makes helping them more difficult.

Starting a computer vision project can be intimidating and several questions come up when faced with creating solutions with image data, do third party tools provide solutions that are feasible? How long does it take to train people in computer vision? Do all image related problems require machine learning solutions? For machine learning solutions what is the optimal way to collect data? Is hiring someone with computer vision experience required?

In this presentation I will talk about how to make a cold start in computer vision and explore areas of (1) traditional image analysis techniques (2) image classification and (3) Object detection. The goal is to detect low quality images and and create a cropper that removes irrelevant details from the image. In order to detect low quality images we use properties of the image such as variance of the Laplacian and histogram analysis. These approaches are computationally simple and require very little resources to deploy. For building the cropper we use stacked deeplearning models, the first model detects the source of the image and then for each source we have a model that crops out unnecessary details in the image. The cropping task is challenging since the properties of the images that are outside the cropping boundary is not significantly different from the properties of the images inside the cropping boundary. The goal of these tasks is to make suggestions and give feedback to students such that they upload images of higher quality that are easy to process. I will present some results and discuss product applications of computer vision projects.

Adding a Machine to the Loop: What if the Loop began with Humans?

Brent Schneeman - Alegion

Machine-in-the-Loop processes blend human judgements with machine-learning power. Oftimes, an ML project focuses on the ML power: train a model and deploy it. What if the initial state of the project is highly human-centric? How do you add ML to “human-is-the-loop” processes in a manner to improve the objective function(s) of those processes?
This talk describes a real-world human-centric process (data annotation) and how adding machine learning has improved the outcome of the process with respect to speed and quality. It will also provide examples of when the process was not improved and how the process is not a single loop but is, much like turtles, loops all the way down.
This session is part of the Human in the Loop session track.

mm-ADT A Multi-Model Abstract Data Type

Marko Rodriguez - RRedux

mm-ADT™ is a distributed virtual machine capable of integrating a diverse collection of data processing technologies. This is made possible via three language-agnostic interfaces: language, process, and storage. When a technology implements a respective interface, the technology is considered mm-ADT compliant and is able to communicate with any other compliant technologies via the virtual machine. In this manner, query language developers can develop languages irrespective of the underlying storage system being manipulated. Processing engines can be programmed by any query language and executed over any storage system. Finally, data storage systems automatically support all mm-ADT compliant query languages and processors.
Follow mm-ADT™ on GitHub, Twitter, StackExchange, or join the Slack Channel.

A Brief History of Knowledge Graph's Main Ideas

Juan Sequeda - data.world

Knowledge Graphs can be considered to be fulfilling an early vision in Computer Science of creating intelligent systems that integrate knowledge and data at large scale. The term “Knowledge Graph” has rapidly gained popularity in academia and industry since Google popularized it in 2012. It is paramount to note that, regardless of the discussions on, and definitions of the term “Knowledge Graph”, it stems from scientific advancements in diverse research areas such as Semantic Web, Databases, Knowledge Representation and Reasoning, NLP, Machine Learning, among others.
The integration of ideas and techniques from such disparate disciplines give the richness to the notion of Knowledge Graph, but at the same time presents a challenge to practitioners and researchers to know how current advances develop from, and are rooted in, early techniques.
In this talk, Juan will provide a historical context on the roots of Knowledge Graphs grounded in the advancements of the computer science disciplines of Knowledge, Data and the combination thereof, starting from the 1950s.

Human Centered Machine Learning

Robert Munro - Author / Serial Founder

Most Machine Learning relies on intensive human feedback. Probably 90% of Machine Learning applications today are powered by Supervised Machine Learning, including autonomous vehicles, in-home devices, and every item you purchase on-line. These systems are powered by people. There are thousands of people fine-tuning the models offline by annotating new data. There are also thousands of people tuning the models online: you, the end user, interacting with your car, device or an on-line store. This talk will cover the ways in which you can design Machine Learning algorithms and Human-Computer Interaction strategies at the same time to get the most out of your Machine Learning applications in the real-world.
This session is part of the Human in the Loop session track.

Where’s my lookup table? Modeling relational data in a denormalized world

Rick Houlihan - Amazon Web Services

When Amazon decided to migrate thousands of application services to NoSQL, many of those services required complex relational models that could not be reduced to simple key-value access patterns. The most commonly documented use cases for NoSQL are simplistic, and there’s a large amount of irrelevant and even outright false information published regarding design patterns and best practices for NoSQL applications. For this migration to succeed, Amazon needed to redefine how NoSQL is applied to modern online transactional processing (OLTP) apps. NoSQL applications work best when access patterns are well defined, which means the sweet spot for a NoSQL database is OLTP applications. This is good because 90% of the apps that get written support a common business process which for all practical purposes is the definition of OLTP. One of the common steps in building an OLTP app is designing the entity relationship model (ERM) which essentially defines how the application uses and stores data. With a relational database management system- (RDBMS) backed application, the ERM was essentially mapped directly into the database schema by creating tables for the top-level entities and defining relationships between them as defined in the ERM. With NoSQL, the data is still relational, it just gets managed differently. Rick Houlihan breaks down complex applications and effectively denormalizes the ERM based on workflows and access patterns. He demonstrates how to apply the design patterns and best practices defined by the Amazon team responsible for migrating thousands of RDBMS based applications to NoSQL and when it makes sense to use them.

Information Extraction with Humans in the Loop

Dr. Anna Lisa Gentile - IBM Research Almaden

Information Extraction (IE) techniques enables us to distill Knowledge from the abundantly available unstructured content. Some of the basic IE methods include the automatic extraction of relevant entities from text (e.g. places, dates, people, …), understanding relations among them, building semantic resources (dictionaries, ontologies) to inform the extraction tasks, connecting extraction results to standard classification resources. IE techniques cannot de- couple from human input – at bare minimum some of the data needs to be manually annotated by a human so that automatic methods can learn patterns to recognize certain type of information. The human-in-the-loop paradigm applied to IE techniques focuses on how to better take advantage of human annotations (the recorded observations), how much interaction with the human is needed for each specific extraction task. In this talk Dr. Gentile will describe various experiments of the human-in- the-loop model on various IE tasks, such as building dictionaries from text corpora in various languages, extracting entities from text and matching them to a reference knowledge base, relation extraction.
This session is part of the Human in the Loop session track.

Responsible AI Requires Context and Connections

Amy Hodler - Neo4j

As creators and users of artificial intelligence (AI), we have a duty to guide the development and application of AI in ways that fit our social values, in particular, to increase accountability, fairness and public trust. AI systems require context and connections to have more responsible outcomes and make decisions similar to the way humans do.
AI today is effective for specific, well-defined tasks but struggles with ambiguity which can lead to subpar or even disastrous results. Humans deal with ambiguities by using context to figure out what’s important in a situation and then also extend that learning to understanding new situations. In this talk, Amy Hodler will cover how artificial intelligence (AI) can be more situationally appropriate and “learn” in a way that leverages adjacency to understand and refine outputs, using peripheral information and connections.
Graph technologies are a state-of-the-art, purpose-built method for adding and leveraging context from data and are increasingly integrated with machine learning and artificial intelligence solutions in order to add contextual information. For any machine learning or AI application, data quality – and not just quantity – is critical. Graphs also serve as a source of truth for AI-related data and components for greater reliability. Amy will discuss how graphs can add essential context to guide more responsible AI that is more robust, reliable, and trustworthy.

Creating Explainable AI with Rules

Jans Aasman - Franz. Inc

This talk is based on Jans' recent article for Forbes magazine.
"There’s a fascinating dichotomy in artificial intelligence between statistics and rules, machine learning and expert systems. Newcomers to artificial intelligence (AI) regard machine learning as innately superior to brittle rules-based systems, while the history of this field reveals both rules and probabilistic learning are integral components of AI.
This fact is perhaps nowhere truer than in establishing explainable AI, which is central to the long-term business value of AI front-office use cases."
"The fundamental necessity for explainable AI spans regulatory compliance, fairness, transparency, ethics and lack of bias -- although this is not a complete list. For example, the effectiveness of counteracting financial crimes and increasing revenues from advanced machine learning predictions in financial services could be greatly enhanced by deploying more accurate deep learning models. But all of this would be arduous to explain to regulators. Translating those results into explainable rules is the basis for more widespread AI deployments producing a more meaningful impact on society."

Moving Your Machine Learning Models to Production with TensorFlow Extended

Jonathan Mugan - DeUmbra

ML is great fun, but now we want it to solve real problems. To do this, we need a way of keeping track of all of our data and models, and we need to know when our models fail and why. This talk will cover how to move ML to production with TensorFlow Extended (TFX). TFX is used by Google internally for machine-learning model development and deployment, and it has recently been made public. TFX consists of multiple pipeline elements and associated components, and this talk will cover them all, but three elements are particularly interesting: TensorFlow Data Validation, TensorFlow Model Analysis, and the What-If Tool.

The TensorFlow Data Validation library analyses incoming data and computes distributions over the feature values. This can show us which features many not be useful, maybe because they always have the same value, or which features may contain bugs. TensorFlow Model Analysis allows us to understand how well our data performs on different slices of the data. For example, we may find that our predictive models are more accurate for events that happen on Tuesdays, and such knowledge can be used to help us better understand our data and our business. The What-If Tool is as an interactive tool that allows you to change data and see what the model would say if a particular record had a particular feature value. It lets you probe your model, and it can automatically find the closest record with a different predicted label, which allows you to learn what the model is homing in on. Machine learning is growing up.

Machine Learning Counterclockwise

Shawn Rutledge - Sigma IQ

"Ship early, ship often". Continuous integration. Test-driven development. As software engineering has matured in industry, these and other patterns/anti-patterns have helped avoid the pitfalls common to moving from code to working software systems. As machine learning grows in practice and pervasiveness in industry, similar patterns can help the move from models into working machine learning pipelines and systems. I'll propose some of these patterns and relate tips and best practices from an industry practitioner's perspective. Along the way I hope to challenge researchers working in the area of machine learning with open problems in feature representations, model deployments, and model explanation.

Automated Encoding of Knowledge from Unstructured Natural Language Text into a Graph Database

Chris Davis - Lymba

Most contemporary data analysts are familiar with mapping knowledge onto tabular data models like spreadsheets or relational databases. However, these models are sometimes too broad to capture subtle relationships between granular concepts among different records. Graph databases provide this conceptual granularity, but they typically require that knowledge is curated and formatted by subject matter experts, which is extremely time- and labor-intensive. This presentation presents an approach to automate the conversion of natural language text into a structured RDF graph database.

Improving Real-Time Predictive Algorithms with Asynchronous Graph Augmentation

Dave Bechberger / Kelly Mondor - DataStax

Shop online, swipe a credit card, check-in on social media – predictive algorithms are watching all of this in real-time, analyzing the behaviors in order to find fraud, or tailor a news feed, or just suggest some other product to purchase.
Graphs are frequently helpful when working with these sorts of predictive algorithms as these use cases can benefit heavily from examining how data is connected. The difficulty lies in that the relevance of connections change over time and efficiently finding the connections that matter becomes exponentially harder as more and more data is added. Historically due to the length of time and amount of computation required this has been solved by running large batch process runs daily, weekly, or even less frequently to update the relevance of connections within a graph. However, in today's world, this is not always fast enough.
What we will show is a method for decoupling these complex analytical transactions from real-time transactions to improve these predictions in near real-time without performance degradation. We will discuss how this method can leverage algorithms such as graph analytics or machine learning to provide optimized graph connections leading to more accurate predictions. Wrapping up we will demonstrate how to apply this process to common use cases such as fraud or personalization to provide better real-time predictive results.

JGTSDB: A JanusGraph/TimescaleDB Mashup

Ted Wilmes - Expero

Time series data is ubiquitous, appearing in many use cases including finance, supply chain, and energy production. Consequently, “How should I model time series data in my graph database?” is one of the top questions folks have when first kicking the tires on a graph database like JanusGraph. Time series provides a number of challenges for a graph database both as it’s coming into the system and when it’s being read it out. High volume and velocity means you need to ingest tens to hundreds of thousands of points per second (or more!). Users expect to be able to perform low latency aggregations and more complicated analytics functions over this time series data. JanusGraph can meet the ingest requirements but it requires some very specific data modeling and tuning tricks that frequently are not worth the extra development complexity. Because of this, we’d usually recommend storing this data in an entirely different database that is more suited to time series workloads. For this talk, we will discuss an alternative approach where we integrate TimescaleDB access into JanusGraph itself, allowing users to write a single Gremlin query that transparently traverses their graph and time series data. This setup inherits the operational characteristics of Timescaledb while providing a single, unified and low latency query interface that does not require careful and specific graph data modeling and tuning.

NuGraphStore: a Transactional Graph Store Backend for JanusGraph


Dr. Jun Li / Dr. Mohammad Roohitavaf / Dr. Gene Zhang - eBay

JanusGraph is a distributed graph database system with pluggable storage backend servers, such as Cassandra, HBase, or BerkeleyDB (which is non-scale-out). There were no fully transactional scale-out backends for JanusGraph. Without transaction support, there would be challenges for applications to deal with index/data inconsistency, and inconsistency related to vertices and edges, such as dangling edges, as well as data loss or data duplication. We have been developing a scale-out KCV storage engine with distributed transaction support for JanusGraph, called NuGraphStore. In this talk, we will present the architecture and design of NuGraphStore, its storage engine and distributed transaction mechanisms. NuGraphStore is (going to be) open-sourced under Apache 2.0 license. We invite interested developers and users to join the community to make NuGraphStore the best backend storage engine for JanusGraph. Its distributed transaction protocol could be adapted for use with other KV store engines as well.

90 min Workshop
Graph Feature Engineering for More Accurate Machine Learning


Amy Hodler / Justin Fine - Neo4j

Graph enhanced AI and ML are changing the landscape of intelligent applications. In this workshop, we’ll focus on using graph feature engineering to improve the accuracy, precision, and recall of machine learning models. You’ll learn how graph algorithms can provide more predictive features. We’ll illustrate a link prediction workflow using Spark and Neo4j to predict collaboration and discuss how you avoid missteps and tips to get measurable improvements.

Modeling, Querying, and Seeing Time Series Data within a Self-Organizing Mesh Network

Denise Gosnell - DataStax

Self-organizing networks rely on sensor communication and a centralized mechanism, like a cell tower, for transmitting the network's status.
So, what happens if the tower goes down? And, how does a graph data structure get involved in the network's healing process?
In this session, Dr. Gosnell will show you how we see graphs in this dynamic network and how path information helps sensors come back online. She will walk through the data, model, and Gremlin queries which help a power company have real-time visibility into different failure scenarios.