The Data Day Texas 2019 Sessions

Below are the first round of confirmed talks. We will be adding new sessions daily.

Opening Keynote - Lies Enterprise Architects Tell

Gwen Shapira - Confluent

Lets face it - we are all liars. We often lie unintentionally and most of all - we lie to ourselves. I’ve spent the last 10 years working with enterprise architects intent on modernizing their data infrastructure, and I’ve heard many “facts" that turned out to be… less than perfectly accurate. Self-deception about state of the industry, our requirements and our capabilities can lead us to make bad choices, which leads us to build bad architectures and often leads to bad business outcomes.
If you say or hear phrases like “we have big data”, "we don't have big data", “this business app must be real-time” and “hybrid-cloud doesn’t exist” - you may work for an organization that can use a bit of reality check. In this talk, Gwen Shapira, principal data architect at Confluent, will share common enterprise architecture myths that did not survive contact with reality and offer some advice on how to design good data architecture given our inherent capacity for self-deception.

Creating A Data Engineering Culture

Jesse Anderson - Big Data Institute

The biggest initial hurdle to success with Big Data isn’t technical - it’s management. Your data engineering project’s initial success is predicated on your management team correctly staffing and resourcing it. This runs opposite to how most data engineering teams are started and run. If you just choose the best technologies, things will just fall into place. They don’t and that’s a common pattern for failure.
But how do you correctly do something that’s so new? This could be your team’s first data engineering project. What should the team look like? What skills should the team have? What should you look for in Data Engineer (because you’ll probably have to hire a Software Engineer and train them)? What are some of the management pitfalls?
In this talk, Jesse will cover the most common reasons why data engineering teams fail and how to correct them. This will include ways to get your management to understand that data engineering is really complex and time consuming. It is not data warehousing with new names. Management needs to understand that you can’t compare a data engineering team to the web development team, for example.
Jesse will share the stories of teams who haven’t set up their data engineering culture correctly and what happened. Then, he will talk about the teams who’ve turned around their culture and how they did it.
FInally, Jesse will share the skills that every data engineering team needs.

The Role of Data Science

Jon Allen - SyncThink

Data Science is often a somewhat nebulously defined role within an organization. “Are they developers? Are they mathematicians? When do we need one? How many do we need? I mean, we definitely need a data science team, cause everyone else has one…” It’s important to identify and categorize the responsibilities of data science to an organization to identify appropriate candidates as a company grows and to alleviate political pressures that often arise when roles are ill-defined.

Extracting Real-Time Insights from Streaming Data

Roger Barga - Amazon

Stream data processing is about identifying and responding to events happening in your business, in your service or application, and with your customers in near real-time. Sensors, IoT and mobile devices, and online transactions all generate data that can be monitored constantly to enable a business to detect and then act on events and insights before they lose their value. The need for large scale, real-time stream processing of big data in motion is more evident now than ever before. In this talk I will draw upon our experience with Amazon Kinesis data streaming services to highlight use cases and dive deep into the role of machine learning over streaming data to extract insights in real-time.

Graph Keynote - From Theory to Production

Dr. Denise Gosnell - DataStax

We are here to build applications with graph data and deliver value. The graph community has spent years defining and describing our passion. In order to decipher graph thinking into a production application, there is a suite of hard decisions that have to be made. It's time for graph to go mainstream!
This talk will walk through some practical and tangible decisions that come into play when shipping distributed graph applications. Developers need to have a tangible set of play books to work from and my years of experience have narrowed it down to some of the most universal and difficult to spot. Let's see how well they match up with yours.

Performant time-series data management and analytics with Postgres

Michael Freedman - TimescaleDB

Time-series databases are one of the fasting growing segments of the database market, spreading across industries and use cases. Common requirements including ingesting high volumes of structured data; answering complex, performant queries for both recent and historical time intervals; and performing specialized time-centric analysis and data management.
Today, many developers working with time series data turn to polyglot solutions: a NoSQL database to store their time series data (for scale) and a relational database for associated metadata and key business data. Yet this leads to engineering complexity, operational challenges, and even referential integrity concerns.
I explain how one can avoid these operational problems by re-engineering Postgres to serve as a general data platform, including high-volume time-series workloads. In particular, TimescaleDB is an open-source time-series databases, implemented as a Postgres plugin, that improves insert rates by 20x over vanilla Postgres and much faster queries, even while offering full SQL (including JOINs). TimescaleDB achieves this by storing data on an individual server in a manner more common to distributed systems: heavily partitioning (sharding) data into chunks to ensure that hot chunks corresponding to recent time records are maintained in memory.
In this talk, I focus on two newly-released features of TimescaleDB, and discuss how these capabilities ease time-series data management: (1) the automated adaptation of time-partitioning intervals, which the database learns by observing data volumes; (2) continuous aggregations in near-real-time, in a manner robust to late-arriving data and transparently supporting queries across different aggregation levels. I discuss how these capabilities have been leveraged across several different use cases.

Kubeflow explained: Portable machine learning on Kubernetes

Michelle Casbon - Google

Practically speaking, some of the biggest challenges facing ML applications are composability, portability, and scalability. The Kubernetes framework is well suited to address these issues, which is why it’s a great foundation for deploying ML products. Kubeflow is designed to take advantage of these benefits.

Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. It removes the need for expertise in a large number of areas, reducing the barrier to entry for developing and maintaining ML products. The composability problem is addressed by providing a single, unified tool for running common processes such as data ingestion, transformation, and analysis, model training, evaluation, and serving, as well as monitoring, logging, and other operational tools. The portability problem is resolved by supporting the use of the entire stack either locally, on-premise, or on the cloud platform of your choice. Scalability is native to the Kubernetes platform and leveraged by Kubeflow to run all aspects of the product, including resource-intensive model training tasks.
Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. By providing a platform that reduces variability between services and environments, Kubeflow enables applications that are more robust and resilient, resulting in less downtime, quality issues, and customer impact. Additionally, it supports the use of specialized hardware such as GPUs, which can reduce operational costs and improve model performance. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project.

Using weak supervision and transfer learning techniques to build a knowledge graph to improve student experiences at Chegg.

Sanghamitra Deb - Chegg

With 1.6 million subscribers and over a hundred fifty million content views, Chegg is a centralized hub where students come to get help with writing, science, math, and other educational needs. In order to impact a student’s learning capabilities we present personalized content to students. Student needs are unique based on their learning style , studying environment location and many other factors. Most students will engage with a subset of the products and contents available at Chegg. In order to recommend personalized content to students we have developed a generalized Machine Learning Pipeline that is able to handle training data generation and model building for a wide range of problems.

We generate a knowledge graph with a hierarchy of concepts and associate student-generated content, such as chatroom data, equations, chemical formulae, reviews, etc with concepts in the knowledge graph. Collecting training data to generate different parts of the knowledge graph is a key bottleneck in developing NLP models. Employing subject matter experts to provide annotations is prohibitively expensive. Instead, we use weak supervision and active learning techniques, with tools such as snorkel, an open source project from Stanford, to make training data generation dramatically easier. With these methods, training data is generated by using broad stroke filters and high precision rules. The rules are modeled probabilistically to incorporate dependencies. Features are generated using transfer learning from language models, question answering systems, text summarizations techniques for classification tasks. The generated structured information is then used to improve product features, and enhance recommendations made to students.

Need for speed: Boosting Apache Cassandra's performance using Netty

Dinesh A. Joshi

Apache Cassandra 4.0 has several enhancements. One of the biggest enhancements is switching from blocking network IO using JDK Sockets to Non Blocking IO with Netty. As a result, Cassandra has seen gains in performance and efficiency. These gains translate into real world costs and allow Cassandra to scale better. This presentation will take you on a tour of the improvements of Cassandra's network layer (old & new) and help quantify the gains in real world terms.
At the end of the talk the audience will learn about the motivations behind changing Cassandra's internode and streaming communication to Netty. They will also learn how these changes significantly affect scalability and recent enhancements such as zero copy streaming over Netty that makes scaling and rebuilding cluster faster than ever before.
Intended audience: The intended audience are developers, database admins and Apache Cassandra users interested in running and managing it.
Technical skills and concepts required: Basic understanding of Networking & Cassandra.

Data Science Keynote - Obvious conclusions that are actually wrong

Sean Owen - Databricks

A little knowledge can be a dangerous thing for a new data scientist. It's all too easy to draw obvious conclusions from data analysis that are actually wrong -- or worse, draw reasonable conclusions that are contradictory. The seasoned data scientists knows stats 'paradoxes' and avoids costly business mistakes and humiliation at the hands of stats majors. This talk will survey five seemingly straightforward scenarios where the right answer looks wrong or ambiguous. It will explore how causation provides a resolution to many such situations and briefly introduce Judea Pearl's do-calculus. Don't do data science until you've seen how correlation and causation are vital to understanding data correctly.

Statistically representative graph generation and benchmarking

Chris Lu

In order to evaluate the performance of a graph database or a graph query solution, it is often necessary to generate a large graph dataset, over which we can then execute queries. However, while there are a number of benchmarks for graph databases which provide tools for data generation, they typically offer few if any options for tailoring the generated graph to the unique schema, topology, and statistics of a the target domain. In practice, this limits the value of these benchmarks for capacity planning and estimation of query latency. In this talk, we will describe an open-source framework for property graph generation and benchmarking in close correspondence with a schema and a statistical model. A simple declarative language is provided for schema and model, while the reference implementation is written in Java and builds upon the Apache TinkerPop graph database API.
Intended audience: Graph developers
Technical skills and concepts required: Familiarity with the property graph data model. Some experience with graph database backends recommended though not required.

Understanding Spark Tuning with Auto Tuning (or how to stop your pager going off at 2am*)

Holden Karau - Google

Tuning Apache Spark is somewhat of a dark art, although thankfully when it goes wrong all we tend to lose is several hours of our day and our employers money. This talk will look at how we can go about auto-tuning selective workloads using a combination of live and historical data, including new settings proposed in Spark 2.4.
Much of the data required to effectively tune jobs is already collected inside of Spark, we just need to understand it. This talk will look at some sample auto-tuners and discuss the options for improving them and applying similar techniques in your own work.
This talk will also look at what kind of tuning can be done statically (e.g. without depending on historic information). This talk will also look at Spark own built in components for auto tunning (currently dynamically scaling cluster size) and how we can improve them.
Even if the idea of building an “auto-tuner” sounds as appealing as “using a rusty spoon to debug the JVM on a haunted super computer”, this talk will give you a better understanding of the knobs available to you to tune your Apache Spark jobs.
*Also to be clear we don’t promise to stop your pager going off at 2am, we just hope this helps.

Moving Beyond Node Views

Lynn Pausic - Expero

When people talk about visualizing graph data, what typically comes to mind is the canonical node view. Node views display nodes (vertex) and the relationships (edge) between them. With large data sets consisting of millions of vertices and edges, node views can quickly become unwieldy to use and comprehend.Further, traditional UI patterns and visualizations conceived for relational schemas often don’t work with graph data. Relational schemas are predefined and relatively static making it easy to tailor UI navigation to the available data dimensions. Due to the distinct mathematical nature of graph data, traversing data in a graph is fairly different. While this presents additional challenges, there are also opportunities. Traversing a graph with certain algorithms allows you to, for example, show key influencers in social networks, clusters of communities in customer reviews or weak points in electrical grids. These new insights into data provide novel tools to craft innovate user experiences. But this opportunity comes at a price, namely more complexity. Through building and deploying dozens of applications driven by graph data, we’ve developed a unique approach to building UIs driven by graph data and an arsenal of data visualizations that work well across broad range of contexts. In this talk we’ll share various tools and examples for displaying graph data in meaningful ways to users.

High Performance JanusGraph Batch & Stream Loading

Ted Wilmes - Expero

You've downloaded JanusGraph, installed it, and run a few queries from the Gremlin console, but what's next? Data loading is the logical next step, but it is a common pain point for JanusGraph newcomers. Inevitably data loading touches on more advanced topics such as performance tuning and an understanding of JanusGraph transaction semantics. This talk will demystify the data loading process by presenting JanusGraph batch and stream loading patterns that you can apply to your next graph database project.

Taming of the Shrew: Using a Knowledge Graph to capture structured Health Information Data

Chris Wixon - Savannah Vascular

Despite investing billions of dollars on modernizing health information infrastructure, the impact on healthcare outcome has been relatively modest. Patient care remains fragmented, costs continue to rise and medical professionals remain frustrated and dissatisfied. Our success in a new era of digital health will depend on the ability to derive insight from data.
Freedom of expression comes from order and structure. Placing limits, and working against them rigorously. Our graph-powered solution offers a schema-driven method of information capture between patient and provider, at the point of service. In contradistinction to a source-oriented method, the knowledge graph models the corpus of medicine itself and incorporates concepts from multiple medical terminology systems (CPT, ICD, SNOMED, NPI, Medical taxonomy) into its persistence layer. A seed pattern of clinically-relevant predicates defines concepts in a formal way to reflect the semantics of the concept. The meta-model supports a uniform user interface and enables efficient documentation by way of semantic browser: click and discover, rather than search and retrieval.
Implementing a clinically oriented Knowledge Graph saves time for the physician, returns the focus back to the patient, and creates computable medical records for healthcare payers to make more informed decisions.
Point of service documentation minimizes the time lag between the patient encounter and data entry and provides more complete and reliable health information data.
The use of the computer as a recall mechanism for medical knowledge allows medical personnel to function at the top of their credentials from a universe that is more comprehensive than human cognition.
The data gathered in a specific section of the record should be entered in an unambiguous and consistent among all health practitioners entering it.
Intended audience: Healthcare Professionals, Data Modeling, Domain-driven design (DDD), Knowledge Graph
Technical skills and concepts required: Beginner – Intermediate knowledge

Eight Prerequisites of a Graph Query Language

Dr. Mingxi Wu - TigerGraph

Graph query language is the key to unleashing the value of interconnected data. The talk includes discussion of 8 prerequisites of a graph query language for successful implementation of real world graph analytics use cases. The talk will present the pros and cons of three query languages - Cypher, Gremlin, and SPARQL. Finally, the talk will provide an overview of GSQL, a Turing Complete graph query language that is a conceptual descendent of Cypher, Gremlin and SPARQL and has incorporated design features from SQL as well as Hadoop MapReduce. The talk will compare GSQL query language with Gremlin, Cypher and SparkQL, pointing out the differences including pros and cons for each language.

GQL: Towards a Standardized Property Graph Query Language

Dr. Petra Selmer - Neo4j

Over the past decade, property graph databases have experienced phenomenal growth within industry across multiple domains such as master data and knowledge management, fraud detection, network management, access control and healthcare, among others. With this proliferation in usage, the need for a standardized property graph query language has become ever more pressing. Efforts are underway within the ISO framework to define and describe a standard property graph query language, GQL. In this talk, I will introduce GQL, and detail the landscape, scope and features envisaged for the first version of GQL, such as complex pattern matching and composable graph querying. I will provide a roadmap of the standardization process, and also describe the outcome of an analytical comparison of existing property graph query languages, which will be one of the inputs into the design of GQL. To conclude, I will outline future directions.
Technical skills and concepts required: Some knowledge/awareness of property graphs would be useful.

A Graph is a Graph is a Graph: Equivalence, Transformations, and Composition of Graph Data Models

Joshua Shinavier - Uber

The power of graphs lies in their intuitiveness: there is nothing much simpler to visualize or reason about than a bunch of dots connected by a bunch of lines. In practice, however, there are a variety of graph data models, separated by shades of expressivity and nuance. These include property graphs and their variants, RDF-based ontology languages, hypergraph data models, entity-relationship models, and any number of formats and schema languages which are somehow graph-like, though not specifically designed for graphs. Over the years, countless special-purpose tools have been written to transform one graph data model to another, or to provide graph views over this or that type of data source. In this talk, we will bring some order to this chaos using concepts from functional programming and category theory, with an emphasis on bidirectional and composable transformations. Along the way, we will ponder the grand vision of bringing together the whole of a company’s data as a knowledge graph.
Technical skills and concepts required: Basic familiarity with the property graph data model. Some experience with functional programming, may help. However, concepts will be introduced at a high level, and should be reasonably easy to follow.

Breaking Down Silos with Knowledge Graphs

Michael Uschold - Semantic Arts

For most of you who work in enterprise computing, silos are the bane of your existence. We explore the origins of silos and point out some technical factors that exert a strong gravitational pull, drawing your enterprise into the deep pit of silos. Chief among these, are application-centricity along with limitations in relational database technology including the lack of explicit semantics. We describe a semantics-based data-centric approach using ontologies and knowledge graphs. We show how it works in practice and illustrate with case studies. We warn against the use of these newer technologies to gain a local advantage in an organization but ultimately recreating silos across the wider enterprise.
Conclusion: The use of an enterprise ontology as a schema to populate an RDF-based knowledge graph opens the door to removing silos and never creating them again. The technology is mature and ready for prime time.

The Intelligent Sales Organization Runs on Speech Recognition, Knowledge Graphs and AI

Dr. Jans Aasman - Franz Inc. / Shannon Copeland - N3

We describe a real world Intelligent Sales Organization that uses graph based technology for taxonomy driven entity extraction, speech recognition, machine learning and predictive analytics to improve quality of conversations, increase sales and improve business visibility.
The details:In the typical sales organization the contents of the actual chat or voice conversation between agent and customer is a black hole. In the modern Intelligent Sales Organization (“ISO”) the interactions between agent and customer are a source of rich information that helps agents to improve the quality of the interaction in real time, creates more sales, and provides far better analytics for management. An ISO is enabled by at least five main technologies. A taxonomy of the products and services sold, speech recognition to turn conversations into text, a taxonomy driven entity extractor to take the important concepts out of conversations, and machine learning to classify chats in various ways. All of this is stored in a real-time Knowledge Graph that also knows (and stores) everything about customers and agents and provides the raw data for machine learning to improve doing the business of ISO.

90 minute workshop - Hands-On Introduction to Gremlin Traversals

Dr. Artem Chebotko - DataStax

This workshop introduces the Gremlin graph traversal language from Apache TinkerPop by exploring graph access patterns that are commonly seen in real-life applications. It features many practice problems, where the access pattern complexity is gradually increased from elements to paths, from paths to subgraphs, and from subgraphs to arbitrary graph patterns.
This workshop is hands-on. Each attendee gets a free cloud instance with pre-installed database and notebook software that can be accessed via a web browser to run Gremlin traversal examples and complete practice problems. A laptop is required to participate but is not absolutely necessary to learn and benefit.
This workshop has no vendor-specific content to learn or understand but does use a graph database, DataStax Enterprise Graph, and notebook software, DataStax Studio, to run examples and test solutions to practice problems. All practice problems use Apache TinkerPop’s Gremlin with no vendor-specific extensions.
Intended audience: Beginners and Intermediate
Technical skills and concepts required: No prior graph data management experience is required.

Operationalizing Graph Analytics With Neo4j

William Lyon - Neo4j

Data science is great and all, but when it comes time to implement some of the advanced features data scientists have prototyped, developers can be left struggling. This talk will show how data scientists and developers can live in harmony by using Neo4j graph database for both operational and analytic workloads.
Taking a tour through the process of making an application smarter with personalized recommendations, graph based search, and knowledge graph features we will move beyond just operational workloads to add these features to our application with graph algorithms, using Neo4j for HTAP (hybrid transaction analytics processing).
We demonstrate how to run graph algorithms such as personalized PageRank, community detection, and similarity metrics in Neo4j and how these algorithms can be used to improve our application. We'll show how to keep our application up to date as data comes into the system, discuss architectural considerations, and enhance data scientists’ capabilities by building user-friendly applications.