The Data Day Texas 2019 Sessions

We'll begin publishing the Data Day Texas 2019 sessions soon. In the meantime, below are a few previews.

Creating A Data Engineering Culture

Jesse Anderson - Big Data Institute

The biggest initial hurdle to success with Big Data isn’t technical - it’s management. Your data engineering project’s initial success is predicated on your management team correctly staffing and resourcing it. This runs opposite to how most data engineering teams are started and run. If you just choose the best technologies, things will just fall into place. They don’t and that’s a common pattern for failure.
But how do you correctly do something that’s so new? This could be your team’s first data engineering project. What should the team look like? What skills should the team have? What should you look for in Data Engineer (because you’ll probably have to hire a Software Engineer and train them)? What are some of the management pitfalls?
In this talk, Jesse will cover the most common reasons why data engineering teams fail and how to correct them. This will include ways to get your management to understand that data engineering is really complex and time consuming. It is not data warehousing with new names. Management needs to understand that you can’t compare a data engineering team to the web development team, for example.
Jesse will share the stories of teams who haven’t set up their data engineering culture correctly and what happened. Then, he will talk about the teams who’ve turned around their culture and how they did it.
FInally, Jesse will share the skills that every data engineering team needs.

Graph Keynote - From Theory to Production

Dr. Denise Gosnell - DataStax

We are here to build applications with graph data and deliver value. The graph community has spent years defining and describing our passion. In order to decipher graph thinking into a production application, there is a suite of hard decisions that have to be made. It's time for graph to go mainstream!
This talk will walk through some practical and tangible decisions that come into play when shipping distributed graph applications. Developers need to have a tangible set of play books to work from and my years of experience have narrowed it down to some of the most universal and difficult to spot. Let's see how well they match up with yours.

Using weak supervision and transfer learning techniques to build a knowledge graph to improve student experiences at Chegg.

Sanghamitra Deb - Chegg

With 1.6 million subscribers and over a hundred fifty million content views, Chegg is a centralized hub where students come to get help with writing, science, math, and other educational needs. In order to impact a student’s learning capabilities we present personalized content to students. Student needs are unique based on their learning style , studying environment location and many other factors. Most students will engage with a subset of the products and contents available at Chegg. In order to recommend personalized content to students we have developed a generalized Machine Learning Pipeline that is able to handle training data generation and model building for a wide range of problems.

We generate a knowledge graph with a hierarchy of concepts and associate student-generated content, such as chatroom data, equations, chemical formulae, reviews, etc with concepts in the knowledge graph. Collecting training data to generate different parts of the knowledge graph is a key bottleneck in developing NLP models. Employing subject matter experts to provide annotations is prohibitively expensive. Instead, we use weak supervision and active learning techniques, with tools such as snorkel, an open source project from Stanford, to make training data generation dramatically easier. With these methods, training data is generated by using broad stroke filters and high precision rules. The rules are modeled probabilistically to incorporate dependencies. Features are generated using transfer learning from language models, question answering systems, text summarizations techniques for classification tasks. The generated structured information is then used to improve product features, and enhance recommendations made to students.

High Performance JanusGraph Batch & Stream Loading

Ted Wilmes - Expero

You've downloaded JanusGraph, installed it, and run a few queries from the Gremlin console, but what's next? Data loading is the logical next step, but it is a common pain point for JanusGraph newcomers. Inevitably data loading touches on more advanced topics such as performance tuning and an understanding of JanusGraph transaction semantics. This talk will demystify the data loading process by presenting JanusGraph batch and stream loading patterns that you can apply to your next graph database project.

Eight Prerequisites of a Graph Query Language

Dr. Mingxi Wu - TigerGraph

Graph query language is the key to unleashing the value of interconnected data. The talk includes discussion of 8 prerequisites of a graph query language for successful implementation of real world graph analytics use cases. The talk will present the pros and cons of three query languages - Cypher, Gremlin, and SPARQL. Finally, the talk will provide an overview of GSQL, a Turing Complete graph query language that is a conceptual descendent of Cypher, Gremlin and SPARQL and has incorporated design features from SQL as well as Hadoop MapReduce. The talk will compare GSQL query language with Gremlin, Cypher and SparkQL, pointing out the differences including pros and cons for each language.