Confirmed Sessions for Data Day Texas 2018

Get $100 off the regular room rate at the official conference hotel. Use the following link to book your room:

We are just now beginning to announce the confirmed sessions. Check this page regularly for updates.

Lexicon Mining for Semiotic Squares: Exploding Binary Classification

Jason Kessler - CDK Global

A common task in natural language processing is category-specific lexicon mining, or identifying words and phrases that are associated with the presence or absence of a specific category. For example, lists of words associated with positive (vs. negative) product reviews may be automatically discovered from labeled corpora.
In the 1960s, the semanticists A. J. Greimas and F. Rastier developed a framework for turning two opposing categories into a network of 10 semantic classes. This talk introduces an algorithm for discovering lexicons associated with those semantic classes given a corpus of categorized documents. This algorithm is implemented as part of Scattertext, and the output can be viewed in an interactive browser-based visualization.

Generating Natural-Language Text with Neural Networks

Jonathan Mugan - Deep Grammar

Automatic text generation enables computers to summarize text, to have conversations in customer-service and other settings, and to customize content based on the characteristics and goals of the human interlocutor. Using neural networks to automatically generate text is appealing because they can be trained through examples with no need to manually specify what should be said when. In this talk, we will provide an overview of the existing algorithms used in neural text generation, such as sequence2sequence models, reinforcement learning, variational methods, and generative adversarial networks. We will also discuss existing work that specifies how the content of generated text can be determined by manipulating a latent code. The talk will conclude with a discussion of current challenges and shortcomings of neural text generation.

Everything is not a graph problem (but there are plenty)

Dr. Denise Gosnell - DataStax

As the reality of the graph hype cycle sets in, the graph pragmatists have shown up to guide the charge. What we are seeing and experiencing is an adjustment in mindset: the convergence to multi-model database systems parallels the mentality of using the right tool for the problem. With graph databases, there is an intricate balance to find where the rubber meets the road between theorists and practitioners.
Before hammering away on the keyboard to insert vertices and edges, it is crucial to iterate and drive the development life cycle from definitive use cases. Too many times the field has seen monoglot system thinking pressure the construction of the one graph that can rule it all which can result in some impressive scope creep. In this talk, Dr. Gosnell will walk through common solution design considerations that can make or break a graph implementation and suggest some best practices for navigating common misconceptions.

Next Generation Real Time Architectures

Karthik Ramasamy - Streamlio

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This is driven by enterprises not only producing data in volume but also at high velocity. Many daily business operations depend on real-time insights and how enterprises react to those situations. In this talk, we will describe what constitutes a real time stack and how the stack is organized to provide an end to end real time experience. Next generation real time stack consists of Apache Pulsar, a messaging system, Heron, a distributed streaming engine and Apache BookKeeper that provides a fast streaming storage. We will delve into details of each of the systems and explain why these systems are better than the previous generation system.

Machine Learning: From The Lab To The Factory

John Akred - Silicon Valley Data Science

When data scientists are done building their models, there are questions to ask:
* How do the model results get to the hands of the decision makers or applications that benefit from this analysis?
* Can the model run automatically without issues and how does it recover from failure?
* What happens if the model becomes stale because it was trained on data that is no longer relevant?
* How do you deploy and manage new versions of that model without breaking downstream consumers?
This talk will illustrate the importance of these questions and provide a perspective on how to address them. John will share experiences deploying models across many enterprises, some of the problems we encountered along the way, and what best practice is for running machine learning models in production.

Introduction to SparkR in AWS EMR (90 minute session)

Alex Engler - Urban Institute

This session is a hands-on tutorial on working in Spark through R and RStudio in AWS Elastic MapReduce (EMR). The demonstration will overview how to launch and access Spark clusters in EMR with R and RStudio installed. Participants will be able to launch their own clusters and run Spark code during an introduction to SparkR, including the SparklyR package, for data science applications. Theoretical concepts of Spark, such as the directed acyclic graph and lazy evaluation, as well as mathematical considerations of distributed methods will be interspersed throughout the training. Follow up materials on launching SparkR clusters and tutorials in SparkR will be provided.
Intended Audience: R users who are interested in a first foray into distributed cloud computing for the analysis of massive datasets. No big data, dev ops, or Spark experience is required.

Autopiloting #realtime processing in Heron

Karthik Ramasamy - Streamlio

Several enterprises have been producing data not only at high volume but also at high velocity. Many daily business operations depend on real-time insights, therefore real-time processing of the data is gaining significance. Hence there is a need for a scalable infrastructure that can continuously process billions of events per day the instant the data is acquired. To achieve real time performance at scale, Twitter developed and deployed Heron, a next-generation cloud streaming engine that provides unparalleled performance at large-scale. Heron has been successfully meeting the strict performance requirements for various streaming applications and is now an open source project with contributors from various institutions. Heron faced some crucial challenges from developers and operators point of view: the manual, time-consuming and error-prone tasks of tuning various configuration knobs to achieve service level objectives (SLO) as well as the maintenance of SLOs in the face of sudden, unpredictable load variation and hardware or software performance degradation.
In order to address these issues, we conceived and implemented Dhalion that aims to bring self-regulating capabilities to streaming systems. Dhalion monitors the streaming application, identifies problems that prohibit the application from meeting its targeted performance and automatically takes actions to recover such as restarting slow processes and scaling up and down resources in case of load variations. Dhalion has been built as an extension to Heron and contributed back open source. In this talk, I will give a brief introduction to Heron and enumerate the challenges that we faced while running in production and describe how Dhalion solves some of the challenges. This is a joint work with Avrilia Floratou and Ashvin Agrawal at Microsoft and Bill Graham at Twitter.

Real-time deep link analytics: The next stage of graph analytics

Dr. Victor Lee - TigerGraph

Graph databases are the fastest growing category in data management, according to DB-Engines. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. To support real-time deep link analytics, we need the power of combining real-time data updates, big datasets, and deep link traversals.
Dr. Victor Lee offers an overview of TigerGraph’s distributed Native Parallel Graph, a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Yu discusses the techniques behind the distributed native parallel graph platform, including how it partitions graph data across machines, supports fast update, and is still able to perform fast graph traversal and computation. He also shares a subsecond real-time fraud detection system managing 100 billion graph elements to detect risk and fraudulent groups.
(Product Showcase)

Writing Distributed Graph Algorithms

Andrew Ray - Sam's Club

Distributed graph algorithms are an important concept for understanding large scale connected data. One such algorithm, Google’s PageRank, changed internet search forever. Efficient implementations of these algorithms in distributed systems are essential to operate at scale.
This talk will introduce the main abstractions for these types of algorithms. First we will discuss the Pregel abstraction created by Google to solve the PageRank problem at scale. Then we will discuss the PowerGraph abstraction and how it overcomes some of the weaknesses of Pregel. Finally we will turn to GraphX and how it combines together some of the best parts of Pregel and PowerGraph to make an easier to use abstraction.
For all of these abstractions we will discuss the implementations of three key examples: Connected Components, Single Source Shortest Path, and PageRank. For the first two abstractions this will be in pseudo code and for GraphX we will use Scala. At the end we will discuss some practical GraphX tips and tricks.

We R What We Ask: The Landscape of R Users on Stack Overflow

Dave Robinson - Stack Overflow

Since its founding in 2008, the question and answer website Stack Overflow has been a valuable resource for the R community, collecting more than 200,000 questions about R that are visited millions of times each month. This makes it a useful source of data for observing trends about how people use and learn the language. In this talk, I show what we can learn from Stack Overflow data about the global use of the R language over the last decade. I'll examine what ecosystems of R packages are asked about together, what other technologies are used alongside it, in what industries it has been most quickly adopted, and what countries have the highest density of users. Together, the data paints a picture of a global and rapidly growing community. Aside from presenting these results, I'll introduce interactive tools and visualizations that the company has published to explore this data, as well as a number of open datasets that analysts can use to examine trends in software development.