Confirmed Sessions at Data Day Texas 2017
We are still confirming talks for Data Day Texas 2017. Bookmark this page for updates.
Building Production Spark Streaming Applications
Joey Echeverria - Rocana
Designing, implementing, and testing a Spark Streaming application is necessary to deploy to production but is not sufficient for long term management and monitoring. Simply learning the Spark Streaming APIs only gets you part of the way there. In this talk, I’ll be focusing on everything that happens after you’ve implemented your application in the context of a real-time alerting system for IT operational data. I’ll discuss:
Deployment options (Apache Hadoop YARN, Apache Mesos, etc.)
Execution options (spark-submit, APIs, etc.)
Configuration and tuning
Saving the World with Data
Andrew Therriault - City of Boston
The emergence of data science over the past decade has had an effect on almost every field you can imagine, but its importance to the non-profit, advocacy, and government sectors has been especially profound. With a wide variety of data and limited resources which beg for optimization, these types of civic organizations can benefit tremendously from data-driven tools and analysis. What's more, the real-world challenges these organizations take on create opportunities for data scientists, analysts, and engineers to do incredibly exciting work and have a much greater impact than you can get anywhere else. This presentation will introduce the audience to the landscape of civic data and technology, present real case studies of data science applications, and discuss how anyone interested in getting more involved can find their place in using data for good.
Extending Spark Machine Learning - Adding your own algorithms & tools
Holden Karau - IBM
Spark's scikit-learn inspired Machine Learning pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren't available yet. This talk introduces Spark's ML pipelines and then looks at how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark's ML pipelines you will be able to take advantage of useful meta-algorithms, like parameter searching and pipeline persistence (with a bit more work of course :p). Even if you don't have your own machine learning algorithms you want to implement, this talk peels back the covers on how the ML APIs are built and can help you make even more awesome ML pipelines and customize Spark models for your needs.
The examples in this talk are will be presented in Scala, but time will be taken to explain any non-standard syntax. A basic understanding of Spark will make it easier to follow along, but if this is your first Spark talk there will still be useful with illustrating how Spark's Machine Learning tools are designed.
Introducing Amazon AI
Robert Munro - Amazon
Details forthcoming ( ! )
Scaling Data Science at Stitch Fix
Stefan Krawczyk - Stitch Fix
Stitch Fix is an online clothing retailer that not only focuses on delivering personalized clothing recommendations for our customers, but also applies the output of data science to automate numerous other business functions through the delivery of forecasts, predictions, and analyses. We rely heavily on the ability for applied mathematics & statistics and our human decision makers to synergistically work; doing this well requires us to merge art & science together. However with around eighty data scientists in residence, it can be challenging to support so many different needs from an infrastructure perspective.The talk will also cover how Stitch Fix scales access to data using S3 as our source of truth as well as how Stitch Fix scales ad-hoc compute resources for data scientists using Docker & ECS
Predictive Models for Inter-Linking Text-Rich Semantic Databases
Gabor Melli - OpenGov
Linguistically rich semantic databases such as taxonomies, knowledge graphs (e.g. Google’s and LinkedIn’s) and lexicalized ontologies are an increasingly important new resource required to create ‘smart’ information systems. On a regular basis these databases need to be interlined to other overlapping databases. I present the state-of-the-art approaches (instance-based and deep neural worked based) end-to-end predictive model-based applications and solutions from several domains ranging from “open data”, B2B and B2C solutions. I also present topics such as evaluation, scalability and human-in-the-loop labeling systems.
Machine learning with humans in the loop
Nick Gaylord - Crowdflower
Increasingly, businesses are exploring machine learning not only to speed the processing of structured data, but to extract insights from unstructured data like text and images as well. These areas have traditionally been the domain of human-powered data analysis, and the role of human intelligence should not be forgotten -- successful applications of machine learning to these tasks are often more about augmenting humans than about replacing them outright. I discuss two strategies for keeping humans in the loop when using ML to process unstructured data: using human judgments to offset your model's lower-confidence predictions, and using active learning to repurpose those supplementary labels as additional training data so your model improves over time.
Machine Learning with Opponents
Brendan Herger - Capital One
Many areas of applied machine learning require models optimized for rare occurrences, such as class imbalances, and users actively attempting to subvert the system (adversaries). The Data Innovation Lab at Capital One has explored advanced modeling techniques for just these challenges. The lab’s use case necessitated that it survey the many related fields that deal with these issues and perform many of the suggested modeling techniques. It has also introduced a few novel variations of its own.
Brendan Herger offers an introduction to the problem space and a brief overview of the modeling frameworks the Data Innovation Lab has chosen to work with, outlines the lab’s approaches, discusses the lessons learned along the way, and explores proposed future work.
Outlier detection via dimensionally reduction (PCA and neural network auto-encoders)
The synthetic minority over-sampling technique (SMOTE sampling)
ScyllaDB dance like butterfly sting like bee
Dor Laor - ScyllaDB
Scylla is a complete re-write of good, old Cassandra. Scylla can drop-in replace Cassandra while providing 10X the speed, consistent tail latency and automatic tuning. Revolutionary shard-per-core design and extreme async programing turns Scylla into a throughput machine with ability to support the big data ecosystem, from Spark to KairosDB to TitanDB. Join a journey that started by IBM's compose and Samsung SDS, be advised, you may get hooked yourself.
Pragmatic Deep Learning for image labelling. An application to a travel recommendation engine
Pierre Gutierrez - Dataiku
Many companies don’t have the luxury to train their own image recognition deep learning models because they often lack of data, hardware, human resources or time. In this talk, we will explain how to tackle these issues.
We will explain how to have a pragmatic approach to deep learning using open pre trained models and transfer learning. We will also describe how to generate more labels using external apis. We will then take the real case of an e-business vacation retailer as an example. Recommender systems are paramount for this type of company since there is an increasing need to take into account all the user information to tailor the best product proposition. Indeed, when it comes to hostels, some people can be more attracted by pictures of the room, the building or even the nearby beach. Finally, we will describe how we managed to extract the sale images content and themes to further customise the recommendation without training any deep neural network ourselves.
Database Reliability Engineering
Laine Campbell - O'Reilly Author
Consider this a new database administration primer, focused on teaching the developer taking on Operations; or the systems administrator diving into site reliability. This is about the core concepts of database operations within today’s IT paradigms, which can include continuous deployment and delivery, DevOps culture, infrastructure as code, and cloud/virtualized environments. By the nature of the topic and time allotted, this talk will focus on breadth, not depth, but should allow the participant to understand the concepts, and gain a framework to focus their additional pursuits and education. You should come out of this course with a better understanding of how a database specialist fits in today's reliability engineering paradigms.
1. Database reliability engineering, overview
Site reliability engineering overview/history
Database administration overview/history
Today’s operational DBA
Service level management
The observability stack
Anatomy of a Datastore, choose your poison
3. Build and Deployment
Infrastructure engineering and management
Capacity planning and performance
Release engineering and change management
4. Operations Core
Data Integrity in Depth
Disaster preparedness and business continuity
Incident management and Oncall
Sampling is always Bad, approximate the Good way with Sketches
Eric Tschetter - Yahoo
Sampling is a common technique employed to shrink the size of data and make it manageable. But, when employed in practice, sampling introduces error that only produces "good" answers under very specific conditions that require a scientist to create and manage. This talk introduces a suite of data structures and companion algorithms called sketches which can also be used to make data manageable in size, but whose results can be interpreted intuitively by someone without an advanced degree. The talk targets technical individuals who manage data warehouses or closely interact with data systems on a daily basis, but should be consumable by anyone with a general understanding of probability.
Towards an open standard for metrics and time series
Paul Dix - InfluxDB
In this presentation I look at different models for representing time series and metrics data in various open source projects. We'll take a look at the different types of meta data, data types, regular and irregular time series and the different use cases for each. I'll explore advantages and disadvantages to each and look at a potential model going forward that can provide flexibility for different use cases.
Structured Logs with InfluxDB - Log management at scale without full text search
Paul Dix - InfluxDB
In this presentation I cover some of the problems with log management at scale and explore an alternative to using full text search as a way to mine information from logs. I look at how adding structure in the log collector along with narrowing down time scale can be used to efficiently grep through logs.
Choose your own Time Series adventure
Patrick McFadin - DataStax
Have a lot of questions about managing and processing time series data? You are not alone! I do a lot of teaching and speaking on time series data topics and I’m here to help. I have collected a bundle of topics that capture a lot of the questions I get. This talk will be a little different format than you might be used to because you get to choose the direction! I’ll present the potential 10-15 minute topics and let the audience pick which ones to talk about. I’ll have more than the time allotted so choose carefully! Here is an idea of the topics you can pick from:
- Time series use cases
- Architectures for scaling
- Apache Cassandra for storage
- Apache Spark for analysis
Robot farmers and chefs: In the field and in your kitchen
Tim Gasper - Ponos
Food production and preparation have always been labor and capital intensive. Today, these traditional industries are rapidly moving beyond technology and tools to support humans and into the realm of compete automation. With the internet of things, low-cost sensors, cloud-computing ubiquity, and big data analysis, farmers and chefs are being replaced with connected, big data robots—not just in the field but also in the kitchens of our homes and restaurants.
Internet-connected systems can be built with just commodity sensors and open source software that gather millions of sensor readings, transport those readings into the cloud, and process that information in real time and with large-scale, batch machine-learning engines. From farming to food shipment and logistics to food prep and cooking, we can:
- Diagnose plant issues with image recognition
- Automate food production in both outdoor and indoor environments
- Automate food prep and even basic cooking functions
- Learn in both unassisted and assisted ways to improve food taste, nutritional value, and food production yield
- Collect and process thousands of sensor readings per second
- Integrate natural language analysis or interfaces
- Monitor complex robotics systems
- Tim Gasper explores the tech stack, data science techniques, and use cases driving this revolution. You’ll learn about some very cool next-generation use cases in food production and food prep, the big data and IoT architectures that support those use cases, and how to implement the software and data science techniques that drives these innovations.
Untangling the Ball of Strings: Machine Learning for Localization
Michelle Casbon - Qordoba
To grow beyond an initial alpha launch, most platforms need to support a variety of locales. The challenge with supporting multiple locales is the maintenance and generation of localized strings, which are deeply integrated into many facets of a product. To address these challenges at Qordoba, we’re using highly scalable tools and machine learning to automate the process. Specifically, we need to generate high-quality translations in many different languages and make them available in real-time across platforms, e.g. mobile, print, and web. The combination of various open source tools such as Apache Spark MLlib provides structure for a scalable localization platform with machine learning at its core.
In this talk, we describe the techniques we’re using to provide:
- Continuous deployment of localized strings
- Live syncing across platforms (mobile, web, photoshop, sketch, help desk, etc.)
- Content generation for any locale
- Emotional response
We will also share our architecture for handling billions of localized strings in many different languages. We talk about our use of:
- Scala and Akka as an orchestration layer
- Apache Cassandra and MariaDB as a storage layer
- Apache Spark & Apache PredictionIO (incubating) for natural language processing
- Apache Kafka as a message bus for reporting, billing, & notifications
- Docker, Marathon, & Apache Mesos for containerized deployment
We present our solution in the context of a platform that makes it feasible to build products that feel native to every user, regardless of language.
NEW - Truth is Dead
Jonathon Morgan - New Knowledge / Data for Democracy
If the 2016 elections made one thing clear, it's that Americans are divided. Hyper-partisan rhetoric, social media filter bubbles, and fake news didn't just shape the outcome, but are warping how Americans experience reality -- even inspiring real-world acts of violence.
We can quantify these different realities and their consequences with neural networks trained to examine language from social media. Using word vectors from models trained on partisan corpora embedded in a neutral vector space, we measured the increasing radicalization of the alt-right, the distance between the left and right on contentious campaign issues, and how racist rhetoric has infected mainstream political discourse.
The results are alarming, but while voters seem to inhabit different realities when it comes to politics and policy issues like immigration, guns, taxes, we'll show that, when it comes to everyday issues like work, family, and even government, we may be more alike than we think.
Ryan Mitchell - HedgeServ
NEW - Distilling dark knowledge from neural networks
Recent papers from the NIPS 2015 workshop on feature extraction suggest that representational learning consisting of "supervised coupled" methods (such as the training of supervised deep neural networks) can significantly improve classification accuracy vis a vis unsupervised and/or uncoupled methods. Such methods jointly learn a representation function and a labeling function. If you are a machine learning practitioner in a field whose applications demand or require strict interpretability constraints, a major drawback of using deep neural networks is that they are notoriously difficult to interpret. In this talk, Alex will discuss "distilled learning" -- training a classifier and extracting its outputs for use as training labels for another model -- and "dark knowledge" -- implicit knowledge of the underlying data representation learned by a classifier. Together, Alex will show their efficacy in improving classification accuracy in more readily interpretable models such as single decision tree and logistic regression learners. Finally, Alex will discuss applications such as health sciences, credit decisions, and fraud detection.
NEW - Distances, Similarities, and Scores: Practical Model Examples
Whether we’re talking about spam emails, merging records, or investigating clusters, there are many times when having a measure of how alike things are makes them easier to work with. You may have unstructured or vague data that isn’t incorporated into your data models (e.g., information from subject-matter experts who have a sense of whether something is good or bad, similar or different). Melissa Santos offers a practical approach to creating a distance metric and validating with business owners that it provides value—providing you with the tools to turn that expert information into numbers you can compare and use to quickly see structures in the data.
Melissa walks you through setting expectations for a distance, creating distance metrics, iterating with experts to check expectations, validating the distance on a large chunk of the dataset, and then circling back to add more complexity and shares some real-world examples, such as distance from usual emails from a domain, quality scores for geographic data, and ranking your customers.
What is a distance?
Turning expert opinion into training data
Making a very basic model
Why your model is wrong
Working with experts and stakeholders to validate usefulness
Data Pipelines with Kafka and Spark (2 hour deep-dive)
Spark and Kafka have emerged as a core part of distributed data processing pipelines. This tutorial will explain how Spark, Kafka and rest of the big data ecosystem fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. By examining use cases and architectures, we’ll trace the flow of data from source to output, and explore the options and considerations for each stage of the pipeline.
When A/B Testing Fails: A Case Study in Real Estate
Nelson Ray, OpenDoor
When sample sizes are small, measurement lags long, and the treatment space that you want to explore large, A/B testing often fails.
American homes represent a $25 trillion asset class, with very little liquidity. Selling a home on the market takes months of hassle and uncertainty. Opendoor offers to buy houses from sellers, charging a fee for their service. Opendoor bears the risk in reselling the house and needs to understand the effectiveness of different liquidity models. Key metrics and resale outcomes can take many months to measure, suggesting that A/B testing may not be the best tool.
Given historical data and believable user models, simulation offers a compelling alternative. We'll walk through how Opendoor data scientists approached simulating these trade-offs and validated them through launch via quasi-experimental means.
How to Observe: Lessons from Epidemiologists, Actuaries and Charlatans
Juliet Hougland - Cloudera
What can you do when you can’t implement an experiment? A/B testing is the bread and butter of many data scientist’s work. Some data scientist aspire to “a culture of experimentation” or even go as far as to (incorrectly) claim that randomized controlled trials are the only way to make inferences from data. What analytic tools are available when we can’t randomize treatment groups and perform direct experiments?
Epidemiologists and actuaries have been working in this situation for decades because many of the processes they need to study are impossible or unethical to experiment on. This talk will provide an overview of observational methods: their strength, limitations, and situations they are best suited for. We will dig into a real observational studies which was not suited for A/B testing -- measuring comparative quality of different versions of on-premise enterprise software.
A useful as observational methods of inference are, they are also easy to misuse and misinterpret. We will discuss some choice examples of misuse and abuse of observational methods and hopefully avoid our own charlatanry in the future.
NLP Day at Data Day Texas
At Data Day Texas 2016, we launched the inaugural NLP Day Texas - a conference within a conference.
Based on community demand, we are expanding NLP Day for 2017. This is not a separate event.
Your registration for Data Day Texas includes the 20+ NLP sessions and workshops as well.
Below is a list of the currently confirmed NLP Day sessions. We are awaiting final abstracts from several of the confirmed speakers. For a complete list of NLP Day speakers, visit the Data Day Speakers page
NEW - Scattertext: A Tool for Visualizing Differences in Language
Jason Kessler (CDK Global)
What sorts of language do persuasive arguments have in common that doesn't tend to appear unpersuasive arguments? How do Republicans and Democrats speak differently to the American public? How is this speech different from non-political language?
These are all questions the new Python library Scattertext is designed to help answer. Given two contrasting sets of documents, the library places dots representing thousands of frequently used phrases on a scatter plot, intelligently labeling terms when space is available. We'll see how the location of each point reveals a phrase's class-association, and how different point-coloring schemes can reveal topic association and association of semantic categories. For example, we will see how Scattertext can show how different categories speak to different genders using a word2vec based coloring scheme.
Finally, we will see how Scattertext can act as an interactive tool to highlight sections of documents that provide an interpretable explanation of the concepts a characteristic term invokes.
NEW - How to Progress from NLP to Artificial Intelligence
Jonathan Mugan (DeepGrammar)
Why isn’t Siri smarter? Our computers currently have no commonsense understanding of our world. This deficiency forces them to rely on parlor tricks when processing natural language because our language evolved not to describe the world but rather to communicate the delta on top of our shared conception. The field of natural language processing has largely been about developing these parlor tricks into robust techniques. This talk will explain these techniques and discuss what needs to be done to move from tricks to understanding.
NEW - Bootstrapping a corpus: how to build a rich topic library from a handful of words
Rob McDaniel (Live Stories)
What do you do when you're trying to discover insights from a small amount of source data? This talk will discuss methods for mining text from the web to augment a small corpus, and exposing semantic relationships within data when you're starting from scratch. Topics will include unsupervised keyword generation and topic modelling, and how these can improve real-world scenarios such as search.
Exploring Modeling Methods in Named Entity Recognition
Jacob Su Wang - Ojo Labs
Named Entity Recognition (NER) refers to the sequence labeling task where a model is learned to map unannotated word sequences to labeled sequences with pre-defined “named entities” (e.g., George Washington = person’s name) identified under different categories.
We overview popular methods of NER in two categories: feature engineering based and feature abstraction based (i.e., deep learning), reviewing supervised, semi-supervised and unsupervised models in the literature. We comment on the pros and cons of the methods and share with the community the difficulties we have encountered in the real-world application of the models.
NEW - Creating Knowledgebases from text in absence of training data
Sanghamitra Deb - Accenture
A major part of Big Data collected in most industries is in the form of unstructured text. Some examples are log files in IT sector, analysts reports in the finance sector, patents, laboratory notes and papers, etc. Some of the challenges of gaining insights from unstructured text is converting it into structured information and generating training sets for machine learning. Typically training sets for supervised learning are generated through the process of human annotation. In case of text this involves reading several thousands to million lines of texts by subject matter experts. This is very expensive and may not always be available, hence it is important to solve the problem of generating training sets before attempting to build machine learning models. Our approach is to combine rule based techniques with small amounts of SME time to by pass time consuming manual creation of training data. The end goal here is to create knowledgebases of
structured data which are used to derive insights on the domain. I have applied this technique to several domains, such as data from drug labels and medical journals, log data generated through customer interaction, generation of market research reports, etc. I will talk about the results in some of these domains and the advantage of using this approach.
Graph Day at Data Day Texas
Following the success of the inaugural Graph Day in January 2016, scores of people asked if we would consider bringing Graph Day back to Austin for 2017. We succumbed. There will be a Graph Day 2017, and it is included as part of Data Day Texas. Your Data Day ticket includes all the Graph Day sessions and workshops as well as all the Data Day and NLP sessions.
Below is a list of the currently confirmed Graph Day sessions. We are awaiting final abstracts from several of the confirmed speakers. For a complete list of Graph Day speakers, visit the Graph Day Speakers page
Using Graph Analytics for Machine Learning Problems
Alex Dimakis (University of Texas)
Many datasets from the web, social networks, bioinformatics and neuroscience are naturally graph-structured while others can be mapped to graph representations in non-obvious ways. We discuss several graph prediction problems like link prediction, finding important nodes and graph classification. We show how machine learning techniques can be used for these problems and how to extract features from the graph structure that can be used to improve a machine learning model.
NEW - Graph Databases: what's next?
Luca Garulli (OrientDB)
Luca Garulli, the Founder of OrientDB, the 2nd Graph Database on the market, will analyze the main differences between today's leading Graph Database products, discussing each product's strengths and the direction the Graph Database market is headed. If you're working with a Graph Database or you're interested in learning more about the Power of Graphs today and in the upcoming future, you can't miss this presentation.
NEW - Neo4j Graph Database Workshop For The Data Scientist Using Python. (90 minutes)
William Lyon (Neo4j)
Graph databases provide a flexible and intuitive data model that is ideal for many data science use cases such as ad-hoc data analysis, generating personalized recommendations, social network analysis, natural language processing, and fraud detection. In addition Cypher, the query language for graphs, allows for traversing the graph by defining expressive graph queries using graph pattern matching. In this workshop we will work through a series of hands on use cases using Neo4j and common Python science tools such as pandas, igraph, and matplotlib. We will cover how to connect to Neo4j from Python, an overview of how to query graphs using Cypher, how to import data into Neo4j, data visualization, and how to use Python data science tools in conjunction with Neo4j for network analysis, generating recommendations, and fraud detection. Attendees should install Neo4j, Jupyter and be somewhat familiar with Python to get the most out of the session.
NEW - Enabling a Multimodel Graph Platform with Apache TinkerPop.
Jason Plurad (IBM / Apache Software Foundation)
Graphs are everywhere, but in a modern data stack, they are not the only tool in the toolbox. With Apache TinkerPop, adding graph capability on top of your existing data platform is not as daunting as it sounds. We will do a deep dive on writing Traversal Strategies to optimize performance of the underlying graph database. We will investigate how various TinkerPop systems offer unique possibilities in a multimodel approach to graph processing. We will discuss how using Gremlin frees you from vendor lock-in and enables you to swap out your graph database as your requirements evolve.
NEW - Graphs vs Tables: Ready? Fight.
Denise Gosnell (PokitDok)
Lessons learned from building similarity models from structured healthcare data in both graph and relational dbs
The infrastructure debate for the “optimal” data science environment is a loud and ever changing conversation. At PokitDok, the data engineering and data science teams have tested and deployed a myriad of architecture combinations including dbs like Titan, Datastax Enterprise, Neo4j, ElasticSearch, MySql, Cassandra, Mongo, … the list goes on. For us, the final implementations of tested and deployed data science pipelines became a balance of the scientific modeling domain, the right engineering tool, and a bunch of sandboxes.
In this talk, a Denise Gosnell from PokitDok will discuss the polarizing false dichotomy of graph dbs vs. relational dbs. She will step through two different recommendation pipelines which ingest and transform structured healthcare transactions into similarity models. She will use (a) graph traversals to rank entities in a database, (b) relational tables to create co-occurrence similarity clusters, and then (c) discuss the modeling intricacies of each development process. Attendees of this talk will be introduced to the complexities of healthcare data analysis, step through graph and tabular based similarity models, and dive into the ongoing false dichotomy of graph vs relational dbs.
NEW - Building a Graph Database in the Cloud: challenges and advantages.
Alaa Mahmoud (IBM)
There are various challenges that face new and existing Graph Database users that make it hard to get started and also contain the cost of maintaining the infrastructure. A Cloud offering that’s cost-effective, robust and scalable seems to be the right answer to these challenges. However, it comes with its own challenges as well. In this talk, we’ll go over the lessons learned from building IBM Graph, a Graph database as a service offering from IBM. Here are the topics we'll be presenting in this talk:
- Hurdles that slow down the adoption of Graph databases
- The need for a cloud-base Graph Database solution
- Different strategies to provide a cloud solution
- Challenges that face Graph Database providers in putting a Graph database on the cloud.
NEW - Graphs in time and space: A visual example
Corey Lanum, Cambridge Intelligence
Graphs and graph databases are helping to solve some of today’s most pressing challenges. From managing critical infrastructure and understanding cyber threats to detecting fraud, we have worked with hundreds of developers building all kinds of mission-critical graph applications.
In almost all of these projects, graphs are being used not just to understand the ‘who’ / ‘how’ / ‘what’ questions, but also the ‘where’ and ‘when’.
This presentation will explore with two dimensions of graphs that, from our experience, cause the most confusion but potentially contain vital data insight: space and time.
Corey will use visual examples to explain the quirks (and importance) of dynamic and geospatial graphs. He will then show how graph visualization tools empower users to explore connections between people, events, locations and times.
Do I need a Graph Database
Juan Sequeda, Capsenta
This talk grew out Juan Sequeda's office hours following the Seattle Graph Meetup. Some of the questions posed were: How do I recognize problem best solved with a graph solution? How do I determine the best type of graph to solve the problem? How do I manage the data where both graph and relational operations will be performed? Juan did such a great job of explaining the options, we asked him to develop his responses into a formal talk.
Graph Database Implementation on PostgreSQL
Kisung Kim, Bitnine
In this presentation, we will share our experiences related to implementing a graph database based on PostgreSQL. Bitnine Global is currently developing a graph database, namely Agens Graph based on PostgreSQL, that is expected to be released at the end of this year. Agens Graph is a multi-model database which supports both the relational and the graph data model. It will simultaneously support SQL as well as the most popular graph query language - Cypher.
We’ll also discuss the architecture of Agens Graph, various challenges of PostgreSQL that we have experienced, including how to overcome the mismatches in the two different data models, the integration of both SQL and Cypher in one single processing engine, and how we exploit the great features of PostgreSQL to implement the new multi-model database. Lastly, we will show the future roadmap for Agens Graph.
Time for a new relation: Going from RDBMS to Graph
Patrick McFadin, DataStax
Most of our introductory graph sessions come from practitioners with a heavy graph background. Patrick McFadin will present a session from the perspective of someone with a broad relational background (at scale) who has recently started working with graphs.
Like many of you, I have a good deal of experience building data models and applications using a relational database. Along the way you may have learned to data model for non-relational databases, but wait! Now we are seeing Graph databases increase in popularity and here’s yet another thing to figure out. I’m here to help! Let’s take all that hard won database knowledge and apply it to building proper Graph based applications. You should take away the following:
- How graph creates relations differently than an RDBMS
- How to insert and query data
- When to use a graph database
- When NOT to use a graph database
- Things that are unique to a graph database
Moving Your Data To Graph
Dave Bechberger, Expero
Graphs are a great analysis and transactional model for certain kinds of data, but unless you're starting your company from scratch, chances are you've got relational or document data you'd like to start with. Using cases from recent work, we will discuss the fundamentals of good graph data modeling and how relational models and document models are best expressed in property graph form, including some common anti-patterns.
NEW - Traversing our way through Spark GraphFrames and GraphX
Mo Patel, Think Big
The power of networks effects have been well studied and put into production by some of most successful organizations around the world. Networks form graph data structures and being able to harness analytic value from these structures furthers increases the utility of networks. In this talk, Mo Patel will review the newly introduced Spark GraphFrames feature and walk through an end to end Graph Analytics use case using Spark GraphX library.
NEW - Implementing Network Algorithms in TinkerPop's GraphComputer
Ted Wilmes, Apache Software Foundation / Expero
The Apache TinkerPop project comes with a set of centrality and clustering graph algorithm implementations, but even more importantly, provides the building blocks to implement your own. There are a plethora of other algorithms that support a wide variety of uses cases including fraud detection, flow analysis, and resource scheduling to name just a few. This talk will dig into how the TinkerPop GraphComputer can execute vertex programs in parallel across massive graphs and how you can implement algorithms that fit your specific use cases.
NEW - Graphs + Sensors = The Internet of Connected Things
Ryan Boyd (Neo4j)
There is no question that the proliferation of connected devices has increased the volume, velocity, and variety of data available. Deriving value and business insight from this data is an ever evolving challenge for the enterprise. Moving beyond analyzing just discrete data points is when the real value of streaming sensor data begins to emerge. Graph databases allow for working with data in the context of the overall network, not just a stream of values from a sensor. This talk with cover an architecture for working with streaming data and graph databases, use-cases that make sense for graphs and IoT data, and how graphs can enable better real-time decisions from sensor data.
NEW - Graph Query Languages
Juan Sequeda, Capsenta
The Linked Data Benchmark Council (LDBC) is a non-profit organization dedicated to establishing benchmarks, benchmark practices and benchmark results for graph data management software. The Graph Query Language task force of LDBC is studying query languages for graph data management systems, and specifically those systems storing so-called Property Graph data. The goals of the GraphQL task force are to: - devise a list of desired features and functionalities of a graph query language - evaluate a number of existing languages (i.e. Cypher, Gremlin, PGQL, SPARQL, SQL) and identify possible issues - provide a better understanding of the design space and state-of-the-art - develop proposals for changes to existing query languages, or even a new graph query language. This query language should cover the needs of the most important use-cases for such systems, such as social network and Business Intelligence workloads.
This talk will present an update of the work accomplished by the LDBC GraphQL task force. We also look for input from the graph community.
NEW - Time Series and Audit Trails: Modeling Time in an Industrial Equipment Property Graph
Sebastian Good, Expero
Natural networks make great cases for graph databases -- telecommuncations, interconnected parts in engines, transportation routes. Companies collect changes in and measurements across this network: new connections made, maintenance and sensor readings, truck locations. In this talk, we will discuss several methods for storing sensor data in graph databases, and for storing a history of changes to a network. Concepts covered will include time series, temporal and bitemporal models.
NEW - Meaningful User Experience with Graph Data
Chris LaCava, Expero
Congratulations, your data is up and running in a graph database! This is the first step of many to unlocking the potential in your data. It’s easy to get mired in the complexities of graph technology and forget that real users, mere mortals, will need to use this information to inform mission critical tasks. To get the value out of your graph investment, you’ll need to provide an experience that enables users to explore and visualize your graph data in meaningful ways. In this talk we’ll take a hands on approach to applying user-centered strategies and leveraging the latest UI tools to rapidly create great experiences with graph data. Topics will include Tailoring experiences to the intended audience and data and Determining the right visualization for the job
NEW - LEBM: Making a Thoroughly Nasty Graph Database Benchmark
David Mizell, Cray
LUBM (Lehigh University BenchMark) is the most-used benchmark for measuring the query performance of graph databases that use the World-Wide Web Consortium’s “RDF Triples” data representation standard and the SPARQL query language (also a W3C standard). The LUBM benchmark contains 14 SPARQL queries that are run against a synthetic database that contains information about however many fictional universities the user specifies. The LUBM synthesizer (written in Java) creates data about the universities’ faculties, their students, their grad students, what department they’re in, papers that the faculty and grad students have published, and so on. The problem with LUBM is that its data is too localized to make it representative of real graph databases. That makes it unrealistically easy to achieve high performance. It was really designed to test the power of a graph database’s ontology processing logic, rather than its performance on complex, graph-oriented queries. We started with the LUBM synthetic university data and superimposed a “social network” – x is friends with y, y is friends with z… between its students. This kind of graph is typically very irregular, non-local, unbalanced and thus hard to efficiently query. Social networks come up a lot in real-world graph databases, so this extension of LUBM (we call it LEBM, the Lehigh Extended BenchMark) is much more representative of the kinds of graphs people want to run queries against. We wrote the LEBM synthesizer in Java, and plan to make it publicly available, probably via Lehigh’s LUBM web site.
Make Graphs Great Again - Analyzing Election Data Using Neo4j
William Lyon (Neo4j)
The US 2016 election was data-rich - from hundreds of millions of tweets about the election, to polling data, to election results, to campaign funding reports. Throughout the election cycle, our team worked along with the Neo4j community to understand the relationship between these data. This talk will discuss how graphs enable us to use these relationships to understand the candidates, races, and overall election. Learn about the Cypher graph query language, graph algorithms (using user defined procedures in Java) and the neo4j-spatial extension and how graph analysis helped us make sense of the abundance of election related data.
NEW - Demography Estimation on a Large Telco Graph – a Case Study
Gergely Svigruha (Lynx Analytics)
Many of the South-East Asian telecommunication industry players have a huge customer base (few ten to a hundred of millions of customers) but most of the time, they don’t know who they are - even the simplest demographic information like the customers’ gender and age.
However, they have a very rich dataset about the calls and texts between the customers (CDR - call data records) that can be used to enrich the customer demography data. The customers (vertices) and the calls (edges) together represent an extremely large graph, with hundreds of millions of vertices and billions of edges that the existing graph tools are unable to analyze.
Lynx built a graph tool on top of Spark that allows the analysis of this large graph. In this talk, we will show how we used graph segmentation to discover homogeneous communities where some of the demographic variables were known and how we virally stamped the rest of the network by knowing the memberships of these communities.
We will demonstrate how we defined the graph from the CDR, the vertex and edge attributes we applied, the rules we used to filter out anomalies (such as the call centers who accept thousands of calls, and shouldn’t be part of the social network analysis), the kind of graph segmentations we used to define the membership in these communities, the challenges we faced and how we overcame these.
With the help of this graph analysis, we achieved a 70% age estimation accuracy (within +/- 5 years) on 100M customers using a 30M base model of 82% accuracy and community and neighborhood based models. The results were evaluated on a 30k ground truth data.
NEW - How to Work with Large and Complex Graphs.
Haikal Pribadi (GRAKN.AI)
In this tutorial, we will describe the characteristics of large-scale interconnected data and why they are challenging to work with. We will then dive into different techniques to overcome these challenges using an open source knowledge graph, Grakn.
In order to maintain information consistency over large network data containing heterogeneous data types, the ability to expressively model your complex dataset is critical. We will demonstrate how to model your dataset through an ontology, which will also function as the schema to guarantee data consistency. You will then learn how to easily modify your schema to mimic any changes in your domain.
Big datasets often come from multiple sources, consist of different types and are in various formats, from JSON to CSV amongst others. Because of this heterogeneity, migration and consolidation into a single consistent store is fraught with problems. This section will cover some typical methodologies and tools used to migrate data into a single source as well as common issues encountered. We will also introduce a language to help us migrate this heterogeneous, multi-sourced data into a consolidated information network.
Performing complex queries efficiently is an integral part of processing interconnected big data. However, queries that involve multiple tables, different data formats or perform aggregation functions over this type of data are frequently verbose, slow to execute or both when using conventional datastores. You will learn about how to compress queries and reduce their complexity via generic rules that can be defined as reusable patterns. We will also demonstrate how to leverage domain specific rules to infer knowledge that is not explicitly stored.
The tutorial will conclude by introducing how to perform complex traversals and intelligent discoveries using a graph query language, Graql. We will show you how to explore connections in your information network, draw implicit insights from explicitly stored data, and perform real time analytics.
NEW - Large Scale Graph Analytics Through Graql.
Borislav Iordanov (GRAKN.AI)
We will discuss the development of graph analytics through distributed algorithms, the different types of analysis that are possible, and some of the potential benefits and business applications, such as fraud detection, recommendation engines, customer 360 and biomedical research. We go on to share our insight on the complexity, lack of reusability and specialised engineering talent required to implement graph analytics successfully, and will demonstrate how the development of graph analytics is costly for every unique dataset. We then introduce a method that combines an open-source knowledge graph, Grakn, with Apache Spark to run Bulk Synchronous Parallel (BSP) algorithms, such as MapReduce and Pregel, to perform massively parallel graph analytics through a graph query language. By abstracting the low-level implementation details of graph analytics, we showed the audience a way to avoid the pitfalls of developing graph analytics from scratch as described above.
The audience will learn how they can harness the power of graph analytics through few lines of a knowledge-oriented query language, Graql, to perform:
1. Cluster analysis to identify common structures within data
2. Path analysis to determine the shortest distance between pieces of information
3. Centrality analysis to identify most interconnected instances in the network
4. Large scale statistics to summarise and understand quantitative data over information networks
We will demonstrate how the audience can perform simple queries for each type of analysis, how they can easily integrate it into the development of intelligent systems, and how Graql enables the development of powerful business applications.
Visiting SF? Quite a few Data Day alumni joined us for the recent the Bay Area NLP Happy Hour. See you at the next one?
Jonathan Mugan, CEO of Deep Grammar, speaking at the recent NLP Community Day in Austin.
We host many NLP events in Austin throughout the year. Join the Austin Text / NLP Meetup to stay in the loop.