The Data Day Texas 2019 Sessions
Below are the first round of confirmed talks. We will be adding new sessions daily.
The Intelligent Sales Organization Runs on Speech Recognition, Knowledge Graphs and AI
Dr. Jans Aasman - Franz Inc. / Shannon Copeland - N3
We describe a real world Intelligent Sales Organization that uses graph based technology for taxonomy driven entity extraction, speech recognition, machine learning and predictive analytics to improve quality of conversations, increase sales and improve business visibility.
The details:In the typical sales organization the contents of the actual chat or voice conversation between agent and customer is a black hole. In the modern Intelligent Sales Organization (“ISO”) the interactions between agent and customer are a source of rich information that helps agents to improve the quality of the interaction in real time, creates more sales, and provides far better analytics for management. An ISO is enabled by at least five main technologies. A taxonomy of the products and services sold, speech recognition to turn conversations into text, a taxonomy driven entity extractor to take the important concepts out of conversations, and machine learning to classify chats in various ways. All of this is stored in a real-time Knowledge Graph that also knows (and stores) everything about customers and agents and provides the raw data for machine learning to improve doing the business of ISO.
The Role of Data Science
Jon Allen - SyncThink
Data Science is often a somewhat nebulously defined role within an organization. “Are they developers? Are they mathematicians? When do we need one? How many do we need? I mean, we definitely need a data science team, cause everyone else has one…” It’s important to identify and categorize the responsibilities of data science to an organization to identify appropriate candidates as a company grows and to alleviate political pressures that often arise when roles are ill-defined.
Creating A Data Engineering Culture
Jesse Anderson - Big Data Institute
The biggest initial hurdle to success with Big Data isn’t technical - it’s management. Your data engineering project’s initial success is predicated on your management team correctly staffing and resourcing it. This runs opposite to how most data engineering teams are started and run. If you just choose the best technologies, things will just fall into place. They don’t and that’s a common pattern for failure.
But how do you correctly do something that’s so new? This could be your team’s first data engineering project. What should the team look like? What skills should the team have? What should you look for in Data Engineer (because you’ll probably have to hire a Software Engineer and train them)? What are some of the management pitfalls?
In this talk, Jesse will cover the most common reasons why data engineering teams fail and how to correct them. This will include ways to get your management to understand that data engineering is really complex and time consuming. It is not data warehousing with new names. Management needs to understand that you can’t compare a data engineering team to the web development team, for example.
Jesse will share the stories of teams who haven’t set up their data engineering culture correctly and what happened. Then, he will talk about the teams who’ve turned around their culture and how they did it.
FInally, Jesse will share the skills that every data engineering team needs.
Extracting Real-Time Insights from Streaming Data
Roger Barga - Amazon
Stream data processing is about identifying and responding to events happening in your business, in your service or application, and with your customers in near real-time. Sensors, IoT and mobile devices, and online transactions all generate data that can be monitored constantly to enable a business to detect and then act on events and insights before they lose their value. The need for large scale, real-time stream processing of big data in motion is more evident now than ever before. In this talk I will draw upon our experience with Amazon Kinesis data streaming services to highlight use cases and dive deep into the role of machine learning over streaming data to extract insights in real-time.
Intro to Graph Databases for Data Scientists
Dave Bechberger - DataStax
With the rise of graph databases, graphs are no longer just a data structure but a powerful set of capabilities at the persistence layer which data scientists can leverage to accelerate the speed to insight. Unlike the relational world, we can now create graph models that work on 10k records or 10’s of millions of records and do so in real time.
In this talk, we will walk someone familiar with building for relational databases through the process of how to start leveraging the power graph databases. We will talk about what makes a good and bad use cases, how to building effective models, and how to prevent your developers and data engineers from pulling their hair out when it goes to production.
Dissolving the Problem: Kafka is more ACID Than Your Database
Tim Berglund - Confluent
It has become at truism in the past decade that building systems at scale, using non-relational databases, requires giving up on the transactional guarantees afforded by the relational databases of yore, ACID transactional semantics are fine, but we all know you can’t have them all in a distributed system. Or can we?
In this talk, I will argue that by designing our systems around a distributed log like Kafka, we can in fact achieve ACID semantics at scale. We can ensure that distributed write operations can be applied atomically, consistently, in isolation between services, and of course with durability. An elusive set of properties becomes relatively easy to achieve with the right architectural paradigm underlying the application.
Data Science Automation: Facts & Fiction
Michael Berthold - KNIME & Konstanz University
Automation of Machine Learning or Artificial Intelligence is often associated with almost magical skills. In this session, Michael will pull the covers of that black box. He will discuss which aspects of the data science cycle can be automated and where human interaction is absolutely necessary.
Michael will then dive deeper into the secret sauce and describe how established techniques from the Machine Learning community such as Bayesian Optimization, Parameter Optimization, and Active Learning are the backbone of the automation engine and how those pieces work together to "not so magically" find the right mix of parameters and models.
Data Science Tools: Cypher for Data Munging
Ryan Boyd - Neo4j
Running data analysis using tools like Pandas, Scikit-Learn, or Apache Spark requires that your data be in a clean format. However, as data scientists, we're often forced to bring data in from many different sources and understand the relationships between the data before running our analysis.
This session will discuss and show how we can use the power of the Cypher query language to bring data in from a variety of different sources, clean it, and prepare it for analysis in a variety of tools. We'll also show how we can supplement the native functionality available in Cypher with APOC - an amazing library of hundreds of utility functions for cleaning, refactoring and analyzing data.
While Cypher is currently used in databases like Neo4j and SAP HANA to query graph structures, it can now be used on Apache Spark with the CAPS alpha project. We'll show how Cypher can be used for Data Prep in both of these scenarios.
Kubeflow explained: Portable machine learning on Kubernetes
Michelle Casbon - Google
Practically speaking, some of the biggest challenges facing ML applications are composability, portability, and scalability. The Kubernetes framework is well suited to address these issues, which is why it’s a great foundation for deploying ML products. Kubeflow is designed to take advantage of these benefits.
Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. It removes the need for expertise in a large number of areas, reducing the barrier to entry for developing and maintaining ML products. The composability problem is addressed by providing a single, unified tool for running common processes such as data ingestion, transformation, and analysis, model training, evaluation, and serving, as well as monitoring, logging, and other operational tools. The portability problem is resolved by supporting the use of the entire stack either locally, on-premise, or on the cloud platform of your choice. Scalability is native to the Kubernetes platform and leveraged by Kubeflow to run all aspects of the product, including resource-intensive model training tasks.
Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. By providing a platform that reduces variability between services and environments, Kubeflow enables applications that are more robust and resilient, resulting in less downtime, quality issues, and customer impact. Additionally, it supports the use of specialized hardware such as GPUs, which can reduce operational costs and improve model performance. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project.
90 minute workshop - Hands-On Introduction to Gremlin Traversals
Dr. Artem Chebotko - DataStax
This workshop introduces the Gremlin graph traversal language from Apache TinkerPop by exploring graph access patterns that are commonly seen in real-life applications. It features many practice problems, where the access pattern complexity is gradually increased from elements to paths, from paths to subgraphs, and from subgraphs to arbitrary graph patterns.
This workshop is hands-on. Each attendee gets a free cloud instance with pre-installed database and notebook software that can be accessed via a web browser to run Gremlin traversal examples and complete practice problems. A laptop is required to participate but is not absolutely necessary to learn and benefit.
This workshop has no vendor-specific content to learn or understand but does use a graph database, DataStax Enterprise Graph, and notebook software, DataStax Studio, to run examples and test solutions to practice problems. All practice problems use Apache TinkerPop’s Gremlin with no vendor-specific extensions.
Intended audience: Beginners and Intermediate
Technical skills and concepts required: No prior graph data management experience is required.
Using weak supervision and transfer learning techniques to build a knowledge graph to improve student experiences at Chegg.
Sanghamitra Deb - Chegg
With 1.6 million subscribers and over a hundred fifty million content views, Chegg is a centralized hub where students come to get help with writing, science, math, and other educational needs. In order to impact a student’s learning capabilities we present personalized content to students. Student needs are unique based on their learning style , studying environment location and many other factors. Most students will engage with a subset of the products and contents available at Chegg. In order to recommend personalized content to students we have developed a generalized Machine Learning Pipeline that is able to handle training data generation and model building for a wide range of problems.
We generate a knowledge graph with a hierarchy of concepts and associate student-generated content, such as chatroom data, equations, chemical formulae, reviews, etc with concepts in the knowledge graph. Collecting training data to generate different parts of the knowledge graph is a key bottleneck in developing NLP models. Employing subject matter experts to provide annotations is prohibitively expensive. Instead, we use weak supervision and active learning techniques, with tools such as snorkel, an open source project from Stanford, to make training data generation dramatically easier. With these methods, training data is generated by using broad stroke filters and high precision rules. The rules are modeled probabilistically to incorporate dependencies. Features are generated using transfer learning from language models, question answering systems, text summarizations techniques for classification tasks. The generated structured information is then used to improve product features, and enhance recommendations made to students.
NEW Visual Authoring of Real-Time Streaming Pipelines
Joey Echeverria - Splunk
Popular stream processing frameworks such as Apache Spark Streaming, Apache Flink, and Apache Kafka Streams make stream processing accessible to developers with language bindings typically in Java, Scala, and Python. These frameworks also include some variant of streaming SQL support to further expand the accessibility of large-scale, low-latency, high-throughput stream processing. What’s missing is bringing the world of stream processing to the Business Intelligence user.
Joey Echeverria presents the design and architecture of a visual authoring tool that makes stream processing accessible to the widest possible audience. This allows users to visually author and preview stream processing pipelines and instantly deploy them at scale. Topics include:
An overview of common stream processing operations (filter, extract, transform, group, aggregate, and join).
The design of a language agnostic AST for representing streaming pipelines.
The design of a simple DSL for writing stream processing pipelines.
The design of a graphical authoring tool, Splunk Data Stream Processor, for authoring and previewing stream processing pipelines in real-time.
Shipping a Machine Learning model to Production; is it always smooth sailing?
Leanne Fitzpatrick - HelloSoda
Allowing data services to be deployed in production environments can have many barriers to entry, from culture to frameworks to real-time practicalities. Many data scientists are trained to develop machine learning models, but how can these seamlessly be integrated into the rest of the production technology stack and what pain points can this bring?
Hello Soda have developed a range of in-house capabilities in order to tackle how data scientists models and products can be deployed in production environments. This talk will cover how Docker and APIs have all helped to support a framework where data scientists code can be deployed, whilst working specifically with NoSQL data and being code agnostic (R, Python, Scala). Despite developing these in-house capabilities, processes and practices in order to tackle the problem of data science in production, it’s not always smooth sailing! Pitfalls such as internal schema development, package management, continuous integration and testing can all leave you wondering why you wanted to deploy your model in the first place! We’ll discuss the learnings, trials and tribulations, whilst understanding how to get started with deploying a machine learning model, considerations around the practicalities of the build and how to bridge gaps between a technical team’s skill sets to enable data products in production.
Performant time-series data management and analytics with Postgres
Michael Freedman - TimescaleDB
Time-series databases are one of the fasting growing segments of the database market, spreading across industries and use cases. Common requirements including ingesting high volumes of structured data; answering complex, performant queries for both recent and historical time intervals; and performing specialized time-centric analysis and data management.
Today, many developers working with time series data turn to polyglot solutions: a NoSQL database to store their time series data (for scale) and a relational database for associated metadata and key business data. Yet this leads to engineering complexity, operational challenges, and even referential integrity concerns.
I explain how one can avoid these operational problems by re-engineering Postgres to serve as a general data platform, including high-volume time-series workloads. In particular, TimescaleDB is an open-source time-series databases, implemented as a Postgres plugin, that improves insert rates by 20x over vanilla Postgres and much faster queries, even while offering full SQL (including JOINs). TimescaleDB achieves this by storing data on an individual server in a manner more common to distributed systems: heavily partitioning (sharding) data into chunks to ensure that hot chunks corresponding to recent time records are maintained in memory.
In this talk, I focus on two newly-released features of TimescaleDB, and discuss how these capabilities ease time-series data management: (1) the automated adaptation of time-partitioning intervals, which the database learns by observing data volumes; (2) continuous aggregations in near-real-time, in a manner robust to late-arriving data and transparently supporting queries across different aggregation levels. I discuss how these capabilities have been leveraged across several different use cases.
Graph Keynote - From Theory to Production
Dr. Denise Gosnell - DataStax
We are here to build applications with graph data and deliver value. The graph community has spent years defining and describing our passion. In order to decipher graph thinking into a production application, there is a suite of hard decisions that have to be made. It's time for graph to go mainstream!
This talk will walk through some practical and tangible decisions that come into play when shipping distributed graph applications. Developers need to have a tangible set of play books to work from and my years of experience have narrowed it down to some of the most universal and difficult to spot. Let's see how well they match up with yours.
A Scalable Graph Database Platform with ArangoDB on Kubernetes
Michael Hackstein - ArangoDB
Many applications today rely on highly connected data consisting of edges & vertices. Context and semantics become more and more important for fraud detection, recommendation systems, identity & access management or neural networks in artificial intelligence. Graph datasets can quickly outgrow the capabilities of a single machine.
Ideally, we would like to move to a cluster of small, cheap machines. But how to overcome the network hop problem when data queried resides on different machines?
Kubernetes has become the leading orchestration system to run containers in the cloud. In the past, running stateful applications was considered “difficult”. Recent developments in Kubernetes like Persistent Volume Claims, Custom Resource Definitions and Service Operators enabled the creation of advanced solutions to stateful services.
Michael will show developers, DevOps, Data Scientists and all interested folk how to deploy & run a distributed graph database with only 7 lines of yaml code on Kubernetes. Furthermore, he will show live on stage how to scale a graph database to billions of nodes & edges while preserving fast query execution.
10 Easy Ways to Tune Your Cassandra Cluster
Jon Haddad - The Last Pickle
There's a direct correlation between database performance and how much it costs to run. If you're using the default Cassandra and OS configuration, you're not getting the most out of your cluster. Not even close. In this talk, Jon Haddad, Principal Consultant at The Last Pickle, will show you how to understand where your cluster bottlenecks are, then 10 easy ways to improve its performance and cut costs.
The Rise of the New AI: The Relationship between the Growth of Data and the AI of Today
Kristian Hammond - Narrative Science
Whether you call it cognitive computing, machine learning or smart machines, it is clear that Artificial Intelligence is back and new types of intelligent systems are emerging. We are now living in a world in which we are surrounded by machines that are fact becoming smarter than we are.
We already have systems that can recommend movies, use machine learning to predict our shopping behavior or use evidence from medical journals to medical to diagnose illness. And a day doesn’t go by without news of another learning or advanced reasoning system that is doing something better than we can imagine.
In this talk, we will discuss the current landscape of cognitive computing and artificial intelligence with the goal to getting beyond the hype and fear and build an understanding of how these systems work and will affect our lives.
We will examine the notion that this renaissance of AI technologies is less a product of the technologies themselves and more a result of a change in the environment in which they work. And we will look at how the new technologies based on statistical analysis can be integrated into the more semantically based technologies of core AI.
NEW Why PostgreSQL? PostgreSQL 10 coolest features
Álvaro Hernández - OnGres
NEW Community Detection using Graph Algorithms and Neo4j
Amy Hodler - Neo4j
Data scientists use graph algorithms for community detection because relationships are one of the most predictive indicators of behavior and preferences. Community detection algorithms help us infer similar preferences in peer groups, anticipate future behavior, estimate group resiliency, find hierarchies, and prepare data for other analysis.
In this session, you’ll learn how graph algorithms can help you forecast real-world situations and why an averages approach fails to describe group dynamics. We’ll focus on how iconic community detection algorithms identify groups and demonstrate examples using Neo4j such as Connected Components and Lovain Modularity. You’ll hear how seeding in Label Propagation improves machine learning results and introduce Balanced Triad for identifying unstable groups.
Need for speed: Boosting Apache Cassandra's performance using Netty
Apache Cassandra 4.0 has several enhancements. One of the biggest enhancements is switching from blocking network IO using JDK Sockets to Non Blocking IO with Netty. As a result, Cassandra has seen gains in performance and efficiency. These gains translate into real world costs and allow Cassandra to scale better. This presentation will take you on a tour of the improvements of Cassandra's network layer (old & new) and help quantify the gains in real world terms.
At the end of the talk the audience will learn about the motivations behind changing Cassandra's internode and streaming communication to Netty. They will also learn how these changes significantly affect scalability and recent enhancements such as zero copy streaming over Netty that makes scaling and rebuilding cluster faster than ever before.
Intended audience: The intended audience are developers, database admins and Apache Cassandra users interested in running and managing it.
Technical skills and concepts required: Basic understanding of Networking & Cassandra.
Morals from a Type 2 Diabetes dataset analytics journey...
Sanjay Joshi - Dell EMC
Sanjay will talk about his adventures with an open source diabetes dataset and some lessons learnt:
Data is dirtiest in healthcare, cleaning it adds bias sometimes
Math magic on data without context confuses the outcomes
Data changes and therefore models change
Luck and serendipity matters (a collection of material from the past)
We don't understand genetics yet...
Understanding Spark Tuning with Auto Tuning (or how to stop your pager going off at 2am*)
Holden Karau - Google
Tuning Apache Spark is somewhat of a dark art, although thankfully when it goes wrong all we tend to lose is several hours of our day and our employers money. This talk will look at how we can go about auto-tuning selective workloads using a combination of live and historical data, including new settings proposed in Spark 2.4.
Much of the data required to effectively tune jobs is already collected inside of Spark, we just need to understand it. This talk will look at some sample auto-tuners and discuss the options for improving them and applying similar techniques in your own work.
This talk will also look at what kind of tuning can be done statically (e.g. without depending on historic information). This talk will also look at Spark own built in components for auto tunning (currently dynamically scaling cluster size) and how we can improve them.
Even if the idea of building an “auto-tuner” sounds as appealing as “using a rusty spoon to debug the JVM on a haunted super computer”, this talk will give you a better understanding of the knobs available to you to tune your Apache Spark jobs.
*Also to be clear we don’t promise to stop your pager going off at 2am, we just hope this helps.
Let's embed everything!
Mayank Kejriwal - Information Sciences Institute
Word embeddings like Word2Vec have emerged as a revolutionary technique in Natural Language Processing in the last decade, allowing machines to read large reams of unlabeled text and automatically answer questions such as What is to man as queen is to woman? The core idea behind word embeddings is to 'embed' each word, using neural networks, into a continuous real-valued space of a few hundred dimensions, and to infer analogies such as the example above using simple mathematical operators. Following the success of word embeddings, there have been massive efforts in both academia and industry to embed all kinds of data, including images, speech, video, entire sentences, phrases and documents, structured data, and even computer programs. These piecemeal approaches are now starting to converge, drawing on a similar mix of techniques. In this talk, I will discuss this ongoing movement that is attempting to embed every conceivable kind of data, sometimes jointly, into rich vector spaces. I will also cover some of the techniques that are now in play. I will conclude by showing the exciting, still largely unrealized, possibilities in general AI that have become possible through such multi-modal embeddings.
How to Destroy Your Graph Project with Terrible Visualization
Corey Lanum - Cambridge Intelligence
We are all using graphs for a reason - in many cases, it's because the graph model presents an intuitive view of the data. Unfortunately, the most elegant graph data models can often be stymied by bad visualizations that obscure rather than enlighten. In this talk, Corey Lanum will discuss a number of bad practices in graph visualization that are surprisingly common. He will then outline graph visualization best practices to help create visual interfaces to graph data that convey useful insight into the data.
Build a Visualization Application in Real Time
Corey Lanum - Cambridge Intelligence
In this talk Corey will walk through the considerations one needs to take when deriving a graph model from tabular data such as survey data, and show the impact of those decisions in a visualization that he will build in real-time during the talk. To do so, Corey will distribute a brief online survey to the attendees with benign biographical questions (attendees are free to opt out or even fabricate answers) and Corey will derive a graph model from the results, make design and styling choices, and build in interactivity to create a full visualization application by the end of the session. Attendees will leave the session with a better understanding of how the user's requirements can drive the data modeling when creating graph data sets and how those modeling and design decisions can flow down to the visualization.
Statistically representative graph generation and benchmarking
In order to evaluate the performance of a graph database or a graph query solution, it is often necessary to generate a large graph dataset, over which we can then execute queries. However, while there are a number of benchmarks for graph databases which provide tools for data generation, they typically offer few if any options for tailoring the generated graph to the unique schema, topology, and statistics of a the target domain. In practice, this limits the value of these benchmarks for capacity planning and estimation of query latency. In this talk, we will describe an open-source framework for property graph generation and benchmarking in close correspondence with a schema and a statistical model. A simple declarative language is provided for schema and model, while the reference implementation is written in Java and builds upon the Apache TinkerPop graph database API.
Intended audience: Graph developers
Technical skills and concepts required: Familiarity with the property graph data model. Some experience with graph database backends recommended though not required.
Operationalizing Graph Analytics With Neo4j
William Lyon - Neo4j
Data science is great and all, but when it comes time to implement some of the advanced features data scientists have prototyped, developers can be left struggling. This talk will show how data scientists and developers can live in harmony by using Neo4j graph database for both operational and analytic workloads.
Taking a tour through the process of making an application smarter with personalized recommendations, graph based search, and knowledge graph features we will move beyond just operational workloads to add these features to our application with graph algorithms, using Neo4j for HTAP (hybrid transaction analytics processing).
We demonstrate how to run graph algorithms such as personalized PageRank, community detection, and similarity metrics in Neo4j and how these algorithms can be used to improve our application. We'll show how to keep our application up to date as data comes into the system, discuss architectural considerations, and enhance data scientists’ capabilities by building user-friendly applications.
Six Things You Need to Know about Cassandra and Kafka
Patrick McFadin - DataStax / Tim Berglund - Confluent
When you're building big , beautiful, modern applications, don’t limit yourself. Go huge or stay on your laptop. Distributed systems are built for now and later, ready to scale when you need. Cassandra and Kafka were born in the fire of crazy scale applications and can be a great combo when used together. Let’s be clear. Used together properly! Fresh and ready for 2019, we will give you a quorum tour of the things you need to know to be successful. There's a lot on the line, so you want to get it right. Here are six things you need to know:
Cassandra is a purpose built application database
Kafka isn't a queue; Kafka is a log. The difference matters a lot, particularly when you're integrating microservices.
When persisting your data across on-prem, one or more clouds, Cassandra is a perfect fit for micro services.
When you're doing data integration, avoid the framework trap and use Connect. When you connect Kafka to external systems, you'll either use Kafka Connect or you'll create a potentially buggy, partial implementation of it.
DataStax has built an amazing Kafka Connector that takes all the hard work out of integration. Now you can worry about your application.
Some answers are best when computed in real time. Know Kafka's stream processing options and use them when results have to happen now.
NEW Using GPUs & Design to Scale Visual Analysis of Digital Crime
Leo Meyerovich - Graphistry
What happens if the performance concerns around visually interacting with today's largest graphs somehow got solved?
We examine this question for several recent incidents: a cybersecurity breach logdump, a multi-million dollar Ethereum blockchain theft, and a trafficking extract. Graphs help tie many disparate events together, but enterprise-scale workloads cause performance and usability breakdowns. First, we demonstrate how GPU Open Analytics Initiative technologies (Arrow, PyGDF, NvGraph, and Graphistry) are tantalizingly close to scaling subsecond visual interactions. The basic idea is to connect GPUs in the data center to GPUs in the browser. Then, we show how the bottleneck is increasingly shifting to human-in-the-loop interaction design questions for tasks such as data wrangling. Finally, we provide examples of augmented interaction techniques that make large-scale graph analysis more practical. Put together, we provide a peek into the emerging generation of scalable visual graph tooling.
How to Progress from NLP to Artificial Intelligence
Jonathan Mugan - Deep Grammar
Why isn’t Siri smarter? Our computers currently have no commonsense understanding of our world. This deficiency forces them to rely on parlor tricks when processing natural language because our language evolved not to describe the world but rather to communicate the delta on top of our shared conception. The field of natural language processing has largely been about developing these parlor tricks into robust techniques. This talk will explain these techniques and discuss what needs to be done to move from tricks to understanding.
Transfer Learning Today: the Good, the Bad, and the Ugly of NLP in 2019
Transfer learning is the hottest research area in Natural Language Process (NLP) today. For the first time in NLP’s history, the most popular research focus is how we can adapt models to new data and new problems. The current movement began only about 5 years ago but has grown quickly. We have gone from static per-word vectors to transformer-generated vectors that dynamically encode that word’s context in every sequence. The math can be complicated, but the strategies are very easy to understand: the highest performing pre-trained models are only trying to predict a missing word from the surrounding sentence.
This talk will bring you up-to-speed on transfer learning in NLP, with a candid evaluation of what approaches and frameworks really matter and what you can ignore. The “good” is Machine Translation (MT) where Neural MT can adapt to new domains with less than 10% of the data they needed 5 years ago. The “bad” is everything else non-English, as researchers are increasingly ignoring data freely available in other languages. The “ugly” is the pre-trained models all named after puppets (Elmo, BERT, etc). No one likes puppets, they are creepy.
Exploring Graphs with R and igraph
This talk will explore how the R statistical programming language and its robust open source ecosystem provides many tools for researchers to create, explore and visualize networks.
We begin by addressing the fact that data in the wild needs to be extracted, cleaned, and manipulated before it can be formatted into a graph. These preliminary steps take many different forms depending on the original data input and the desired graph output. There are R libraries for scraping the web, for pulling text from pdfs, and for querying databases. If the purpose of the graph is natural language processing, then libraries for word stemming and part of speech tagging may be of use.
After preprocessing is complete, we can then create the desired graph with the igraph library. The igraph library has proven to be incredibly powerful with regards to both querying and modeling graphs. This library allows researchers to filter down the network, aggregate the information held in the nodes & edges, and identify communities. Finally, the researchers can explore the graph further by using igraph's visualization capabilities.
One goal of this talk is to encourage graph and network researchers to explore the R statistical programming language for their own projects. The other goal, which is the inverse of the first, is to encourage R useRs to explore their data as a graph.
Data Science Keynote - Obvious conclusions that are actually wrong
Sean Owen - Databricks
A little knowledge can be a dangerous thing for a new data scientist. It's all too easy to draw obvious conclusions from data analysis that are actually wrong -- or worse, draw reasonable conclusions that are contradictory. The seasoned data scientists knows stats 'paradoxes' and avoids costly business mistakes and humiliation at the hands of stats majors. This talk will survey five seemingly straightforward scenarios where the right answer looks wrong or ambiguous. It will explore how causation provides a resolution to many such situations and briefly introduce Judea Pearl's do-calculus. Don't do data science until you've seen how correlation and causation are vital to understanding data correctly.
Moving Beyond Node Views
Lynn Pausic - Expero
When people talk about visualizing graph data, what typically comes to mind is the canonical node view. Node views display nodes (vertex) and the relationships (edge) between them. With large data sets consisting of millions of vertices and edges, node views can quickly become unwieldy to use and comprehend.Further, traditional UI patterns and visualizations conceived for relational schemas often don’t work with graph data. Relational schemas are predefined and relatively static making it easy to tailor UI navigation to the available data dimensions. Due to the distinct mathematical nature of graph data, traversing data in a graph is fairly different. While this presents additional challenges, there are also opportunities. Traversing a graph with certain algorithms allows you to, for example, show key influencers in social networks, clusters of communities in customer reviews or weak points in electrical grids. These new insights into data provide novel tools to craft innovate user experiences. But this opportunity comes at a price, namely more complexity. Through building and deploying dozens of applications driven by graph data, we’ve developed a unique approach to building UIs driven by graph data and an arsenal of data visualizations that work well across broad range of contexts. In this talk we’ll share various tools and examples for displaying graph data in meaningful ways to users.
And Bad Mistakes, I’ve Made a Few: Experience from the Trenches as a Graph Data Architect
Josh Perryman - Expero
For four years Josh has crossed the country and traveled the globe working on graph data projects of all sizes, in a variety of industries, with several different engines. And he’s not been alone. Expero has several graph data architects who have worked with a host of client projects. The best of their combined experiences are collected in this one talk.
In this session Josh will cover several specific lessons learned in the pursuit of client success on the frontiers of connected data technology. These include fables such as: “graph is always awesomer than relational databases (except when it isn’t)”, “my practical access patterns beat up your elegant schema”, “the trick to fast ingest is to store no data (and the client loved us)”, ”I’ve got write amplification and I know just how to use it”, “you really can have too many vertex labels in the model”, and the classic tale: “sometimes the best edge is the one that only works one way”.
Choosing Sides When Choosing Tools Hurts
Davin Potts - Appliomics
The current wealth of machine learning tools force data scientists and engineers to "choose sides" between Java-based toolsets versus C++ toolsets or some other implementation choice behind a tool that distracts from the true goals and best algorithms and methodology. Tools like KNIME help by making it downright easy to integrate heterogeneous mixes of ML tools written in any language from Scala to C to JavaScript to Python or R, employing graph databases (e.g. Neo4J) or NoSQL (e.g. MongoDB) or traditional RDBMS (e.g. PostgreSQL), executing in parallel on Spark or dask or MPI, proprietary or open source -- all so that the human problem solver can spend less time fussing over implementation and more time exploiting the key features of these tools. The KNIME developers are experimenting with new ways to reduce the cost of data motion, as data moves from one tool to the next, which is often the true villain behind needing to "choose sides" in the first place. This session will describe recent advances involving KNIME's Python integration, in particular.
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Karthik Ramasamy / David Kjerrumgaard - Streamlio
The business value of data decreases rapidly after it is created, particularly in use cases such as fraud prevention, cybersecurity, and real-time system monitoring. The high-volume, high-velocity datasets used to feed these use cases often contain valuable, but perishable, insights that must be acted upon immediately.
In order to maximize the value of their data enterprises must fundamentally change their approach to processing real-time data to focusing reducing their decision latency on the perishable insights that exist within their real-time data streams. Thereby enabling the organization to act upon them while the window of opportunity is open.
Generating timely insights in a high-volume, high-velocity data environment is challenging for a multitude of reasons. As the volume of data increases, so does the amount of time required to transmit it back to the datacenter and process it. Secondly, as the velocity of the data increases, the faster the data and the insights derived from it lose value.
In this talk, we will present a solution based on Apache Pulsar Functions that significantly reduces decision latency by using probabilistic algorithms to perform analytic calculations on the edge.
Apache Pulsar: Next Generation Cloud Native Messaging Streaming System
Karthik Ramasamy - Streamlio
Apache Pulsar is the next generation pub/sub messaging system and server less stream processing platform that leverages cloud-native technologies. Apache Pulsar was open sourced by Yahoo! in 2016 and it is one of the fast growing open source projects. It is used in more than 100+ companies.
Apache Pulsar is designed as a layered architecture, separating message storage from serving and compute. It takes advantage of cloud native technologies such as Kubernetes to auto-scale each layer independently. With Apache 2.X release, Apache Pulsar provides Pulsar Functions and Pulsar I/O, a framework to write server less functions and connectors to process data immediately when data arrives. This server less stream processing approach leverages Kubernetes for scaling its computing capability and provide cloud portability. Furthermore, because of the separation of storage and serving layers, Apache Pulsar can provide infinite storage by tiering storage to cheaper cloud native storage such as Amazon S3, Google Cloud Storage, Microsoft Azure Cloud Storage and Hadoop. Apache Pulsar seamlessly moves data across various tiered storage tiers and enables easy access.
In this talk, Karthik will give an overview about Apache Pulsar, its architecture in depth and its cloud native capabilities on messaging, storage and server less data processing. We will also describe how Apache Pulsar deployed on Kubernetes to provide a stream-native data stack on it and share our operational experiences.
NEW Benchmarking a Graph OLAP database to complement OLTP systems
Steve Sarsfield - Cambridge Semantics
Analytics that traverse large portions of large graphs have been problematic for both RDF and LPG graph engines. In this session, we discuss the native parallel-computing approach taken in AnzoGraph to yield interactive, scalable performance for RDF and LPG graphs. We discuss benchmarking results that highlights the performance possible with well-understood industry-standard data. We’ll dig into how graph OLAP databases scale differently than graph OLTP databases and how AnzoGraph complements Neo4j, AWS Neptune, and other graph and relational OLTP systems.
GQL: Towards a Standardized Property Graph Query Language
Dr. Petra Selmer - Neo4j
Over the past decade, property graph databases have experienced phenomenal growth within industry across multiple domains such as master data and knowledge management, fraud detection, network management, access control and healthcare, among others. With this proliferation in usage, the need for a standardized property graph query language has become ever more pressing. Efforts are underway within the ISO framework to define and describe a standard property graph query language, GQL. In this talk, I will introduce GQL, and detail the landscape, scope and features envisaged for the first version of GQL, such as complex pattern matching and composable graph querying. I will provide a roadmap of the standardization process, and also describe the outcome of an analytical comparison of existing property graph query languages, which will be one of the inputs into the design of GQL. To conclude, I will outline future directions.
Technical skills and concepts required: Some knowledge/awareness of property graphs would be useful.
Building Enterprise Knowledge Graphs: Lessons Learned from the Trenches
Juan Sequeda - Capsenta
Knowledge Graphs are fulfilling the vision of creating intelligent systems that integrate knowledge and data at large scale. The technology has matured: graph databases, information extraction techniques, ontology languages and tools.
We observe the adoption of Knowledge Graphs by the Googles of the world. However, not everybody is a Google. Enterprises still struggle to understand the hundreds of data sources with thousands of tables, millions of attributes and how the data all works together. How can enterprise adopt Knowledge Graphs successfully without boiling the ocean?
This talk will chronicle the obstacles encountered and lessons learned when deploying Semantic Web and Knowledge Graph technologies with enterprise users to address Data Integration and Business Intelligence needs. The audience will takeaway concrete steps on how to effectively start building knowledge graphs that will be widely useful within their enterprise.
Design Knowledge Graphs Simply: An Introduction to Gra.fo
Juan Sequeda - Capsenta
Every large company has accumulated tens, hundreds or even thousands of data sources, these days mainly relational databases. Business users do not understand these complex data sources. Indeed, frequently even the IT department struggles to understand the thousands of tables, millions of attributes and how the data all works together.
Knowledge Graphs bridge the conceptual gap between how users think about the data and how the data is physically organized because the graph data model provides a beautiful view of these myriad, complex relational data sources. This beautiful data can then used in machine learning applications and business intelligence tools to answer important business questions.
A common frustration we’ve encountered is the lack of adequate tooling around knowledge graph schema design. Many tools exist– some are overly complex, some are very expensive, and none allow one to work visually, collaboratively and in real time on a schema with multiple concurrent users.
In this presentation we will introduce and demo our solution to this problem: Gra.fo, the first and only visual, collaborative and real-time knowledge graph schema editor.
Opening Keynote - Lies Enterprise Architects Tell
Gwen Shapira - Confluent
Lets face it - we are all liars. We often lie unintentionally and most of all - we lie to ourselves. I’ve spent the last 10 years working with enterprise architects intent on modernizing their data infrastructure, and I’ve heard many “facts" that turned out to be… less than perfectly accurate. Self-deception about state of the industry, our requirements and our capabilities can lead us to make bad choices, which leads us to build bad architectures and often leads to bad business outcomes.
If you say or hear phrases like “we have big data”, "we don't have big data", “this business app must be real-time” and “hybrid-cloud doesn’t exist” - you may work for an organization that can use a bit of reality check. In this talk, Gwen Shapira, principal data architect at Confluent, will share common enterprise architecture myths that did not survive contact with reality and offer some advice on how to design good data architecture given our inherent capacity for self-deception.
A Graph is a Graph is a Graph: Equivalence, Transformations, and Composition of Graph Data Models
Joshua Shinavier - Uber
The power of graphs lies in their intuitiveness: there is nothing much simpler to visualize or reason about than a bunch of dots connected by a bunch of lines. In practice, however, there are a variety of graph data models, separated by shades of expressivity and nuance. These include property graphs and their variants, RDF-based ontology languages, hypergraph data models, entity-relationship models, and any number of formats and schema languages which are somehow graph-like, though not specifically designed for graphs. Over the years, countless special-purpose tools have been written to transform one graph data model to another, or to provide graph views over this or that type of data source. In this talk, we will bring some order to this chaos using concepts from functional programming and category theory, with an emphasis on bidirectional and composable transformations. Along the way, we will ponder the grand vision of bringing together the whole of a company’s data as a knowledge graph.
Technical skills and concepts required: Basic familiarity with the property graph data model. Some experience with functional programming, may help. However, concepts will be introduced at a high level, and should be reasonably easy to follow.
Predicting new edges in large scale dynamic graphs
Gabriel Tanase - Graphen
Lots of enterprises today are using graph databases and graph algorithms to model and solve complex problems. At Graphen we are successfully using graphs in the Fintech, Cybersecurity and Health domains. With knowledge extracted from graphs we improve accuracy of various machine learning pipelines when predicting non performing loans, money laundering or users that behave outside of their predefined ‘normal’ activities (intrusion detection).
In this talk we focus on predicting if a new connection will appear between two entities in a graph (link prediction). This is the high level concept we use when predicting non performing loans in a bank but the concept itself can be applied to various other domains. For example, one can be interested in predicting what two authors may co-author a paper in the coming year, or who will be friends with who in a social network, even to predict who may acquire a certain item (product recommendation). It turns out that link prediction is a highly computational problem, requiring lots of complex algorithms to extract features and perform machine learning. In this talk we will show how a data scientist can implement a link prediction pipeline and experiment with different features to obtain the best results for his particular domain. We exemplify using a DPLB co-authorship dataset available online.
NEW Introduction to Memgraph
Dominik Tomicevic - Memgraph
Memgraph is the world's first native in-memory distributed graph database. Memgraph is on a mission to fulfil the true potential of graph databases and open the doors to a whole new era of applications, by delivering a high-performance, horizontally scalable enterprise graph platform.
Engineered from the ground-up, Memgraph scales horizontally on cloud instances or industry-standard hardware, providing high throughput and low latency across a wide range of platforms. Memgraph also maintains broad compatibility with common technologies in the modern data processing ecosystem so you can easily integrate it into your existing environment. It features an in-memory first durable and redundant architecture and handles both highly concurrent operational and analytical workloads. Learn more at www.memgraph.com and follow us @Memgrahdb.
Breaking Down Silos with Knowledge Graphs
Michael Uschold - Semantic Arts
For most of you who work in enterprise computing, silos are the bane of your existence. We explore the origins of silos and point out some technical factors that exert a strong gravitational pull, drawing your enterprise into the deep pit of silos. Chief among these, are application-centricity along with limitations in relational database technology including the lack of explicit semantics. We describe a semantics-based data-centric approach using ontologies and knowledge graphs. We show how it works in practice and illustrate with case studies. We warn against the use of these newer technologies to gain a local advantage in an organization but ultimately recreating silos across the wider enterprise.
Conclusion: The use of an enterprise ontology as a schema to populate an RDF-based knowledge graph opens the door to removing silos and never creating them again. The technology is mature and ready for prime time.
High Performance JanusGraph Batch & Stream Loading
Ted Wilmes - Expero
You've downloaded JanusGraph, installed it, and run a few queries from the Gremlin console, but what's next? Data loading is the logical next step, but it is a common pain point for JanusGraph newcomers. Inevitably data loading touches on more advanced topics such as performance tuning and an understanding of JanusGraph transaction semantics. This talk will demystify the data loading process by presenting JanusGraph batch and stream loading patterns that you can apply to your next graph database project.
Taming of the Shrew: Using a Knowledge Graph to capture structured Health Information Data
Chris Wixon - Savannah Vascular
Despite investing billions of dollars on modernizing health information infrastructure, the impact on healthcare outcome has been relatively modest. Patient care remains fragmented, costs continue to rise and medical professionals remain frustrated and dissatisfied. Our success in a new era of digital health will depend on the ability to derive insight from data.
Freedom of expression comes from order and structure. Placing limits, and working against them rigorously. Our graph-powered solution offers a schema-driven method of information capture between patient and provider, at the point of service. In contradistinction to a source-oriented method, the knowledge graph models the corpus of medicine itself and incorporates concepts from multiple medical terminology systems (CPT, ICD, SNOMED, NPI, Medical taxonomy) into its persistence layer. A seed pattern of clinically-relevant predicates defines concepts in a formal way to reflect the semantics of the concept. The meta-model supports a uniform user interface and enables efficient documentation by way of semantic browser: click and discover, rather than search and retrieval.
Implementing a clinically oriented Knowledge Graph saves time for the physician, returns the focus back to the patient, and creates computable medical records for healthcare payers to make more informed decisions.
Point of service documentation minimizes the time lag between the patient encounter and data entry and provides more complete and reliable health information data.
The use of the computer as a recall mechanism for medical knowledge allows medical personnel to function at the top of their credentials from a universe that is more comprehensive than human cognition.
The data gathered in a specific section of the record should be entered in an unambiguous and consistent among all health practitioners entering it.
Intended audience: Healthcare Professionals, Data Modeling, Domain-driven design (DDD), Knowledge Graph
Technical skills and concepts required: Beginner – Intermediate knowledge
Eight Prerequisites of a Graph Query Language
Dr. Mingxi Wu - TigerGraph
Graph query language is the key to unleashing the value of interconnected data. The talk includes discussion of 8 prerequisites of a graph query language for successful implementation of real world graph analytics use cases. The talk will present the pros and cons of three query languages - Cypher, Gremlin, and SPARQL. Finally, the talk will provide an overview of GSQL, a Turing Complete graph query language that is a conceptual descendent of Cypher, Gremlin and SPARQL and has incorporated design features from SQL as well as Hadoop MapReduce. The talk will compare GSQL query language with Gremlin, Cypher and SparkQL, pointing out the differences including pros and cons for each language.
Choosing the Right Graph Architecture for Your Use-Case - Operations vs. Analytics.
Barry Zane - Cambridge Semantics
Graph databases and application stacks are put to many uses ranging from operational "point-queries" all the way to ad-hoc "big-data analytics". This talk focuses on the identification of the primary characteristics of these uses and how that drives the selection, deployment and predictable performance of these systems.
The goal is to provide the technical user of these systems an "under the hood" sense of the underlying dynamics of these systems to broaden an understanding of what graph-based analytics means in the modern world of discovery, ML and AI.
Using New Open Source Tools for Apache Cassandra – Kubernetes Operator, Prometheus Exproter, and LDAP + Kerberos Authentication
Adam Zegelin - Instaclustr
As the popularity of both Kubernetes and Apache Cassandra has swelled, it’s no surprise that more developers are looking to run these technologies in tandem. Similarly, devs using Cassandra have been increasingly looking for ways to utilize LDAP and Kerberos authentication. Answers thus far have largely been roll-your-own, but in the interest of speeding up adoption of Kubernetes with Cassandra, improving monitoring with the help of Prometheus, or making more seamless use of LDAP and Kerberos authentication, Instaclustr and partner contributors have created and released four new open source projects that are now ready for use.
This session walks devs through these new tools and how they add key functionality and ease-of-use to their Cassandra deployments.
- A Cassandra operator for running and operating Cassandra within Kubernetes
The open source Cassandra operator functions as a Cassandra-as-a-Service on Kubernetes, fully handling deployment and operations duties so that developers don’t have to. It also offers a consistent environment and set of operations founded on best practices, which is reproducible across production clusters and development, staging, and QA environments.
- Cassandra Prometheus Exporter
The cassandra-exporter is a high-performance metrics collection agent that allows for easy integration with the Prometheus monitoring solution. It has been designed to collect detailed metrics on production-sized clusters with complex schemata with minimal performance impact while at the same time following Prometheus's best practices for exporting metrics.
- An LDAP authenticator plug-in for Cassandra
The open source LDAP authenticator plug-in works closely with the existing CassandraAuthorizer implementation. The plug-in enables developers to quickly reap the benefits of secure LDAP authentication without the need to write their own solutions, and to transition to using the authenticator with zero downtime.
- A Kerberos authenticator plug-in for Cassandra
The open source Kerberos authenticator plug-in enables Cassandra users to leverage Kerberos’ industry-leading secure authentication and true single sign-on capabilities. The open source project also includes a Kerberos authenticator plugin for the Cassandra Java driver.
The audience for this presentation will learn the specifics of how to implement – and get the most out of – these open source solutions.
Your application is a graph too!
Tom Zeppenfeldt - Graphileon
Graph technology and algorithms offer huge possibilities when applied to big or business data. But when you look closely at the applications that developers build to manage, explore and analyze graph databases, it becomes clear that many of those applications can be modeled as graphs too! In a typical case, a query provides data for a table or a histogram, and a subsequent click on a tablerow or a bar of the histogram triggers a new query that provides data for a map or a network visualisation.
Modelling an applications as a graph, in which vertices represent queries or user - interface widgets, and edges provide a way to pass data from one function to another increases the speed of development dramatically, and allows for agile development, prototyping and deployment that is accessible and understandable for non-developers. In addition, you can apply graph algorithms to the application graph, facilitating debugging and optimization.