The Data Day Texas 2020 Sessions
The conference hotel is now sold out, but rooms are still available for Thursday, Friday and Saturday night at the nearby Hampton Inn (Hilton) - just one block away from the conference. If they should sell out, reasonably priced rooms are still available at the DoubleTree Hilton on 15th St. - just three blocks away.
We have more still more sessions for Data Day Texas 2020. Check back frequently for updates.
Opening Keynote
Working Together as Data Teams
Jesse Anderson - Big Data Institute
Between books and the real world lies the actual reality of implementing and creating data teams. Creating data teams represents the difficulty of taking scientists, engineers, and operations people then forcing them to actually get something working that solves a business problem.
This talk covers the balancing act that data team managers and members deal with. I share some of the shared experiences that every manager and member of a data team should know. However, these issues aren’t well known or talked about. This makes managers or team members think they’re alone or unique in facing these issues. The reality is that these are common issues and they should be addressed.
These common issues include:
- Dealing with the politics of the organization. How do you get data from people siloed due to politics?
- High bandwidth connections in data teams. How do you create high bandwidth connections between the teams to allow for the fastest help and least friction?
- Communicating the value of data teams to the rest of the organization. How do managers educate the rest organization on what they do and the value each data team creates?
- Getting credit to data engineering and operations. How do managers make sure that all credit for the work doesn’t just go to the data scientists?
- Creating the right ratio of data scientists to data engineers. What should the right ratio of data engineers to data scientists be?
This talk is based on the Working as a Data Team chapter of my forthcoming Data Teams book from O’Reilly.
Main Takeaways:
There are common, real-world issues that every data team manager and team member should know.
Managers need to know and understand the balancing act they need to deal with in order to create data products.
Managers should know and address the common issues in working together as a data team.
From Stanford to Startup:
making academic human-in-the-loop technology work in the real world
Abraham Starosta / Tanner Gilligan
This talk will share the presenter's experience taking the academic literature about human-in-the-loop machine learning and turning that into a product that many customers interact with daily.
Sculpt Intel helps companies classify and extract information from text with an interactive interface for non-technical users. One feature that our customers like is the ability to highlight the key phrases in a piece of text that they thought had the biggest impact on their labeling decision. This helps the customers feel like they are controlling the machine learning process, without needing to write code themselves. This feature also makes their models learn faster than if our customers labeled examples without explaining their decisions.
The presenter's will demo the product on the well-known IMDb movie review dataset and talk about what they have learned so far in terms of the tradeoffs and limits between cost and accuracy.
Festured Session
Humans, machines and disagreement: Lessons from production at Stitch Fix
Brad Klingenberg - Stitch Fix
Most recommendation algorithms produce results without humans-in-the-loop. Combining algorithms with expert human curation can make recommendations much more effective, especially in hard-to-quantify domains like fashion. But it also makes things more complicated, introducing new sources of statistical bias and challenges for traditional approaches to training and evaluating algorithms. Humans and machines can also disagree, further complicating the design of production systems. In this talk I’ll share lessons from combining algorithms and human judgement for personal styling recommendations at Stitch Fix, an online personal styling service that commits committed to its recommendations through the physical delivery of merchandise to clients.
Serverless Data Integration
Sony Green - Kineviz
Criminals are innovative. Because their strategies are constantly evolving, the data model that revealed pattern A could miss pattern B completely. We’ll present an intuitive analytics methodology based on visual graph transformations that gives crime fighters the flexibility to form inferences and iterate rapidly.
The multitude of attack vectors available to criminals enables their strategies to evolve rapidly. This imposes a limited shelf life on fraud fighting solutions. It’s imperative to integrate new (and frequently messy) data sources as well as combining existing data sources in new ways. Combining sources like biometrics, social media, financial, geospatial, and time series data can reveal fraud rings, identity theft, and more. Unfortunately, quickly iterating on these data sources and managing them long term is a nightmare with traditional static schemas.
We have defined a set of Graph Operators---Map, Extract, Aggregate, Link and Shortcut---that enable rapid, non-destructive transformation of a property graph. Individually or in concert, these operators enable statistical analysis, anomaly detection, and simplification or delineation of graph patterns. The resulting workflow is intuitive and highly visual, enabling non-technical users to perform complex analyses. For fraud fighters, effective application of graph operators is a game changer.
Kineviz develops visual analytics software and solutions. Our GraphXR platform is used across a range of disciplines including law enforcement, medical research, and business intelligence. In addition to accelerating time to insight, GraphXR enables analysts and business users to make needle-in-haystack discoveries that evade traditional analytic workflows.
Serverless Data Integration
Gaja Krishna Vaidyanatha - DBPerfMan LLC
Serverless computing is a layer of abstraction that enables IT practitioners to solely focus on the event-driven execution aspects of an application (built as microservices in one’s favorite programming language), without any concern for hardware capacity planning, physical infrastructure provisioning, operating system installation, patching and management.
Data Integration strives to solve the pandemic problem of data quality and inconsistency caused by multiple data silos created within an organization, over many years of IT supporting an enterprise’s business needs. A Data Integration Platform, thus brings multiple sources of data together, cleanses it and provides a high-quality unified data view that enables digital transformation. The primary goal of any data integration effort is to provide a single and accurate data view for the business. Without this unified view of the data, the value proposition of all initiatives that deliver information/insights to the business, i.e., Analytics, Machine Learning, Deep Learning & Artificial Intelligence is questionable.
The rationale is pretty simple --> If the quality and standardization of data is questionable, how could the generated business insights be trustworthy?
Deploying DataOps for analytics agility
Arvind Prabhakar - StreamSets
Advanced analytics is driving future revenue streams and competitive differentiation. Because of this, data scientists and analysts want unfettered access to data without the cascade of “no” that emanates from core IT. The need for self-service data availability is driving teams to rethink their data delivery strategy with an emerging practice called DataOps. A DataOps technology platform operationalizes the full lifecycle of data movement. It applies monitoring, automation, and embedded policy enforcement to the building, executing, and operating of data pipelines, so you can accelerate delivery of data with confidence against the backdrop of ceaseless change.
Arvind Prabhakar explains DataOps, a set of practices and technologies that combine the development and continuous operation of data movement architectures, independent of the underlying data sources and processing systems. DataOps solves problems that manifest with managing data movement at scale (i.e., data drift, stream processing, and edge computing). It moves beyond traditional approaches to designing, deploying, and operating data pipelines as siloed activities and applies repeatable, Agile practices to data delivery. This is why it’s become critical for business initiatives like customer 360, IoT, and cybersecurity.
The Next Five Years in Databases
Jonathan Ellis - DataStax
At his last Data Day Texas talk, Jonathan explained five lessons from the last five years in distributed databases. This time, he offers his predictions on the future: what impact will databases experience from machine learning, graph algorithms, hardware trends, and cloud services?
Defeating pipeline debt with Great Expectations
Abe Gong - Superconductive Health
Data organizations everywhere struggle with pipeline debt: untested, unverified assumptions that corrupt data quality, drain productivity, and erode trust in data. Since its launch in early 2018, Great Expectations has become the leading open source library for fighting pipeline debt.
Abe Gong and Taylor Miller share insights gathered from across the data community in the course of developing Great Expectations. They detail success stories and best practices from teams fighting pipeline debt, plus new features in Great Expectations developed to support those practices.
Fun features include:
Batteries-included data validation: Building a data validation stack from scratch usually takes months of engineering effort. Great Expectations provides an "on rails" experience to compress this down to a day. It's a still work in progress, developing rapidly.
Tests are docs and docs are tests: Many data teams struggle to maintain up-to-date data documentation. Great Expectations solves this problem by rendering Expectations directly into clean, human-readable documentation. In this workflow, your docs are rendered from tests, and tests are run against new data as it arrives, which means your documentation is guaranteed to never go stale
Automated data profiling: Wouldn't it be great if your tests could write themselves? Great Expectations' data profilers use raw data to automatically generate a first draft of Expectation test suites, plus data documentation. Profiling provides the double benefit of helping you explore data faster, and capturing knowledge for the future.
Pluggable and extensible: Every component of the framework is designed to be extensible. This design choice gives a lot of creative freedom to developers working with Great Expectations. We're excited to see what the data community comes up with!
Data Governance and FATTER AI
Devangana Khokhar - Thoughtworks
Data Governance has become a need of the hour for most of the enterprises around the world, whether it’s because of the new data protection policies that various countries are establishing, or the enterprise platforms that everyone seems to be building or just because it’s a buzz-word everyone’s trying to latch onto. This, however, doesn’t diminish the fact that most of the businesses need to define good data governance strategies in order to serve their customers effectively and responsibly. This coupled with guarantee for building responsible data ecosystem presents a rather challenging problem for everyone involved in the value chain, all the way from executives, strategists, data scientists, developers to product designers. FATTER AI stands for Fair, Accountable, Transparent, Testable, Ethical and Responsible AI and is an incredibly complex topic, but nonetheless highly important. In this talk, I would like to take the audience through this journey of understanding an evolutionary data governance approach with an explicit focus on FATTER AI. This is derived from my own experiences of leading data platform and governance teams.
Fujitsu Graph AI Technologies for Powering Business Decisions
Ajay Chander - Fujitsu
Fujitsu’s Deep Tensor Graph AI technologies integrate with Neo4j and work directly on graph data without requiring encoding of the graph into vectors. We describe how we have used the Neo4j and Deep Tensor combination to understand real world data sets and build practical Machine Learning solutions.
How Declarative Configurations and Automation Can Prevent A Data Mutiny
Sean Knapp - Ascend.io
As the data landscape becomes increasingly complex with more data types, application demands and end points, it becomes exponentially harder to ensure data is continuously flowing as required. In response to these accelerating complexities, data engineering teams often attempt to build their own systems in an effort to restore some control and ensure a level of performance. But the complexity can’t be outrun because data is only going to continue growing in quantity and importance. Instead of addressing these challenges one by one as they appear - the data engineering equivalent of Whac-A-Mole - teams need to start at the end and work back.
This is what led us to rethink the state of pipeline orchestration. In taking a page from other technologies, such as Kubernetes, React, and even compilers, we opted to move away from imperative, task-based design and move to declarative, data-centric automation. In this session, Sean Knapp, CEO & founder of Ascend, will discuss what led us to take this approach and essentially rearchitect pipeline orchestration from the ground up.
With more users and applications, data challenges will only continue to get bigger. Data engineers and IT at large are under the gun. They either meet the needs of the business, or risk a “data mutiny” in which internal users create their own shadow IT organizations with a mandate to move fast and scale to better serve the business. We'll walk through how combining the power of declarative configurations with automation can help bring back this balance of speed and flexibility, without risking quality, so the business won’t move on without you.
Stateful Streaming Application Integration - Leveraging Kafka Streams Processor API with KSQL DB
Clay Lambert - Expero
Approaching the Kafka Streams Processor API can seem daunting at first glance. Yet it is actually easy to work with and is a powerful tool for solving stateful stream processing problems easily and quickly. Additionally, leveraging the Confluent KSQL DB with the Kafka Streams Processor API, solutions can be arrived at even faster.
Large enterprises have a plethora of systems and databases that need to be integrated seamlessly. Incorporating proprietary and legacy databases with custom applications via real-time stream processing is an increasingly common business use case.
There exist many situations where such data cannot be easily integrated or efficiently processed by making use of stateless streaming patterns using the Kafka Streams DSL. Specifically where state needs to be established or data needs to accumulate from multiple sources within the processing pipeline in stages before being moved to an application in streaming fashion, such as an ML oriented analytics application.
In this talk we will lay out for you how to build a Kafka based streaming data pipeline that integrates a proprietary database source, also requiring scanning of multiple Kafka topics fed by other application sources to accumulate state, in staged fashion. Pulling together data that needs to be merged into coherent data sets that feed a custom data processing application via a real-time data stream.
Specifically you will learn:
- How to feed Kafka topics that drive your workflow pipeline using the Confluent JDBC connector
- How to model and build stream processing component workflow chains using the Kafka Processor API
- How to stage multiple phases of stateful data transformation and data merging within a Kafka streaming processor workflow pipeline
- How to manage state within the Kafka Processor API, with RocksDB and runtime in-memory data structures.
- How to integrate Confluent KSQL DB persistent queries into your streaming workflow pipeline programmatically via REST to reduce code needed to meet tight deadlines
Creating Cloud-Native Machine Learning Workflows on Kubernetes using ScyllaDB Backend as Persistent Storage
Timo Mechler / Charles Adetiloye - SmartDeployAI
At SmartDeployAI, we build data workflow pipelines for running large scale Industrial IoT applications. Our software platform is a shared multi-tenant Kubernetes cluster environment where multiple workflow pipelines can be bootstrapped and scheduled to run concurrently. When IoT sensors or devices are provisioned on our platform, a new workflow pipeline is first instantiated to run and orchestrate ingested data streams, and then executes the various modular tasks in each stage of the pipeline, such as data transformation / aggregation, near real-time inferencing, rolling window computations, and batched model training. This process requires us to keep track of several markers in our metadata store or, in some cases, the parameters to run various pipeline models. We need to be able to persist and make this information available throughout the entire data workflow pipeline life-cycle, i.e. from when the pipeline is first instantiated until it has run to completion.
In all our use-cases, the primary goals are to:
1. Always minimize latency (network or IO) wherever and whenever possible
2. Maintain data storage isolation for each workflow pipeline even though the pipelines run in a shared Kubernetes cluster
3. Having the ability to instantaneously bootstrap pipeline artifacts and resources on demand
4. Reduce the resource consumption footprint by each instant of the pipeline as much as possible
Our evolutionary journey in trying to accomplish these goals ultimately led us to ScyllaDB. In this talk we will go through the major architectural decisions we have had to make to accomplish this solution, and how ScyllaDB has helped us to become more nimble and agile in our data workflow pipeline provisioning compared to any other current alternatives.
How to trust your Human-In-The-Loop (HITL) data annotations
Emanuel Ott - iMerit
How to trust your Human-In-The-Loop (HITL) data annotations: A set of ingredients and approaches which make you sleep better at night and turn perceived quality into a set of quantifiable metrics.
When creating HITL annotations, there are unexpected complexities that are not always easy to solve. This is especially the case if your efforts involve large groups of remote workforces, nuanced understanding of specific use cases, and domain-specific knowledge mixed into a rapidly changing set of requirements.
Emanuel has implemented countless annotation workflows for large HITL projects across various applications and industries. Drawing from his and his team’s experience of use cases such as key-value extraction of form documents, semantic segmentation of road scenes for autonomous vehicles and medical MRI analysis, he will walk through the best practices of building efficient HITL annotation workflows by highlighting the commonalities he has observed across these implementations.
Furthermore, Emanuel will explore techniques on how to turn perceived quality into quantifiable metrics, which can be used to report success in a more comprehensive manner.
Data Journeys: From Document DB to RDF to Property Graph
Clark Richey - FactGem
Clark Richey, CTO & Co-Founder of FactGem, will discuss his journey in building a commercial product founded on a labeled property graph database. Describing why the product originated with a major document database, and what forced its transition, Clark will also discuss his experiences with RDF and supporting databases, including challenges inherent to RDF implementations. Clark will describe his search to find a new data storage system to address the challenges encountered when building the RDF based system, and why he selected a labeled property graph database solution. Clark will also share his thoughts around what he sees as the future for graph databases.
Cassandra 4.0 In Action
Jeffrey Carpenter - DataStax
Cassandra 4.0 has been 3 years in the making, and the community is looking forward to an official release this year. If you can’t wait to get your hands on this new release, join Jeff Carpenter for a guided tour of how to enable and use features including virtual tables, full query logging, transient replication, and more. We’ll also take a look forward at what’s next, including Cassandra sidecars, Kubernetes operators, why the Cassandra Enhancement Proposal (CEP) process is a game changer, and what might be in a 5.0 release.
Shining a Light on Dark Documents
Dale Markowitz - Google
Data Scientists love working with clean, tabular data stored in queryable databases. Alas, for most businesses, slick datasets like these are the exception rather than the rule. Meanwhile, the majority of useful information is scattered throughout paper documents, images, PDF scans, call recordings, and other equally cumbersome data types. Just as machine learning can make predictions from sanitized numbers in a spreadsheet, so too can it do the heavy lifting of parsing, transforming, and organizing data of all types. In this session, Dale Markowitz will cover all the different techniques needed to transform a set of so-called “dark documents” into a structured, searchable database, including extracting text, identifying entities, and building custom models that can parse forms, contracts, and more.
This talk is designed for software developers with little ML expertise, data scientists looking to expedite their workflows, and business professionals seeking to understand how document understanding can transform their workflows.
Fresh and Fast with Apache Druid: Analytics for Real-time and Ad-hoc Applications
Gian Merlino - Imply
Real-time analytics is what happens when you reimagine analytics for a stream-based, real-time world. There are two sides to this coin: ingest and query. They can stand alone: real-time ingestion by itself helps simplify data pipelines, while real-time querying by itself powers interactive, exploratory workflows. But they are much more powerful together. When put together, the combination means you can do more ad-hoc data exploration and monitoring, versus simple reporting.
In this talk, we will drill into what real-time, exploration and monitoring is and how building your data pipeline with Apache Druid is the way it is being done at Netflix, Amazon, Lyft, and many more. We will delve into all the fun, nitty gritty technical details on how Druid internals work and the use cases that it can power in your organization.
Scaling Your Cassandra Cluster For Fluctuating Workloads
Brian Hall - Expero
Now that shared nothing, noSQL solutions are widely available in the marketplace, deploying clusters of most sizes is both practical and necessary. But how quickly do my clusters need to expand (or contract) and how can you be sure that you’ve properly planned and budgeted for that deluge of traffic after your new product announcement?
In this case study, we’ll walk you through how we setup a dynamically scaling cluster management solution that could add or remove capacity as planned traffic demands. All of this informed by a sophisticated test harness to game out for possible and probable scenarios.
Topics that will be covered are:
1. Deploying complex configuration dynamically within Ansible and Terraform scripts
2. Automating a fluctuating test harness to model spikes and plateaus within read and write traffic patterns with Kubernetes
3. Finding and addressing performance bottlenecks in your Cassandra environment
4. Keeping a healthy Cassandra ecosystem through monitoring and proactive maintenance
5. Push button provisioning for handling “Black Friday” type spikes in demand
The Death of Data : 2020
Heidi Waterhouse - LaunchDarkly
Big data is big business. That’s why we’re here at this conference. But data retention is not neutral, either ethically or financially. How can we understand the impact of data retention, tainted data sets, and human effects? What should we be considering as we work on the data, tools, and governance that predict the future and explicate the past. How can we kon-mari our data and keep only that which sparks joy?
Deep Learning and the Analysis of Time Series Data
Dr. Bivin Sadler - Southern Methodist University
We are all familiar with the 4 Vs of Big Data: Velocity, Veracity, Volume and Variety. With respect to the fourth ‘V’, ‘Variety’, various unstructured data types such as text, image and video data have gained quite a bit of attention lately and continue to gain momentum. However, there is a type of structured data that has maintained its intrigue and importance in both business and academia … Time Series Data. While it is not hard to find applications of Deep Learning methods to text, image and video data, they have also shown to have great promise in the analysis of data collected over time. In this session, we will investigate the theory, implementation and performance of recurrent neural networks (RNNs), long short-term memory (LSTM), and gated recurrent units (GRUs) in the context of forecasting and predictive analytics. In addition, we will compare these deep learning methods with the more traditional but widely used ARMA and ARIMA type models.
TinkerPop Keynote
TinkerPop 2020
Joshua Shinavier - Uber
What is now Apache TinkerPop began in late 2009 as a collection of developer tools that intermingled graph structure and processing. The framework quickly evolved into a unifying data model and query paradigm for the emerging family of property graph databases, contributing significantly to the rise of graphs in industry and in the open source community. Ten years later, graphs are truly everywhere; there are dozens of graph systems implementing TinkerPop, and the labeled property graph data model is playing a major role in enterprise knowledge graphs and data integration efforts of all kinds. While TinkerPop has been hugely popular with developers, however, it is likely that graphs have only tapped a fraction of their potential. What will take this community “to eleven” are abstractions for structure and process that are as powerful from a formal and computational point of view as they are compelling to the human developer or end user. In this talk, we will take a brief look back in time at TinkerPop versions 1, 2, and 3 before reviewing the current state of the art and setting the stage for TinkerPop 4. Looking ahead to the next year or so, we are prioritizing strong schemas, strongly-typed traversals, and functional encapsulation of side-effects, exceptions, and transactions that will make data and process in TinkerPop far more portable across language environments and enable new-to-graph capabilities like automated data migration and query translation, as well as various new forms of query optimization. As TinkerPop transcends the JVM, we will rely to a greater extent on composable mappings and code generation to propagate data structures and logic into places no graph has gone before. Done right, we think graphs may become as ubiquitous as the relational model, but so much more interesting and so much more similar to the way we humans naturally structure our world. Of course, the best way to predict the future is to make it happen.
Theory for the Working Database Programmer
Ryan Wisnesky - Conexus
They say theory is not so much learned outright as absorbed over a long period of time. In this talk, we aim to motivate the audience to start absorbing and applying theory by telling the history of how theory (including category theory, database theory, programming language theory, and more) led directly to many of the foundational advances in query processing that we take for granted today, including:
- SQL / relational algebra
- Structural Recursion (fold) on monadic collection types (list,bag,set) as a Query Language
- Language Integrated Comprehensions (map,filter,bind and do-notation)
We conclude by briefly summarizing more recent applications of category theory to data processing, including in the Multi-Model Abstract Datatype (mm-ADT) and Categorical Query Language (CQL) and Algebraic Property Graph (APG) projects.
MLflow: An open platform to simplify the machine learning lifecycle
Corey Zumar - Databricks
Developing applications that successfully leverage machine learning is difficult. Building and deploying a machine learning model is challenging to do once. Enabling other data scientists (or even yourself, one month later) to reproduce your pipeline, compare the results of different versions, track what’s running where, and redeploy and rollback updated models is much harder.
Corey Zumar offers an overview of MLflow, a new open source project from Databricks that simplifies this process. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment and for managing the deployment of models to production. Moreover, MLflow is designed to be an open, modular platform—you can use it with any existing ML library and incorporate it incrementally into an existing ML development process.
Learning sequential tasks from human feedback
Brad Knox - Bosch USA
Robotics and software-only agents can benefit greatly from harnessing the knowledge and preferences of the humans around them. In particular, human-provided feedback can guide agents as they learn tasks, dramatically reducing the data and computation required to reach satisfactory performance. Alternatively, when a task is not fully defined, human-provided feedback can inform agents what behavior and task specifications fit the specific desires of the person.
Brad will first survey past work on learning from humans as they explicitly provide feedback in a teaching role. He will then share an ongoing project focusing on learning from implicit human feedback, the feedback that is present in natural human reactions that occur as humans observe an automated system perform tasks they have an shared interest in. The potential information-richness of such reactions is vividly illustrated by the example of someone watching a sports team that they follow intensely; a secondary observer could often infer much about the game -- when each team has scored, when a team is threatening the other with a scoring opportunity, and so on -- by watching only the sports fan's reactions to the game. In another example, passengers in autonomous vehicles will also provide reactions based on their evaluations of the driving behavior. This project seeks a rigorous, data-driven interpretation of such human reactions, which are currently untapped, low-cost sources of information about an agent's behavior.
How to use a Knowledge Graph to Improve the Search Experience
Mike Grove - Stardog
It's no secret the growth of data is increasing exponentially, and it's importance to the enterprise is increasing in kind. It's now more important than ever to be able to find the right information at the right time, but given the increase in the scale of data, finding the needle in the haystack is harder than ever.
Traditional search engines are struggling to keep up. They end up missing missing results or, just as worse, drowning users in irrelevant results. Now many look to how they can make search smarter, and increasingly, knowledge graphs are the answer.
Unlike other data management tools, a knowledge graph accurately captures both the raw data and its real world context, easily representing and describing the relationships in the data. Further, with their ability to represent data of all types, they provide a wider corpus of information to search over, greatly increasing the odds of finding the right answer.
This presentation will provide an overview of the core features of a knowledge graph and briefly describe how they work. It will then walk through how they apply to the challenges of enterprise search and compare with other options such as Elastic or Solr. Finally, we will dig into how these principles work in the real world: Springer Nature, Roche, and the National Bank of Canada have all improved their search experiences by leveraging a knowledge graph.
Architecting Production IoT Analytics
Paige Roberts - Vertica
Analyzing Internet of Things data has broad applications in a variety of industries from smart buildings to smart farming, from network optimization for telecoms to preventative maintenance on expensive medical machines or factory robots. When you look at technology and data engineering choices, even in companies with wildly different use cases and requirements, you see something surprising: Successful production IoT architectures show a remarkable number of similarities.
Join us as we drill into the data architectures in a selection of companies like Philips, Anritsu, and Optimal+. Each company, regardless of industry or use case, has one thing in common: highly successful IoT analytics programs in large scale enterprise production deployments.
By comparing the architectures of these companies, you’ll see the commonalities, and gain a deep understanding of why certain architectural choices make sense in a variety of IoT applications.
Immutable Data Pipelines for Fun and Profit
Rob McDaniel - Sigma IQ
Modern machine learning platforms require more flexible data pipelines than have traditionally been designed for data applications. While databases may be easy to query, and offer useful tools such as foreign-key constraints, they also tend to be inflexible and often end up requiring much care and feeding. Additionally, they require a certain amount of foresight in terms of schema design -- and capacity planning. Try as you might, you'll never guess the correct schema for a project from the outset; and any future changes to your data model tend to require complicated (or delicate!) data migrations to prevent accidental data loss or corruption. Furthermore, reliance on a living database as your "source of truth" necessitates various data recovery and backup schemes, which everyone hopes are never needed. And worse still, rolling back to previous states is complicated and failure-prone -- if even possible. This talk will discuss various techniques that, while still allowing for standard application development against common databases, completely side-steps these limitations by providing a schema-flexible, immutable platform which deemphasizes the concept of a "living" source-of-truth by leveraging the classic database at the end of the pipeline, rather than at the beginning.
Building a streaming data warehouse using Flink and Pulsar
Sijie Guo - StreamNative
As organizations getting better at capturing streaming data and the data velocity and volume are ever-increasing, the traditional messaging queues or log storage systems are suffering from scalability or operational and maintenance problems.
Apache Pulsar is a cloud-native event streaming system. Pulsar includes multiple features, such as native support for multiple clusters in a Pulsar instance, seamless geo-replication of messages across clusters, very low publishing and end-to-end latency, seamless scalability to over a million topics, and guaranteed message delivery with persistent message storage provided by Apache BookKeeper. Its segment centric storage design along with layered architecture makes Pulsar a perfect unbounded streaming data system for building a real-time streaming data warehouse.
In this talk, Sijie will use one of the most popular stream processing engines, Apache Flink, as an example, to share his experiences in building a streaming data warehouse using Apache Pulsar and Apache Flink. Some of the traits are:
- Introduction of Pulsar and its architecture
- Introduction of Pulsar’s two-level read API
- Why Pulsar provides a unified data abstraction for both stream and batch processing
- How to implement Pulsar topics as infinite streaming tables based on Pulsar’s schema
- A real-world use case shows how Pulsar is used for real-time risk control in financial institute.
Managing your Kafka in an explosive growth environment
Alon Gavra - AppsFlyer
Frequently, Kafka is just a piece of the stack that lives in production that often times no one wants to touch—because it just works. Kafka sits at the core of AppsFlyer’s infrastructure that processes billions of events daily.
Alon Gavra dives into how AppsFlyer built its microservices architecture with Kafka as its core piece to support 70B+ requests daily. With continuous growth, the company needed to “learn on the job” how to improve its Kafka architecture by moving to the producer-owner cluster model, breaking up its massive monolith clusters to smaller, more robust clusters and migrating from an older version of Kafka with real-time production clients and data streams. Alon outlines best practices for leveraging Kafka’s in-memory capabilities and built-in partitioning, as well as some of the tweaks and stabilization mechanisms that enable real-time performance at web scale, alongside processes for continuous upgrades and deployments with end-to-end automation, in an environment of constant traffic growth.
Cost-Optimized Data Labeling Strategy (slides)
Jennifer Prendki - Alectio
Active Learning is one of the oldest topics when it comes to Human-in-the-Loop Machine Learning. And with more research than ever on the topic, we seem to be coming closer to a time when all ML projects will adopt some version of it as an alternative to the old brute-force supervised learning approach. That unfortunately doesn’t account for one major caveat: Active Learning essentially runs on the assumption that all provided labels are correct.
In this talk, Jennifer will discuss the tradeoffs between the size of a training set and the accuracy of the labeling process, and present a framework for smart labeling cost optimization that can simultaneously reduce labeling costs and diagnose the model.
This session is part of the Human in the Loop session track.
Practicing data science: A collection of case studies
Rosaria Silipo - KNIME
There are many delineations of data science projects: with or without labeled data; stopping at data wrangling or involving machine learning algorithms; predicting classes or predicting numbers; with unevenly distributed classes, with binary classes, or even with no examples of one of the classes; with structured data and with unstructured data; using past samples or just remaining in the present; with real-time or close-to-real-time execution requirements and with acceptably slower performances; showing the results in shiny reports or hiding the nitty-gritty behind a REST service; and—last but not least—with large budgets or no budget at all.
Rosaria Silipo discusses some of her past data science projects, showing what was possible and sharing the tricks used to solve their specific challenges. You’ll learn about demand prediction in energy, anomaly detection in the IoT, risk assessment in finance, the most common applications in customer intelligence, social media analysis, topic detection, sentiment analysis, fraud detection, bots, recommendation engines, and more.
Learning with limited labeled data
Shioulin Sam - Cloudera Fast Forward Labs
Being able to teach machines with examples is a powerful capability, but it hinges on the availability of vast amounts of data. The data not only needs to exist but has to be in a form that allows relationships between input features and output to be uncovered. Creating labels for each input feature fulfills this requirement, but is an expensive undertaking.
Classical approaches to this problem rely on human and machine collaboration. In these approaches, engineered heuristics are used to smartly select “best” instances of data to label in order to reduce cost. A human steps in to provide the label; the model then learns from this smaller labeled dataset. Recent advancements have made these approaches amenable to deep learning, enabling models to be built with limited labeled data.
Shioulin Sam explores algorithmic approaches that drive this capability and provides practical guidance for translating this capability into production. You’ll view a live demonstration to understand how and why these algorithms work.
Time-Series analysis in healthcare: A practical approach
Sanjay Joshi - Dell
The 2017 Nobel Prize in medicine and physiology was awarded to Hall, Rosbash and Young for "their discoveries of molecular mechanisms controlling the circadian rhythm." Most of the basis of our "health" relies on these molecular clocks in almost all human cells; yet of the more than 250,000 yearly clinical trials, a tiny few have time-series study design. The hypothesis testing of clinical pathways in the bio-pharmaceutical and healthcare worlds are themselves time-series exercises. Sanjay will present a practical approach to time-series analytics with an app security log dataset and a more complex microbiome dataset.
How to start your first computer vision project
Sanghamitra Deb - Chegg
Computer vision is becoming an integral part of machine learning pipelines underlying products that serve users with recommendations, ranking, search results, etc. This is particularly true in content driven fields such as education, healthcare, news to mention a few. With 2.2 million subscribers and two hundred million content views, Chegg is a centralized hub where students come to get help with writing, science, math, through online tutoring, chegg study, flash cards and other products. Students spend a large amount of time on their smartphones and very often they will upload a photo of related to the concept they are trying to learn. Disciplines such as Chemistry or Biology are dominated by diagrams, thus students submit more images in these disciplines. These images could be of lower quality and contain too much irrelevant details, sometimes this makes interpreting the images difficult which in turn makes helping them more difficult.
Starting a computer vision project can be intimidating and several questions come up when faced with creating solutions with image data, do third party tools provide solutions that are feasible? How long does it take to train people in computer vision? Do all image related problems require machine learning solutions? For machine learning solutions what is the optimal way to collect data? Is hiring someone with computer vision experience required?
In this presentation I will talk about how to make a cold start in computer vision and explore areas of (1) traditional image analysis techniques (2) image classification and (3) Object detection. The goal is to detect low quality images and and create a cropper that removes irrelevant details from the image. In order to detect low quality images we use properties of the image such as variance of the Laplacian and histogram analysis. These approaches are computationally simple and require very little resources to deploy. For building the cropper we use stacked deeplearning models, the first model detects the source of the image and then for each source we have a model that crops out unnecessary details in the image. The cropping task is challenging since the properties of the images that are outside the cropping boundary is not significantly different from the properties of the images inside the cropping boundary. The goal of these tasks is to make suggestions and give feedback to students such that they upload images of higher quality that are easy to process. I will present some results and discuss product applications of computer vision projects.
Bigger data vs. better math: which is most effective in ML?
Brent Schneeman - Alegion
As data science and machine learning teams attempt to squeeze more performance out of ML solutions, they are faced with how to spend their resources: should they pursue more data or invest in different algorithms?
This talk looks at the trade-offs associated with both approaches and attempts to provide practical guidance. It expands on the subject and offers a third dimension: how do biased labels affect performance of ML systems?
Database Keynote
mm-ADT A Multi-Model Abstract Data Type
Marko Rodriguez - RRedux
mm-ADT™ is a distributed virtual machine capable of integrating a diverse collection of data processing technologies. This is made possible via three language-agnostic interfaces: language, process, and storage. When a technology implements a respective interface, the technology is considered mm-ADT compliant and is able to communicate with any other compliant technologies via the virtual machine. In this manner, query language developers can develop languages irrespective of the underlying storage system being manipulated. Processing engines can be programmed by any query language and executed over any storage system. Finally, data storage systems automatically support all mm-ADT compliant query languages and processors.
Follow mm-ADT™ on GitHub, Twitter, StackExchange, or join the Slack Channel.
Knowledge Graph Keynote - A Brief History of Knowledge Graph's Main Ideas
Juan Sequeda - data.world
Knowledge Graphs can be considered to be fulfilling an early vision in Computer Science of creating intelligent systems that integrate knowledge and data at large scale. The term “Knowledge Graph” has rapidly gained popularity in academia and industry since Google popularized it in 2012. It is paramount to note that, regardless of the discussions on, and definitions of the term “Knowledge Graph”, it stems from scientific advancements in diverse research areas such as Semantic Web, Databases, Knowledge Representation and Reasoning, NLP, Machine Learning, among others.
The integration of ideas and techniques from such disparate disciplines give the richness to the notion of Knowledge Graph, but at the same time presents a challenge to practitioners and researchers to know how current advances develop from, and are rooted in, early techniques.
In this talk, Juan will provide a historical context on the roots of Knowledge Graphs grounded in the advancements of the computer science disciplines of Knowledge, Data and the combination thereof, starting from the 1950s.
Case Study: AnzoGraphDB at Parabole.ai
Steve Sarsfield - Cambridge Semantics / . Sandip Bhaumik - Parabole.ai
As AI makes its way into the corporate research domain, it is possible for investors to gain a competitive advantage by using Knowledge Graph-base analytical graphs, NLP and advanced automated discovery tools? Come to this session for a case study on Parabole’s Alpha ESG solution, powered by AnzoGraph DB from Cambridge Semantics. See how Parabole has developed an application that uses NLP, graph databases and machine learning to enable portfolio managers and analysts to monitor ESG signals, discover new investment opportunities, expose potential risks, and accelerate knowledge discovery from structured and unstructured data. ESG, which is the standard nomenclature for environmental, social and governance, drives the investment decisions of many modern fund managers.
Human Centered Machine Learning
Robert Munro - Author / Serial Founder
Most Machine Learning relies on intensive human feedback. Probably 90% of Machine Learning applications today are powered by Supervised Machine Learning, including autonomous vehicles, in-home devices, and every item you purchase on-line. These systems are powered by people. There are thousands of people fine-tuning the models offline by annotating new data. There are also thousands of people tuning the models online: you, the end user, interacting with your car, device or an on-line store. This talk will cover the ways in which you can design Machine Learning algorithms and Human-Computer Interaction strategies at the same time to get the most out of your Machine Learning applications in the real-world.
This session is part of the Human in the Loop session track.
Where’s my lookup table? Modeling relational data in a denormalized world
Rick Houlihan - Amazon Web Services
When Amazon decided to migrate thousands of application services to NoSQL, many of those services required complex relational models that could not be reduced to simple key-value access patterns. The most commonly documented use cases for NoSQL are simplistic, and there’s a large amount of irrelevant and even outright false information published regarding design patterns and best practices for NoSQL applications. For this migration to succeed, Amazon needed to redefine how NoSQL is applied to modern online transactional processing (OLTP) apps. NoSQL applications work best when access patterns are well defined, which means the sweet spot for a NoSQL database is OLTP applications. This is good because 90% of the apps that get written support a common business process which for all practical purposes is the definition of OLTP. One of the common steps in building an OLTP app is designing the entity relationship model (ERM) which essentially defines how the application uses and stores data. With a relational database management system- (RDBMS) backed application, the ERM was essentially mapped directly into the database schema by creating tables for the top-level entities and defining relationships between them as defined in the ERM. With NoSQL, the data is still relational, it just gets managed differently. Rick Houlihan breaks down complex applications and effectively denormalizes the ERM based on workflows and access patterns. He demonstrates how to apply the design patterns and best practices defined by the Amazon team responsible for migrating thousands of RDBMS based applications to NoSQL and when it makes sense to use them.
Information Extraction with Humans in the Loop
Dr. Anna Lisa Gentile - IBM Research Almaden
Information Extraction (IE) techniques enables us to distill Knowledge from the abundantly available unstructured content. Some of the basic IE methods include the automatic extraction of relevant entities from text (e.g. places, dates, people, …), understanding relations among them, building semantic resources (dictionaries, ontologies) to inform the extraction tasks, connecting extraction results to standard classification resources. IE techniques cannot de- couple from human input – at bare minimum some of the data needs to be manually annotated by a human so that automatic methods can learn patterns to recognize certain type of information. The human-in-the-loop paradigm applied to IE techniques focuses on how to better take advantage of human annotations (the recorded observations), how much interaction with the human is needed for each specific extraction task. In this talk Dr. Gentile will describe various experiments of the human-in- the-loop model on various IE tasks, such as building dictionaries from text corpora in various languages, extracting entities from text and matching them to a reference knowledge base, relation extraction.
This session is part of the Human in the Loop session track.
Knowledge Graph for drug discovery
Dr. Ying Ding - University of Texas at Austin
A critical barrier in current drug discovery is the inability to utilize public datasets in an integrated fashion to fully understand the actions of drugs and chemical compounds on biological systems. There is a need to intelligently integrate heterogeneous datasets pertaining to compounds, drugs, targets, genes, diseases, and drug side effects now available to enable effective network data mining algorithms to extract important biological relationships. In this talk, we demonstrate the semantic integration of 25 different databases and develop various mining and predication methods to identify hidden associations that could provide valuable directions for further exploration at the experimental level.
Responsible AI Requires Context and Connections
Amy Hodler - Neo4j
As creators and users of artificial intelligence (AI), we have a duty to guide the development and application of AI in ways that fit our social values, in particular, to increase accountability, fairness and public trust. AI systems require context and connections to have more responsible outcomes and make decisions similar to the way humans do.
AI today is effective for specific, well-defined tasks but struggles with ambiguity which can lead to subpar or even disastrous results. Humans deal with ambiguities by using context to figure out what’s important in a situation and then also extend that learning to understanding new situations. In this talk, Amy Hodler will cover how artificial intelligence (AI) can be more situationally appropriate and “learn” in a way that leverages adjacency to understand and refine outputs, using peripheral information and connections.
Graph technologies are a state-of-the-art, purpose-built method for adding and leveraging context from data and are increasingly integrated with machine learning and artificial intelligence solutions in order to add contextual information. For any machine learning or AI application, data quality – and not just quantity – is critical. Graphs also serve as a source of truth for AI-related data and components for greater reliability. Amy will discuss how graphs can add essential context to guide more responsible AI that is more robust, reliable, and trustworthy.
Creating Explainable AI with Rules
Jans Aasman - Franz. Inc
This talk is based on Jans' recent article for Forbes magazine.
"There’s a fascinating dichotomy in artificial intelligence between statistics and rules, machine learning and expert systems. Newcomers to artificial intelligence (AI) regard machine learning as innately superior to brittle rules-based systems, while the history of this field reveals both rules and probabilistic learning are integral components of AI.
This fact is perhaps nowhere truer than in establishing explainable AI, which is central to the long-term business value of AI front-office use cases."
"The fundamental necessity for explainable AI spans regulatory compliance, fairness, transparency, ethics and lack of bias -- although this is not a complete list. For example, the effectiveness of counteracting financial crimes and increasing revenues from advanced machine learning predictions in financial services could be greatly enhanced by deploying more accurate deep learning models. But all of this would be arduous to explain to regulators. Translating those results into explainable rules is the basis for more widespread AI deployments producing a more meaningful impact on society."
Moving Your Machine Learning Models to Production with TensorFlow Extended
Jonathan Mugan - DeUmbra
ML is great fun, but now we want it to solve real problems. To do this, we need a way of keeping track of all of our data and models, and we need to know when our models fail and why. This talk will cover how to move ML to production with TensorFlow Extended (TFX). TFX is used by Google internally for machine-learning model development and deployment, and it has recently been made public. TFX consists of multiple pipeline elements and associated components, and this talk will cover them all, but three elements are particularly interesting: TensorFlow Data Validation, TensorFlow Model Analysis, and the What-If Tool.
The TensorFlow Data Validation library analyses incoming data and computes distributions over the feature values. This can show us which features many not be useful, maybe because they always have the same value, or which features may contain bugs. TensorFlow Model Analysis allows us to understand how well our data performs on different slices of the data. For example, we may find that our predictive models are more accurate for events that happen on Tuesdays, and such knowledge can be used to help us better understand our data and our business. The What-If Tool is as an interactive tool that allows you to change data and see what the model would say if a particular record had a particular feature value. It lets you probe your model, and it can automatically find the closest record with a different predicted label, which allows you to learn what the model is homing in on. Machine learning is growing up.
Machine Learning Counterclockwise
Shawn Rutledge - Sigma IQ
"Ship early, ship often". Continuous integration. Test-driven development. As software engineering has matured in industry, these and other patterns/anti-patterns have helped avoid the pitfalls common to moving from code to working software systems. As machine learning grows in practice and pervasiveness in industry, similar patterns can help the move from models into working machine learning pipelines and systems. I'll propose some of these patterns and relate tips and best practices from an industry practitioner's perspective. Along the way I hope to challenge researchers working in the area of machine learning with open problems in feature representations, model deployments, and model explanation.
Automated Encoding of Knowledge from Unstructured Natural Language Text into a Graph Database
Chris Davis - Lymba
Most contemporary data analysts are familiar with mapping knowledge onto tabular data models like spreadsheets or relational databases. However, these models are sometimes too broad to capture subtle relationships between granular concepts among different records. Graph databases provide this conceptual granularity, but they typically require that knowledge is curated and formatted by subject matter experts, which is extremely time- and labor-intensive. This presentation presents an approach to automate the conversion of natural language text into a structured RDF graph database.
90 min Workshop
Intro to RDF/SPARQL and RDF*/SPARQL*
Thomas Cook - Cambridge Semantics
SPARQL is the premier query language to retrieve and manipulate data stored in Resource Description Framework (RDF) format, a format that’s used for many applications – from elastic search to NLP to web data sharing. Come to this hands-on session to learn about this declarative language and its similarities to SQL.
This session will also look at RDF*/SPARQL*, the next evolution of RDF/SPARQL, under the proposal by the W3C as a standard for labeled property graph support. Labeled property graphs enable users to add properties to triples that further define the relationships in the data, such as properties that describe the start and end dates, the provenance of the data, the weight, score, or veracity of the data, and a much-needed capability when do analytics, graph algorithm and data science/ML functions.
Managing Relationships in the Healthcare Industry with Graphileon: A CHG Healthcare Use Case
Tyler Glaittli - CHG Healthcare
One of the great powers of graph technology is that it visually communicates relationships to users. At CHG Healthcare, we're using Graphileon to manage complex relationships in America's healthcare industry in an intuitive and non-technical way. Many current information systems in production today weren't built to manage multi-layered, complex relationships. As a part of CHG's digital transformation, we're integrating a new wave of tools to bring order to the chaos. Neo4j and Graphileon are at the forefront of this transformation. In this presentation, we'll describe some of the challenges of managing complex relationships in a monolithic framework with legacy systems. Then we'll show how a solution built with Graphileon's components (functions that are connected by triggers) allows us to manage the most critical components of our industry. Lastly, we'll share some of the lessons we learned.
Modeling, Querying, and Seeing Time Series Data within a Self-Organizing Mesh Network
Denise Gosnell - DataStax
Self-organizing networks rely on sensor communication and a centralized mechanism, like a cell tower, for transmitting the network's status.
So, what happens if the tower goes down? And, how does a graph data structure get involved in the network's healing process?
In this session, Dr. Gosnell will show you how we see graphs in this dynamic network and how path information helps sensors come back online. She will walk through the data, model, and Gremlin queries which help a power company have real-time visibility into different failure scenarios.
Building a Graph User-Interface for Malware-Analysis
Stefan Hausotte - G DATA / Ethan Hasson - Expero
As a security company, G DATA built a large JanusGraph database with information about different malware threats over the years. In this talk, the presenters will show how they built a web-interface to explore the data. This allows malware analysts to get insights about threats without the need to query the database manually and helps to get an understanding of the connections between malware through an appealing visualization. They will also discuss how they build a GraphQL API for the JanusGraph instance and how the user interface was built with open-source JavaScript libraries.
90 min Workshop
Graph Feature Engineering for More Accurate Machine Learning
Amy Hodler / Justin Fine - Neo4j
Graph enhanced AI and ML are changing the landscape of intelligent applications. In this workshop, we’ll focus on using graph feature engineering to improve the accuracy, precision, and recall of machine learning models. You’ll learn how graph algorithms can provide more predictive features. We’ll illustrate a link prediction workflow using Spark and Neo4j to predict collaboration and discuss how you avoid missteps and tips to get measurable improvements.
Securing your Analytics Environment at Scale
Rick Paiste / Sam Jacob - Expero
Most sophisticated enterprises have more than one compute environment, more than one department doing sophisticated analysis and more than one data source they are interrogating for information. To complicate matters, enterprises must deal with a myriad of data sources, data source types, authorization schemes, auditability requirements and regulatory compliance mandates. In this talk, we’ll explore solution options to address these requirements and the trade-offs involved.
Along the way, you will learn about:
- Techniques to enforce fine-grained access control to data hosted in a virtual data lake
- Applicable Open Source technologies - Presto, Hive, Ranger, Impala, Aluxio, Parquet, Hudi
- Commercial product offerings that we explored in this space
- What we discovered and implemented
Using Graphs to Improve Machine Learning and Produce Explainable AI
Victor Lee - TigerGraph
Graphs are changing the rules for machine learning. Traditionally, data must be structured as a 2D matrix or 3D tensor for analysis, and dimensionality reduction is often employed. Graphs, on the other hand, are an intuitive, adaptable, and efficient way to represent knowledge and relationships in an unlimited number of dimensions, where each type of relationship represents an additional dimension. In short, graphs provide a potentially richer source of knowledge, but data scientists need to know practical methods for leveraging this resource.
The session combines lecture, demonstration, and hands-on exercises to introduce to how scalable graph analytics platforms are being used today for machine learning and explainable AI.
GQL: Get Ready for a Standard Graph Query Language
Stefan Plantikow - Neo4j
A new standard query language is coming. For the first time in decades, the ISO Standards Committee has initiated work on a new language, a graph query language (GQL). With the backing of experts from seven nations and major database vendors, an early draft of this query language for property graphs is now ready.
Attend this session to learn about the initial features for GQL (ISO/IEC 39075), as well as ongoing efforts by the community and in the official standardization bodies. Get early information on capabilities such as the generalized property graph data model, advanced pattern matching, graph schema, parameterized graph views, query composition, and the advanced type system. As GQL is a sibling to SQL, we’ll also discuss how it aligns with shared features from the upcoming edition of SQL.
This talk will help you get ready for GQL, an unprecedented development in the graph landscape, with tips on planning for future transitions. You’ll also get guidance on how to engage with the GQL community and how to keep up to date with the official standardization process.
AI/ML Model Serving using Apache Pulsar Functions
Karthik Ramasamy - Streamlio
Machine learning and AI is invading every enterprise. While a lot of algorithms and process have been invented for model building and training, one of the areas that is not getting enough attention is model serving - how to serve the models that have been trained using machine learning and AI. Today model serving is done in an ad-hoc fashion where the model is often encoded in the application code. When a new model is available it requires changing the application, debugging and deploying it. This increases the iteration time of getting new models into production and getting feedback before the model is further enhanced. Furthermore, the problem is compounded by the fact that the responsibility of model training and model serving are often with different teams. In this talk, we will propose an overall architecture on how both data and models use a streaming approach to serve models in real time for incoming data while the models are updated without restarting model serving applications using Apache. We will delve into details of how Apache Pulsar architecture and light compute paradigm of Pulsar functions simplify model serving.
Using interactive Querying of Streaming Data for Anomaly Detection
Karthik Ramasamy - Streamlio
Identifying a sequence of events in data that could indicate anomalous behavior of a system needs to be done immediately on the data stream to quickly alert and react in order to minimize risk or damage. The delays inherent in batch processing are unacceptable for this case, as a result of which the most common approach has been to use a streaming system with windowing semantics to evaluate anomalies in each window. Such an approach requires predetermination of window size and thus requires sizing the streaming system apriori. With this approach, changes in window size and changes in logic for detecting anomalies and outliers require expensive debugging and performance tuning.
In this talk, we outline an alternative approach where streams are stored as soon as data is captured, organized by time, and where SQL is used to query the system to surface anomalies and outliers. For this approach to work seamlessly, the supporting technology needs to provide fast ingestion, durable storage, optimized data organization for fast access and the ability to query shards of data in parallel using SQL. As part of this talk, we will describe Apache Pulsar and how it satisfies the requirement to capture and organize data streams, to make data available within a few milliseconds. We’ll also cover how SQL is processed in slices of window based on time in a parallel fashion on distributed segments of data. Finally, we will outline the various anomaly detection techniques that exploit Pulsar SQL.
Query Processor of GraphflowDB and Techniques for the Graph Databases of 2020s
Semih Salihoglu - University of Waterloo
Graph database management systems (GDBMSs) in contemporary jargon refers to systems that adopt the property graph model and often power applications such as fraud detection and recommendations that require very fast joins of records, often beyond the performance that existing relational systems generally provide. There are several techniques that are universally adopted by GDBMSs to speed up joins, such as double indexing of pre-defined relations in adjacency lists and ID-based hash joins. In this talk, I will give an overview of of the query processor of GraphflowDB, a graph database we are actively developing at University of Waterloo, that integrates three other novel techniques to perform very fast joins tailored for large-scale graphs: (1) worst-case optimal join-style intersection based joins; (2) a novel indexing sub-system that allows indexing subsets of edges, similar to relational views, and allowing adjacency lists to be bound to edges; and (3) factorized processing, which allows query processing on compressed intermediate data. These techniques have been introduced by the theory community in the context of relational database management systems but I will argue that one of their best applications are in GDBMSs.
90 minute workshop
Ontology for Data Scientists
Michael Uschold - Semantic Arts
We start with an interactive discussion to identify what are the main things that data scientists do and why and what some key challenges are. We give a brief overview of ontology and semantic technology with the goal of identifying how and where it may be useful for data scientists.
The main part of the tutorial is to give a deeper understanding of what an ontologies are and how they are used. This technology grew out of core AI research in the 70s and 80s. It was formalized and standardized in the 00s by the W3C under the rubric of the Semantic Web. We introduce the following foundational concepts for building an ontology in OWL, the W3C standard language for representing ontologies.
- Individual things are OWL individuals - e.g., JaneDoe
- Kinds of things are OWL classes - e.g., Organization
- Kinds of relationships are OWL properties - e.g., worksFor
Through interactive sessions, participants will identify what the key things are in everyday subjects and how they are related to each other. We will start to build an ontology in healthcare, using this as a driver to introduce key OWL constructs that we use to describe the meaning of data. Key topics and learning points will be:
- An ontology is a model of subject matter that you care about, represented as triples.
- Populating the ontology as triples using TARQL, R2RML and SHACL
- The ontology is used as a schema that gives data meaning.
- Building a semantic application using SPARQL.
We close the loop by again considering how ontology and semantic technology can help data scientists, and what next steps they may wish to take to learn more.
JGTSDB: A JanusGraph/TimescaleDB Mashup
Ted Wilmes - Expero
Time series data is ubiquitous, appearing in many use cases including finance, supply chain, and energy production. Consequently, “How should I model time series data in my graph database?” is one of the top questions folks have when first kicking the tires on a graph database like JanusGraph. Time series provides a number of challenges for a graph database both as it’s coming into the system and when it’s being read it out. High volume and velocity means you need to ingest tens to hundreds of thousands of points per second (or more!). Users expect to be able to perform low latency aggregations and more complicated analytics functions over this time series data. JanusGraph can meet the ingest requirements but it requires some very specific data modeling and tuning tricks that frequently are not worth the extra development complexity. Because of this, we’d usually recommend storing this data in an entirely different database that is more suited to time series workloads. For this talk, we will discuss an alternative approach where we integrate TimescaleDB access into JanusGraph itself, allowing users to write a single Gremlin query that transparently traverses their graph and time series data. This setup inherits the operational characteristics of Timescaledb while providing a single, unified and low latency query interface that does not require careful and specific graph data modeling and tuning.
Crime Analysis with Visual Graph Transformation .
Sony Green - Kineviz
Criminals are innovative. Because their strategies are constantly evolving, the data model that revealed pattern A could miss pattern B completely. We’ll present an intuitive analytics methodology based on visual graph transformations that gives crime fighters the flexibility to form inferences and iterate rapidly.
The multitude of attack vectors available to criminals enables their strategies to evolve rapidly. This imposes a limited shelf life on fraud fighting solutions. It’s imperative to integrate new (and frequently messy) data sources as well as combining existing data sources in new ways. Combining sources like biometrics, social media, financial, geospatial, and time series data can reveal fraud rings, identity theft, and more. Unfortunately, quickly iterating on these data sources and managing them long term is a nightmare with traditional static schemas.
We have defined a set of Graph Operators---Map, Extract, Aggregate, Link and Shortcut---that enable rapid, non-destructive transformation of a property graph. Individually or in concert, these operators enable statistical analysis, anomaly detection, and simplification or delineation of graph patterns. The resulting workflow is intuitive and highly visual, enabling non-technical users to perform complex analyses. For fraud fighters, effective application of graph operators is a game changer.