2014 workshops

In addition to presentations, birds of a feather meetings, booksignings and socials, Data Day will host a variety of workshops on the technologies requested by our attendees. Below are some of the workshops scheduled. We'll be publishing the remainder in the next few days.

For those of you interested in a deep dive into machine learning, Paco Nathan will be teaching his Hands-on Introduction to Machine Learning on Friday, January 10. This is a full day course.

How to Build a Hadoop Data Application

Eric Sammer, O'Reilly Author and Engineering Manager, Cloudera
With a such a large number of components in the Hadoop ecosystem, writing Hadoop applications can be a challenge for users who are new to the platform. The Cloudera Development Kit (CDK) is an open source project with the goal of simplifying Hadoop application development. It codifies best-practice for writing Hadoop applications by providing documentation, examples, tools, and APIs for Java developers.
We will discuss the architecture of a common data pipeline from data ingest from an application to report generation. Hadoop concepts and components (including HDFS, Avro, Flume, Crunch, HCatalog, Hive, Impala, Oozie) will be introduced along the way, and they will be explained in the context of solving a concrete problem for the application. The goal is to build a simple end-to-end Hadoop data application that you can take away and adapt to your own use cases.
Attendees should be familiar with Java and common enterprise APIs like Servlets. No prior experience of Hadoop is necessary, although an awareness of the functions of components in the Hadoop stack is a plus.

Introduction to KNIME Data Mining Software (2 hours)

Michael Berthold
We're fortunate to have the folks from KNIME joining us for Data Day. This is their first trip to Texas. Michael Berthold, the orginal developer behind KNIME, will be speaking as well. For those of you unfamiliar with it, KNIME, the Konstanz Information Miner, is an open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. A graphical user interface allows assembly of nodes for data preprocessing (ETL: Extraction, Transformation, Loading), for modeling and data analysis and visualization. Since 2006, KNIME has been used in pharmaceutical research,[1] but is also used in other areas like CRM customer data analysis, business intelligence and financial data analysis.
Additional details on this workshop are forthcoming.

Mining Social Web APIs with IPython Notebook (2 hours)

Matthew Russell, CTO of Digital Reasoning and author of Mining the Social Web.
Social web properties such as Twitter, Facebook, LinkedIn, and Google+ have vast amounts of valuable insights lurking just beneath the surface, and this workshop minimizes the barriers to exploring and mining this valuable data by presenting turn-key examples from Mining the Social Web (2nd Edition) with IPython Notebook.
Each module of the workshop consists of a brief period in which each attendee will customize the corresponding notebook for the module with their own account credentials with the remainder of the module devoted to learning what data is available from the API and exercises demonstrating analysis of the data—all from a pre-populated IPython Notebook. Even attendees with minimal programming experience should be able to walk away from this workshop with a working knowledge of the material and be equipped with sample code that can be easily repurposed given the design of this tutorial.

Apache Mesos as the Building Blocks for Distributed Systems (2 hours)

Paco Nathan: O'Reilly author and Mesos evangelist