# R User Day at Data Day Texas

For several years, we've received a multitude of requests from the data community to increase the Data Day coverage of the **R language and environment** (wikipedia). This year, we decided to do something about it.

**Imagine if Austin had a world class R User Conference.**

Early this spring, our friend **Daniel Woodie**, took over as organizer of the **Austin R User Group**. Seeing the great job he is doing to revive the group and bring in good content, we asked him if he would help us curate an R User track for Data Day Texas. After reviewing our lists of potential speakers, we decided not to limit R to an individual track, but to go all the way -- and create a mini R conference within Data Day Texas. We will set aside an entire portion of the conference facility aside for the the R community. **You will not need to buy a separate ticket for R User Day. Your Data Day Texas ticket gets you into all the content for R User Day as well.**

# Confirmed Talks for R User Day

## Pilgrim’s Progress: a journey from confusion to contribution

*Mara Averick - RStudio*

Navigating the data science landscape can be overwhelming. Luckily, you don't have to do it alone! In fact, I'll argue shouldn't do it alone. Whether it be by tweeting your latest mistake, asking a well-formed question, or submitting a pull request to a popular package, you can help others and yourselves by "learning out loud." No matter how much (or little) you know, you can turn your confusion into contributions, and have a surprising amount of fun along the way.

## Making Causal Claims as a Data Scientist: Tips and Tricks Using R

*Lucy D'Agostino - Vanderbilt University Medical Center*

Making believable causal claims can be difficult, especially with the much repeated adage “correlation is not causation”. This talk will walk through some tools often used to practice safe causation, such as propensity scores and sensitivity analyses. In addition, we will cover principles that suggest causation such as the understanding of counterfactuals, and applying Hill’s criteria in a data science setting. We will walk through specific examples, as well as provide R code for all methods discussed.

## Statistics for Data Science: what you should know and why

*Gabriela de Queiroz - R-Ladies*

Data science is not only about machine learning. To be a successful data person, you also need a significant understanding of statistics. Gabriela de Queiroz walks you through the top five statistical concepts every Data Scientist should know to work with data.

## R, What is it good for? Absolutely Everything

*Jasmine Dumas - Simple Finance*

Good does not mean great, but good is better than bad. When we try to compare programming languages we tend to look at the surface components (popular developer influence, singular use cases or language development & design choices) and sometimes we forget the substantive (sometimes secondary) components of what can make a programming language appropriate for use, such as: versatility, environment and inclusivity. I’ll highlight each of these themes in the presentation to show and not tell of why R is good for everything!

## Introduction to SparkR in AWS EMR (90 minute session)

*Alex Engler - Urban Institute*

This session is a hands-on tutorial on working in Spark through R and RStudio in AWS Elastic MapReduce (EMR). The demonstration will overview how to launch and access Spark clusters in EMR with R and RStudio installed. Participants will be able to launch their own clusters and run Spark code during an introduction to SparkR, including the SparklyR package, for data science applications. Theoretical concepts of Spark, such as the directed acyclic graph and lazy evaluation, as well as mathematical considerations of distributed methods will be interspersed throughout the training. Follow up materials on launching SparkR clusters and tutorials in SparkR will be provided.

Intended Audience: R users who are interested in a first foray into distributed cloud computing for the analysis of massive datasets. No big data, dev ops, or Spark experience is required.

## Using R for Advanced Analytics with MongoDB

*Jane Fine - MongoDB*

In the age of big data, organizations rely on data scientists to provide critical decision support and predictive analysis. Most industries now leverage new kinds of data to innovate, understand their customers, and capture new markets. MongoDB’s flexible schema and scalability makes it a natural choice for storing diverse data sets needed to accomplish these tasks.

In this session, we will explore the tools and design patterns available to the data scientist to harness the power of MongoDB for data preparation and enrichment. We will focus on R for advanced analytics utilizing mongolite as well as MongoDB Spark Connector R API.

## infer: an R package for tidy statistical inference

*Chester Ismay - DataCamp*

How do you code-up a permutation test in R? What about an ANOVA or a chi-square test? Have you ever been uncertain as to exactly which type of test you should run given the data and questions asked? The `infer` package was created to unite common statistical inference tasks into an expressive and intuitive framework to alleviate some of these struggles and make inference more intuitive. This talk will focus on the design principles of the package, which are firmly motivated by Hadley Wickham's tidy tools manifesto. It will also discuss the implementation, centered on the common conceptual threads that link a surprising range of hypothesis tests and confidence intervals. Lastly, we'll walk through some examples of how to implement the code of the `infer` package. The package is aimed to be useful to new students of statistics as well as seasoned practitioners.

## Something old, something new, something borrowed, something blue: Ways to teach data science (and learn it too!)

*Albert Y. Kim - Amherst College*

How can we help newcomers take their first steps into the world of data science and statistics? In this talk, I present ModernDive: An Introduction to Statistical and Data Sciences via R, an open source, fully reproducible electronic textbook available at ModernDive.com, co-authored by myself and Chester Ismay, Data Science Curriculum Lead at DataCamp. ModernDive’s authoring follows a paradigm of “versions, not editions” much more in line with software development than traditional textbook publishing, as it is built using RStudio’s bookdown interface to R Markdown. In this talk, I will present details on our book’s construction, our approaches to teaching novices to use tidyverse tools for data science (in particular ggplot2 for data visualization and dplyr for data wrangling), how we leverage these data science tools to teach data modeling via regression, and preview the new infer package for statistical inference, which performs statistical inference using an expressive syntax that follows tidy design principles. We’ll conclude by presenting example vignettes and R Markdown analyses created by undergraduate students to demonstrate the great potential yielded by effectively empowering new data scientists with the right tools.

## Building Shiny Apps: Challenges and Responsibilities

*Jessica Minnier - Oregon Health and Science University*

R Shiny has revolutionized the way statisticians and data scientists distribute analytic results and research methods. We can easily build interactive web tools that empower non-statisticians to interrogate and visualize their data or perform their own analyses with methods we develop. However, ensuring the user has an enjoyable experience while guaranteeing the analyses options are statistically sound is a difficult balance to achieve. Through a case study of building START (Shiny Transcriptome Analysis Resource Tool), a shiny app for "omics" data visualization and analysis, I will present the challenges you may face when building and deploying an app of your own. By allowing the non-statistician user to explore and analyze data, we can make our job easier and improve collaborative relationships, but the success of this goal requires software development skills. We may need to consider such issues as data security, open source collaborative code development, error handling and testing, user education, maintenance due to advancing methods and packages, and responsibility for downstream analyses and decisions based on the app’s results. With Shiny we do not want to fully eliminate the statistician or analyst “middle man” but instead need to stay relevant and in control of all types of statistical products we create.

## Using R on small teams in industry

*Jonathan Nolis - Lenati*

Doing statistical analyses and machine learning in R requires many different components: data, code, models, outputs, and presentations. While one person can usually keep track of their own work, as you grow into a team of people it becomes more important to keep coordinated. This session discusses the work we do data science work at Lenati, a marketing and strategy consulting firm, and why R is a great tool for us. It covers the best practices we found for working on R code together over many projects and people, and how we handle the occasional instances where we must use other languages.

## Opinionated Analysis Development

*Hilary Parker - Stitch Fix*

Traditionally, statistical training has focused primarily on mathematical derivations and proofs of statistical tests. The process of developing the technical artifact -- that is, the paper, dashboard, or other deliverable -- is much less frequently taught, presumably because of an aversion to cookbookery or prescribing specific software choices. In this talk, I argue that it's critical to teach generalized opinions for how to go about developing an analysis in order to maximize the probability that an analysis is reproducible, accurate and collaborative. A critical component of this is adopting a blameless postmortem culture. By encouraging the use of and fluency in tooling that implements these opinions, as well as a blameless way of correcting course as analysts encounter errors, we as a community can foster the growth of processes that fail the practitioners as infrequently as possible.

## We R What We Ask: The Landscape of R Users on Stack Overflow

*Dave Robinson - Stack Overflow*

Since its founding in 2008, the question and answer website Stack Overflow has been a valuable resource for the R community, collecting more than 200,000 questions about R that are visited millions of times each month. This makes it a useful source of data for observing trends about how people use and learn the language. In this talk, I show what we can learn from Stack Overflow data about the global use of the R language over the last decade. I'll examine what ecosystems of R packages are asked about together, what other technologies are used alongside it, in what industries it has been most quickly adopted, and what countries have the highest density of users. Together, the data paints a picture of a global and rapidly growing community. Aside from presenting these results, I'll introduce interactive tools and visualizations that the company has published to explore this data, as well as a number of open datasets that analysts can use to examine trends in software development.

## The Lesser Known Stars of the Tidyverse

*Emily Robinson - Etsy*

While most R programmers have heard of ggplot2 and dplyr, many are unfamiliar with the breath of the tidyverse and the variety of problems it can solve. In this talk, we will give a brief introduction to the concept of the tidyverse and then describe three packages you can immediately start using to make your workflow easier. The first package is forcats, designed for making working with categorical variables easier; the second is glue, for programmatically combining data and strings; and the third package is tibble, an alternative to data.frames. We will cover their basic functions so that, at the end of the talk, we will be able to use and learn more about the broader tidyverse.

## Text Mining Using Tidy Data Principles

*Julia Silge - Stack Overflow*

Text data is increasingly important in many domains, and tidy data principles and tidy tools can make text mining easier and more effective. I will demonstrate how we can manipulate, summarize, and visualize the characteristics of text using these methods and R packages from the tidy tool ecosystem. These tools are highly effective for many analytical questions and allow analysts to integrate natural language processing into effective workflows already in wide use. We will explore how to implement approaches such as sentiment analysis of texts, measuring tf-idf, and measuring word vectors.

## Speeding up R with Parallel Programming in the Cloud

*David Smith - Microsoft*

There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.

Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.

## Making Magic with Keras and Shiny

*Nicholas Strayer - Vanderbilt University*

The web-application framework Shiny has opened up enormous opportunities for data scientists by giving them a way to bring their models and visualizations to the public in interactive applications with only R code. Likewise, the package keras has simplified the process of getting up and running with deep-neural networks by abstracting away much of the boiler-plate and book-keeping associated with writing models in a lower-level library such as tensorflow. In this presentation, I will demo and discuss the development of a shiny app that allows users to cast 'spells' simply by waving their phone around like a wand. The app gathers the motion of the device using the library shinysense and feeds it into a convolutional neural network which predicts spell casts with high accuracy. A supplementary shiny app for gathering data will be also be shown. These applications demonstrate the ability for shiny to be used at both the data-gathering and model-presentation steps of data science.

# Confirmed Speakers for R User Day

## Mara Averick (*Boston*) @dataandme

*LinkedIn / GitHub / Medium )* is a polymath and self-confessed data nerd. With a strong background in research, she has a breadth of experience in data analysis, visualization, and applications thereof. Currently, by day, she’s a Tidyverse Developer Advocate, RStudio. By night, you’ll find her sharing dope R related stuff on Twitter and translating heavily technical subject matter into easy reading for a non-technical audience. When she’s not talking data, she's diving into NBA stats, exploring weird and wonderful words, and/or indulging in her obsession with all things Archer.)*Mara will be presenting the R User Day session: Pilgrim’s Progress: a journey from confusion to contribution*.

## Lucy D'Agostino (*Nashville*) @LucyStats

*LinkedIn / GitHub)* is a Biostatistics PhD candidate at Vanderbilt University where her research focuses on observational studies, large-scale inference, and methods for quantifying and estimating the effect of unmeasured confounding. She is the co-founder of R-Ladies Nashville and is enthusiastic about learning from and uplifting other women in the R and STEM communities.*Lucy will be presenting the R User Day session: Making Causal Claims as a Data Scientist: Tips and Tricks Using R*.

## Gabriela de Queiroz (*San Francisco*) @gdequeiroz

*LinkedIn / GitHub)* is the Lead Data Scientist at SelfScore. Formerly Gabriela was data scientist at Sharethrough, where she developed statistical models from concept creation to production, designed, ran, and analyzed experiments, and employed a variety of techniques to derive insights and drive data-centric decisions. Gabriela is the founder of R-Ladies, an organization created to promote diversity in the R community, which now has over 25 chapters worldwide. Currently, she is developing an online course on machine learning in partnership with DataCamp.*Gabriela will be presenting the R User Day session: Statistics for Data Science: what you should know and why*.

## Jasmine Dumas (*Connecticut*) @jasdumas

*LinkedIn / GitHub)* is a Data Scientist at Simple Finance where she is focused on experimentation and data product development. She earned a B.S.E. in Biomedical Engineering from the University of Hartford and has experienece in Aerospace Manufacturing, Medical Devices and Financial Technology. She is an active member of the R programming community and has developed open source packages: shinyGEO, ttbbeer, shinyLP, & gramr and participated in Google Summer of Code, NASA Datanauts, R-Ladies, and Forwards. She is currently developing a course on shiny with DataCamp and co-organizing the regional Noreast'R Conference.*Jasmine will be giving the following R User Day talk: R, What is it good for? Absolutely Everything*.

## Alex Engler (*Washington, D.C.*) @alexcengler

*LinkedIn / GitHub / Urban Institute / Georgetown / Johns Hopkins)* is the Program Director and Lecturer for the M.S. in Computational Analysis and Public Policy program at the University of Chicago. He is also a contributing data scientist to the Urban Institute, where he worked before UChicago. Alex also previously taught visualization and data science for policy analysis at Georgetown University and Johns Hopkins University.*Alex will be presenting the following workshop: Introduction to SparkR in AWS EMR, as part of R User Day*.

## Jane Fine (*SF Bay Area*) @janeuyvova

*LinkedIn)* joined MongoDB in 2016 from Teradata Aster where she ran their Big Data Practice delivering advanced analytics such as churn & customer 360 modeling, digital marketing attribution, sentiment analysis and IoT. Jane has over ten years of experience in enterprise software, analytics, and consulting as well as background in biology and life sciences. She loves solving real world problems and talking to customers..*Jane will be giving the following R User Day talk: Using R for Advanced Analytics with MongoDB*.

## Chester Ismay (*Portland*) @old_man_chester

*LinkedIn / GitHub)* is Curriculum Lead at DataCamp. He was formerly an Adjunct Professor of Sociology at Pacific University and an Instructional Technologist and Consultant for Data Science, Statistics, and R at Reed College. He obtained his PhD in statistics from Arizona State University and has taught courses and led workshops in statistics, data science, mathematics, computer science, and sociology. He is the co-author of the fivethirtyeight R data package and is the author of the thesisdown R package. He is also a co-author of an open source textbook entitled ModernDive: An Introduction to Statistical and Data Sciences via R*Chester will be giving the following R User Day talk: infer: an R package for tidy statistical inference*.

## Albert Y. Kim (*Amherst*) @rudeboybert

*LinkedIn / GitHub)* is a Lecturer in Statistics in the Mathematics & Statistics Department at Amherst College. Born in Montreal Quebec, he earned his BSc in Mathematics and Computer Science from McGill University in 2004 and his PhD in Statistics from the University of Washington in 2011. Prior to joining Amherst College, he was a Decision Support Engineering Analyst in the AdWords division of Google Inc, a Visiting Assistant Professor of Statistics at Reed College, and an Assistant Professor of Statistics at Middlebury College.*Albert will be giving the following R User Day talk: Something old, something new, something borrowed, something blue: Ways to teach data science (and learn it too!)*.

## Jared Lander (*NYC*) @jaredlander

*LinkedIn)* is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

Jared specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

## Jessica Minnier (*Portland*) @datapointier

*LinkedIn / GitHub)*

is an Assistant Professor of Biostatistics at Oregon Health & Sciences University. She is a faculty member of the OHSU-PSU School of Public Health with appointments in the Knight Cardiovascular Institute and Knight Cancer Institute Biostatistics Shared Resource. Her statistical research interests include risk prediction with high dimensional data sets and the analysis of genetic and other omics data. She is also interested in statistical computing (mostly in R), reproducible research and open science.

Jessica teaches Mathematics/Statistics II, a statistical inference course for the MS in Biostatistics program at OHSU-PSU School of Public Health. Jessica has an A.M. and Ph.D. in Biostatistics from Harvard University and a B.A. in Mathematics with minor in Computer Science from Lewis & Clark College.*Jessica will be presenting the R User Day session: Building Shiny Apps: Challenges and Responsibilities*.

## Jonathan Nolis (*Seattle*) @skyetetra

*LinkedIn / GitHub)* is the Director of Insights & Analytics at Lenati, and is the lead of the Customer Insights & Analytics team. He has over a decade of experience in solving business problems using data science. Jonathan has provided insights and strategic advice in industries such as retail, manufacturing, aerospace, health care, and e-commerce. Jonathan helps create proprietary technology for Lenati including the Loyalty Program ROI Simulator – a tool that uses big data to predict the value of a loyalty program. He has a PhD in industrial engineering, and has several academic publications in the field of applied optimization. Prior to joining Lenati, Jonathan was a Lead of Advanced Analytics at Promontory Financial Group, a regulatory compliance consulting firm.*Jonathan will be presenting the R User Day session: Using R on small teams in industry*.

## Hilary Parker (*San Francisco*) @hspter

*LinkedIn / GitHub)* is a Data Scientist at Stitch Fix and co-host of the Not So Standard Deviations podcast. She is an R and statistics enthusiast determined to bring rigor to analysis wherever she goes. At Stitch Fix she works on teasing apart correlation from causation, with a strong dose of reproducibility. Formerly a Senior Data Analyst at Etsy, she received a PhD in Biostatistics from the Johns Hopkins Bloomberg School of Public Health.*Hilary will be presenting the R User Day session: Opinionated Analysis Development*.

## David Robinson (*NYC*) @drob

*LinkedIn / GitHub)* is a data scientist at Stack Overflow with a PhD in Quantitative and Computational Biology from Princeton University. He enjoys developing open source R packages, including broom, gganimate, fuzzyjoin and widyr, as well as blogging about statistics, R, and text mining on his blog, Variance Explained.*David will be giving the following R User Day talk: We R What We Ask: The Landscape of R Users on Stack Overflow*.

## Emily Robinson (*NYC*) @robinson_es

*LinkedIn / GitHub)* works as a Data Analyst at Etsy with the search team to design, implement, and analyze experiments on the ranking algorithm, UI changes, and new features. Emily earned her masters in Organizational Behavior from INSEAD in 2016 and her bachelor’s in Decision Sciences from Rice University (where she took classes from Hadley Wickham). She's a co-organizer of the R-Ladies NYC chapter, a global organization to promote gender diversity in the R community. She enjoys blogging about A/B Testing, conferences, and data science projects on her blog, Hooked on Data.*Emily will be giving the following R User Day presentation: The Lesser Known Stars of the Tidyverse*.

## Julia Silge (*Salt Lake City*) @juliasilge

*LinkedIn / GitHub)* is a data scientist at Stack Overflow. She enjoys making beautiful charts, the statistical programming language R,

black coffee, red wine, and the mountains of her adopted home here in Utah. She has a PhD in astrophysics and an abiding love for Jane Austen. Her work involves analyzing and modeling complex data sets while communicating about technical topics with diverse audiences.*Julia giving the following R User Day presentation*: Text Mining Using Tidy Data Principles.

## David Smith (*Chicago*) @revodavid

*David will be presenting the R User Day session: Speeding up R with Parallel Programming in the Cloud*

## Nick Strayer (*Nashville*) @NicholasStrayer

*LinkedIn / GitHub)* has worked in many different realms, including as a Journalist at the New York Times, data scientist at Dealer.com in Vermont, and as a "data artist in residence" at tech startup Conduce in California. Currently, he is a PhD student in biostatistics at Vanderbilt University and also an intern at the Johns Hopkins Data Science Lab. Recently (May '15), he graduated from the University of Vermont where he majored in mathematics and statistics and minored in computer science.

Nick likes data. Manipulating it, modeling it, making it (simulation), visualizing it and yes, even cleaning it. He does these things with some combination of R, Python and Javascript (d3.js in particular). Most recently he has been fascinated with conveying complex statistical topics and methods using intuitive and interactive graphics.

Nick's current research interests include: data gathering, extracting inference from machine learning, data visualization and scientific communication. When not in "school mode" Nick loves to bike places, read science fiction and wander around gardens/musuems.*Nick will be presenting the R User Day session: Making Magic with Keras and Shiny*

## Daniel Woodie (*Austin*) @DanielWoodie5

*Daniel will be emcee for R User Day at Data Day Texas.*

# Author Signings at R User Day

## Text Mining with R (O'Reilly Media)

Tackle a variety of tasks in natural language processing by learning how to use the R language and tidy data principles. This practical guide provides examples and resources to help you get up to speed with dplyr, broom, ggplot2, and other tidy tools from the R ecosystem. You’ll discover how tidy data principles can make text mining easier, more effective, and consistent by employing tools already in wide use.

Get real-world examples for implementing text mining using tidy R package

Understand natural language processing concepts like sentiment analysis, tf-idf, and topic modeling

Learn how to analyze unstructured, text-heavy data using R language and ecosystem.