Ava: From Data to Insights Through Conversation

Enterprises increasingly employ a wide array of tools and processes to make data-driven decisions. However, there are large ineffciencies in the enterprise-wide work ow that stem from the fact that business work ows are expressed in natural language but the actual computational workflow has to be manually translated into computational programs. In this paper, we present an initial approach to bridge this gap by targeting the data science component of enterprise work-flows. In many cases, this component is the slowest part of the overall enterprise process, and focusing on it allows us to take an initial step in solving the larger enterprise-wide productivity problem. In this initial approach, we propose using a chatbot to allow a data scientist to assemble data analytics pipelines. A crucial insight is that while precise interpretation of general natural language continues to be challenging, controlled natural language methods are starting to become practical as natural interfaces in complex decision-making domains. In addition, we recognize that data science work-flow components are often templatized. Putting these two insights together, we develop a practical system, called Ava, that uses (controlled) natural language to program data science workflows. We have an initial proof-of-concept that demonstrates the potential of our approach.


Demystifying Black-Box Models with SHAP Value Analysis

As an Applied Data Scientist at Civis, I implemented the latest data science research to solve real-world problems. We recently worked with a global tool manufacturing company to reduce churn among their most loyal customers. A newly proposed tool, called SHAP (SHapley Additive exPlanation) values, allowed us to build a complex time-series XGBoost model capable of making highly accurate predictions for which customers were at risk, while still allowing for an individual-level interpretation of the factors that made each of these customers more or less likely to churn.


Data Science with Statistical Modeling (Interview)

Robustify your data science with statistical modeling, whether you work in tech, epidemiology, finance or anything else.


Create your Machine Learning library from scratch with R ! (2/5) – PCA

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement Principal components analysis (PCA) using only the linear algebra available in R. Previously, we managed to implement linear regression and logistic regression from scratch and next time we will deal with K nearest neighbors (KNN).


Understanding Random Forests Classifiers in Python

Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance. Random forests has a variety of applications, such as recommendation engines, image classification and feature selection. It can be used to classify loyal loan applicants, identify fraudulent activity and predict diseases. It lies at the base of the Boruta algorithm, which selects important features in a dataset.


How will the GDPR impact machine learning?

Much has been made about the potential impact of the EU’s General Data Protection Regulation (GDPR) on data science programs. But there’s perhaps no more important—or uncertain—question than how the regulation will impact machine learning (ML), in particular. Given the recent advancements in ML, and given increasing investments in the field by global organizations, ML is fast becoming the future of enterprise data science. This article aims to demystify this intersection between ML and the GDPR, focusing on the three biggest questions I’ve received at Immuta about maintaining GDPR-compliant data science and R&D programs. Granted, with an enforcement data of May 25, the GDPR has yet to come into full effect, and a good deal of what we do know about how it will be enforced is either vague or evolving (or both!). But key questions and key challenges have already started to emerge.


Detecting and dating structural breaks in functional data without dimension reduction

Methodology is proposed to uncover structural breaks in functional data that is ‘fully functional’ in the sense that it does not rely on dimension reduction techniques. A thorough asymptotic theory is developed for a fully functional break detection procedure as well as for a break date estimator, assuming a fixed break size and a shrinking break size. The latter result is utilized to derive confidence intervals for the unknown break date. The main results highlight that the fully functional procedures perform best under conditions when analogous estimators based on functional principal component analysis are at their worst, namely when the feature of interest is orthogonal to the leading principal components of the data. The theoretical findings are confirmed by means of a Monte Carlo simulation study in finite samples. An application to annual temperature curves illustrates the practical relevance of the procedures proposed.


Time Series of the World, Unite!

The R ecosystem knows a ridiculous number of time series classes. So, I decided to create a new universal standard that finally covers everyone’s use case… Ok, just kidding! tsbox, now freshly on CRAN, provides a set of tools that are agnostic towards existing time series classes. It is built around a set of converters, which convert time series stored as ts, xts, data.frame, data.table, tibble, zoo, tsibble or timeSeries to each other.


CHAID and R — When you need explanation

A modern data scientist using R has access to an almost bewildering number of tools, libraries and algorithms to analyze the data. In my next two posts I’m going to focus on an in depth visit with CHAID (Chi-square automatic interaction detection). The title should give you a hint for why I think CHAID is a good “tool” for your analytical toolbox. There are lots of tools that can help you predict or classify but CHAID is especially good at helping you explain to any audience how the model arrives at it’s prediction or classification. It’s also incredibly robust from a statistical perspective, making almost no assumptions about your data for distribution or normality. I’ll try and elaborate on that as we work the example.


19 Data Science Tools for people who aren’t so good at Programming

1. RapidMiner
2. DataRobot
3. BigML
4. Google Cloud AutoML
5. Paxata
6. Trifacta
7. MLBase
8. Auto-WEKA
9. Driverless AI
10. Microsoft Azure ML Studio
11. MLJar
12. Amazon Lex
13. IBM Watson Studio
14. Automatic Statistician
15. KNIME
16. FeatureLab
17. MarketSwitch
18. Logical Glue
19. Pure Predictive


Taking the baseline measurement into account: constrained LDA in R

In Randomized Controlled Trials (RCTs), a “Pre” measurement is often taken at baseline (before randomization), and treatment effects are measured at one or more time point(s) after randomization (“Post” measurements). There are many ways to take the baseline measurement into account when comparing 2 groups in a classic pre-post design with one post measurement. Here, I will discuss the constrained longitudinal data analysis (cLDA).


5 Common GDPR Misconceptions

Misconception #1: Companies must ensure that personal data resides in the country of origin.
Misconception #2: Individual privacy rights are the end-all, be-all.
Misconception #3: GDPR will limit my company’s ability to do business.
Misconception #4: Consultants will save the day.
Misconception #5: Companies can relax after May 25, 2018.


GANs in TensorFlow from the Command Line: Creating Your First GitHub Project

In this article I will present the steps to create your first GitHub Project. I will use as an example Generative Adversarial Networks.


A new benchmark suite for machine learning

We are in an empirical era for machine learning, and it’s important to be able to identify tools that enable efficient experimentation with end-to-end machine learning pipelines. Organizations that are using and deploying machine learning are confronted with a plethora of options for training models and model inference, at the edge and on cloud services. To that end,MLPerf, a new set of benchmarks compiled by a growing list of industry and academic contributors,was recently announced at the recent Artificial Intelligence conference in NYC.


Elegant regression results tables and plots in R: the finalfit package

The finafit package brings together the day-to-day functions we use to generate final results tables and plots when modelling. I spent many years repeatedly manually copying results from R analyses and built these functions to automate our standard healthcare data workflow. It is particularly useful when undertaking a large study involving multiple different regression analyses. When combined with RMarkdown, the reporting becomes entirely automated. Its design follows Hadley Wickham’s tidy tool manifesto.


Setting benchmarks in machine learning

Machine learning has rapidly become one of the most impactful fields in industry. For that reason, it has also become one of the most competitive. However, everyone stands to benefit from setting common goals and standards to drive the entire field forward in a non-zero-sum fashion. With MLPerf, a cooperative of leading academic and industry institutions have come together to do just that. Because machine learning is such a diverse field, MLPerf will define an entire suite of benchmarks to measure performance of software, hardware, and cloud systems. The different types of machine learning problems covered range from image classification to translation to reinforcement learning. MLPerf will feature two divisions for benchmarking. The Closed Division fixes both models and hyper parameters to fairly compare performance across different hardware systems and software frameworks. The Open Division relieves those constraints, and the hope is that best-in-class models and hyperparameters emerging from the Open Division might then set the standards for next-generation Closed Division testing. Perhaps the most potent ingredient of this ambitious new effort is the roster of industry-leading players that have already come together. In this interview, you will have a chance to hear from some of them, including Turing Award winner Dave Patterson.


Mastering Advanced Analytics with Apache Spark

Apache Spark has rapidly emerged as the de facto standard for big data processing across all industries and use cases – from providing recommendations based on user behavior to analyzing millions of genomic sequence data to accelerate drug innovation and development for personalized medicine.
This eBook, the second of a series, offers a collection of the most popular technical blog posts that provide an introduction to machine learning on Apache Spark, and highlights many of the major developments around Spark MLlib and GraphX.
Whether you are just getting started with Spark or are already a Spark power user, this eBook will arm you with the knowledge to be successful on your next Spark project including:
• An introduction to machine learning in Apache Spark
• Using Spark for advanced topics such as clustering, trees, graph processing
• How you can use SparkR to analyze data at scale with the R language
Advertisements