In the Time of Big Data and Machine Learning, It’s Important to Ask “Why?”

In this special guest feature, Sundeep Sanghavi, Co-founder and CEO of DataRPM, believes that “The Five Whys” interrogation technique developed by Sakichi Toyoda, founder of the Toyota car company and hero of the Japanese industrial revolution, is still very relevant in today’s age of big data and machine learning. CEO Sundeep Sanghavi co-founded DataRPM, a Progress company, with the goal of providing a cognitive platform for predictive maintenance solutions to organizations challenged by the volume, velocity and variety of their big data and machine learning

Three enablers for machine learning in data unification: trust, legacy, and scale

Data unification is the process of combining multiple, diverse data sets and preparing them for analysis by matching, deduplicating, and otherwise cleaning the records (Figure 1). This effort consumes more than 60% of data scientists’ time, according to many recent studies. Moreover, it has been shown that cleaning data before feeding it to machine learning (ML) models (e.g., to perform predictive analytics) is far more effective than trying to fiddle with model parameters and feature engineering applied to dirty data. These facts render data cleaning activities absolutely necessary, yet unfortunately both tedious and time-consuming, given the state of the tools available to most data scientists. The use of machine learning models for analytics is well-understood for predicting end-of-quarter spending, capacity planning, and deciding on investment portfolios. However, the application of machine learning to data unification and cleaning, although effective in dealing with vast amount of dirty and siloed data, faces many technical and pragmatic challenges.

Bayesian A/B Testing Made Easy

A/B Testing is a familiar task for many working in business analytics. Essentially, A/B Testing is a simple form of hypothesis testing with one control group and one treatment group. Classical frequentist methodology instructs the analyst to estimate the expected effect of the treatment, calculate the required sample size, and perform a test to determine if a large enough effect is observed. This method is somewhat lacking: it only leaves one with point estimates for the control and the treatment groups, and a verdict to reject (effect is observed) or to fail to reject (effect is not observed). Let’s consider an alternative approach following Bayesian methods with the bayesAB package. Suppose that we have the current version and a proposed version of a web page, each containing a button of interest, and we wish to determine whether the proposed version leads to more clicks on the button of interest. Currently, approximately half of all visitors click the button of interest. Suppose the proposed version of the web page is actually much worse and only 30 percent will click it.

Compare Tube Types with R – Repeated Measures ANOVA

Sometimes we might want to compare three or four tube types for a particular analyte on a group of patients or we might want to see if a particular analyte is stable over time in aliqioted samples. In these experiments are essentially doing the multivariable analogue of the paired t-test. In the tube-type experiment, the factor that is differing between the (‘paired’) groups is the container: serum separator tubes (SST), EDTA plasma tubes, plasma separator tubes (PST) etc. In a stability experiment, the factor that is differing is storage duration. Since this is a fairly common clinical lab experiment, I thought I would just jot down how this is accomplished in R – though I must confess I know just about lim x->0 x about statistics. In any case, the statistical test is a repeated-measures ANOVA and this is one way to do it (there are many) including an approach to the post-hoc testing.

Datasets for Data Science and Machine Learning

Exploratory analysis is your first step in most data science exercises. The best datasets for practicing exploratory analysis should be fun, interesting, and non-trivial (i.e. require you to dig a little to uncover all the insights).

Ensemble Learning to Improve Machine Learning Results

Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking).

Using regression trees for forecasting double-seasonal time series with trend in R

After blogging break caused by writing research papers, I managed to secure time to write something new about time series forecasting. This time I want to share with you my experiences with seasonal-trend time series forecasting using simple regression trees. Classification and regression tree (or decision tree) is broadly used machine learning method for modeling. They are favorite because of these factors:
• simple to understand (white box)
• from a tree we can extract interpretable results and make simple decisions
• they are helpful for exploratory analysis as binary structure of tree is simple to visualize
• very good prediction accuracy performance
• very fast
• they can be simply tuned by ensemble learning techniques

Some Neat New R Notations

The R package seplyr supplies a few neat new coding notations.

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

Of all the developers’ delight, what constitutes as most attractive one is a set of APIs that make developers productive, that is easy to use, and that is intuitive and expressive. One of Apache Spark’s appeal to developers has been its easy-to-use APIs, for operating on large datasets, across languages: Scala, Java, Python, and R. In this blog, I explore three sets of APIs—RDDs, DataFrames, and Datasets—available in Apache Spark 2.0 and beyond Apache Spark 2.0; why and when you should use each set; outline their performance and optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of RDDs. Mostly, I will focus on DataFrames and Datasets, because in Apache Spark 2.0, these two APIs are unified. Our primary motivation behind this unification is our quest to simplify Spark by limiting the number of concepts that you have to learn and by offering ways to process structured data. And through structure, Spark can offer higher-level abstraction and APIs as domain-specific language constructs.

37 Reasons why your Neural Network is not working

The network had been training for the last 12 hours. It all looked good: the gradients were flowing and the loss was decreasing. But then came the predictions: all zeroes, all background, nothing detected. “What did I do wrong?”?—?I asked my computer, who didn’t answer. Where do you start checking if your model is outputting garbage (for example predicting the mean of all outputs, or it has really poor accuracy)? A network might not be training for a number of reasons. Over the course of many debugging sessions, I would often find myself doing the same checks. I’ve compiled my experience along with the best ideas around in this handy list. I hope they would be of use to you, too.

An interactive GPU-powered deep dive into 11.6 billion rows of US shipping data

For some of us, summer is synonymous with salt water and waves. For many others, the sea is a year-round occupation. The US has 12,383 miles of coastline and 95,471 miles of shoreline, and it buzzes with billions of trips each year, all tracked by the US Coast Guard. Our latest demo of MapD Core and MapD Immerse reveals the vast scope of marine activity around America’s shores-everything from the tracks of commercial freighters to the patrols of military vessels to the lazy patterns of pleasure boats out for a Sunday sail on San Francisco Bay. With more than 11 billion rows of public ship AIS data to explore, spanning from 2009 to 2014, you can filter the data by ship type such as tugboat, cargo ship, passenger ship or tanker, by length (the largest are about 350 ft) and of course by time, showing seasonality and trends. The visualization traces the path of each vessel, allowing you to investigate the main shipping lanes around the US coasts, key ports and waterways.