Covariate Shift – Unearthing hidden problems in Real World Data Science

You may have heard from various people that data science competitions are a good way to learn data science, but they are not as useful in solving real world data science problems. Why do you think this is the case? One of the differences lies in the quality of data that has been provided. In Data Science Competitions, the datasets are carefully curated. Usually, a single large dataset is split into train and test file. So, most of the times the train and test have been generated from the same distribution. But this is not the case when dealing with real world problems, especially when the data has been collected over a long period of time. In such cases, there may be multiple variables / environment changes might have happened during that period. If proper care is not taken then, the training dataset cannot be used to predict anything about the test dataset in a usable manner. In this article, we will see the different types of problems or Dataset Shift that we might encounter in the real world. Specifically, we will be talking in detail about one particular kind of shift in the Dataset (Covariate shift), the existing methods to deal with this kind of shift and an in depth demonstration of a particular method to correct this shift.

Keras: The Python Deep Learning library

Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow, CNTK or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. Use Keras if you need a deep learning library that:
• Allows for easy and fast prototyping (through user friendliness, modularity, and extensibility).
• Supports both convolutional networks and recurrent networks, as well as combinations of the two.
• Runs seamlessly on CPU and GPU.

In my last post I demonstrated how to obtain linear regression parameter estimates in R using only matrices and linear algebra. Using the well-known Boston data set of housing characteristics, I calculated ordinary least-squares parameter estimates using the closed-form solution. In this post I’ll explore how to do the same thing in Python using numpy arrays and then compare our estimates to those obtained using the linear_model function from the statsmodels package. First, let’s import the modules and functions we’ll need. We’ll use numpy for matrix and linear algebra. In the last post, we obtained the Boston housing data set from R’s MASS library. In Python, we can find the same data set in the scikit-learn module.

Generalized Additive Models

GAMs are simply a class of statistical Models in which the usual Linear relationship between the Response and Predictors are replaced by several Non linear smooth functions to model and capture the Non linearities in the data.These are also a flexible and smooth technique which helps us to fit Linear Models which can be either linearly or non linearly dependent on several Predictors X i Xi to capture Non linear relationships between Response and Predictors.In this article I am going to discuss the implementation of GAMs in R using the ‘gam’ package.Simply saying GAMs are just a Generalized version of Linear Models in which the Predictors X i Xi depend Linearly or Non linearly on some Smooth Non Linear functions like Splines , Polynomials or Step functions etc. The Regression Function F(x) F(x) gets modified in Generalized Additive Models , and only due to this transformation the GAMs are better in terms of Generalization to random unseen data , fits the data very smoothly and flexibly without adding Complexities or much variance to the Model most of the times. The basic idea in Splines is that we are going to fit Smooth Non linear Functions on a bunch of Predictors X i Xi to capture and learn the Non linear relationships between the Model’s variables i.e X X and Y Y .Additive in the name means we are going to fit and retain the additivity of the Linear Models.

Introduction to Neural Networks, Advantages and Applications

Artificial Neural Network(ANN) uses the processing of the brain as a basis to develop algorithms that can be used to model complex patterns and prediction problems. Lets begin by first understanding how our brain processes information: In our brain, there are billions of cells called neurons, which processes information in the form of electric signals. External information/stimuli is received by the dendrites of the neuron, proccessed in the neuron cell body, converted to an output and passed through the Axon to the next neuron. The next neuron can choose to either accept it or reject it depending on the strength of the signal.

Machine Learning: Pruning Decision Trees

Machine learning is a problem of trade-offs. The classic issue is overfitting versus underfitting. Overfitting happens when a model memorizes its training data so well that it is learning noise on top of the signal. Underfitting is the opposite: the model is too simple to find the patterns in the data. Simplicity versus accuracy is a similar consideration. Do you want a model that can fit onto one sheet of paper and be understood by a broad audience? Or do you want the best possible accuracy, even if it is a “black box”? In this post I am going to look at two techniques (called pruning and early stopping) for managing these trade-offs in the context of decision trees. The techniques I am going to describe in this post give you the power to find a model that suits your needs.

Modeling Agents with Probabilistic Programs

This book describes and implements models of rational agents for (PO)MDPs and Reinforcement Learning. One motivation is to create richer models of human planning, which capture human biases and bounded rationality. Agents are implemented as differentiable functional programs in a probabilistic programming language based on Javascript. Agents plan by recursively simulating their future selves or by simulating their opponents in multi-agent games. Our agents and environments run directly in the browser and are easy to modify and extend. The book assumes basic programming experience but is otherwise self-contained. It includes short introductions to “planning as inference”, MDPs, POMDPs, inverse reinforcement learning, hyperbolic discounting, myopic planning, and multi-agent planning.

Introduction to Market Basket Analysis in Python

There are many data analysis tools available to the python analyst and it can be challenging to know which ones to use in a particular situation. A useful (but somewhat overlooked) technique is called association analysis which attempts to find common patterns of items in large data sets. One specific application is often called market basket analysis. The most commonly cited example of market basket analysis is the so-called “beer and diapers” case. The basic story is that a large retailer was able to mine their transaction data and find an unexpected purchase pattern of individuals that were buying beer and baby diapers at the same time. Unfortunately this story is most likely a data urban legend. However, it is an illustrative (and entertaining) example of the types of insights that can be gained by mining transactional data. While these types of associations are normally used for looking at sales transactions; the basic analysis can be applied to other situations like click stream tracking, spare parts ordering and online recommendation engines – just to name a few. If you have some basic understanding of the python data science world, your first inclination would be to look at scikit-learn for a ready-made algorithm. However, scikit-learn does not support this algorithm. Fortunately, the very useful MLxtend library by Sebastian Raschka has a a an implementation of the Apriori algorithm for extracting frequent item sets for further analysis. The rest of this article will walk through an example of using this library to analyze a relatively large online retail data set and try to find interesting purchase combinations. By the end of this article, you should be familiar enough with the basic approach to apply it to your own data sets.

The R Shiny packages you need for your web apps!

Shiny is an R Package to deploy web apps using an R backend. Let’s face it, Shiny is awesome! It brings all the power of R to a simple web app with interactivity, user inputs, and interactive visualizations. If you don’t know Shiny yet, you can access a selection of apps on Show me shiny.

The Nature Conservancy Fisheries Monitoring Competition, 1st Place Winner’s Interview: Team ‘Towards Robust-Optimal Learning of Learning’

This year, The Nature Conservancy Fisheries Monitoring competition challenged the Kaggle community to develop algorithms that automatically detects and classifies species of sea life that fishing boats catch. Illegal and unreported fishing practices threaten marine ecosystems. These algorithms would help increase The Nature Conservancy’s capacity to analyze data from camera-based monitoring systems. In this winners’ interview, first place team, ‘Towards Robust-Optimal Learning of Learning’ (Gediminas Pekšys, Ignas Namajunas, Jonas Bialopetravicius), shares details of their approach like how they needed to have a validation set with images from different ships than the training set and how they handled night-vision images. Because the photos from the competition’s dataset aren’t publicly releasable, the team’s recruited graphic designer Jurgita Avišansyte to contribute illustrations for this blog post.

Data Science Governance?-?Why does it matter? Why now?

Everyone is talking about GDPR, Data Governance and Data Privacy, these days. Here we discuss what is it and why does it matter.

Exploratory Data Analysis in Python

Earlier this year, we wrote about the value of exploratory data analysis and why you should care. In that post, we covered at a very high level what exploratory data analysis (EDA) is, and the reasons both the data scientist and business stakeholder should find it critical to the success of their analytical projects. However, that post may have left you wondering: How do I do EDA myself? Last month, my fellow senior data scientist, Jonathan Whitmore, and I taught a tutorial at PyCon titled Exploratory Data Analysis in Python—you can watch it here. In this post, we will summarize the objectives and contents of the tutorial, and then provide instructions for following along so you can begin developing your own EDA skills.

Deploying Data Science Projects [Whitepaper]

In a new whitepaper from Team Anaconda, Productionizing and Deploying Data Science Projects, our data science experts share the factors to consider when deploying data science projects, how to leverage Anaconda Project to encapsulate your data science projects, and more.

Set Theory Ordered Pairs and Cartesian Product with R

Part 5 of 5 in the series Set Theory
• Introduction to Set Theory and Sets with R
• Set Operations Unions and Intersections in R
• Set Theory Arbitrary Union and Intersection Operations with R
• Algebra of Sets in R
• Set Theory Ordered Pairs and Cartesian Product with R

Text Mining of Stack Overflow Questions

This week, my fellow Stack Overflow data scientist David Robinson and I are happy to announce the publication of our book Text Mining with R with O’Reilly. We are so excited to see this project out in the world, and so relieved to finally be finished with it! Text data is being generated all the time around us, in healthcare, finance, tech, and beyond; text mining allows us to transform that unstructured text data into real insight that can increase understanding and inform decision-making. In our book, we demonstrate how using tidy data principles can make text mining easier and more effective. Let’s mark this happy occasion with an exploration of Stack Overflow text data, and show how natural language processing techniques we cover in our book can be applied to real-world data to gain insight. For this analysis, I’ll use Stack Overflow questions from StackSample, a dataset of text from 10% of Stack Overflow questions and answers on programming topics that is freely available on Kaggle. The code that I’m using in this post is available as a kernel on Kaggle, so you can fork it for your own exploration. This analysis focuses only on questions posted on Stack Overflow, and uses topic modeling to dig into the text.