Practicing ‘No Code’ Data Science

We are entering a new phase in the practice of data science, the ‘Code-Free’ era. Like all major changes this one has not sprung fully grown but the movement is now large enough that its momentum is clear. Here’s what you need to know.


New Course: Mixture Models in R

Mixture modeling is a way of representing populations when we are interested in their heterogeneity. Mixture models use familiar probability distributions (e.g. Gaussian, Poisson, Binomial) to provide a convenient yet formal statistical framework for clustering and classification. Unlike standard clustering approaches, we can estimate the probability of belonging to a cluster and make inference about the sub-populations. For example, in the context of marketing, you may want to cluster different customer groups and find their respective probabilities of purchasing specific products to better target them with custom promotions. When applying natural language processing to a large set of documents, you may want to cluster documents into different topics and understand how important each topic is across each document. In this course, you will learn what Mixture Models are, how they are estimated, and when it is appropriate to apply them!


New Course: Developing R Packages

In this course, you will learn the end-to-end process for creating an R package from scratch. You will start off by creating the basic structure for your package, and adding in important details like functions and metadata. Once the basic components of your package are in place, you will learn about how to document your package, and why this is important for creating quality packages that other people – as well as your future self – can use with ease. Once you have created the components of your package, you will learn how to test they work properly, by creating tests, running checks, and building your package. By the end of this course you can expect to have all the necessary skills to create and share your own R packages.


neuralnet: Train and Test Neural Networks Using R

A neural network is a computational system that creates predictions based on existing data. Let us train and test a neural network using the neuralnet library in R.


Classifying time series using feature extraction

When you want to classify a time series, there are two approaches. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. In this article, we look at how to automatically extract relevant features with a Python package called tsfresh. The datasets we use come from the Time Series Classification Repository. The site provides information of the best accuracy achieved for each dataset. It looks like we get results close to the state of the art, or better, with every dataset we try.


Top 8 Python Machine Learning Libraries

1. scikit-learn
2. Keras
3. XGBoost
4. StatsModels
5. LightGBM
6. CatBoost
7. PyBrain
8. Eli5


In regression, we assume noise is independent of all measured predictors. What happens if it isn’t?

A number of key assumptions underlie the linear regression model – among them linearity and normally distributed noise (error) terms with constant variance In this post, I consider an additional assumption: the unobserved noise is uncorrelated with any covariates or predictors in the model.


All about Logistic regression

All about Logistic regression in one article


Building Machine Learning at LinkedIn Scale

Building machine learning at scale is a road full of challenges and there are not many well-documented case studies that can be used as a reference. My team at Invector Labs, recently published a slide deck that summarizes some of the lessons we have learned building machine learning solutions at scale but we are also always trying to study how other companies in the space are solving these issues. LinkedIn is one of the companies that have been applying machine learning to large scale scenarios for years but little was known about the specific methods and techniques used at the software giant. Recently, the LinkedIn engineering team has published a series of blog posts that provide some very interesting insights about their machine learning infrastructure and practices. While many of the scenarios are very specific to LinkedIn, the techniques and best practices are applicable to many large scale machine learning solutions.


Open Sourcing Active Question Reformulation with Reinforcement Learning

Natural language understanding is a significant ongoing focus of Google’s AI research, with application to machine translation, syntactic and semantic parsing, and much more. Importantly, as conversational technology increasingly requires the ability to directly answer users’ questions, one of the most active areas of research we pursue is question answering (QA), a fundamental building block of human dialogue. Because open sourcing code is a critical component of reproducible research, we are releasing a TensorFlow package for Active Question Answering (ActiveQA), a research project that investigates using reinforcement learning to train artificial agents for question answering. Introduced for the first time in our ICLR 2018 paper ‘Ask the Right Questions: Active Question Reformulation with Reinforcement Learning’, ActiveQA interacts with QA systems using natural language with the goal of providing better answers.


How To Dockerize an R shiny App – Part 2

At first I want to apologize for the lengthy delay of publishing the second part. The reason for the delay is that some of the Part 2 blog materials (commands, snapshots) got deleted in my laptop accidentally. Due to lack of time and my nature of work becoming more Data science’y’ rather than Devops /Data engineering, did not allow me to piece together the pieces to make it a worthwhile article. Nevertheless, since I had been getting lot of queries from readers about part 2, after having read Part 1. I finally decided to write the Part 2. So Readers, Thanks for your patience. Let us dive into article where I left it in Part 1. Okay then, a quick refresher on Docker and its concepts.


The One Theorem Every Data Scientist Should Know

This article serves as a quick guide on one of the most important theorem that every data scientist should know, the Central Limit Theorem. What is it? When can you not use it? Why is it important? Is it the same thing as the law of large numbers?
Advertisements