The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how R data.frames describe themselves (try “str(data.frame(x=1:2))” in an R-console to see this) and is part of the tidy data manifesto. Tools like SQL (structured query language) and dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below.


10 Dataviz Tools To Enhance Data Science

Data visualizations can help business users understand analytics insights and actually see the reasons why certain recommendations make the most sense. Traditional business intelligence and analytics vendors, as well as newer market entrants, are offering data visualization technologies and platforms. Here’s a collection of 10 data visualization tools worthy of your consideration:


Bayesian Basics, Explained

This interview between Professor Andrew Gelman of Columbia University and marketing scientist Kevin Gray covers the basics of Bayesian statistics.


Introduction to K-means Clustering

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.


Descriptive Analytics-Part 5: Data Visualisation (Categorical variables)

Descriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.


(A Very) Experimental Threading in R

I’ve been trying to find a way to introduce threads to R. I guess there can be many reasons to do that, among which I could mention simplified input/output logic, sending tasks to the background (e.g. building a model asynchronously), running computation-intensive tasks in parallel (e.g. parallel, chunk-wise var() on a large vector). Finally, it’s just a neat problem to look at?? I’m trying to follow approach similar to Python’s global interpreter lock.


Don’t give up on single trees yet…. An interactive tree with Microsoft R

A few days ago Microsoft announced their new Microsoft R Server 9.0 version. Among a lot of new things, it includes some new and improved machine learning algorithms in their MicrosoftML package. ?Fast linear learner, with support for L1 and L2 regularization. Fast boosted decision tree. Fast random forest. Logistic regression, with support for L1 and L2 regularization. ?GPU-accelerated Deep Neural Networks (DNNs) with convolutions. Binary classification using a One-Class Support Vector Machine. And the nice thing is, the MicrosoftML package is now also available in the Microsoft R client version, which you can download and use for free.


Handling Class Imbalance with R and Caret – An Introduction

When faced with classification tasks in the real world, it can be challenging to deal with an outcome where one class heavily outweighs the other (a.k.a., imbalanced classes). The following will be a two-part post on some of the techniques that can help to improve prediction performance in the case of imbalanced classes using R and caret. This first post provides a general overview of how these techniques can be implemented in practice, and the second post highlights some caveats to keep in mind when using these methods.


Predictive modeling, supervised machine learning, and pattern classification – the big picture?

When I was working on my next pattern classification application, I realized that it might be worthwhile to take a step back and look at the big picture of pattern classification in order to put my previous topics into context and to provide and introduction for the future topics that are going to follow. Pattern classification and machine learning are very hot topics and used in almost every modern application: Optical Character Recognition (OCR) in the post office, spam filtering in our email clients, barcode scanners in the supermarket … the list is endless. In this article, I want to give a quick overview about the main concepts of a typical supervised learning task as a primer for future articles and implementations of various learning algorithms and applications.


Linear Discriminant Analysis – Bit by Bit?

Linear Discriminant Analysis (LDA) is most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications. The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting (“curse of dimensionality”) and also reduce computational costs.


Dixon’s Q test for outlier identification – A questionable practice

I recently faced the impossible task to identify outliers in a dataset with very, very small sample sizes and Dixon’s Q test caught my attention. Honestly, I am not a big fan of this statistical test, but since Dixon’s Q-test is still quite popular in certain scientific fields (e.g., chemistry) that it is important to understand its principles in order to draw your own conclusion of the presented research data that you might stumble upon in research articles or scientific talks.


Chatbots to become a major part of CX by 2020

When asked which technologies will most improve the customer experience, nearly 40% of sales and marketing leaders cited Virtual Reality (VR) and 34% believe Artificial Intelligence (AI) will be the biggest game-changer. More than three quarters (?78%) of brands say they have already implemented, or are planning to implement, AI and VR by 2020 to better serve customers. When it comes to chatbots – one of the most recognisable form of AI – 80% of sales and marketing leaders say they already use these in their CX or plan to do so by 2020.


Estimate Regression with (Type-I) Pareto Response

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.


Three Shiny Apps to Celebrate the Beauty of Maths

One of the best decisions I took this year related with this blog was to move it to my own self-hosted domain using WordPress.org. It allows to me, for example, to embed dynamic JavaScript visualizations like this one. Another thing I can do now is to upload my Shiny Apps to share them with my readers. In this post I have gathered three Apps I made some time ago; you can play with them as well as get the code I wrote for each one:


Where Analytics, Data Mining, Data Science were applied in 2016

CRM/Consumer Analytics, Finance, and Banking are still the leading applications, but Anti-spam, Mobile apps, Travel/hospitality grew the most in 2016. Share of Health care, Consumer analytics, and Direct Marketing/ Fundraising data science applications declined for 2 years in a row.


Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Code and info for the paper ‘Plug and Play Generative Networks’ (PPGNs) which are models that are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable ‘condition’ network C that tells the generator what to draw. The authors demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network).


Faster Implicit Matrix Factorization

As part of my post on matrix factorization, I released a fast Python version of the Implicit Alternating Least Squares matrix factorization algorithm that is frequently used to recommend items. While this matrix factorization code was already extremely f