SQL for Data Analysis – Tutorial for Beginners

This is the second episode of my SQL for Data Analysis (for beginners) series and today I’ll show you every tiny little details of the SQL WHERE clause. It’s not by accident that I dedicate a whole article to this topic: SQL WHERE clause is essential, if you want to select the right bit of your data from your datatable! In the first half of this article I’ll show you the different operators. In the second half, we will finally import our favorite 7.000.000+ rows dataset (the one with the airplane delays) and eventually I’ll give you a few assignments to test your SQL knowledge and practice a bit!

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

Generative Adversarial Networks (GANs) have shown remarkable success as a framework for training models to produce realistic-looking data. In this work, we propose a Recurrent GAN (RGAN) and Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data. RGANs make use of recurrent neural networks in the generator and the discriminator. In the case of RCGANs, both of these RNNs are conditioned on auxiliary information. We demonstrate our models in a set of toy datasets, where we show visually and quantitatively (using sample likelihood and maximum mean discrepancy) that they can successfully generate realistic time-series. We also describe novel evaluation methods for GANs, where we generate a synthetic labelled training dataset, and evaluate on a real test set the performance of a model trained on the synthetic data, and vice-versa. We illustrate with these metrics that RCGANs can generate time-series data useful for supervised training, with only minor degradation in performance on real test data. This is demonstrated on digit classification from ‘serialised’ MNIST and by training an early warning system on a medical dataset of 17,000 patients from an intensive care unit. We further discuss and analyse the privacy concerns that may arise when using RCGANs to generate realistic synthetic medical time series data.

The Practical Importance of Feature Selection

Feature selection is useful on a variety of fronts: it is the best weapon against the Curse of Dimensionality; it can reduce overall training times; and it is a powerful defense against overfitting, increasing generalizability.

Joining Tables in SparkR

Interfacing with APIs using R: the basics

While R (and its package ecosystem) provides a wealth of functions for querying and analyzing data, in our cloud-enabled world there’s now a plethora of online services with APIs you can use to augment R’s capabilities. Many of these APIs use a RESTful interface, which means you will typically send/receive data encoded in the JSON format using HTTP commands.


Hello, everyone! I’ve been meaning to get a new blog post out for the past couple of weeks. During that time I’ve been messing around with clustering. Clustering, or cluster analysis, is a method of data mining that groups similar observations together. Classification and clustering are quite alike, but clustering is more concerned with exploration than an end result. Note: This post is far from an exhaustive look at all clustering has to offer. Check out this guide for more. I am reading Data Mining by Aggarwal presently, which is very informative.

LASSO regression in R exercises

Lease Absolute Shrinkage and Selection Operator (LASSO) performs regularization and variable selection on a given model. Depending on the size of the penalty term, LASSO shrinks less relevant predictors to (possibly) zero. Thus, it enables us to consider a more parsimonious model. In this exercise set we will use the glmnet package (package description: here) to implement LASSO regression in R.

Facilitating Exploratory Data Visualization: Application to TCGA Genomic Data

In genomic fields, it’s very common to explore the gene expression profile of one or a list of genes involved in a pathway of interest. Here, we present some helper functions in the ggpubr R package to facilitate exploratory data analysis (EDA) for life scientists.

A Data Scientist’s Guide to Predicting Housing Prices in Russia

In May 2017, Sberbank, Russia’s oldest and largest bank, challenged data scientists on Kaggle to come up with the best machine learning models to estimate housing prices for its customers, which includes consumers and developers looking to buy, invest in, or rent properties. This blog post outlines the end-to-end process of how we went about tackling this challenge.