Tutorial on Automated Machine Learning using MLBox

What if I tell you there exists a library called MLBox , which does most of the heavy lifting in machine learning for you in minimal lines of code? From missing value imputation to feature engineering using state-of-the-art Entity Embeddings for categorical features, MLBox has it all. In these 8 lines of code using MLBox, I have also performed hyperparameter optimisation and tested around 50 models with blazing speed – isn’t that awesome? You will be able to use this library by end of this article.


7 Great Articles About TensorFlow

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, Hadoop, decision trees, ensembles, correlation, outliers, regression, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, time series, cross-validation, model fitting, dataviz, AI and many more.


Recommendation System Algorithms

Main existing recommendation engines and how they work


Bayesian Bootstrap in Python

bayesian_bootstrap is a package for Bayesian bootstrapping in Python. For an overview of the Bayesian bootstrap, I highly recommend reading Rasmus Bååth’s writeup. This Python package is similar to his R package. This README contains some examples, below. For the documentation of the package’s API, see the docs. This package is on pypi – you can install it with pip install bayesian_bootstrap .


Word Vectors and SAT Analogies

The king/queen example is not difficult, and I don’t know whether it was tested or discovered. A better evaluation would use a set of challenging pre-determined questions. There is a Google set of analogy questions, but all the relationships are grammatical, geographical, or by gender. Typical: ‘fast : fastest :: old : oldest.’ (dataset, paper, context) SAT questions are more interesting. Selecting from fixed answer choices provides a nice guessing baseline (1/5 is 20%) and using a human test means it’s easier to get human performance levels (average US college applicant is 57%; human voting is 81.5%). Michael Littman and Peter Turney have made available a set of 374 SAT analogy questions since 2003. You have to email Turney to get them, and I appreciate that he helped me out.


How to Handle Imbalanced Classes in Machine Learning

Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class. Standard accuracy no longer reliably measures performance, which makes model training much trickier.


A pick of the best R packages for interactive plot and visualisation (2/2)

In the first part of A pick of the best R packages for interactive plot and visualization, we saw the best packages to do interactive plot in R. Now, let’s see what are the best packages for interactive visualizations.
While plots tend are representing ‘classic’ data. These plots have an x-axis a y-axis and one or two other variables represented as colors, size or symbols. Visualizations are to represent data that are not structured in a ‘regular’ way, for instance:
• Network and graph data (for instance social network connections)
• Sequential data (for example a consumer journey on a website or in a shop)
• Hierarchical data (to represent group imbricated like a Russian Doll)
•Textual data


Getting Started with Python for Data Analysis

A friend recently asked this and I thought it might benefit others if published here. This is for someone new to Python that wants the easiest path from zero to one.


Data wrangling : Transforming (1/3)

Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the third part of the series and it aims to cover the transforming of data used.This can include filtering, summarizing, and ordering your data by different means. This also includes combining various data sets, creating new variables, and many other manipulation tasks. At this post, we will go through the most basic tasks including slicing, and filtering on the famous mtcars data set.


Teach the tidyverse to beginners

A few years ago, I wrote a post Don’t teach built-in plotting to beginners (teach ggplot2). I argued that ggplot2 was not an advanced approach meant for experts, but rather a suitable introduction to data visualization.


How perceptions of R have changed

In the sponsor presentation for Microsoft at the useR!2017 conference in Brussels this morning, I thought I’d share how perceptions of R have changed over the years. Today, R known as is popular, comprehensive, accepted, scalable, production-ready and supported software environment for data analysis, but that wasn’t always the case. You can find the slides for my presentation R, Then and Now below: …
Advertisements