Spark Release 1.2.0
Spark 1.2.0 is the third release on the 1.X line. This release brings performance and usability improvements in Spark’s core engine, a major new API for MLlib, expanded ML support in Python, a fully H/A mode in Spark Streaming, and much more. GraphX has seen major performance and API improvements and graduates from an alpha component. Spark 1.2 represents the work of 172 contributors from more than 60 institutions in more than 1000 individual patches.

Statistical and causal approaches to machine learning
Where would you take machine learning? 2014’s Milner Award winner Professor Bernhard Schölkopf, Max Planck Institute for Intelligent Systems, talks through basic concepts of machine learning to pioneering research now widely used in science and industry.

NeuralNetTools 1.0.0 now on CRAN
After successfully navigating the perilous path of CRAN submission, I’m pleased to announce that NeuralNetTools is now available! From the description file, the package provides visualization and analysis tools to aid in the interpretation of neural networks, including functions for plotting, variable importance, and sensitivity analyses. I’ve written at length about each of these functions (see here, here, and here), so I’ll only provide an overview in this post. Most of these functions have remained unchanged since I initially described them, with one important change for the Garson function. Rather than reporting variable importance as -1 to 1 for each variable, I’ve returned to the original method that reports importance as 0 to 1. I was getting inconsistent results after toying around with some additional examples and decided the original method was a safer approach for the package. The modified version can still be installed from my GitHub gist. The development version of the package is also available on GitHub. Please use the development page to report issues.

Post–Big Data World in 2015
Predictive analytics and the Big Data that fuels it become deeply and broadly embedded in business and society – no longer a phenomenon but widely accepted as part of the foundation. Here are a few of his predictions for the coming year.
1. Big Data and analytics are ‘business as usual’.
2. Predictive security will become an important tool in the effort to stop cyber criminals.
3. Automation of modeling as well as the related reporting will be a priority.
4. The unstructured shall inherit the Earth.
5. The causation vs. correlation debate is becoming passé.

Averaging improves both accuracy and speed of time series classification
Time series classification using k-nearest neighbors and dynamic time warping can be improved in many practical applications in both speed and accuracy using averaging.

How to conduct a tombola with R amongst Twitter followers
Two weeks ago, we announced to raffle off three hardcover versions of our ADCR book among all followers of our Twitter account @RDataCollection. Tomorrow is closing day, so it is high time to present the drawing procedure, which, as a matter of course, is conducted with R.

An Extented Version Of The Scikit-Learn Cheat Sheet
You probably know the famous scikit-learn algorithm cheat sheet. This is a kind of decision tree, helping to figure out what machine learning algorithm to choose, depending on the type of problem you have : classification, regression, etc. …

Contextual Measurement Is a Game Changer
Decontextualized questions tend to activate a self-presentation strategy and retrieve memories of past positioning of oneself (impression management). Such personality inventories can be completed without ever thinking about how we actually behave in real situations. The phrase “at work” may disrupt that process if we do not have a prepared statement concerning our workplace demeanor. Yet, a simple “at work” may not be sufficient, and we may be forced to become more concrete and operationally define what we mean by courteous workplace behavior (performance appraisal). Our measures are still self-reports, but the added specificity requires that we relive the events described by the question (episodic memory) rather than providing inferences concerning the possible causes of our behavior.

One-way ANOVA with fixed and random effects from a Bayesian perspective
This blog post is derived from a computer practical session that I ran as part of my new course on Statistics for Big Data, previously discussed. This course covered a lot of material very quickly. In particular, I deferred introducing notions of hierarchical modelling until the Bayesian part of the course, where I feel it is more natural and powerful. However, some of the terminology associated with hierarchical statistical modelling probably seems a bit mysterious to those without a strong background in classical statistical modelling, and so this practical session was intended to clear up some potential confusion. I will analyse a simple one-way Analysis of Variance (ANOVA) model from a Bayesian perspective, making sure to highlight the difference between fixed and random effects in a Bayesian context where everything is random, as well as emphasising the associated identifiability issues. R code is used to illustrate the ideas.

Interactive in-browser 3D visualization of datasets
In this post we’ll be looking at 3D visualization of various datasets using data-projector from Datacratic. The original demo didn’t impress us initially as much as it could, maybe because the data is synthetic – it shows a bunch of small spheres in rainbow colors. Real datasets look better. The basic view in data-projector is a rotating cube. You can drag it any which way with your mouse. Use the mouse wheel to embiggen the cube.

Abridged List of Machine Learning Topics
1. Deep Learning
2. Online Learning
3. Graphical Models
4. Structured Predictions
5. Ensemble Methods
6. Kernel Machines
7. Hyper-parameter Optimization
8. Optimization
9. Graphs
10. Hadoop / Spark
11. GPU learning
12. Julia
13. Robotics
14. Natural Language Processing
15. Visualization
16. Computer Vision

Simple Multimodal Design Recommender
In the last post I discussed ways to evaluate the performance of recommender systems. In my experience, there is almost nothing as important, when building recommender and predictive models, as correctly evaluating their quality and performance. You could spend hours in preparing the training data, feature engineering and choosing the most advanced algorithm. But it’s all worth nothing if you can’t tell whether your model will actually work or not, i.e. positively influence your KPIs and business processes.

Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency
This essay is a (near) transcript of a talk I recently gave at a NIPS 2014 workshop on “Fairness, Accountability, and Transparency in Machine Learning,”

Handwritten digit recognition
Handwritten digit recognition task was one of first great successes of machine learning methods. Nowadays, the task can be carried out by multiple specialized libraries with very high accuracy (> 97% of correct answers), such that many times, despite of indirectly we use these features in tablets and smartphones, in general we do not know exactly how the method works.

Adding Cost Functions to ROCR performance objects
In the ROCR reference manual, it states “new performance measures can be added using a standard interface”, but I have not found that to be so. I may have missed some crucial step, but others have tried to adapt new performance measures. One example I came across had “patched” the performance code to use a new performance measure wss (Work Saved over Sampling). I liked some parts of what they did, but wanted to add my own measure and allow for a user to pass a new measure into a function without having to re-copy all the code.

Predictive Analytics: When to use or not to use a consultant?
From someone who failed and failed and finally moving forward.

Data Just Right: A Practical Introduction to Data Science Skills
Michael’s talk is an introductory session covering current tools, skills, and trends in data analysis. Using a hands-on walk through with a real data set, you’ll look at common data use cases and patterns, along with the technology most appropriate for solving some of these data challenges. You’ll also take a look at trends in technology, learn which Data Scientist tasks are becoming automated, and which tasks require human skills more than ever!

The Best Data Visualization Projects of 2014
These are my favorites for the year, roughly in order of favorite on down and based on use of data, design, and being useful. Mostly though, my picks are based on gut.

A Resurgence of Neural Networks in Machine Learning
There’s really nothing magical about a neural network. It’s a natural extension of a simple linear model.

Julia By Example
Below are a series of examples of common operations in Julia. They assume you already have Julia installed and working (the examples are currently tested with Julia v0.3 – latest).

Making sense of word2vec
One year ago, Tomáš Mikolov (together with his colleagues at Google) made some ripples by releasing word2vec, an unsupervised algorithm for learning the meaning behind words. In this blog post, I’ll evaluate some extensions that have appeared over the year.

Dataframes: Julia, Python, R
I took this awesome tutorial by Greg Reda on “Working with DataFrames” and tried to port the example to R and Julia.

Rapid A/B-testing with Sequential Analysis
A common issue with classical A/B-tests, especially when you want to be able to detect small differences, is that the sample size needed can be prohibitively large. In many cases it can take several weeks, months or even years to collect enough data to conclude a test. In this post I’ll introduce a very little known test that in many cases severely reduces the number of samples needed, namely the Sequential Generalized Likelihood Ratio Test.

Animations and GIFs using ggplot2
I thought I’d have a little fun with the new animation package in R. It’s actually really easy to use. I recently had some fun with it when I presented my research at an electronic poster session, and had an animated movie embedded into the powerpoint.

Principal Component Analysis on Imaging
Ever wonder what’s the mathematics behind face recognition on most gadgets like digital camera and smartphones? Well for most part it has something to do with statistics. One statistical tool that is capable of doing such feature is the Principal Component Analysis (PCA). In this post, however, we will not do (sorry to disappoint you) face recognition as we reserve this for future post while I’m still doing research on it. Instead, we go through its basic concept and use it for data reduction on spectral bands of the image using R.

Regression Analysis using R explained
What is Regression analysis, where is it applicable?

Introductory R Presentation
A short intro presentation for some people explaining a little bit about R from an introductory point of view. Slides put together with R/markdown and ioslides.

Spectral Clustering to Uncover the Structure of the US Economy
…Group the nodes by clusters such that you minimize inter-group links and maximize intra-group links. In other words, spectral clustering on the production consumption graph. …

HiScore: A Python Library for elegantly creating scoring functions
HiScore is a python library for making scoring functions, which map objects (vectors of numerical attributes) to scores (a single numerical value). Scores are a way for domain experts to communicate the quality of a complex, multi-faceted object to a broader audience. Scores are ubiquitous; everything from NFL Quarterbacks to the walkability of neighborhoods has a score. HiScore provides a new way for domain experts to quickly create and improve intuitive scoring functions: by using reference sets, a set of representative objects that are assigned scores.

How to extract a data.frame from string data
Sometimes, data of subjects are recorded on a server (e.g. SQL server) as string data records for each subject. In some cases we need only a part of those string data for each subject and we need it as numerical data (e.g. as a data.frame). How can we get the required data?