XLA – TensorFlow, compiled

One of the design goals and core strengths of TensorFlow is its flexibility. TensorFlow was designed to be a flexible and extensible system for defining arbitrary data flow graphs and executing them efficiently in a distributed manner using heterogenous computing devices (such as CPUs and GPUs). But flexibility is often at odds with performance. While TensorFlow aims to let you define any kind of data flow graph, it’s challenging to make all graphs execute efficiently because TensorFlow optimizes each op separately. When an op with an efficient implementation exists or when each op is a relatively heavyweight operation, all is well; otherwise, the user can still compose this op out of lower-level ops, but this composition is not guaranteed to run in the most efficient way.

Github Course of Practical Reinforcement Learning

A course on reinforcement learning in the wild. Taught on-campus in HSE and Yandex SDA (russian) and maintained to be friendly to online students (both english and russian).

Fitting Gaussian Process Models in Python

A common applied statistics task involves building regression models to characterize non-linear relationships between variables. It is possible to fit such models by assuming a particular non-linear functional form, such as a sinusoidal, exponential, or polynomial function, to describe one variable’s response to the variation in another. Unless this relationship is obvious from the outset, however, it involves possibly extensive model selection procedures to ensure the most appropriate model is retained. Alternatively, a non-parametric approach can be adopted by defining a set of knots across the variable space and use a spline or kernel regression to describe arbitrary non-linear relationships. However, knot layout procedures are somewhat ad hoc and can also involve variable selection. A third alternative is to adopt a Bayesian non-parametric strategy, and directly model the unknown underlying function. For this, we can employ Gaussian process models.

8 simple ways how to boost your coding skills (not just) in R

Our world is generating more and more data, which people and businesses want to turn into something useful. This naturally attracts many data scientists – or sometimes called data analysts, data miners, and many other fancier names – who aim to help with this extraction of information from data. A lot of data scientists around me graduated in statistics, mathematics, physics or biology. During their studies they focused on individual modelling techniques or nice visualizations for the papers they wrote. Nobody had ever taken a proper computer science course that would help them tame the programming language completely and allow them to produce a nice and professional code that is easy to read, can be re-used, runs fast and with reasonable memory requirements, is easy to collaborate on and most importantly gives reliable results.

What makes a good data visualization – a Data Scientist perspective

We examine principles of good data visualization, including some great and terrible examples, guidelines for human perception, focus on key variables, changes and trends, avoiding chart junk, and more.

The Challenges of Building a Predictive Churn Model

Unlike other data science problems, there is no one method for predicting which customers are likely to churn in the next month. Here we review the most popular approaches.

Building Regression Models in R using Support Vector Regression

The article studies the advantage of Support Vector Regression (SVR) over Simple Linear Regression (SLR) models for predicting real values, using the same basic idea as Support Vector Machines (SVM) use for classification.

Changing names in the tidyverse: An example for many regressions

A collaborator posed an interesting R question to me today. She wanted to do several regressions using different outcomes, with models being computed on different strata defined by a combination of experimental design variables. She then just wanted to extract the p-values for the slopes for each of the models, and then filter the strata based on p-value levels.

Stream processing with R in AWS

R is rarely mentioned among the big data tools, although it’s fairly well scalable for most data science problems and ETL tasks. This talk presents an open-source R package to interact with Amazon Kinesis via the MultiLangDaemon bundled with the Amazon KCL to start multiple R sessions on a machine or cluster of nodes to process data from theoretically any number of Kinesis shards. Besides the technical background and a quick introduction on how Kinesis works, this talk will feature some stream processing use-cases at CARD.com, and will also provide an overview and hands-on demos on the related data infrastructure built on the top of Docker, Amazon ECS, ECR, KMS, Redshift and a bunch of third-party APIs.

Advanced Econometrics: Model Selection

On Thursday, March 23rd, Arthur Charpentier will give the third lecture of the PhD course on advanced tools for econometrics, on model selection and variable selection, where he will focus on ridge and lasso regressions . Slides are available online.