**Measuring the Progress of AI Research**

**Using csvkit to Summarize Data: A Quick Example**

**Julia vs R and Python: what does Stack Overflow Developer Survey 2017 tell us?**

**Using the TensorFlow API: An Introductory Tutorial Series**

Advertisements

**29**
*Thursday*
Jun 2017

Posted Distilled News

in**Measuring the Progress of AI Research**

This pilot project collects problems and metrics/datasets from the AI research literature, and tracks progress on them. You can use this Notebook to see how things are progressing in specific subfields or AI/ML as a whole, as a place to report new results you’ve obtained, as a place to look for problems that might benefit from having new datasets/metrics designed for them, or as a source to build on for data science projects. At EFF, we’re ultimately most interested in how this data can influence our understanding of the likely implications of AI. To begin with, we’re focused on gathering it.

**Using csvkit to Summarize Data: A Quick Example**

As data analysts, we’re frequently presented with comma-separated value files and tasked with reporting insights. While it’s tempting to import that data directly into R or Python in order to perform data munging and exploratory data analysis, there are also a number of utilities to examine, fix, slice, transform, and summarize data through the command line. In particular, Csvkit is a suite of python based utilities for working with CSV files from the terminal. For this post, we will grab data using wget, subset rows containing a particular value, and summarize the data in different ways. The goal is to take data on criminal activity, group by a particular offense type, and develop counts to understand the frequency distribution.

**Julia vs R and Python: what does Stack Overflow Developer Survey 2017 tell us?**

TLDR: Most Julia programmers also use Python. However, among all languages R is the one whose users are most likely to also develop in Julia. Recently Stack Overflow has made public the results of Developer Survey 2017. It is definitely an interesting data set. In this post I analyzed the answers to the question ‘Which of the following languages have you done extensive development work in over the past year, and which do you want to work in over the next year?’ from the perspective of Julia language against other programming languages. Actually we get two variables of interest: 1) what was used and 2) what is planned to be used.

I’m pleased to announce the release of the dbplyr package, which now contains all dplyr code related to connecting to databases. This shouldn’t affect you-as-a-user much, but it makes dplyr simpler, and makes it easier to release improvements just for database related code.

**Using the TensorFlow API: An Introductory Tutorial Series**

This post summarizes and links to a great multi-part tutorial series on learning the TensorFlow API for building a variety of neural networks, as well as a bonus tutorial on backpropagation from the beginning.

Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm().

Advertisements

**28**
*Wednesday*
Jun 2017

Posted Distilled News

in**Hands on with Deep Learning – Solution for Age Detection Practice Problem**

It is one thing to learn data science by reading or watching a video / MOOC and other to apply it on problems. You need to do both the things to learn the subject effectively. Today’s article is meant to help you apply deep learning on an interesting problem. If you are questioning, why learn or apply deep learning – you have most likely come out of a cave just now. Deep learning in already powering face detection in cameras, voice recognition on mobile devices to deep learning cars. Today, we will solve age detection problem using deep learning. If you are new to deep learning, I would recommend you to refer the articles below before going through this tutorial and making a submission.

**Neural Networks as a Corporation Chain of Command**

Neural networks are considered complicated and they are always explained using neurons and a brain function. But we do not need to learn how to brain works to understand Neural networks structure and how they operate.

**Recurrent Neural Nets – The Third and Least Appreciated Leg of the AI Stool**

Convolutional Neural Nets are getting all the press but it’s Recurrent Neural Nets that are the real workhorse of this generation of AI.

**How Feature Engineering can help you do well in a Kaggle competition – Part II**

In the first part of this series, I introduced the Outbrain Click Prediction machine learning competition. That post described some preliminary and important data science tasks like exploratory data analysis and feature engineering performed for the competition, using a Spark cluster deployed on Google Dataproc. In this post, I describe the competition evaluation, the design of my cross-validation strategy and my baseline models using statistics and trees ensembles.

For R users, there hasn’t been a production grade solution for deep learning (sorry MXNET). This post introduces the Keras interface for R and how it can be used to perform image classification. The post ends by providing some code snippets that show Keras is intuitive and powerful.

**Course: Visualizing Time Series Data in R**

I’m very pleased to announce my DataCamp course on Visualizing Time Series Data in R. This course is also part of the Time Series with R skills track. Feel free to have a look, the first chapter is free!

The padr package was designed to prepare datetime data for analysis. That is, to take raw, timestamped data, and quickly convert it into a tidy format that can be analyzed with all the tidyverse tools. Recently, a colleague and I discovered a second use for the package that I had not anticipated: checking data quality. Every analysis should contain checking if data are as expected. In the case of timestamped data, observations are sometimes missing due to technical malfunction of the system that produced them. Here are two examples that show how pad and thicken can be leveraged to detect problems in timestamped data quickly.

**Volatility modelling in R exercises (Part-1)**

Volatility modelling is typically used for high frequency financial data. Asset returns are typically uncorrelated while the variation of asset prices (volatility) tends to be correlated across time. In this exercise set we will use the rugarch package (package description: here) to implement the ARCH (Autoregressive Conditional Heteroskedasticity) model in R.

**27**
*Tuesday*
Jun 2017

Posted Distilled News

in**Title: AI for Summarization – Enabling Human-Consumable Information**

In the first post in this series on Artificial Intelligence: Monster or Mentor? we saw that there are several key ways in which AI advances can improve human productivity in organizations. In this article, we’ll look at the first: Distillation. Distillation is applying AI approaches to automate making large data volumes interpretable. Just like miners distill tons of raw ore into ounces of gold using machines, the goal is to automate the identification of value in big data. Here, we’ll focus specifically on how Distillation can be applied to the business problem of customer experience.

**Deep Learning with TensorFlow in Python**

The following problems appeared in the first few assignments in the Udacity course Deep Learning (by Google). The descriptions of the problems are taken from the assignments. Classifying the letters with notMNIST dataset Let’s first learn about simple data curation practices, and familiarize ourselves with some of the data that are going to be used for deep learning using tensorflow. The notMNIST dataset to be used with python experiments. This dataset is designed to look like the classic MNIST dataset, while looking a little more like real data: it’s a harder task, and the data is a lot less ‘clean’ than MNIST.

**Data Visualization with googleVis exercises part 4**

We saw in the previous charts some basic and well-known types of charts that googleVis offers to users. Before continuing with other, more sophisticated charts in the next parts we are going to “dig a little deeper” and see some interesting features of those we already know. Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

**Course: Operations Research with R**

This blog entry concerns our course on “Operations Reserch with R” that we teach as part of our study program. We hope that the materials are of value to lectures and everyone else working in the field of numerical optimiatzion.

**24**
*Saturday*
Jun 2017

Posted Distilled News

in**For Companies, Data Analytics is a Pain; But Why?**

1. Analytics is not a vaccine, but a routine workout

2. Insights are just the initiations, and don’t add immediate value to your business

3. Scalability

4. Descriptive analytics is a post-mortem, does it really help

5. Human intervention in analytics is a friend and a foe too

6. Opportunities cost is huge; stale answers make dents

7. Manually intensive

8. Numerical data is analyzed, but what about categorical values

9. Users without expertise

10. Increased lead time to value

2. Insights are just the initiations, and don’t add immediate value to your business

3. Scalability

4. Descriptive analytics is a post-mortem, does it really help

5. Human intervention in analytics is a friend and a foe too

6. Opportunities cost is huge; stale answers make dents

7. Manually intensive

8. Numerical data is analyzed, but what about categorical values

9. Users without expertise

10. Increased lead time to value

**An introduction to Support Vector Machines (SVM)**

So you’re working on a text classification problem. You’re refining your training set, and maybe you’ve even tried stuff out using Naive Bayes. But now you’re feeling confident in your dataset, and want to take it one step further. Enter Support Vector Machines (SVM): a fast and dependable classification algorithm that performs very well with a limited amount of data. Perhaps you have dug a bit deeper, and ran into terms like linearly separable, kernel trick and kernel functions. But fear not! The idea behind the SVM algorithm is simple, and applying it to natural language classification doesn’t require most of the complicated stuff. Before continuing, we recommend reading our guide to Naive Bayes classifiers first, since a lot of the things regarding text processing that are said there are relevant here as well. Done? Great! Let’s move on.

**Taxonomy of Methods for Deep Meta Learning**

Let’s talk about Meta-Learning because this is one confusing topic. I wrote a previous post about Deconstructing Meta-Learning which explored “Learning to Learn”. I realized thought that there is another kind of Meta-Learning that practitioners are more familiar with. This kind of Meta-Learning can be understood as algorithms the search and select different DL architectures. Hyper-parameter optimization is an instance of this, however there are another more elaborate algorithms that follow the same prescription of searching for architectures.

**Set Theory Arbitrary Union and Intersection Operations with R**

Part 3 of 3 in the series Set Theory

• Introduction to Set Theory and Sets with R

• Set Operations Unions and Intersections in R

• Set Theory Arbitrary Union and Intersection Operations with R

• Introduction to Set Theory and Sets with R

• Set Operations Unions and Intersections in R

• Set Theory Arbitrary Union and Intersection Operations with R

**Interactive R visuals in Power BI**

Power BI has long had the capability to include custom R charts in dashboards and reports. But in sharp contrast to standard Power BI visuals, these R charts were static. While R charts would update when the report data was refreshed or filtered, it wasn’t possible to interact with an R chart on the screen (to display tool-tips, for example).

OpenCV is an incredibly powerful tool to have in your toolbox. I have had a lot of success using it in Python but very little success in R. I haven’t done too much other than searching Google but it seems as if “imager” and “videoplayR” provide a lot of the functionality but not all of it. I have never actually called Python functions from R before. Initially, I tried the “rPython” library – that has a lot of advantages, but was completely unnecessary for me so system() worked absolutely fine. While this example is extremely simple, it should help to illustrate how easy it is to utilize the power of Python from within R. I need to give credit to Harrison Kinsley for all of his efforts and work at PythonProgramming.net – I used a lot of his code and ideas for this post (especially the Python portion). Using videoplayR I created a function which would take a picture with my webcam and save it as “originalWebcamShot.png”

Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the second part of this series and it aims to cover the reshaping of data used to turn them into a tidy form. By tidy form, we mean that each feature forms a column and each observation forms a row.

**22**
*Thursday*
Jun 2017

Posted Distilled News

in**A comprehensive beginners guide for Linear, Ridge and Lasso Regression**

I was talking to one of my friends who happens to be an operations manager at one of the Supermarket chains in India. Over our discussion, we started talking about the amount of preparation the store chain needs to do before the Indian festive season (Diwali) kicks in. He told me how critical it is for them to estimate / predict which product will sell like hot cakes and which would not prior to the purchase. A bad decision can leave your customers to look for offers and products in the competitor stores. The challenge does not finish there – you need to estimate the sales of products across a range of different categories for stores in varied locations and with consumers having different consumption techniques. While my friend was describing the challenge, the data scientist in me started smiling! Why? I just figured out a potential topic for my next article. In today’s article, I will tell you everything you need to know about regression models and how they can be used to solve prediction problems like the one mentioned above.

**Can we predict flu deaths with Machine Learning and R?**

Among the many R packages, there is the outbreaks package. It contains datasets on epidemics, on of which is from the 2013 outbreak of influenza A H7N9 in China, as analysed by Kucharski et al (2014). I will be using their data as an example to test whether we can use Machine Learning algorithms for predicting disease outcome. To do so, I selected and extracted features from the raw data, including age, days between onset and outcome, gender, whether the patients were hospitalised, etc. Missing values were imputed and different model algorithms were used to predict outcome (death or recovery). The prediction accuracy, sensitivity and specificity. The thus prepared dataset was devided into training and testing subsets. The test subset contained all cases with an unknown outcome. Before I applied the models to the test data, I further split the training data into validation subsets. The tested modeling algorithms were similarly successful at predicting the outcomes of the validation data. To decide on final classifications, I compared predictions from all models and defined the outcome “Death” or “Recovery” as a function of all models, whereas classifications with a low prediction probability were flagged as “uncertain”. Accounting for this uncertainty led to a 100% correct classification of the validation test set. The training cases with unknown outcome were then classified based on the same algorithms. From 57 unknown cases, 14 were classified as “Recovery”, 10 as “Death” and 33 as uncertain.

**Envisioning the Future of Intelligent Applications**

This morning we announced the release of Ayasdi Envision – a framework for accelerating the development of intelligent applications based on our award winning machine intelligence platform. Envision does a lot of innovative things and while I won’t recount the press release here, I do want to expand on a few of them.

**Making Sense of Machine Learning**

Machine learning gets a lot of buzz these days, usually in connection with big data and artificial intelligence (AI). But what exactly is it? Broadly speaking, machine learners are computer algorithms designed for pattern recognition, curve fitting, classification and clustering. The word learning in the term stems from the ability to learn from data. Machine learning is also widely used in data mining and predictive analytics, which some commentators loosely call big data. It also is used for consumer survey analytics and is not restricted to high-volume, high-velocity data or unstructured data and need not have any connection with AI. In fact, many methods marketing researchers are well acquainted with, such as regression and k-means clustering, are also frequently called machine learners. For examples, see Apache Spark’s machine learning library or the books I cite in the last section of this article. To keep things simple, I will refer to well-known statistical techniques like regression and factor analysis as older machine learners and methods such as artificial neural networks as newer machine learners since they are generally less familiar to marketing researchers.

**Probabilistic programming from scratch**

Real-world data is almost always incomplete or innaccurate in some way. This means that the uncertain conclusions we draw from it are only meaningful if we can answer the question: how uncertain? One way to do this is using Bayesian inference. But, while Bayesian inference is conceptually simple, it can be analytically and computationally difficult in practice. Probabilistic programming is a paradigm that abstracts away some of this complexity. There are many probabilistic programming systems. Perhaps the most advanced is Stan, and the most accessible to non-statistician programmers is PyMC3. At Fast Forward Labs, we recently shared with our clients a detailed report on the technology and uses of probabilistic programming in startups and enterprises. But in this article, rather than use either of these advanced comprehensive systems, we’re going to build our own extremely simple system from from scratch.

**Updated Data Science Virtual Machine for Windows: GPU-enabled with Docker support**

The Windows edition of the Data Science Virtual Machine (DSVM), the all-in-one virtual machine image with a wide-collection of open-source and Microsoft data science tools, has been updated to the Windows Server 2016 platform. This update brings built-in support for Docker containers and GPU-based deep learning.

**Smoothing a time-series with a Bayesian model**

Recently I looked at fitting a smoother to a time-series using Bayesian modelling. Now I will look at how you can control the smoothness by using more or less informative priors on the precision (1/variance) of the random effect.

**Analytics Administration for R**

Analytic administrator is a role that data scientists assume when they onboard new tools, deploy solutions, support existing standards, or train other data scientists. It is a role that works closely with IT to maintain, upgrade, and scale analytic environments. Analytic admins have a multiplier effect – as they go about their work, they influence others in the organization to be more effective. If you are a data scientist using R, you might consider filling the role of analytic admin for your organization. Consider the data scientist who wants to make R a legitimate part of their organization. This person has to introduce a new technology and help IT build the architecture around it. In this role, the data scientist – acting as an analytic admin – influences their entire organization.

**A scalable time-series database that supports SQL**

In this episode of the Data Show, I spoke with Michael Freedman, CTO of Timescale and professor of computer science at Princeton University. When I first heard that Freedman and his collaborators were building a time-series database, my immediate reaction was: “Don’t we have enough options already?” The early incarnation of Timescale was a startup focused on IoT, and it was while building tools for the IoT problem space that Freedman and the rest of the Timescale team came to realize that the database they needed wasn’t available (at least out in open source). Specifically, they wanted a database that could easily support complex queries and the sort of real-time applications many have come to associate with streaming platforms. Based on early reactions to TimescaleDB, many users concur.

**22**
*Thursday*
Jun 2017

Posted Distilled News

in**A Semi-Supervised Classification Algorithm using Markov Chain and Random Walk in R**

In this article, a semi-supervised classification algorithm implementation will be described using Markov Chains and Random Walks. We have the following 2D circles dataset (with 1000 points) with only 2 points labeled (as shown in the figure, colored red and blue respectively, for all others the labels are unknown, indicated by the color black). Now the task is to predict the labels of the other (unlabeled) points. From each of the unlabeled points (Markov states) a random walk with Markov transition matrix (computed from the row-stochastic kernelized distance matrix) will be started that will end in one labeled state, which will be an absorbing state in the Markov Chain.

**Python: Implementing a k-means algorithm with sklearn**

The below is an example of how sklearn in Python can be used to develop a k-means clustering algorithm. The purpose of k-means clustering is to be able to partition observations in a dataset into a specific number of clusters in order to aid in analysis of the data. From this perspective, it has particular value from a data visualisation perspective.

**Second step with non-linear regression: adding predictors**

In this post we will see how to include the effect of predictors in non-linear regressions. In other words, letting the parameters of non-linear regressions vary according to some explanatory variables (or predictors). Be sure to check the first post on this if you are new to non-linear regressions. The example that I will use throughout this post is the logistic growth function, it is often used in ecology to model population growth. For instance, say you count the number of bacteria cells in a petri dish, in the beginning the cell counts will increase exponentially but after some time due to limits in resources (be it space or food), the bacteria population will reach an equilibrium. This will produce the classical S-shaped, non-linear, logistic growth function. The logistic growth function has three parameters: the growth rate called “r”, the population size at equilibrium called “K” and the population size at the beginning called “n0”.

**Matrix Factorization in PyTorch**

Hey, remember when I wrote those ungodly long posts about matrix factorization chock-full of gory math? Good news! You can forget it all. We have now entered the Era of Deep Learning, and automatic differentiation shall be our guiding light. Less facetiously, I have finally spent some time checking out these new-fangled deep learning frameworks, and damn if I am not excited. In this post, I will show you how to use PyTorch to bypass the mess of code from my old post on Explicit Matrix Factorization and instead implement a model that will converge faster in fewer lines of code. But first, let’s take a trip down memory lane.

At Indeed, machine learning is key to our mission of helping people get jobs. Machine learning lets us collect, sort, and analyze millions of job postings a day. In this post, we’ll describe our open-source Java wrapper for a particularly useful machine learning library, and we’ll explain how you can benefit from our work.

Challenges of machine learning

It’s not easy to build a machine learning system. A good system needs to do several things right:

• Feature engineering. For example, converting a text to features vector requires you to precalculate statistics about words. This process can be challenging.

• Model quality. Most algorithms require hyper parameters tuning, which is usually done through grid search. This process can take hours, making it hard to iterate quickly on ideas.

• Model training for large datasets. The implementations for most algorithms assume that the entire dataset fits in memory in a single process. Extremely large datasets, like those we work with at Indeed, are harder to train.

Challenges of machine learning

It’s not easy to build a machine learning system. A good system needs to do several things right:

• Feature engineering. For example, converting a text to features vector requires you to precalculate statistics about words. This process can be challenging.

• Model quality. Most algorithms require hyper parameters tuning, which is usually done through grid search. This process can take hours, making it hard to iterate quickly on ideas.

• Model training for large datasets. The implementations for most algorithms assume that the entire dataset fits in memory in a single process. Extremely large datasets, like those we work with at Indeed, are harder to train.

Max-pooling is a procedure in a neural network which has several benefits. It performs dimensionality reduction by taking a collection of neurons and reducing them to a single value for future layers to receive as input. It can also prevent overfitting, since it takes a large set of inputs and admits only one value, making it harder to memorize the input. In this episode, we discuss the intuitive interpretation of max-pooling and why it’s more common than mean-pooling or (theoretically) quartile-pooling.

**How to Properly Introduce a Neural Network**

I discuss the concept of a “neural network” by providing some examples of recent successes in neural network machine learning algorithms and providing a historical perspective on the evolution of the neural network concept from its biological origins.

**MultiModel: Multi-Task Machine Learning Across Domains**

Over the last decade, the application and performance of Deep Learning has progressed at an astonishing rate. However, the current state of the field is that the neural network architectures are highly specialized to specific domains of application. An important question remains unanswered: Will a convergence between these domains facilitate a unified model capable of performing well across multiple domains? Today, we present MultiModel, a neural network architecture that draws from the success of vision, language and audio networks to simultaneously solve a number of problems spanning multiple domains, including image recognition, translation and speech recognition. While strides have been made in this direction before, namely in Google’s Multilingual Neural Machine Translation System used in Google Translate, MultiModel is a first step towards the convergence of vision, audio and language understanding into a single network. The inspiration for how MultiModel handles multiple domains comes from how the brain transforms sensory input from different modalities (such as sound, vision or taste), into a single shared representation and back out in the form of language or actions. As an analog to these modalities and the transformations they perform, MultiModel has a number of small modality-specific sub-networks for audio, images, or text, and a shared model consisting of an encoder, input/output mixer and decoder, as illustrated below.

**Accelerating Deep Learning Research with the Tensor2Tensor Library**

Deep Learning (DL) has enabled the rapid advancement of many useful technologies, such as machine translation, speech recognition and object detection. In the research community, one can find code open-sourced by the authors to help in replicating their results and further advancing deep learning. However, most of these DL systems use unique setups that require significant engineering effort and may only work for a specific problem or architecture, making it hard to run new experiments and compare the results.

Today, we are happy to release Tensor2Tensor (T2T), an open-source system for training deep learning models in TensorFlow. T2T facilitates the creation of state-of-the art models for a wide variety of ML applications, such as translation, parsing, image captioning and more, enabling the exploration of various ideas much faster than previously possible. This release also includes a library of datasets and models, including the best models from a few recent papers (Attention Is All You Need, Depthwise Separable Convolutions for Neural Machine Translation and One Model to Learn Them All) to help kick-start your own DL research.

Today, we are happy to release Tensor2Tensor (T2T), an open-source system for training deep learning models in TensorFlow. T2T facilitates the creation of state-of-the art models for a wide variety of ML applications, such as translation, parsing, image captioning and more, enabling the exploration of various ideas much faster than previously possible. This release also includes a library of datasets and models, including the best models from a few recent papers (Attention Is All You Need, Depthwise Separable Convolutions for Neural Machine Translation and One Model to Learn Them All) to help kick-start your own DL research.

What is Jupyter, and why do you care? After all, Jupyter has never become a buzzword like data science, artificial intelligence, or Web 2.0. Unlike those big abstractions, Jupyter is very concrete. It’s an open source project, a piece of software, that does specific things.

But without attracting the hype, Jupyter Notebooks are revolutionizing the way engineers and data scientists work together. If all important work is collaborative, the most important tools we have are tools for collaboration, tools that make working together more productive.

That’s what Jupyter is, in a nutshell: it’s a tool for collaborating. It’s built for writing and sharing code and text, within the context of a web page. The code runs on a server, and the results are turned into HTML and incorporated into the page you’re writing. That server can be anywhere: on your laptop, behind your firewall, or on the public internet. Your page contains your thoughts, your code, and the results of running the code.

But without attracting the hype, Jupyter Notebooks are revolutionizing the way engineers and data scientists work together. If all important work is collaborative, the most important tools we have are tools for collaboration, tools that make working together more productive.

That’s what Jupyter is, in a nutshell: it’s a tool for collaborating. It’s built for writing and sharing code and text, within the context of a web page. The code runs on a server, and the results are turned into HTML and incorporated into the page you’re writing. That server can be anywhere: on your laptop, behind your firewall, or on the public internet. Your page contains your thoughts, your code, and the results of running the code.

**How to build a color palette from any image with R and k-means algo**

Some weeks ago, I was working on a dataviz to show results coming from an analysis I had performed, and I found myself looking at that default ggplot2 palette, which is optimal in term of discrimination among categories, but nevertheless can not be compared to some wonderful palettes you can see employed within art masterpieces like Monet impression,soleil levant or Michelangelo’s Tondo Doni. Those palettes are coming from years and years of studies around colour theory and pictorial techniques and, I started thinking, would do a great service if employed as plot’s palettes, with their complementary colours or their balanced set of hues.

**Data Visualization with googleVis exercises part 3**

This is the third part of our data visualization series and at this part we will explore the features of two more of the charts that googleVis provides. Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

**Using sparklyr with Microsoft R Server**

The sparklyr package (by RStudio) provides a high-level interface between R and Apache Spark. Among many other things, it allows you to filter and aggregate data in Spark using the dplyr syntax. In Microsoft R Server 9.1, you can now connect to a a Spark session using the sparklyr package as the interface, allowing you to combine the data-preparation capabilities of sparklyr and the data-analysis capabilities of Microsoft R Server in the same environment. In a presentation by at the Spark Summit (embedded below, and you can find the slides here), Ali Zaidi shows how to connect to a Spark session from Microsoft R Server, and use the sparklyr package to extract a data set. He then shows how to build predictive models on this data (specifically, a deep Neural Network and a Boosted Trees classifier). He also shows how to build general ensemble models, cross-validate hyper-parameters in parallel, and even gives a preview of forthcoming streaming analysis capabilities.

**Ridge Regression in R Exercises**

Bias vs Variance tradeoff is always encountered in applying supervised learning algorithms. Least squares regression provides a good fit for the training set but can suffer from high variance which lowers predictive ability. To counter this problem, we can regularize the beta coefficients by employing a penalization term. Ridge regression applies l2 penalty to the residual sum of squares. In contrast, LASSO regression, which was covered here previously, applies l1 penalty. Using ridge regression, we can shrink the beta coefficients towards zero which would reduce variance at the cost of higher bias which can result in better predictive ability than least squares regression. In this exercise set we will use the glmnet package (package description: here) to implement ridge regression in R.

**19**
*Monday*
Jun 2017

Posted Distilled News

in**How to create animated GIF images for data visualization using gganimate (in R)?**

I say that because how you create data stories and visualization has a huge impact on how your customers look at your work. Ultimately, data science is not only about how complicated and sophisticated your models are. It is about solving problems using data based insights. And in order to implement these solutions, your stakeholders need to understand what you are proposing. One of the challenges in creating effective visualizations is to create images which speak for themselves. This article will tell one of the ways to do so using animated GIF images (Graphics Interchangeable format). This would be particularly helpful when you want to show time / flow based stories. Using animation in images, you can plot comparable data over time for specific set of parameters. In other words, it is easy to understand and see the growth of certain parameter over time. Let me show this with an example

**Data Modelling Topologies of a Graph Database**

There is a lot of confusion with the definition of graph databases. In my opinion, any definition that avoids any reference to the semantics of nodes and edges or their internal structure is preferable. Failing to follow this guideline, it is unavoidable to favor specific implementations, e.g. Property Graph Databases or Triple Stores, and you may easily become myopic to other types that are based on different models, e.g. hypergraph databases, or different data storage paradigms, e.g. key-value stores. Therefore, I propose we adopt a vendor neutral definition, such as the following one, which cannot exclude any future type of graph database.

**100 Free Tutorials for Learning R**

R programming language tutorials are listed below which are ideal for beginners to advanced users. R language is the world’s most widely used programming language for statistical analysis, predictive modeling and data science. It’s popularity is claimed in many recent surveys and studies. R programming language is getting powerful day by day as number of supported packages grows. Some of big IT companies such as Microsoft and IBM have also started developing packages on R and offering enterprise version of R.

**Image Segmentation using deconvolution layer in Tensorflow**

In this series of post, we shall learn the algorithm for image segmentation and implementation of the same using Tensorflow. This is the first part of the series where we shall focus on understanding and be implementing a deconvolutional/fractional-strided-convolutional layer in Tensorflow.

**Weather Forecast With Regression Models – Part 4**

…

**How to use Windows Linux Subsystem & Win10 side by side for Machine Learning and Coding !**

A large number of open source libraries\modules in machine learning are first made available for Linux and the windows versions are always released later . Maintaining two separate OS on Dual boot or switching between Virtual machines is not the best way.If you are already working on Linux only, then this guide is not for you. If you are in corporate environment with no dual boot and want to run Linux and at the same time want to be on AD server, this is the ideal solution.

**K-means Clustering with Tableau – Call Detail Records Example**

We show how to use Tableau 10 clustering feature to create statistically-based segments that provide insights about similarities in different groups and performance of the groups when compared to each other.

**Normalization in Deep Learning**

A few days ago (Jun 2017), a 100 page on Self-Normalizing Networks appeared. An amazing piece of theoretical work, it claims to have solved the problem of building very large Feed Forward Networks (FNNs). It builds upon a Batch Normalization (BN), introduced in 2015- and is now the defacto standard for all CNNs and RNNs. But not so useful for FNNs. What makes normalization so special? It makes very Deep Networks easier to train, by damping out oscillations in the distribution of activations.

My last entry introduces principal component analysis (PCA), one of many unsupervised learning tools. I concluded the post with a demonstration of principal component regression (PCR), which essentially is a ordinary least squares (OLS) fit using the first k principal components (PCs) from the predictors. This brings about many advantages …

**Automatic tools for improving R packages**

During my talk at RUG BCN, for each tool I gave a short introduction and then applied it to a small package I had created for the occasion. In that post I’ll just shortly present each tool. Most of them are only automatic because they automatically provide you with a list of things to fix, but they won’t do the work for you, sorry. If you have an R package you develop at hand, I’d really advise you to apply them on it and see what you get! I concentrated on tools improving the coding style, the package structure, testing, the documentation, but not features and performance.

**Non-Standard Evaluation and Function Composition in R**

In this article we will discuss composing standard-evaluation interfaces (SE) and composing non-standard-evaluation interfaces (NSE) in R.

**17**
*Saturday*
Jun 2017

Posted Distilled News

in**Random Effects Neural Networks in Edward and Keras**

Bayesian probabilistic models provide a nimble and expressive framework for modeling ‘small-world’ data. In contrast, deep learning offers a more rigid yet much more powerful framework for modeling data of massive size. Edward is a probabilistic programming library that bridges this gap: ‘black-box’ variational inference enables us to fit extremely flexible Bayesian models to large-scale data. Furthermore, these models themselves may take advantage of classic deep-learning architectures of arbitrary complexity. Edward uses TensorFlow for symbolic gradients and data flow graphs. As such, it interfaces cleanly with other libraries that do the same, namely TF-Slim, PrettyTensor and Keras. Personally, I’ve been working often with the latter, and am consistently delighted by the ease with which it allows me to specify complex neural architectures. The aim of this post is to lay a practical foundation for Bayesian modeling in Edward, then explore how, and how easily, we can extend these models in the direction of classical deep learning via Keras. It will give both a conceptual overview of the models below, as well as notes on the practical considerations of their implementation — what worked and what didn’t. Finally, this post will conclude with concrete ways in which to extend these models further, of which there are many. If you’re just getting started with Edward or Keras, I recommend first perusing the Edward tutorials and Keras documentation respectively. To ‘pull us down the path,’ we build three models in additive fashion: a Bayesian linear regression model, a Bayesian linear regression model with random effects, and a neural network with random effects. We fit them on the Zillow Prize dataset, which asks us to predict logerror (in house-price estimate, i.e. the ‘Zestimate’) given metadata for a list of homes. These models are intended to be demonstrative, not performant: they will not win you the prize in their current form.

**Supercharge your Computer Vision models with the TensorFlow Object Detection API**

At Google, we develop flexible state-of-the-art machine learning (ML) systems for computer vision that not only can be used to improve our products and services, but also spur progress in the research community. Creating accurate ML models capable of localizing and identifying multiple objects in a single image remains a core challenge in the field, and we invest a significant amount of time training and experimenting with these systems.

You’ve probably heard the adage “two heads are better than one.” Well, it applies just as well to machine learning where the combination of diverse approaches leads to better results. And if you’ve followed Kaggle competitions, you probably also know that this approach, called stacking, has become a staple technique among top Kagglers. In this interview, Marios Michailidis (AKA Competitions Grandmaster KazAnova on Kaggle) gives an intuitive overview of stacking, including its rise in use on Kaggle, and how the resurgence of neural networks led to the genesis of his stacking library introduced here, StackNet. He shares how to make StackNet-a computational, scalable and analytical, meta-modeling framework-part of your toolkit and explains why machine learning practitioners shouldn’t always shy away from complex solutions in their work.

**Understanding deep learning requires re-thinking generalization**

This paper has a wonderful combination of properties: the results are easy to understand, somewhat surprising, and then leave you pondering over what it all might mean for a long while afterwards! By “generalize well,” the authors simply mean “what causes a network that performs well on training data to also perform well on the (held out) test data?” (As opposed to transfer learning, which involves applying the trained network to a related but different problem). If you think about that for a moment, the question pretty much boils down to “why do neural networks work as well as they do?” Generalisation is the difference between just memorising portions of the training data and parroting it back, and actually developing some meaningful intuition about the dataset that can be used to make predictions. So it would be somewhat troubling, would it not, if the answer to the question “why do neural networks work (generalize) as well as they do?” turned out to be “we don’t really know!”

**Design Context for the bot Revolution**

Bots are going to disrupt the software industry in the same way the web and mobile revolutions did. History has taught us that great opportunities arise in these revolutions: we’ve seen how successful companies like Uber, Airbnb, and Salesforce were created as a result of new technology, user experience, and distribution channels. At the end of this book, I hope you will be better prepared to grab these opportunities and design a great product for this bot revolution. Our lives have become full of bots in 2017 —I wake up in the morning and ask Amazon’s Alexa (a voice bot by Amazon) to play my favorite bossa nova, Amy (an email bot by x.ai) emails me about today’s meetings, and Slackbot (a bot powered by Slack) sends me a notification to remind me to buy airline tickets to NYC today. Bots are everywhere!

**Using Partial Least Squares to conduct relative importance analysis in Displayr**

Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations. Relative importance analysis is a general term applied to any technique used for estimating the importance of predictor variables in a regression model. The output is a set of scores which enable the predictor variables to be ranked based upon how strongly each influences the outcome variable. There are a number of different approaches to calculating relative importance analysis including Relative Weights and Shapley Regression as described here and here. In this blog post I briefly describe an alternative method – Partial Least Squares. Because it effectively compresses the data before regression, PLS is particularly useful when the number of predictor variables is more than the number of observations.

**An easy way to accidentally inflate reported R-squared in linear regression models**

Here is an absolutely horrible way to confuse yourself and get an inflated reported R-squared on a simple linear regression model in R. We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here.

**Finer Monotonic Binning Based on Isotonic Regression**

In my early post (https://…/monotonic-binning-with-smbinning-package ), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.

**Set Operations Unions and Intersections in R**

Part 2 of 2 in the series Set Theory

**Demo: Real-Time Predictions with Microsoft R Server**

At the R/Finance conference last month, I demonstrated how to operationalize models developed in Microsoft R Server as web services using the mrsdeploy package. Then, I used that deployed model to generate predictions for loan delinquency, using a Python script as the client. (You can see slides here, and a video of the presentation below.)

**Neural networks Exercises (Part-2)**

Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know. In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. Today, we’ll practice how to use the nnet and neuralnet packages to create a feedforward neural networks, which we introduce in the last set of exercises. In this type of neural network, all the neuron from the input layer are linked to the neuron from the hidden layer and all of those neuron are linked to the output layer, like seen on this image. Since there’s no cycle in this network, the information flow in one direction from the input layer to the hidden layers to the output layer. For more information about those types of neural network you can read this page.

**Sampling weights and multilevel modeling in R**

So many things have been said about weighting, but on my personal view of statistical inference processes, you do have to weight. From a single statistic until a complex model, you have to weight, because of the probability measure that induces the variation of the sample comes from an (almost always) complex sampling design that you should not ignore. Weighting is a complex issue that has been discussed by several authors in recent years. The social researchers have no found consensus about the appropriateness of the use of weighting when it comes to the fit of statistical models. Angrist and Pischke (2009, p. 91) claim that few things are as confusing to applied researchers as the role of sample weights. Even now, 20 years post-Ph.D., we read the section of the Stata manual on weighting with some dismay. Anyway, despite the fact that researchers do not have consensus on when to weight, the reality is that you have to be careful when doing so. For example, when it comes to estimating totals, means or proportions, you can use the inverse probability as a way for weighting, and it looks like every social researcher agrees to weight in order to estimate this kind of descriptive statistics. The rationale behind this practice is that you suppose that every unit belonging to the sample represents itself and many others that were not selected in the sample. When using weights to estimate parameter models, you have to keep in mind the nature of the sampling design. For example, when it comes to estimates multilevel parameters, you have to take into account not only the final sampling unit weights but also the first sampling unit weights. For example, let’s assume that you have a sample of students, selected from a national frame of schools. Then, we have two sets of weights, the first one regarding schools (notice that one selected school represents itself as well as others not in the sample) and the second one regarding students.

**16**
*Friday*
Jun 2017

Posted Distilled News

in**Introductory guide to Generative Adversarial Networks (GANs) and their promise!**

Neural Networks have made great progress. They now recognize images and voice at levels comparable to humans. They are also able to understand natural language with a good accuracy. But, even then, the talk of automating human tasks with machines looks a bit far fetched. After all, we do much more than just recognizing image / voice or understanding what people around us are saying – don’t we?

**GOAI: Open GPU-Accelerated Data Analytics**

Recently, Continuum Analytics, H2O.ai, and MapD announced the formation of the GPU Open Analytics Initiative (GOAI). GOAI—also joined by BlazingDB, Graphistry and the Gunrock project from the University of California, Davis—aims to create open frameworks that allow developers and data scientists to build applications using standard data formats and APIs on GPUs. Bringing standard analytics data formats to GPUs will allow data analytics to be even more efficient, and to take advantage of the high throughput of GPUs. NVIDIA believes this initiative is a key contributor to the continued growth of GPU computing in accelerated analytics.

**MobileNets: Open-Source Models for Efficient On-Device Vision**

Deep learning has fueled tremendous progress in the field of computer vision in recent years, with neural networks repeatedly pushing the frontier of visual recognition technology. While many of those technologies such as object, landmark, logo and text recognition are provided for internet-connected devices through the Cloud Vision API, we believe that the ever-increasing computational power of mobile devices can enable the delivery of these technologies into the hands of our users, anytime, anywhere, regardless of internet connection. However, visual recognition for on device and embedded applications poses many challenges — models must run quickly with high accuracy in a resource-constrained environment making use of limited computation, power and space. Today we are pleased to announce the release of MobileNets, a family of mobile-first computer vision models for TensorFlow, designed to effectively maximize accuracy while being mindful of the restricted resources for an on-device or embedded application. MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. They can be built upon for classification, detection, embeddings and segmentation similar to how other popular large scale models, such as Inception, are used.

**The Surprising Complexity of Randomness**

Previously, in a walkthrough on building a simple application without a database, I touched on randomness. Randomness and generating random numbers is a surprisingly deep and important area of computer science, and also one that few outside of computer science know much about. As such, for my own benefit as much as yours, I thought I would take a deeper look at the surprising complexity of randomness.

**Data Scientist: 21st Century Sexiest Job For Free**

It is a well-known fact now that “Data Scientist” is the sexiest job for the current century, but how much does it cost you to prepare yourself to be a data scientist? Are there other relevant jobs to data science and what is the career pathway for this job? What are the essential skills needed to enter this career? Where can we find free courses and what are the Must-Read Books in this field? I will try to answer most of these questions in this blog. I know those are common questions that are mind-blowing for all those thinking to take the data science career.

**Exploratory Factor Analysis – Exercises**

This set of exercises is about exploratory factor analysis. We shall use some basic features of psych package. For quick introduction to exploratory factor analysis and psych package, we recommend this short “how to” guide.

The LASSO has two important uses, the first is forecasting and the second is variable selection. We are going to talk about the second. The variable selection objective is to recover the correct set of variables that generate the data or at least the best approximation given the candidate variables. The LASSO has attracted a lot of attention lately because it allows us to estimate a linear regression with thousands of variables and the model select the right ones for us. However, what many people ignore is when the LASSO fails. Like any model, the LASSO also rely on assumptions in order to work. The first is sparsity, i.e. only a small number of variables may actually be relevant. If this assumption does not hold there is no hope to use the LASSO for variable selection. Another assumption is that the irrepresentable condition must hold, this condition may look very technical but it only says that the relevant variable may not be very correlated with the irrelevant variables.

**Dynamic Networks: Visualizing Arms Trades Over Time**

I previously made some network graphs of Middle East country relationships here using Slate’s Middle East Friendship Chart. I was thinking of a way to visualize these relationships (and possibly other ones) with more rule based and objective measure over time. What kind of public dataset could show countries relationships accurately? I used weapons / arms trades between nations to explore these relationships. I think arms trades are a good indicator of how friendly countries are together because 1) countries won’t sell to enemies and 2) if a country wants to befriends country, buying weapons are a good way to do buy influence.

**Installing R packages with rxInstallPackages in Microsoft R Server**

In MicrosoftML package comes – in my opinion – long anticipated function for installing R packages for SQL Server and Microsoft R Server. And, I am super happy. Last year, in one of my previous blog posts, I have been showing how to install R package from SSMS using sp_execute_external_script. Now, with new package MicrosoftML (that is part of Microsoft R Server 9.x and above) new function is available that enables you to easy install the package and also little bit more.

**14**
*Wednesday*
Jun 2017

Posted Distilled News

in**Office Politics: A survivor’s guide for data scientists (and other techies)**

Everyone gets sunk by office politics at some point in their career, but data scientists are in some ways especially ill-prepared to navigate the unspoken rules and hidden agendas that together form a critical part of the corporate world. There are those who leverage office politics as a tool to advance their careers and increase their power, but I would like to simply discuss a few basic survival skills, so we can focus on doing the interesting analytics work.

**Quantum Computing, Deep Learning, and Artificial Intelligence**

Quantum computing is already being used in deep learning and promises dramatic reductions in processing time and resource utilization to train even the most complex models. Here are a few things you need to know.

**How to Start Incorporating Machine Learning in Enterprises**

The world is long past the Industrial Revolution, and now we are experiencing an era of Digital Revolution. Machine Learning, Artificial Intelligence, and Big Data Analysis are the reality of today’s world. I recently had a chance to talk to Ciaran Dynes, Senior Vice President of Products at Talend and Justin Mullen, Managing Director at Datalytyx . Talend is a software integration vendor that provides Big Data solutions to enterprises, and Datalytyx is a leading provider of big data engineering, data analytics, and cloud solutions, enabling faster, more effective, and more profitable decision-making throughout an enterprise.

**How to make and share an R package in 3 steps**

If you find yourself often repeating the same scripts in R, you might come to the point where you want to turn them into reusable functions and create your own R package. I recently reached that point and wanted to learn how to build my own R package – as simple as possible.

**Syberia: A development framework for R code in production**

Putting R code into production generally involves orchestrating the execution of a series of R scripts. Even if much of the application logic is encoded into R packages, a run-time environment typically involves scripts to ingest and prepare data, run the application logic, validate the results, and operationalize the output. Managing those scripts, especially in the face of working with multiple R versions, can be a pain — and worse, very complex scripts are difficult to understand and reuse for future applications. That’s where Syberia comes in: an open-source framework created by Robert Krzyzanowski and other engineers at the consumer lending company Avant. There, Syberia has been used by more than 30 developers to build a production data modeling system.

**Data Science for Business – Time Series Forecasting Part 3: Forecasting with Facebook’s Prophet**

Predicting future events/sales/etc. isn’t trivial for a number of reasons and different algorithms use different approaches to handle these problems. Time series data does not behave like a regular numeric vector, because months don’t have the same number of days, weekends and holidays differ between years, etc. Because of this, we often have to deal with multiple layers of seasonality (i.e. weekly, monthly, yearly, irregular holidays, etc.). Regularly missing days, like weekends, are easier to incorporate into time series models than irregularly missing days.