Advertisements

Document worth reading: “Interpreting Blackbox Models via Model Extraction”

Interpretability has become an important issue as machine learning is increasingly used to inform consequential decisions. We propose an approach for interpreting a blackbox model by extracting a decision tree that approximates the model. Our model extraction algorithm avoids overfitting by leveraging blackbox model access to actively sample new training points. We prove that as the number of samples goes to infinity, the decision tree learned using our algorithm converges to the exact greedy decision tree. In our evaluation, we use our algorithm to interpret random forests and neural nets trained on several datasets from the UCI Machine Learning Repository, as well as control policies learned for three classical reinforcement learning problems. We show that our algorithm improves over a baseline based on CART on every problem instance. Furthermore, we show how an interpretation generated by our approach can be used to understand and debug these models. Interpreting Blackbox Models via Model Extraction

Advertisements

Book Memo: “Statistical Shape Analysis”

With Applications in R
A thoroughly revised and updated edition of this introduction to modern statistical methods for shape analysis. Shape analysis is an important tool in the many disciplines where objects are compared using geometrical features. Examples include comparing brain shape in schizophrenia; investigating protein molecules in bioinformatics; and describing growth of organisms in biology. This book is a significant update of the highly-regarded `Statistical Shape Analysis’ by the same authors. The new edition lays the foundations of landmark shape analysis, including geometrical concepts and statistical techniques, and extends to include analysis of curves, surfaces, images and other types of object data. Key definitions and concepts are discussed throughout, and the relative merits of different approaches are presented. The authors have included substantial new material on recent statistical developments and offer numerous examples throughout the text. Concepts are introduced in an accessible manner, while retaining sufficient detail for more specialist statisticians to appreciate the challenges and opportunities of this new field. Computer code has been included for instructional use, along with exercises to enable readers to implement the applications themselves in R and to follow the key ideas by hands-on analysis. Statistical Shape Analysis: with Applications in R will offer a valuable introduction to this fast-moving research area for statisticians and other applied scientists working in diverse areas, including archaeology, bioinformatics, biology, chemistry, computer science, medicine, morphometics and image analysis

Magister Dixit

“Today, you are much less likely to face a scenario in which you cannot query data and get a response back in a brief period of time. Analytical processes that used to require month, days, or hours have been reduced to minutes, seconds, and fractions of seconds. But shorter processing times have led to higher expectations. Two years ago, many data analysts thought that generating a result from a query in less than 40 minutes was nothing short of miraculous. Today, they expect to see results in under a minute. That’s practically the speed of thought – you think of a query, you get a result, and you begin your experiment. “It’s about moving with greater speed toward previously unknown questions, defining new insights, and reducing the time between when an event happens somewhere in the world and someone responds or reacts to that event,” says Erickson. A rapidly emerging universe of newer technologies has dramatically reduced data processing cycle time, making it possible to explore and experiment with data in ways that would not have been practical or even possible a few years ago. Despite the availability of new tools and systems for handling massive amounts of data at incredible speeds, however, the real promise of advanced data analytics lies beyond the realm of pure technology. “Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse,” says Michael Minelli, co-author of Big Data, Big Analytics. “It’s about the ability to make better decisions and take meaningful actions at the right time. It’s about detecting fraud while someone is swiping a credit card, or triggering an offer while a shopper is standing on a checkout line, or placing an ad on a website while someone is reading a specific article. It’s about combining and analyzing data so you can take the right action, at the right time, and at the right place.” For some, real-time big data analytics (RTBDA) is a ticket to improved sales, higher profits and lower marketing costs. To others, it signals the dawn of a new era in which machines begin to think and respond more like humans.” Mike Barlow ( 2013 )

If you did not already know

Deep Rotation Equivariant Network (DREN) google
Recently, learning equivariant representations has attracted considerable research attention. Dieleman et al. introduce four operations which can be inserted to CNN to learn deep representations equivariant to rotation. However, feature maps should be copied and rotated four times in each layer in their approach, which causes much running time and memory overhead. In order to address this problem, we propose Deep Rotation Equivariant Network(DREN) consisting of cycle layers, isotonic layers and decycle layers.Our proposed layers apply rotation transformation on filters rather than feature maps, achieving a speed up of more than 2 times with even less memory overhead. We evaluate DRENs on Rotated MNIST and CIFAR-10 datasets and demonstrate that it can improve the performance of state-of-the-art architectures. Our codes are released on GitHub. …

Semantic Matching google
Semantic matching is a technique used in computer science to identify information which is semantically related. Given any two graph-like structures, e.g. classifications, taxonomies database or XML schemas and ontologies, matching is an operator which identifies those nodes in the two structures which semantically correspond to one another. For example, applied to file systems it can identify that a folder labeled “car” is semantically equivalent to another folder “automobile” because they are synonyms in English. This information can be taken from a linguistic resource like WordNet. In the recent years many of them have been offered. S-Match is an example of a semantic matching operator. It works on lightweight ontologies, namely graph structures where each node is labeled by a natural language sentence, for example in English. These sentences are translated into a formal logical formula (according to an artificial unambiguous language) codifying the meaning of the node taking into account its position in the graph. For example, in case the folder “car” is under another folder “red” we can say that the meaning of the folder “car” is “red car” in this case. This is translated into the logical formula “red AND car”. The output of S-Match is a set of semantic correspondences called mappings attached with one of the following semantic relations: disjointness (⊥), equivalence (≡), more specific (⊑) and less specific (⊒). In our example the algorithm will return a mapping between ”car” and ”automobile” attached with an equivalence relation. Information semantically matched can also be used as a measure of relevance through a mapping of near-term relationships. Such use of S-Match technology is prevalent in the career space where it is used to gauge depth of skills through relational mapping of information found in applicant resumes. Semantic matching represents a fundamental technique in many applications in areas such as resource discovery, data integration, data migration, query translation, peer to peer networks, agent communication, schema and ontology merging. It using is also being investigated in other areas such as event processing. In fact, it has been proposed as a valid solution to the semantic heterogeneity problem, namely managing the diversity in knowledge. Interoperability among people of different cultures and languages, having different viewpoints and using different terminology has always been a huge problem. Especially with the advent of the Web and the consequential information explosion, the problem seems to be emphasized. People face the concrete problem to retrieve, disambiguate and integrate information coming from a wide variety of sources. …

Waterfall Plot google
A waterfall plot is a three-dimensional plot in which multiple curves of data, typically spectra, are displayed simultaneously. Typically the curves are staggered both across the screen and vertically, with ‘nearer’ curves masking the ones behind. The result is a series of ‘mountain’ shapes that appear to be side by side. The waterfall plot is often used to show how two-dimensional information changes over time or some other variable such as rpm. The term ‘waterfall plot’ is sometimes used interchangeably with ‘spectrogram’ or ‘Cumulative Spectral Decay’ (CSD) plot. …

Whats new on arXiv

Unsupervised Learning Layers for Video Analysis

This paper presents two unsupervised learning layers (UL layers) for label-free video analysis: one for fully connected layers, and the other for convolutional ones. The proposed UL layers can play two roles: they can be the cost function layer for providing global training signal; meanwhile they can be added to any regular neural network layers for providing local training signals and combined with the training signals backpropagated from upper layers for extracting both slow and fast changing features at layers of different depths. Therefore, the UL layers can be used in either pure unsupervised or semi-supervised settings. Both a closed-form solution and an online learning algorithm for two UL layers are provided. Experiments with unlabeled synthetic and real-world videos demonstrated that the neural networks equipped with UL layers and trained with the proposed online learning algorithm can extract shape and motion information from video sequences of moving objects. The experiments demonstrated the potential applications of UL layers and online learning algorithm to head orientation estimation and moving object localization.


Proximity Variational Inference

Variational inference is a powerful approach for approximate posterior inference. However, it is sensitive to initialization and can be subject to poor local optima. In this paper, we develop proximity variational inference (PVI). PVI is a new method for optimizing the variational objective that constrains subsequent iterates of the variational parameters to robustify the optimization path. Consequently, PVI is less sensitive to initialization and optimization quirks and finds better local optima. We demonstrate our method on three proximity statistics. We study PVI on a Bernoulli factor model and sigmoid belief network with both real and synthetic data and compare to deterministic annealing (Katahira et al., 2008). We highlight the flexibility of PVI by designing a proximity statistic for Bayesian deep learning models such as the variational autoencoder (Kingma and Welling, 2014; Rezende et al., 2014). Empirically, we show that PVI consistently finds better local optima and gives better predictive performance.


Approximation and Convergence Properties of Generative Adversarial Learning

Generative adversarial networks (GAN) approximate a target data distribution by jointly optimizing an objective function through a ‘two-player game’ between a generator and a discriminator. Despite their empirical success, however, two very basic questions on how well they can approximate the target distribution remain unanswered. First, it is not known how restricting the discriminator family affects the approximation quality. Second, while a number of different objective functions have been proposed, we do not understand when convergence to the global minima of the objective function leads to convergence to the target distribution under various notions of distributional convergence. In this paper, we address these questions in a broad and unified setting by defining a notion of adversarial divergences that includes a number of recently proposed objective functions. We show that if the objective function is an adversarial divergence with some additional conditions, then using a restricted discriminator family has a moment-matching effect. Additionally, we show that for objective functions that are strict adversarial divergences, convergence in the objective function implies weak convergence, thus generalizing previous results.


Neural Decomposition of Time-Series Data for Effective Generalization

We present a neural network technique for the analysis and extrapolation of time-series data called Neural Decomposition (ND). Units with a sinusoidal activation function are used to perform a Fourier-like decomposition of training samples into a sum of sinusoids, augmented by units with nonperiodic activation functions to capture linear trends and other nonperiodic components. We show how careful weight initialization can be combined with regularization to form a simple model that generalizes well. Our method generalizes effectively on the Mackey-Glass series, a dataset of unemployment rates as reported by the U.S. Department of Labor Statistics, a time-series of monthly international airline passengers, the monthly ozone concentration in downtown Los Angeles, and an unevenly sampled time-series of oxygen isotope measurements from a cave in north India. We find that ND outperforms popular time-series forecasting techniques including LSTM, echo state networks, ARIMA, SARIMA, SVR with a radial basis function, and Gashler and Ashmore’s model.


Towards Consistency of Adversarial Training for Generative Models

This work presents a rigorous statistical analysis of adversarial training for generative models, advancing recent work by Arjovsky and Bottou [2]. A key element is the distinction between the objective function with respect to the (unknown) data distribution, and its empirical counterpart. This yields a straight-forward explanation for common pathologies in practical adversarial training such as vanishing gradients. To overcome such issues, we pursue the idea of smoothing the Jensen-Shannon Divergence (JSD) by incorporating noise in the formulation of the discriminator. As we show, this effectively leads to an empirical version of the JSD in which the true and the generator densities are replaced by kernel density estimates. We analyze statistical consistency of this objective, and demonstrate its practical effectiveness.


Neural Attribute Machines for Program Generation

Recurrent neural networks have achieved remarkable success at generating sequences with complex structures, thanks to advances that include richer embeddings of input and cures for vanishing gradients. Trained only on sequences from a known grammar, though, they can still struggle to learn rules and constraints of the grammar. Neural Attribute Machines (NAMs) are equipped with a logical machine that represents the underlying grammar, which is used to teach the constraints to the neural machine by (i) augmenting the input sequence, and (ii) optimizing a custom loss function. Unlike traditional RNNs, NAMs are exposed to the grammar, as well as samples from the language of the grammar. During generation, NAMs make significantly fewer violations of the constraints of the underlying grammar than RNNs trained only on samples from the language of the grammar.


Geometric Methods for Robust Data Analysis in High Dimension

Machine learning and data analysis now finds both scientific and industrial application in biology, chemistry, geology, medicine, and physics. These applications rely on large quantities of data gathered from automated sensors and user input. Furthermore, the dimensionality of many datasets is extreme: more details are being gathered about single user interactions or sensor readings. All of these applications encounter problems with a common theme: use observed data to make inferences about the world. Our work obtains the first provably efficient algorithms for Independent Component Analysis (ICA) in the presence of heavy-tailed data. The main tool in this result is the centroid body (a well-known topic in convex geometry), along with optimization and random walks for sampling from a convex body. This is the first algorithmic use of the centroid body and it is of independent theoretical interest, since it effectively replaces the estimation of covariance from samples, and is more generally accessible. This reduction relies on a non-linear transformation of samples from such an intersection of halfspaces (i.e. a simplex) to samples which are approximately from a linearly transformed product distribution. Through this transformation of samples, which can be done efficiently, one can then use an ICA algorithm to recover the vertices of the intersection of halfspaces. Finally, we again use ICA as an algorithmic primitive to construct an efficient solution to the widely-studied problem of learning the parameters of a Gaussian mixture model. Our algorithm again transforms samples from a Gaussian mixture model into samples which fit into the ICA model and, when processed by an ICA algorithm, result in recovery of the mixture parameters. Our algorithm is effective even when the number of Gaussians in the mixture grows polynomially with the ambient dimension


Who Will Share My Image? Predicting the Content Diffusion Path in Online Social Networks

Content popularity prediction has been extensively studied due to its importance and interest for both users and hosts of social media sites like Facebook, Instagram, Twitter, and Pinterest. However, existing work mainly focuses on modeling popularity using a single metric such as the total number of likes or shares. In this work, we propose Diffusion-LSTM, a memory-based deep recurrent network that learns to recursively predict the entire diffusion path of an image through a social network. By combining user social features and image features, and encoding the diffusion path taken thus far with an explicit memory cell, our model predicts the diffusion path of an image more accurately compared to alternate baselines that either encode only image or social features, or lack memory. By mapping individual users to user prototypes, our model can generalize to new users not seen during training. Finally, we demonstrate our model’s capability of generating diffusion trees, and show that the generated trees closely resemble ground-truth trees.


Implicit Regularization in Matrix Factorization

We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix X with gradient descent on a factorization of X. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.


Consistent Kernel Density Estimation with Non-Vanishing Bandwidth

Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

Attention-based Natural Language Person Retrieval

Counterfactual Multi-Agent Policy Gradients

Compiling Quantum Circuits to Realistic Hardware Architectures using Temporal Planners

Adaptive Estimation of High Dimensional Partially Linear Model

Doubly Stochastic Variational Inference for Deep Gaussian Processes

Visual Servoing from Deep Neural Networks

Dual Dynamic Programming with cut selection: convergence proof and numerical experiments

Joint PoS Tagging and Stemming for Agglutinative Languages

Novel Deep Convolution Neural Network Applied to MRI Cardiac Segmentation

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

New Results for Provable Dynamic Robust PCA

Efficient, Safe, and Probably Approximately Complete Learning of Action Models

Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo

Communication vs Distributed Computation: an alternative trade-off curve

Logic Tensor Networks for Semantic Image Interpretation

Optimal Cooperative Inference

Cultural Diffusion and Trends in Facebook Photographs

The Onsager-Machlup functional associated with additive fractional noise

Multicut decomposition methods with cut selection for multistage stochastic programs

Automatic sequences and generalised polynomials

Modeling The Intensity Function Of Point Process Via Recurrent Neural Networks

Plug-and-Play Unplugged: Optimization Free Reconstruction using Consensus Equilibrium

The Dual Graph Shift Operator: Identifying the Support of the Frequency Domain

Matroids Hitting Sets and Unsupervised Dependency Grammar Induction

State Space Decomposition and Subgoal Creation for Transfer in Deep Reinforcement Learning

Large induced subgraphs with $k$ vertices of almost maximum degree

Extraction and Classification of Diving Clips from Continuous Video Footage

Principled Hybrids of Generative and Discriminative Domain Adaptation

The tessellation problem of quantum walks

Learning to Pour

Spectrum Sharing and Cyclical Multiple Access in UAV-Aided Cellular Offloading

Online Edge Grafting for Efficient MRF Structure Learning

Fast Causal Inference with Non-Random Missingness by Test-Wise Deletion

Lat-Net: Compressing Lattice Boltzmann Flow Simulations using Deep Neural Networks

Deriving Neural Architectures from Sequence and Graph Kernels

A Conic Integer Programming Approach to Constrained Assortment Optimization under the Mixed Multinomial Logit Model

Energy-Efficient Multi-Pair Two-Way AF Full-Duplex Massive MIMO Relaying

Cross-Domain Perceptual Reward Functions

Expectation Propagation for t-Exponential Family Using Q-Algebra

Convergence of Langevin MCMC in KL-divergence

A Clustering-based Consistency Adaptation Strategy for Distributed SDN Controllers

Weakly Supervised Semantic Segmentation Based on Co-segmentation

Circular law for the sum of random permutation matrices

Max-Cosine Matching Based Neural Models for Recognizing Textual Entailment

The cost of fairness in classification

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

A Spatial Branch-and-Cut Method for Nonconvex QCQP with Bounded Complex Variables

An Empirical Analysis of Approximation Algorithms for the Euclidean Traveling Salesman Problem

Vector Transport-Free SVRG with General Retraction for Riemannian Optimization: Complexity Analysis and Practical Implementation

Triangle Finding and Listing in CONGEST Networks

MagNet: a Two-Pronged Defense against Adversarial Examples

Gaps between avalanches in 1D Random Field Ising Models

Load Balancing for Skewed Streams on Heterogeneous Cluster

Wireless Powered Communications with Finite Battery and Finite Blocklength

Port-Hamiltonian descriptor systems

Dynamic degree-corrected blockmodels for social networks: a nonparametric approach

Performance Optimization of Co-Existing Underlay Secondary Networks

Recent progress in many-body localization

SLAM based Quasi Dense Reconstruction For Minimally Invasive Surgery Scenes

A matrix-based method of moments for fitting multivariate network meta-analysis models with multiple outcomes and random inconsistency effects

The structure of delta-matroids with width one twists

Topology Induced Oscillations in Majorana Fermions in a Quasiperiodic Superconducting Chain

First-spike based visual categorization using reward-modulated STDP

Deep image representations using caption generators

Distributionally Robust Optimisation in Congestion Control

Cut-norm and entropy minimization over weak* limits

Boolean dimension and local dimension

Shorter stabilizer circuits via Bruhat decomposition and quantum circuit transformations

On the (parameterized) complexity of recognizing well-covered (r,l)-graphs

Investigation of Using VAE for i-Vector Speaker Verification

Jointly Learning Sentence Embeddings and Syntax with Unsupervised Tree-LSTMs

Classification of Quantitative Light-Induced Fluorescence Images Using Convolutional Neural Network

Firing rate equations require a spike synchrony mechanism to correctly describe fast oscillations in inhibitory networks

Learning Structured Text Representations

A simplicial decomposition framework for large scale convex quadratic programming

Hypergeometric and basic hypergeometric series and integrals associated with root systems

Geometry of time-reversible group-based models

Asynchronous Parallel Bayesian Optimisation via Thompson Sampling

GSplit LBI: Taming the Procedural Bias in Neuroimaging for Disease Prediction

Arrangements of homothets of a convex body II

On the Cauchy problem for integro-differential equations in the scale of spaces of generalized smoothness

Quantum-secured blockchain

Entanglement properties of quantum grid states

Flux-dependent localisation in a disordered flat-band lattice

Is Our Model for Contention Resolution Wrong?

Filtering Variational Objectives

Gated XNOR Networks: Deep Neural Networks with Ternary Weights and Activations under a Unified Discretization Framework

Document worth reading: “Living Together: Mind and Machine Intelligence”

In this paper we consider the nature of the machine intelligences we have created in the context of our human intelligence. We suggest that the fundamental difference between human and machine intelligence comes down to \emph{embodiment factors}. We define embodiment factors as the ratio between an entity’s ability to communicate information vs compute information. We speculate on the role of embodiment factors in driving our own intelligence and consciousness. We briefly review dual process models of cognition and cast machine intelligence within that framework, characterising it as a dominant System Zero, which can drive behaviour through interfacing with us subconsciously. Driven by concerns about the consequence of such a system we suggest prophylactic courses of action that could be considered. Our main conclusion is that it is \emph{not} sentient intelligence we should fear but \emph{non-sentient} intelligence. Living Together: Mind and Machine Intelligence

Book Memo: “Health 4.0”

How Virtualization and Big Data are Revolutionizing Healthcare
This book describes how the creation of new digital services—through vertical and horizontal integration of data coming from sensors on top of existing legacy systems—that has already had a major impact on industry is now extending to healthcare. The book describes the fourth industrial revolution (i.e. Health 4.0), which is based on virtualization and service aggregation. It shows how sensors, embedded systems, and cyber-physical systems are fundamentally changing the way industrial processes work, their business models, and how we consume, while also affecting the health and care domains. Chapters describe the technology behind the shift of point of care to point of need and away from hospitals and institutions; how care will be delivered virtually outside hospitals; that services will be tailored to individuals rather than being designed as statistical averages; that data analytics will be used to help patients to manage their chronic conditions with help of smart devices; and that pharmaceuticals will be interactive to help prevent adverse reactions. The topics presented will have an impact on a variety of healthcare stakeholders in a continuously global and hyper-connected world.

Distilled News

A comprehensive beginners guide to Linear Algebra for Data Scientists

One of the most common question we get on Analytics Vidhya is: ‘How much maths do I need to learn to be a data scientist?’ Even though the question sounds simple, there is no simple answer to the the question. Usually, we say that you need to know basic descriptive and inferential statistics to start. That is good to start. But, once you have covered the basic concepts in machine learning, you will need to learn some more math. You need it to understand how these algorithms work. What are their limitations and in case they make any underlying assumptions. Now, there could be a lot of areas to study including algebra, calculus, statistics, 3-D geometry etc. If you get confused (like I did) and ask experts what should you learn at this stage, most of them would suggest / agree that you go ahead with Linear Algebra. But, the problem does not stop there. The next challenge is to figure out how to learn Linear Algebra. You can get lost in the detailed mathematics and derivation and learning them would not help as much! I went through that journey myself and hence decided to write this comprehensive guide. If you have faced this question about how to learn & what to learn in Linear Algebra – you are at the right place. Just follow this guide.


Two Sigma Financial Modeling Challenge, Winner’s Interview: 2nd Place, Nima Shahbazi, Chahhou Mohamed

Our Two Sigma Financial Modeling Challenge ran from December 2016 to March 2017 this year. Asked to search for signal in financial markets data with limited hardware and computational time, this competition attracted over 2000 competitors. In this winners’ interview, 2nd place winners’ Nima and Chahhou describe how paying close attention to unreliable engineered features was important to building a successful model.


Regular Expression & Treemaps to Visualize Emergency Department Visits

It’s been a while since my last post on some TB WHO data. A lot has happened since then, including the opportunity to attend the Open Data Science Conference (ODSC) East held in Boston, MA. Over a two day period I had the opportunity to listen to a number of leaders in various industries and fields. It was inspiring to learn about the wide variety of data science applications ranging from finance and marketing to genomics and even the refugee crisis. One of the workshops at ODSC was text analytics, which includes basic text processing, dendrograms, natural language processing and sentiment analysis. This gave me the thought of applying some text analytics to visualize some data I was working on last summer. In this post I’m going to walk through how I used regular expression to label classification codes in a large dataset (NHAMCS) representing emergency department visits in the United States and eventually visualize the data.


A recipe for dynamic programming in R: Solving a “quadruel” case from PuzzlOR

As also described in Cormen, et al (2009) p. 65, in algorithm design, divide-and-conquer paradigm incorporates a recursive approach in which the main problem is:
• Divided into smaller sub-problems (divide),
• The sub-problems are solved (conquer),
• And the solutions to sub-problems are combined to solve the original and “bigger” problem (combine).
Instead of constructing indefinite number of nested loops destroying the readability of the code and the performance of execution, the “recursive” way utilizes just one block of code which calls itself (hence the term “recursive”) for the smaller problem. The main point is to define a “stop” rule, so that the function does not sink into an infinite recursion depth. While nested loops modify the same object (or address space in the low level sense), recursion moves the “stack pointer”, so each recursion depth uses a different part of the stack (a copy of the objects will be created for each recursion). This illustrates a well-known trade-off in algorithm design: Memory versus performance; recursion enhances performance at the expense of using more memory.


Microsoft R Open 3.4.0 now available

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.0 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to R 3.4.0, reduces the size of the installer image, and updates the bundled packages. R 3.4.0 (upon which MRO 3.4.0 is based) is a major update to the R language, with many fixes and improvements. Most notably, R 3.4.0 introduces a just-in-time (JIT) compiler to improve performance of the scripts and functions that you write. There have been a few minor tweaks to the language itself, but in general functions and packages written for R 3.3.x should work the same in R 3.4.0. As usual, MRO points to a fixed CRAN snapshot from May 1 2017, but you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages). MRO is supported on Windows, Mac and Linux (Ubuntu, RedHat/CentOS, and SUSE). MRO 3.4.0 is 100% compatible with R 3.4.0, and you can use it with any of the 10,000+ packages available on CRAN. Here are some highlights of new packages released since the last MRO update. We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.


What Kaggle has learned from almost a million data scientists

This is a keynote highlight from the Strata Data Conference in London 2017.


Data science and deep learning in retail

In this episode of the Data Show, I spoke with Jeremy Stanley, VP of data science at Instacart, a popular grocery delivery service that is expanding rapidly. As Stanley describes it, Instacart operates a four-sided marketplace comprised of retail stores, products within the stores, shoppers assigned to the stores, and customers who order from Instacart. The objective is to get fresh groceries from popular retailers delivered to customers in a timely fashion. Instacart’s goals land them in the center of the many opportunities and challenges involved in building high-impact data products.


Running a word count application using Spark

This is a highlight from Ted Malaska’s Introduction to Apache Spark for Java and Scala developers.


DataScience.com Releases Python Package for Interpreting the Decision-Making Processes of Predictive Models

DataScience.com new Python library, Skater, uses a combination of model interpretation algorithms to identify how models leverage data to make predictions.


How A Data Scientist Can Improve His Productivity

Data science and machine learning are iterative processes. It is never possible to successfully complete a data science project in a single pass. A data scientist constantly tries new ideas and changes steps of his pipeline:
1. extract new features and accidentally find noise in the data
2. clean up the noise, find one more promising feature
3. extract the new feature
4. rebuild and validate the model, realize that the learning algorithm parameters are not perfect for the new feature set
5. change machine learning algorithm parameters and retrain the model
6. find the ineffective feature subset and remove it from the feature set
7. try a few more new features
8. try another ML algorithm. And then a data format change is required.
This is only a small episode in a data scientist’s daily life and it is what makes our job different from a regular engineering job.


Deep learning for natural language processing, Part 1

The machine learning revolution leaves no stone unturned. Natural language processing is yet another field that underwent a small revolution thanks to the second coming of artificial neural networks. Let’s just briefly discuss two advances in the natural language processing toolbox made thanks to artificial neural networks and deep learning techniques.


Python vs R. Which language should you choose?

Data science is an interdisciplinary field where scientific techniques from statistics, mathematics, and computer science are used to analyze data and solve problems more accurately and effectively. It is no wonder, then, that languages such as R and Python, with their extensive packages and libraries that support statistical methods and machine learning algorithms are cornerstones of the data science revolution. Often times, beginners find it hard to decide which language to learn first. This guide will help you make that decision.


A practical explanation of a Naive Bayes classifier

The simplest solutions are usually the most powerful ones, and Naive Bayes is a good proof of that. In spite of the great advances of the Machine Learning in the last years, it has proven to not only be simple but also fast, accurate and reliable. It has been successfully used for many purposes, but it works particularly well with natural language processing (NLP) problems. Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and Bayes’ Theorem to predict the category of a sample (like a piece of news or a customer review). They are probabilistic, which means that they calculate the probability of each category for a given sample, and then output the category with the highest one. The way they get these probabilities is by using Bayes’ Theorem, which describes the probability of a feature, based on prior knowledge of conditions that might be related to that feature. We’re going to be working with an algorithm called Multinomial Naive Bayes. We’ll walk through the algorithm applied to NLP with an example, so by the end not only will you know how this method works, but also why it works. Then, we’ll lay out a few advanced techniques that can make Naive Bayes competitive with more complex Machine Learning algorithms, such as SVM and neural networks.


Dimensionality Reduction Algorithms: Strengths and Weaknesses

Welcome to Part 2 of our tour through modern machine learning algorithms. In this part, we’ll cover methods for Dimensionality Reduction, further broken into Feature Selection and Feature Extraction. In general, these tasks are rarely performed in isolation. Instead, they’re often preprocessing steps to support other tasks.


Exponential Change and Unsupervised Learning

Brian Hopkins of Forrester Research recently penned an excellent blog post about why companies are getting disrupted and why they realize it so late. The post draws from Ray Kurzweil’s Law Of Accelerating Returns and speaks to the fact that the human brain doesn’t do well with exponential growth.


An Introduction to the MXNet Python API?

This post outlines an entire 6-part tutorial series on the MXNet deep learning library and its Python API. In-depth and descriptive, this is a great guide for anyone looking to start leveraging this powerful neural network library.


Versioning R model objects in SQL Server

If you build a model and never update it you’re missing a trick. Behaviours change so your model will tend to perform worse over time. You’ve got to regularly refresh it, whether that’s adjusting the existing model to fit the latest data (recalibration) or building a whole new model (retraining), but this means you’ve got new versions of your model that you have to handle. You need to think about your methodology for versioning R model objects, ideally before you lose any versions. You could store models with ye olde YYYYMMDD style of versioning but that means regularly changing your code to use the latest model version. I’m too lazy for that! If we’re storing our R model objects in SQL Server then we can utilise another SQL Server capability, temporal tables, to take the pain out of versioning and make it super simple. Temporal tables will track changes automatically so you would overwrite the previous model with the new one and it would keep a copy of the old one automagically in a history table. You get to always use the latest version via the main table but you can then write temporal queries to extract any version of the model that’s ever been implemented. Super neat! For some of you, if you’re not interested in the technical details you can drop off now with the knowledge that you can store your models in a non-destructive but easy to use way in SQL Server if you need to. If you want to see how it’s done, read on!


What’s Next for Big Data? Thinking Beyond Hadoop with Elastic Object Storage

In this special guest feature, Irshad Raihan, Product Marketing Manager at Red Hat Storage, discusses how organizations can save money and realize greater flexibility by moving data with lower business value to a more affordable storage solution. Irshad Raihan is a product manager at Red Hat Storage, responsible for product strategy, messaging, and go to market activities. Previously, he held senior product marketing and product management positions at HP and IBM responsible for big data and data management products. Irshad holds a Masters in Computer Science from Clemson University, and an MBA from Carnegie Mellon University.


What is an Ontology?

This is a short blog post to introduce the concept of an ontology for those who are unfamiliar with the term, or who have previously encountered explanations that make little or no sense, as I have. I’m aiming to “democratise knowledge of this topic” as one of my colleagues put it.

R Packages worth a look

Constructing an Epistemic Model for the Games with Two Players (EpistemicGameTheory)
Constructing an epistemic model such that, for every player i and for every choice c(i) which is optimal, there is one type that expresses common belief in rationality.

Bayesian Network Belief Propagation (BayesNetBP)
Belief propagation methods in Bayesian Networks to propagate evidence through the network. The implementation of these methods are based on the article: Cowell, RG (2005). Local Propagation in Conditional Gaussian Bayesian Networks <http://…/>.

Quick Generalized Full Matching (quickmatch)
Provides functions for constructing near-optimal generalized full matching. Generalized full matching is an extension of the original full matching method to situations with more intricate study designs. The package is made with large data sets in mind and derives matches more than an order of magnitude quicker than other methods.

Convolution of Gamma Distributions (coga)
Convolution of gamma distributions in R. The convolution of gamma distributions is the sum of series of gamma distributions and all gamma distributions here can have different parameters. This package can calculate density, distribution function and do simulation work.

Graphical User Interface for Generalized Multistate Simulation Model (GUIgems)
A graphical user interface for the R package Gems. Apart from the functionality of Gems package in the Graphical User interface, GUIgems allows adding states to a defined model, merging states for the analysis and plotting progression paths between states based on the simulated cohort. There is also a module in the GUIgems which allows to compare costs and QALYs between different cohorts.

R Packages worth a look

A Faster Implementation of the Poisson-Binomial Distribution (poisbinom)
Provides the probability, distribution, and quantile functions and random number generator for the Poisson-Binomial distribution. This package relies on FFTW to implement the discrete Fourier transform, so that it is much faster than the existing implementation of the same algorithm in R.

Evolutionary Monte Carlo (EMC) Methods for Clustering (EMCC)
Evolutionary Monte Carlo methods for clustering, temperature ladder construction and placement. This package implements methods introduced in Goswami, Liu and Wong (2007) <doi:10.1198/106186007X255072>. The paper above introduced probabilistic genetic-algorithm-style crossover moves for clustering. The paper applied the algorithm to several clustering problems including Bernoulli clustering, biological sequence motif clustering, BIC based variable selection, mixture of Normals clustering, and showed that the proposed algorithm performed better both as a sampler and as a stochastic optimizer than the existing tools, namely, Gibbs sampling, “split-merge” Metropolis-Hastings algorithm, K-means clustering, and the MCLUST algorithm (in the package ‘mclust’).

JAR Files of the Apache Commons Mathematics Library (commonsMath)
Java JAR files for the Apache Commons Mathematics Library for use by users and other packages.

Interactive Graphs with R (RJSplot)
Creates interactive graphs with ‘R’. It joins the data analysis power of R and the visualization libraries of JavaScript in one package.

Simulation Education (simEd)
Contains various functions to be used for simulation education, including queueing simulation functions, variate generation functions capable of producing independent streams and antithetic variates, functions for illustrating random variate generation for various discrete and continuous distributions, and functions to compute time-persistent statistics. Also contains two queueing data sets (one fabricated, one real-world) to facilitate input modeling.