# If you did not already know

Internet Shopping Problem
Introduced by Blazewicz et al. (2010), where a customer wants to buy a list of products at the lowest possible total cost from shops which offer discounts when purchases exceed a certain threshold. The problem is NP-hard. …

EigenPro 2.0
In recent years machine learning methods that nearly interpolate the data have achieved remarkable success. In many settings achieving near-zero training error leads to excellent test results. In this work we show how the mathematical and conceptual simplicity of interpolation can be harnessed to construct a framework for very efficient, scalable and accurate kernel machines. Our main innovation is in constructing kernel machines that output solutions mathematically equivalent to those obtained using standard kernels, yet capable of fully utilizing the available computing power of a parallel computational resource, such as GPU. Such utilization is key to strong performance since much of the computational resource capability is wasted by the standard iterative methods. The computational resource and data adaptivity of our learned kernels is based on theoretical convergence bounds. The resulting algorithm, which we call EigenPro 2.0, is accurate, principled and very fast. For example, using a single GPU, training on ImageNet with $1.3\times 10^6$ data points and $1000$ labels takes under an hour, while smaller datasets, such as MNIST, take seconds. Moreover, as the parameters are chosen analytically, based on the theory, little tuning beyond selecting the kernel and kernel parameter is needed, further facilitating the practical use of these methods. …

Customer Experience Management (CEM)
Customer experience management (CEM or CXM) is the process that companies use to oversee and track all interactions with a customer during the duration of their relationship. This involves the strategy of building around the needs of individual customers. According to Jeananne Rae, companies are realizing that ‘building great consumer experiences is a complex enterprise, involving strategy, integration of technology, orchestrating business models, brand management and CEO commitment.’ …

# Document worth reading: “Deep Probabilistic Programming Languages: A Qualitative Study”

Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic programming languages and indirectly by characterizing their current strengths and weaknesses. Deep Probabilistic Programming Languages: A Qualitative Study

# Distilled News

This tool finds the Granger causality relationship among the input time series and visualizes the results in a directed causal graph and a directed adjacency matrix. It applies the Lasso-Granger and Copula-Granger algorithms with length of lag l=1.
• Andrew Arnold, Yan Liu, and Naoki Abe. Temporal Causal Modeling with Graphical Granger Methods, KDD 2007.
• Taha Bahadori, Yan Liu. An Examination of Practical Granger Causality Inference, SDM 2013
In my most recent blog, I discussed the idea of aligning the supply of services to market demand. My conceptualization of ‘alignment’ specifically relates to time intervals: i.e. having people at the right place and at the right time – for example, to take advantage of opportunities – is a sign of alignment. Alignment for me is often about the relationship between capacity and incapacity: the ability to supply services versus the inability to satisfy the market demand for those services. In this blog I will be considering the interpretation of charts to evaluate the effectiveness of strategic allocation.
In this study, we predict the outcome of the football matches in the FIFA World Cup 2018 to be held in Russia this summer. We do this using classification models over a dataset of historic football results that includes attributes from the playing teams by rating them in attack, midfield, defence, aggression, pressure, chance creation and building ability. This last training data was a result of merging international matches results with AE games ratings of the teams considering the timeline of the matches with their respective statistics. Final predictions show the four countries with the most chances of getting to the semifinals as France, Brazil, Spain and Germany while giving Spain as the winner.
In the exercises below, we will explore more in Time Series analysis.The previous exercise is here,Please follow this in sequence.
We discuss a new approach for selecting features from a large set of features, in an unsupervised machine learning framework. In supervised learning such as linear regression or supervised clustering, it is possible to test the predicting power of a set of features (also called independent variables by statisticians, or predictors) using metrics such as goodness of fit with the response (the dependent variable), for instance using the R-squared coefficient. This makes the process of feature selection rather easy. Here this is not feasible. The context could be pure clustering, with no training sets available, for instance in a fraud detection problem. We are also dealing with discrete and continuous variables, possibly including dummy variables that represent categories, such as gender. We assume that no simple statistical model explains the data, so the framework here is model-free, data-driven. In this context, traditional methods are based on information theory metrics to determine which subset of features brings the largest amount of information.
Teaser: Tensor Networks can be seen as a higher-order generalization of traditional deep neural networks, and yet they lack an explicit non-linearity such as applying the ReLU or sigmoid function as we do with neural nets. A deeper understanding of what nonlinearity actually means, however, reveals that tensor networks can indeed learn non-linear functions. The non-linearity of tensor networks arises soley from the architecture and topology of the network itself.
In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing what groups the data points fall into when we apply a clustering algorithm. Today, we´re going to look at 5 popular clustering algorithms that data scientists need to know and their pros and cons!
• K-Means Clustering
• Mean-Shift Clustering
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
• Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
• Agglomerative Hierarchical Clustering
This is a tutorial for beginners interested in learning about MNIST and Softmax regression using machine learning (ML) and TensorFlow. When we start learning programming, the first thing we learned to do was to print ‘Hello World.’ It´s like Hello World, the entry point to programming, and MNIST, the starting point for machine learning.
At first glance, this reminds us of AI, when a machine decides how to manage a task based on statistical data. In fact, this concept is part of the AI phenomenon and makes it possible to develop machine intelligence and improve the decision-making process. According to NVidia, ‘Machine Learning at its most basic is the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world’. In a nutshell, a machine analyzes and recommends information without human participation. This work can be done manually but takes a plenty of time and effort. Thanks to huge computing power, modern machines perform data mining, data analytics and predictive modeling more effectively than people do. The next section will be dedicated to a recommender technique based on the machine learning approach.

# Whats new on arXiv

We present an approach to identify concise equations from data using a shallow neural network approach. In contrast to ordinary black-box regression, this approach allows understanding functional relations and generalizing them from observed data to unseen parts of the parameter space. We show how to extend the class of learnable equations for a recently proposed equation learning network to include divisions, and we improve the learning and model selection strategy to be useful for challenging real-world data. For systems governed by analytical expressions, our method can in many cases identify the true underlying equation and extrapolate to unseen domains. We demonstrate its effectiveness by experiments on a cart-pendulum system, where only 2 random rollouts are required to learn the forward dynamics and successfully achieve the swing-up task.
Do-calculus is concerned with estimating the interventional distribution of an action from the observed joint probability distribution of the variables in a given causal structure. All identifiable causal effects can be derived using the rules of do-calculus, but the rules themselves do not give any direct indication whether the effect in question is identifiable or not. Shpitser and Pearl constructed an algorithm for identifying joint interventional distributions in causal models, which contain unobserved variables and induce directed acyclic graphs. This algorithm can be seen as a repeated application of the rules of do-calculus and known properties of probabilities, and it ultimately either derives an expression for the causal distribution, or fails to identify the effect, in which case the effect is non-identifiable. In this paper, the R package causaleffect is presented, which provides an implementation of this algorithm. Functionality of causaleffect is also demonstrated through examples.
Obtaining a non-parametric expression for an interventional distribution is one of the most fundamental tasks in causal inference. Such an expression can be obtained for an identifiable causal effect by an algorithm or by manual application of do-calculus. Often we are left with a complicated expression which can lead to biased or inefficient estimates when missing data or measurement errors are involved. We present an automatic simplification algorithm that seeks to eliminate symbolically unnecessary variables from these expressions by taking advantage of the structure of the underlying graphical model. Our method is applicable to all causal effect formulas and is readily available in the R package causaleffect.
We derive a formal, decision-based method for comparing the performance of counterfactual treatment regime predictions using the results of experiments that give relevant information on the distribution of treated outcomes. Our approach allows us to quantify and assess the statistical significance of differential performance for optimal treatment regimes estimated from structural models, extrapolated treatment effects, expert opinion, and other methods. We apply our method to evaluate optimal treatment regimes for conditional cash transfer programs across countries where predictions are generated using data from experimental evaluations in other countries and pre-program data in the country of interest.
We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a blackbox differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.
With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that with a single RNN architecture and pre-trained fixed embeddings, inst2vec outperforms specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art.
Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the traversals of tree data structures incur many random memory accesses and are very slow. We present memory packing techniques that reorganize learned forests to minimize cache misses during classification. The resulting layout is hierarchical. At low levels, we pack the nodes of multiple trees into contiguous memory blocks so that each memory access fetches data for multiple trees. At higher levels, we use leaf cardinality to identify the most popular paths through a tree and collocate those paths in cache lines. We extend this layout with out-of-order execution and cache-line prefetching to increase memory throughput. Together, these optimizations increase the performance of classification in ensembles by a factor of four over an optimized C++ implementation and a actor of 50 over a popular R language implementation.
We empirically investigate learning from partial feedback in neural machine translation (NMT), when partial feedback is collected by asking users to highlight a correct chunk of a translation. We propose a simple and effective way of utilizing such feedback in NMT training. We demonstrate how the common machine translation problem of domain mismatch between training and deployment can be reduced solely based on chunk-level user feedback. We conduct a series of simulation experiments to test the effectiveness of the proposed method. Our results show that chunk-level feedback outperforms sentence based feedback by up to 2.61% BLEU absolute.
Smart factories are on the verge of becoming the new industrial paradigm, wherein optimization permeates all aspects of production, from concept generation to sales. To fully pursue this paradigm, flexibility in the production means as well as in their timely organization is of paramount importance. AI is planning a major role in this transition, but the scenarios encountered in practice might be challenging for current tools. Task planning is one example where AI enables more efficient and flexible operation through an online automated adaptation and rescheduling of the activities to cope with new operational constraints and demands. In this paper we present SMarTplan, a task planner specifically conceived to deal with real-world scenarios in the emerging smart factory paradigm. Including both special-purpose and general-purpose algorithms, SMarTplan is based on current automated reasoning technology and it is designed to tackle complex application domains. In particular, we show its effectiveness on a logistic scenario, by comparing its specialized version with the general purpose one, and extending the comparison to other state-of-the-art task planners.
Fraud detection is a difficult problem that can benefit from predictive modeling. However, the verification of a prediction is challenging; for a single insurance policy, the model only provides a prediction score. We present a case study where we reflect on different instance-level model explanation techniques to aid a fraud detection team in their work. To this end, we designed two novel dashboards combining various state-of-the-art explanation techniques. These enable the domain expert to analyze and understand predictions, dramatically speeding up the process of filtering potential fraud cases. Finally, we discuss the lessons learned and outline open research issues.
The restricted Boltzmann machine is a network of stochastic units with undirected interactions between pairs of visible and hidden units. This model was popularized as a building block of deep learning architectures and has continued to play an important role in applied and theoretical machine learning. Restricted Boltzmann machines carry a rich structure, with connections to geometry, applied algebra, probability, statistics, machine learning, and other areas. The analysis of these models is attractive in its own right and also as a platform to combine and generalize mathematical tools for graphical models with hidden variables. This article gives an introduction to the mathematical analysis of restricted Boltzmann machines, reviews recent results on the geometry of the sets of probability distributions representable by these models, and suggests a few directions for further investigation.
Deep neural networks have been proven powerful at processing perceptual data, such as images and audio. However for tabular data, tree-based models are more popular. A nice property of tree-based models is their natural interpretability. In this work, we present Deep Neural Decision Trees (DNDT) — tree models realised by neural networks. A DNDT is intrinsically interpretable, as it is a tree. Yet as it is also a neural network (NN), it can be easily implemented in NN toolkits, and trained with gradient descent rather than greedy splitting. We evaluate DNDT on several tabular datasets, verify its efficacy, and investigate similarities and differences between DNDT and vanilla decision trees. Interestingly, DNDT self-prunes at both split and feature-level.
Inverse reinforcement learning is the problem of inferring the reward function of an observed agent, given its policy or behavior. Researchers perceive IRL both as a problem and as a class of methods. By categorically surveying the current literature in IRL, this article serves as a reference for researchers and practitioners in machine learning to understand the challenges of IRL and select the approaches best suited for the problem on hand. The survey formally introduces the IRL problem along with its central challenges which include accurate inference, generalizability, correctness of prior knowledge, and growth in solution complexity with problem size. The article elaborates how the current methods mitigate these challenges. We further discuss the extensions of traditional IRL methods: (i) inaccurate and incomplete perception, (ii) incomplete model, (iii) multiple rewards, and (iv) non-linear reward functions. This discussion concludes with some broad advances in the research area and currently open research questions.
Tensors are higher-order extensions of matrices. In recent work [Kilmer and Martin, 2011], the authors introduced the notion of the t-product, a generalization of matrix multiplication for tensors of order three. The multiplication is based on a convolution-like operation, which can be implemented efficiently using the Fast Fourier Transform (FFT). Based on t-product, there has a similar linear algebraic structure of tensors to matrices. For example, there has the tensor SVD (t-SVD) which is computable. By using some properties of FFT, we have a more efficient way for computing t-product and t-SVD in [C. Lu, et al., 2018]. We develop a Matlab toolbox to implement several basic operations on tensors based on t-product. The toolbox is available at https://…/tproduct.
Convolutional Neural Networks(CNNs) are complex systems. They are trained so they can adapt their internal connections to recognize images, texts and more. It is both interesting and helpful to visualize the dynamics within such deep artificial neural networks so that people can understand how these artificial networks are learning and making predictions. In the field of scientific simulations, visualization tools like Paraview have long been utilized to provide insights and understandings. We present in situ TensorView to visualize the training and functioning of CNNs as if they are systems of scientific simulations. In situ TensorView is a loosely coupled in situ visualization open framework that provides multiple viewers to help users to visualize and understand their networks. It leverages the capability of co-processing from Paraview to provide real-time visualization during training and predicting phases. This avoid heavy I/O overhead for visualizing large dynamic systems. Only a small number of lines of codes are injected in TensorFlow framework. The visualization can provide guidance to adjust the architecture of networks, or compress the pre-trained networks. We showcase visualizing the training of LeNet-5 and VGG16 using in situ TensorView.

# R Packages worth a look

Facilities for Simulating from ODE-Based Models (RxODE)
Facilities for running simulations from ordinary differential equation (ODE) models, such as pharmacometrics and other compartmental models. A compilation manager translates the ODE model into C, compiles it, and dynamically loads the object code into R for improved computational efficiency. An event table object facilitates the specification of complex dosing regimens (optional) and sampling schedules. NB: The use of this package requires both C and Fortran compilers, for details on their use with R please see Section 6.3, Appendix A, and Appendix D in the ‘R Administration and Installation’ manual. Also the code is mostly released under GPL. The VODE and LSODA are in the public domain. The information is available in the inst/COPYRIGHTS.

Stanford ‘ATLAS’ Search Engine API (atlas)
Stanford ‘ATLAS’ (Advanced Temporal Search Engine) is a powerful tool that allows constructing cohorts of patients extremely quickly and efficiently. This package is designed to interface directly with an instance of ‘ATLAS’ search engine and facilitates API queries and data dumps. Prerequisite is a good knowledge of the temporal language to be able to efficiently construct a query. More information available at <https://…/start>.

In-place Operators for R (inplace)
It provides in-place operators for R that are equivalent to ‘+=’, ‘-=’, ‘*=’, ‘/=’ in C++. Those can be applied on integer|double vectors|matrices. You have also access to sweep operations (in-place).

Simulation Extrapolation Inverse Probability Weighted Generalized Estimating Equations (swgee)
Simulation extrapolation and inverse probability weighted generalized estimating equations method for longitudinal data with missing observations and measurement error in covariates. References: Yi, G. Y. (2008) <doi:10.1093/biostatistics/kxm054>; Cook, J. R. and Stefanski, L. A. (1994) <doi:10.1080/01621459.1994.10476871>; Little, R. J. A. and Rubin, D. B. (2002, ISBN:978-0-471-18386-0).

A User-Oriented Statistical Toolkit for Analytical Variance Estimation (gustave)
Provides a toolkit for analytical variance estimation in survey sampling. Apart from the implementation of standard variance estimators, its main feature is to help the sampling expert produce easy-to-use variance estimation ‘wrappers’, where systematic operations (linearization, domain estimation) are handled in a consistent and transparent way for the end user.

# Magister Dixit

“Sometimes some data scientists seem to ignore this: you can think of using the most sophisticated and trendy algorithm, come up with brilliant ideas, imagine the most creative visualizations but, if you do not know how to get the data and handle it in the exact way you need it, all of this becomes worthless.” Hernán Resnizky ( May 15, 2015 )

# Book Memo: “Practical Text Analytics”

 Maximizing the Value of Your Text Data This book explores the process of text analytics in order to increase the accessibility of information available in unstructured text data. Unlike other books available in the text analytics field, Practical Text Analytics opens the door to business analysts and practitioners that may not have extensive coding experience or knowledge of the area. This allows readers without a programming background to take advantage of the nearly limitless information currently shrouded by text. Text analytics can help organizations derive insights for their business from text-based content like emails, documents, or social media posts. This book covers the elements involved in creating a text-mining pipeline. While analysts will not use every element in every project, each tool provides a potential segment in the final pipeline. Understanding the options is key to choosing the appropriate elements in designing and conducting text analysis.

# Document worth reading: “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review”

The framework of reinforcement learning or optimal control provides a mathematical formalization of intelligent decision making that is powerful and broadly applicable. While the general form of the reinforcement learning problem enables effective reasoning about uncertainty, the connection between reinforcement learning and inference in probabilistic models is not immediately obvious. However, such a connection has considerable value when it comes to algorithm design: formalizing a problem as probabilistic inference in principle allows us to bring to bear a wide array of approximate inference tools, extend the model in flexible and powerful ways, and reason about compositionality and partial observability. In this article, we will discuss how a generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference in the case of stochastic dynamics. We will present a detailed derivation of this framework, overview prior work that has drawn on this and related ideas to propose new reinforcement learning and control algorithms, and describe perspectives on future research. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

# If you did not already know

CYCLOSA
By regularly querying Web search engines, users (unconsciously) disclose large amounts of their personal data as part of their search queries, among which some might reveal sensitive information (e.g. health issues, sexual, political or religious preferences). Several solutions exist to allow users querying search engines while improving privacy protection. However, these solutions suffer from a number of limitations: some are subject to user re-identification attacks, while others lack scalability or are unable to provide accurate results. This paper presents CYCLOSA, a secure, scalable and accurate private Web search solution. CYCLOSA improves security by relying on trusted execution environments (TEEs) as provided by Intel SGX. Further, CYCLOSA proposes a novel adaptive privacy protection solution that reduces the risk of user re- identification. CYCLOSA sends fake queries to the search engine and dynamically adapts their count according to the sensitivity of the user query. In addition, CYCLOSA meets scalability as it is fully decentralized, spreading the load for distributing fake queries among other nodes. Finally, CYCLOSA achieves accuracy of Web search as it handles the real query and the fake queries separately, in contrast to other existing solutions that mix fake and real query results. …

Stochastic Decorrelation Loss (SDL)
Multi-view learning aims to learn an embedding space where multiple views are either maximally correlated for cross-view recognition, or decorrelated for latent factor disentanglement. A key challenge for deep multi-view representation learning is scalability. To correlate or decorrelate multi-view signals, the covariance of the whole training set should be computed which does not fit well with the mini-batch based training strategy, and moreover (de)correlation should be done in a way that is free of SVD-based computation in order to scale to contemporary layer sizes. In this work, a unified approach is proposed for efficient and scalable deep multi-view learning. Specifically, a mini-batch based Stochastic Decorrelation Loss (SDL) is proposed which can be applied to any network layer to provide soft decorrelation of the layer’s activations. This reveals the connection between deep multi-view learning models such as Deep Canonical Correlation Analysis (DCCA) and Factorisation Autoencoder (FAE), and allows them to be easily implemented. We further show that SDL is superior to other decorrelation losses in terms of efficacy and scalability. …

Markov Chain Las Vegas (MCLV)
We propose a Las Vegas transformation of Markov Chain Monte Carlo (MCMC) estimators of Restricted Boltzmann Machines (RBMs). We denote our approach Markov Chain Las Vegas (MCLV). MCLV gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has maximum number of Markov chain steps K (referred as MCLV-K). We present a MCLV-K gradient estimator (LVS-K) for RBMs and explore the correspondence and differences between LVS-K and Contrastive Divergence (CD-K), with LVS-K significantly outperforming CD-K training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models. …