# If you did not already know

Residual Gated Graph ConvNet
Graph-structured data such as functional brain networks, social networks, gene regulatory networks, communications networks have brought the interest in generalizing neural networks to graph domains. In this paper, we are interested to de- sign efficient neural network architectures for graphs with variable length. Several existing works such as Scarselli et al. (2009); Li et al. (2016) have focused on recurrent neural networks (RNNs) to solve this task. A recent different approach was proposed in Sukhbaatar et al. (2016), where a vanilla graph convolutional neural network (ConvNets) was introduced. We believe the latter approach to be a better paradigm to solve graph learning problems because ConvNets are more pruned to deep networks than RNNs. For this reason, we propose the most generic class of residual multi-layer graph ConvNets that make use of an edge gating mechanism, as proposed in Marcheggiani & Titov (2017). Gated edges appear to be a natural property in the context of graph learning tasks, as the system has the ability to learn which edges are important or not for the task to solve. We apply several graph neural models to two basic network science tasks; subgraph matching and semi-supervised clustering for graphs with variable length. Numerical results show the performances of the new model. …

Gated Linear Network
This paper describes a family of probabilistic architectures designed for online learning under the logarithmic loss. Rather than relying on non-linear transfer functions, our method gains representational power by the use of data conditioning. We state under general conditions a learnable capacity theorem that shows this approach can in principle learn any bounded Borel-measurable function on a compact subset of euclidean space; the result is stronger than many universality results for connectionist architectures because we provide both the model and the learning procedure for which convergence is guaranteed. …

Spatially Compact Semantic Scan (SCSS)
Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spatially compact events in text streams. SCSS employs alternating optimization between using semantic scan to estimate contrastive foreground topics in documents, and discovering spatial neighborhoods with high occurrence of documents containing the foreground topics. We evaluate our method on Emergency Department chief complaints dataset (ED dataset) to verify the effectiveness of our method in detecting real-world disease outbreaks from free-text ED chief complaint data. …

# R Packages worth a look

Pena-Yohai Initial Estimator for Robust S-Regression (pyinit)
Deterministic Pena-Yohai initial estimator for robust S estimators of regression. The procedure is described in detail in Pena, D., & Yohai, V. (1999) <doi:10.2307/2670164>.

Measuring Disparity (dispRity)
A modular package for measuring disparity from multidimensional matrices. Disparity can be calculated from any matrix defining a multidimensional space. The package provides a set of implemented metrics to measure properties of the space and allows users to provide and test their own metrics. The package also provides functions for looking at disparity in a serial way (e.g. disparity through time) or per groups as well as visualising the results. Finally, this package provides several basic statistical tests for disparity analysis.

Functional Concurrent Regression for Sparse Data (fcr)
Dynamic prediction in functional concurrent regression with an application to child growth. Extends the pffr() function from the ‘refund’ package to handle the scenario where the functional response and concurrently measured functional predictor are irregularly measured. Leroux et al. (2017), Statistics in Medicine, <doi:10.1002/sim.7582>.

Age-Structured Population Dynamics Model (albopictus)
Implements discrete time deterministic and stochastic age-structured population dynamics models described in Erguler and others (2016) <doi:10.1371/journal.pone.0149282> and Erguler and others (2017) <doi:10.1371/journal.pone.0174293>.

Compare Big Datasets to the Uniform Distribution (ggQQunif)
A quantile-quantile plot can be used to compare a sample of p-values to the uniform distribution. But when the dataset is big (i.e. > 1e4 p-values), plotting the quantile-quantile plot can be slow. geom_QQ uses all the data to calculate the quantiles, but thins it out in a way that focuses on points near zero before plotting to speed up plotting and decrease file size, when vector graphics are stored.

# If you did not already know

Riemann-Theta Boltzmann Machine
A general Boltzmann machine with continuous visible and discrete integer valued hidden states is introduced. Under mild assumptions about the connection matrices, the probability density function of the visible units can be solved for analytically, yielding a novel parametric density function involving a ratio of Riemann-Theta functions. The conditional expectation of a hidden state for given visible states can also be calculated analytically, yielding a derivative of the logarithmic Riemann-Theta function. The conditional expectation can be used as activation function in a feedforward neural network, thereby increasing the modelling capacity of the network. Both the Boltzmann machine and the derived feedforward neural network can be successfully trained via standard gradient- and non-gradient-based optimization techniques. …

Hierarchical Compositional Network (HCN)
We introduce the hierarchical compositional network (HCN), a directed generative model able to discover and disentangle, without supervision, the building blocks of a set of binary images. The building blocks are binary features defined hierarchically as a composition of some of the features in the layer immediately below, arranged in a particular manner. At a high level, HCN is similar to a sigmoid belief network with pooling. Inference and learning in HCN are very challenging and existing variational approximations do not work satisfactorily. A main contribution of this work is to show that both can be addressed using max-product message passing (MPMP) with a particular schedule (no EM required). Also, using MPMP as an inference engine for HCN makes new tasks simple: adding supervision information, classifying images, or performing inpainting all correspond to clamping some variables of the model to their known values and running MPMP on the rest. When used for classification, fast inference with HCN has exactly the same functional form as a convolutional neural network (CNN) with linear activations and binary weights. However, HCN’s features are qualitatively very different. …

Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
CMA-ES stands for Covariance Matrix Adaptation Evolution Strategy. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation (via recombination and mutation) and selection: in each generation (iteration) new individuals (candidate solutions, denoted as x) are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value f(x). Like this, over the generation sequence, individuals with better and better f-values are generated. In an evolution strategy, new candidate solutions are sampled according to a multivariate normal distribution in the R^n. Recombination amounts to selecting a new mean value for the distribution. Mutation amounts to adding a random vector, a perturbation with zero mean. Pairwise dependencies between the variables in the distribution are represented by a covariance matrix. The covariance matrix adaptation (CMA) is a method to update the covariance matrix of this distribution. This is particularly useful, if the function f is ill-conditioned. Adaptation of the covariance matrix amounts to learning a second order model of the underlying objective function similar to the approximation of the inverse Hessian matrix in the Quasi-Newton method in classical optimization. In contrast to most classical methods, fewer assumptions on the nature of the underlying objective function are made. Only the ranking between candidate solutions is exploited for learning the sample distribution and neither derivatives nor even the function values themselves are required by the method. …

# Whats new on arXiv

Gaussian processes (GPs) are Bayesian nonparametric generative models that provide interpretability of hyperparameters, admit closed-form expressions for training and inference, and are able to accurately represent uncertainty. To model general non-Gaussian data with complex correlation structure, GPs can be paired with an expressive covariance kernel and then fed into a nonlinear transformation (or warping). However, overparametrising the kernel and the warping is known to, respectively, hinder gradient-based training and make the predictions computationally expensive. We remedy this issue by (i) training the model using derivative-free global-optimisation techniques so as to find meaningful maxima of the model likelihood, and (ii) proposing a warping function based on the celebrated Box-Cox transformation that requires minimal numerical approximations—unlike existing warped GP models. We validate the proposed approach by first showing that predictions can be computed analytically, and then on a learning, reconstruction and forecasting experiment using real-world datasets.
In this position paper, we describe our vision of the future of machine-based programming through a categorical examination of three pillars of research. Those pillars are: (i) intention, (ii) invention, and(iii) adaptation. Intention emphasizes advancements in the human-to-computer and computer-to-machine-learning interfaces. Invention emphasizes the creation or refinement of algorithms or core hardware and software building blocks through machine learning (ML). Adaptation emphasizes advances in the use of ML-based constructs to autonomously evolve software.
As concerns about unfairness and discrimination in ‘black box’ machine learning systems rise, a legal ‘right to an explanation’ has emerged as a compellingly attractive approach for challenge and redress. We outline recent debates on the limited provisions in European data protection law, and introduce and analyze newer explanation rights in French administrative law and the draft modernized Council of Europe Convention 108. While individual rights can be useful, in privacy law they have historically unreasonably burdened the average data subject. ‘Meaningful information’ about algorithmic logics is more technically possible than commonly thought, but this exacerbates a new ‘transparency fallacy’—an illusion of remedy rather than anything substantively helpful. While rights-based approaches deserve a firm place in the toolbox, other forms of governance, such as impact assessments, ‘soft law,’ judicial review, and model repositories deserve more attention, alongside catalyzing agencies acting for users to control algorithmic system design.
Memory and computation efficient deep learning architec- tures are crucial to continued proliferation of machine learning capabili- ties to new platforms and systems. Binarization of operations in convo- lutional neural networks has shown promising results in reducing model size and computing efficiency. In this paper, we tackle the problem us- ing a strategy different from the existing literature by proposing local binary pattern networks or LBPNet, that is able to learn and perform binary operations in an end-to-end fashion. LBPNet1 uses local binary comparisons and random projection in place of conventional convolu- tion (or approximation of convolution) operations. These operations can be implemented efficiently on different platforms including direct hard- ware implementation. We applied LBPNet and its variants on standard benchmarks. The results are promising across benchmarks while provid- ing an important means to improve memory and speed efficiency that is particularly suited for small footprint devices and hardware accelerators.
The ability to anticipate the future is essential when making real time critical decisions, provides valuable information to understand dynamic natural scenes, and can help unsupervised video representation learning. State-of-art video prediction is based on LSTM recursive networks and/or generative adversarial network learning. These are complex architectures that need to learn large numbers of parameters, are potentially hard to train, slow to run, and may produce blurry predictions. In this paper, we introduce DYAN, a novel network with very few parameters and easy to train, which produces accurate, high quality frame predictions, significantly faster than previous approaches. DYAN owes its good qualities to its encoder and decoder, which are designed following concepts from systems identification theory and exploit the dynamics-based invariants of the data. Extensive experiments using several standard video datasets show that DYAN is superior generating frames and that it generalizes well across domains.
AI researchers employ not only the scientific method, but also methodology from mathematics and engineering. However, the use of the scientific method – specifically hypothesis testing – in AI is typically conducted in service of engineering objectives. Growing interest in topics such as fairness and algorithmic bias show that engineering-focused questions only comprise a subset of the important questions about AI systems. This results in the AI Knowledge Gap: the number of unique AI systems grows faster than the number of studies that characterize these systems’ behavior. To close this gap, we argue that the study of AI could benefit from the greater inclusion of researchers who are well positioned to formulate and test hypotheses about the behavior of AI systems. We examine the barriers preventing social and behavioral scientists from conducting such studies. Our diagnosis suggests that accelerating the scientific study of AI systems requires new incentives for academia and industry, mediated by new tools and institutions. To address these needs, we propose a two-sided marketplace called TuringBox. On one side, AI contributors upload existing and novel algorithms to be studied scientifically by others. On the other side, AI examiners develop and post machine intelligence tasks designed to evaluate and characterize algorithmic behavior. We discuss this market’s potential to democratize the scientific study of AI behavior, and thus narrow the AI Knowledge Gap.
We propose a new network architecture, Gated Attention Networks (GaAN), for learning on graphs. Unlike the traditional multi-head attention mechanism, which equally consumes all attention heads, GaAN uses a convolutional sub-network to control each attention head’s importance. We demonstrate the effectiveness of GaAN on the inductive node classification problem. Moreover, with GaAN as a building block, we construct the Graph Gated Recurrent Unit (GGRU) to address the traffic speed forecasting problem. Extensive experiments on three real-world datasets show that our GaAN framework achieves state-of-the-art results on both tasks.
This paper presents findings for training a Q-learning reinforcement learning agent using natural gradient techniques. We compare the original deep Q-network (DQN) algorithm to its natural gradient counterpart (NGDQN), measuring NGDQN and DQN performance on classic controls environments without target networks. We find that NGDQN performs favorably relative to DQN, converging to significantly better policies faster and more frequently. These results indicate that natural gradient could be used for value function optimization in reinforcement learning to accelerate and stabilize training.
The paper tackles the unsupervised estimation of the effective dimension of a sample of dependent random vectors. The proposed method uses the principal components (PC) decomposition of sample covariance to establish a low-rank approximation that helps uncover the hidden structure. The number of PCs to be included in the decomposition is determined via a Probabilistic Principal Components Analysis (PPCA) embedded in a penalized profile likelihood criterion. The choice of penalty parameter is guided by a data-driven procedure that is justified via analytical derivations and extensive finite sample simulations. Application of the proposed penalized PPCA is illustrated with three gene expression datasets in which the number of cancer subtypes is estimated from all expression measurements. The analyses point towards hidden structures in the data, e.g. additional subgroups, that could be of scientific interest.
Data efficiency, i.e., learning from small data sets, is critical in many practical applications where data collection is time consuming or expensive, e.g., robotics, animal experiments or drug design. Meta learning is one way to increase the data efficiency of learning algorithms by generalizing learned concepts from a set of training tasks to unseen, but related, tasks. Often, this relationship between tasks is hard coded or relies in some other way on human expertise. In this paper, we propose to automatically learn the relationship between tasks using a latent variable model. Our approach finds a variational posterior over tasks and averages over all plausible (according to this posterior) tasks when making predictions. We apply this framework within a model-based reinforcement learning setting for learning dynamics models and controllers of many related tasks. We apply our framework in a model-based reinforcement learning setting, and show that our model effectively generalizes to novel tasks, and that it reduces the average interaction time needed to solve tasks by up to 60% compared to strong baselines.
In this paper, we introduce a powerful technique, Leave-One-Out, to the analysis of low-rank matrix completion problems. Using this technique, we develop a general approach for obtaining fine-grained, entry-wise bounds on iterative stochastic procedures. We demonstrate the power of this approach in analyzing two of the most important algorithms for matrix completion: the non-convex approach based on Singular Value Projection (SVP), and the convex relaxation approach based on nuclear norm minimization (NNM). In particular, we prove for the first time that the original form of SVP, without re-sampling or sample splitting, converges linearly in the infinity norm. We further apply our leave-one-out approach to an iterative procedure that arises in the analysis of the dual solutions of NNM. Our results show that NNM recovers the true $d$-by-$d$ rank-$r$ matrix with $\mathcal{O}(\mu^2 r^3d \log d )$ observed entries, which has optimal dependence on the dimension and is independent of the condition number of the matrix. To the best of our knowledge, this is the first sample complexity result for a tractable matrix completion algorithm that satisfies these two properties simultaneously.

# Book Memo: “Introduction to HPC with MPI for Data Science”

 This gentle introduction to High Performance Computing (HPC) for Data Science using the Message Passing Interface (MPI) standard has been designed as a first course for undergraduates on parallel programming on distributed memory models, and requires only basic programming notions. Divided into two parts the first part covers high performance computing using C++ with the Message Passing Interface (MPI) standard followed by a second part providing high-performance data analytics on computer clusters. In the first part, the fundamental notions of blocking versus non-blocking point-to-point communications, global communications (like broadcast or scatter) and collaborative computations (reduce), with Amdalh and Gustafson speed-up laws are described before addressing parallel sorting and parallel linear algebra on computer clusters. The common ring, torus and hypercube topologies of clusters are then explained and global communication procedures on these topologies are studied. This first part closes with the MapReduce (MR) model of computation well-suited to processing big data using the MPI framework. In the second part, the book focuses on high-performance data analytics. Flat and hierarchical clustering algorithms are introduced for data exploration along with how to program these algorithms on computer clusters, followed by machine learning classification, and an introduction to graph analytics. This part closes with a concise introduction to data core-sets that let big data problems be amenable to tiny data problems.

# Book Memo: “Probability and Statistics for Computer Science”

 This textbook is aimed at computer science undergraduates late in sophomore or early in junior year, supplying a comprehensive background in qualitative and quantitative data analysis, probability, random variables, and statistical methods, including machine learning. With careful treatment of topics that fill the curricular needs for the course, Probability and Statistics for Computer Science features: • A treatment of random variables and expectations dealing primarily with the discrete case. • A practical treatment of simulation, showing how many interesting probabilities and expectations can be extracted, with particular emphasis on Markov chains. • A clear but crisp account of simple point inference strategies (maximum likelihood; Bayesian inference) in simple contexts. This is extended to cover some confidence intervals, samples and populations for random sampling with replacement, and the simplest hypothesis testing. • A chapter dealing with classification, explaining why it’s useful; how to train SVM classifiers with stochastic gradient descent; and how to use implementations of more advanced methods such as random forests and nearest neighbors. • A chapter dealing with regression, explaining how to set up, use and understand linear regression and nearest neighbors regression in practical problems. • A chapter dealing with principal components analysis, developing intuition carefully, and including numerous practical examples. There is a brief description of multivariate scaling via principal coordinate analysis. • A chapter dealing with clustering via agglomerative methods and k-means, showing how to build vector quantized features for complex signals. Illustrated throughout, each main chapter includes many worked examples and other pedagogical elements such as boxed Procedures, Definitions, Useful Facts, and Remember This (short tips). Problems and Programming Exercises are at the end of each chapter, with a summary of what the reader should know. Instructor resources include a full set of model solutions for all problems, and an Instructor’s Manual with accompanying presentation slides.

# Distilled News

I was recently chatting to a friend whose startup’s machine learning models were so disorganized it was causing serious problems as his team tried to build on each other’s work and share it with clients. Even the original author sometimes couldn’t train the same model and get similar results! He was hoping that I had a solution I could recommend, but I had to admit that I struggle with the same problems in my own work. It’s hard to explain to people who haven’t worked with machine learning, but we’re still back in the dark ages when it comes to tracking changes and rebuilding models from scratch. It’s so bad it sometimes feels like stepping back in time to when we coded without source control.
Machine learning can drive tangible business value for a wide range of industries — but only if it is actually put to use. Despite the many machine learning discoveries being made by academics, new research papers showing what is possible, and an increasing amount of data available, companies are struggling to deploy machine learning to solve real business problems. In short, the gap for most companies isn’t that machine learning doesn’t work, but that they struggle to actually use it. How can companies close this execution gap? In a recent project we illustrated the principles of how to do it. We used machine learning to augment the power of seasoned professionals — in this case, project managers — by allowing them to make data-driven business decisions well in advance. And in doing so, we demonstrated that getting value from machine learning is less about cutting-edge models, and more about making deployment easier.
The k-Nearest-Neighbors (kNN) method of classification is one of the simplest methods in machine learning, and is a great way to introduce yourself to machine learning and classification in general. At its most basic level, it is essentially classification by finding the most similar data points in the training data, and making an educated guess based on their classifications. Although very simple to understand and implement, this method has seen wide application in many domains, such as in recommendation systems, semantic searching, and anomaly detection.
In logistic regression, separation refers to the situation in which a linear combination of predictors perfectly discriminates the binary outcome. Because finite-valued maximum likelihood parameter estimates do not exist under separation, Bayesian regressions with informative shrinkage of the regression coefficients offer a suitable alternative. Little focus has been given on whether and how to shrink the intercept parameter. Based upon classical studies of separation, we argue that efficiency in estimating regression coefficients may vary with the intercept prior. We adapt alternative prior distributions for the intercept that downweight implausibly extreme regions of the parameter space rendering less sensitivity to separation. Through simulation and the analysis of exemplar datasets, we quantify differences across priors stratified by established statistics measuring the degree of separation. Relative to diffuse priors, our recommendations generally result in more efficient estimation of the regression coefficients themselves when the data are nearly separated. They are equally efficient in non-separated datasets, making them suitable for default use. Modest differences were observed with respect to out-of-sample discrimination. Our work also highlights the interplay between priors for the intercept and the regression coefficients: numerical results are more sensitive to the choice of intercept prior when using a weakly informative prior on the regression coefficients than an informative shrinkage prior.
If you regularly have to deal with specific versions of R, or different package combinations, or getting R set up to work with other databases or applications then, well, it can be a pain. You could dedicate a special machine for each configuration you need, I guess, but that’s expensive and impractical. You could set up virtual machines in the cloud which works well for one-off situations, but gets tedious having to re-configure a new VM each time. Or, you could use Docker containers, which were expressly designed to make it quick easy to configure and launch an independent and secure collection of software and services. If you’re new to the concept of Docker containers, here’s a docker tutorial for data scientists. But the concepts are pretty simple. At Docker hub, you can search ‘images’ – basically, bundles of software with pre-configured settings – contributed by the community and by vendors. (You’ll be referring to the images by name, for example: rocker/r-base.) You can then create a ‘container’ (a running instance of that image) on your machine with the docker application, or in the cloud using the tools offered by your provider of choice.
Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x). Briefly, the goal of regression model is to build a mathematical equation that defines y as a function of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new values of the predictor variables (x).
We’re stuck. Or at least we’re plateaued. Can anyone remember the last time a year went by without a major notable advance in algorithms, chips, or data handling? It was so unusual to go to the Strata San Jose conference a few weeks ago and see no new eye catching developments. As I reported earlier, it seems we’ve hit maturity and now our major efforts are aimed at either making sure all our powerful new techniques work well together (converged platforms) or making a buck from those massive VC investments in same. I’m not the only one who noticed. Several attendees and exhibitors said very similar things to me. And just the other day I had a note from a team of well-regarded researchers who had been evaluating the relative merits of different advanced analytic platforms, and concluding there weren’t any differences worth reporting.
SketchCode: Go from idea to HTML in 5 seconds
Most data scientists have to write code to analyze data or build products. While coding, data scientists act as software engineers. Adopting best practices from software engineering is key to ensuring the correctness, reproducibility, and maintainability of data science projects. This post describes some of our efforts in the area.
There are lots of applications of text classification in the commercial world. For example, news stories are typically organized by topics; content or products are often tagged by categories; users can be classified into cohorts based on how they talk about a product or brand online.
Natural Language Processing (NLP) has been seen as one of the blackboxes of Data Analytics. The aim of this post is to introduce this simple-to-use but effective R package udpipe for NLP and Text Analytics. UDPipe?—?R package provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.
In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by Prof. Geoffrey Hinton from the University of Toronto in 2012.
The research on improving Artificial Intelligence (A.I.) has been ongoing for decades. However, it wasn’t until recently that developers were finally able to create smart systems that closely resemble the A.I. capabilities of humans. The main reason for this breakthrough in technology is advancements in Big Data. Recent developments in Big Data have allowed us the capability to organize a very large amount of information into structured components that can be very quickly processed by computers. Another technology that has the potential for rapidly advancing and transforming Artificial Intelligence is the Blockchain. While some of the applications that have been developed on Blockchain are nothing more than ledger records of transactions, others are so incredibly smart that they almost appear like AI. Here, we will look more closely at the opportunities for A.I. advancement through the Blockchain protocol.

# Magister Dixit

“Big data is not for the feint of heart, you and your team must be willing to master many disciplines in order to be successful. You’ll need understanding of code, hardware, Virtualization, networking, databases (SQL & NoSQL), ETL, Cloud, and more. Don’t fool yourself, you’ll need some serious skills on-board.” Kevin Daly ( 10.11.2014 )

# Document worth reading: “An introduction to Graph Data Management”

A graph database is a database where the data structures for the schema and/or instances are modeled as a (labeled)(directed) graph or generalizations of it, and where querying is expressed by graph-oriented operations and type constructors. In this article we present the basic notions of graph databases, give an historical overview of its main development, and study the main current systems that implement them. An introduction to Graph Data Management