Self-attentive feed-forward sequence models have been shown to achieve impressive results on sequence modeling tasks, thereby presenting a compelling alternative to recurrent neural networks (RNNs) which has remained the de-facto standard architecture for many sequence modeling problems to date. Despite these successes, however, feed-forward sequence models like the Transformer fail to generalize in many tasks that recurrent models handle with ease (e.g. copying when the string lengths exceed those observed at training time). Moreover, and in contrast to RNNs, the Transformer model is not computationally universal, limiting its theoretical expressivity. In this paper we propose the Universal Transformer which addresses these practical and theoretical shortcomings and we show that it leads to improved performance on several tasks. Instead of recurring over the individual symbols of sequences like RNNs, the Universal Transformer repeatedly revises its representations of all symbols in the sequence with each recurrent step. In order to combine information from different parts of a sequence, it employs a self-attention mechanism in every recurrent step. Assuming sufficient memory, its recurrence makes the Universal Transformer computationally universal. We further employ an adaptive computation time (ACT) mechanism to allow the model to dynamically adjust the number of times the representation of each position in a sequence is revised. Beyond saving computation, we show that ACT can improve the accuracy of the model. Our experiments show that on various algorithmic tasks and a diverse set of large-scale language understanding tasks the Universal Transformer generalizes significantly better and outperforms both a vanilla Transformer and an LSTM in machine translation, and achieves a new state of the art on the bAbI linguistic reasoning task and the challenging LAMBADA language modeling task.
In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has different computational complexity at different branches. Through frequent merging of features from branches at distinct scales, our model obtains multi-scale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks,using popular architectures including ResNet and ResNeXt. For object recognition, our approach reduces computation by 33% on object recognition while improving accuracy with 0.9%. Furthermore, our model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains.
While model-based reinforcement learning has empirically been shown to significantly reduce the sample complexity that hinders model-free RL, the theoretical understanding of such methods has been rather limited. In this paper, we introduce a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees, and a practical algorithm Optimistic Lower Bounds Optimization (OLBO). In particular, we derive a theoretical guarantee of monotone improvement for model-based RL with our framework. We iteratively build a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and maximize it jointly over the policy and the model. Assuming the optimization in each iteration succeeds, the expected reward is guaranteed to improve. The framework also incorporates an optimism-driven perspective, and reveals the intrinsic measure for the model prediction error. Preliminary simulations demonstrate that our approach outperforms the standard baselines on continuous control benchmark tasks.
Automatic machine learning performs predictive modeling with high performing machine learning tools without human interference. This is achieved by making machine learning applications parameter-free, i.e. only a dataset is provided while the complete model selection and model building process is handled internally through (often meta) optimization. Projects like Auto-WEKA and auto-sklearn aim to solve the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem resulting in huge configuration spaces. However, for most real-world applications, the optimization over only a few different key learning algorithms can not only be sufficient, but also potentially beneficial. The latter becomes apparent when one considers that models have to be validated, explained, deployed and maintained. Here, less complex model are often preferred, for validation or efficiency reasons, or even a strict requirement. Automatic gradient boosting simplifies this idea one step further, using only gradient boosting as a single learning algorithm in combination with model-based hyperparameter tuning, threshold optimization and encoding of categorical features. We introduce this general framework as well as a concrete implementation called autoxgboost. It is compared to current AutoML projects on 16 datasets and despite its simplicity is able to achieve comparable results on about half of the datasets as well as performing best on two.
Observed multidimensional network data can have different levels of complexity, as nodes may be characterized by heterogeneous individual-specific features. Also, such characteristics may vary across the networks. This article discusses a novel class of models for multidimensional networks, able to deal with different levels of heterogeneity within and between networks. The proposed framework is developed within the family of latent space models, in order to distinguish recurrent symmetrical relations between the nodes from node-specific features in the different views. Models parameters are estimated via a Markov Chain Monte Carlo algorithm. Simulated data and also FAO fruits import/export data are analysed to illustrate the performances of the proposed models.
Deep generative models have shown promising results in generating realistic images, but it is still non-trivial to generate images with complicated structures. The main reason is that most of the current generative models fail to explore the structures in the images including spatial layout and semantic relations between objects. To address this issue, we propose a novel deep structured generative model which boosts generative adversarial networks (GANs) with the aid of structure information. In particular, the layout or structure of the scene is encoded by a stochastic and-or graph (sAOG), in which the terminal nodes represent single objects and edges represent relations between objects. With the sAOG appropriately harnessed, our model can successfully capture the intrinsic structure in the scenes and generate images of complicated scenes accordingly. Furthermore, a detection network is introduced to infer scene structures from a image. Experimental results demonstrate the effectiveness of our proposed method on both modeling the intrinsic structures, and generating realistic images.
In this study, we address the challenge of measuring the ability of a recommender system to make surprising recommendations. Although current evaluation methods make it possible to determine if two algorithms can make recommendations with a significant difference in their average surprise measure, it could be of interest to our community to know how competent an algorithm is at embedding surprise in its recommendations, without having to resort to making a direct comparison with another algorithm. We argue that a) surprise is a finite resource in a recommender system, b) there is a limit to how much surprise any algorithm can embed in a recommendation, and c) this limit can provide us with a scale against which the performance of any algorithm can be measured. By exploring these ideas, it is possible to define the concepts of maximum and minimum potential surprise and design a surprise metric called ‘normalised surprise’ that employs these limits to potential surprise. Two experiments were conducted to test the proposed metric. The aim of the first was to validate the quality of the estimates of minimum and maximum potential surprise produced by a greedy algorithm. The purpose of the second experiment was to analyse the behaviour of the proposed metric using the MovieLens dataset. The results confirmed the behaviour that was expected, and showed that the proposed surprise metric is both effective and consistent for differing choices of recommendation algorithms, data representations and distance functions.
Multimodal machine learning is a core research area spanning the language, visual and acoustic modalities. The central challenge in multimodal learning involves learning representations that can process and relate information from multiple modalities. In this paper, we propose two methods for unsupervised learning of joint multimodal representations using sequence to sequence (Seq2Seq) methods: a \textit{Seq2Seq Modality Translation Model} and a \textit{Hierarchical Seq2Seq Modality Translation Model}. We also explore multiple different variations on the multimodal inputs and outputs of these seq2seq models. Our experiments on multimodal sentiment analysis using the CMU-MOSI dataset indicate that our methods learn informative multimodal representations that outperform the baselines and achieve improved performance on multimodal sentiment analysis, specifically in the Bimodal case where our model is able to improve F1 Score by twelve points. We also discuss future directions for multimodal Seq2Seq methods.
An analytic process is iterative between two agents, an analyst and an analytic toolbox. Each iteration comprises three main steps: preparing a dataset, running an analytic tool, and evaluating the result, where dataset preparation and result evaluation, conducted by the analyst, are largely domain-knowledge driven. In this work, the focus is on automating the result evaluation step. The underlying problem is to identify plots that are deemed interesting by an analyst. We propose a methodology to learn such analyst’s intent based on Generative Adversarial Networks (GANs) and demonstrate its applications in the context of production yield optimization using data collected from several product lines.
Deep Learning has the hierarchical network architecture to represent the complicated features of input patterns. Such architecture is well known to represent higher learning capability compared with some conventional models if the best set of parameters in the optimal network structure is found. We have been developing the adaptive learning method that can discover the optimal network structure in Deep Belief Network (DBN). The learning method can construct the network structure with the optimal number of hidden neurons in each Restricted Boltzmann Machine and with the optimal number of layers in the DBN during learning phase. The network structure of the learning method can be self-organized according to given input patterns of big data set. In this paper, we embed the adaptive learning method into the recurrent temporal RBM and the self-generated layer into DBN. In order to verify the effectiveness of our proposed method, the experimental results are higher classification capability than the conventional methods in this paper.
We propose a novel end-to-end neural network architecture that, once trained, directly outputs a probabilistic clustering of a batch of input examples in one pass. It estimates a distribution over the number of clusters $k$, and for each $1 \leq k \leq k_\mathrm{max}$, a distribution over the individual cluster assignment for each data point. The network is trained in advance in a supervised fashion on separate data to learn grouping by any perceptual similarity criterion based on pairwise labels (same/different group). It can then be applied to different data containing different groups. We demonstrate promising performance on high-dimensional data like images (COIL-100) and speech (TIMIT). We call this “learning to cluster” and show its conceptual difference to deep metric learning, semi-supervise clustering and other related approaches while having the advantage of performing learnable clustering fully end-to-end.
The data mining technique of time series clustering is well established in many fields. However, as an unsupervised learning method, it requires making choices that are nontrivially influenced by the nature of the data involved. The aim of this paper is to verify usefulness of the time series clustering method for macroeconomics research, and to develop the most suitable methodology. By extensively testing various possibilities, we arrive at a choice of a dissimilarity measure (compression-based dissimilarity measure, or CDM) which is particularly suitable for clustering macroeconomic variables. We check that the results are stable in time and reflect large-scale phenomena such as crises. We also successfully apply our findings to analysis of national economies, specifically to identyfing their structural relations.
Missing data are ubiquitous in many domains such as healthcare. Depending on how they are missing, the (conditional) independence relations in the observed data may be different from those for the complete data generated by the underlying causal process and, as a consequence, simply applying existing causal discovery methods to the observed data may lead to wrong conclusions. It is then essential to extend existing causal discovery approaches to find true underlying causal structure from such incomplete data. In this paper, we aim at solving this problem for data that are missing with different mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). With missingness mechanisms represented by missingness Graph (m-Graph), we analyze conditions under which addition correction is needed to derive conditional independence/dependence relations in the complete data. Based on our analysis, we propose missing value PC (MVPC), which combines additional corrections with traditional causal discovery algorithm, in particular, PC. Our proposed MVPC is shown in theory to give asymptotically correct results even using data that are MAR and MNAR. Experiment results illustrate that the proposed algorithm can correct the conditional independence for values MCAR, MAR and rather general cases of values MNAR both with synthetic data as well as real-life healthcare application.
Due to the iterative nature of most nonnegative matrix factorization (\textsc{NMF}) algorithms, initialization is a key aspect as it significantly influences both the convergence and the final solution obtained. Many initialization schemes have been proposed for NMF, among which one of the most popular class of methods are based on the singular value decomposition (SVD). However, these SVD-based initializations do not satisfy a rather natural condition, namely that the error should decrease as the rank of factorization increases. In this paper, we propose a novel SVD-based \textsc{NMF} initialization to specifically address this shortcoming by taking into account the SVD factors that were discarded to obtain a nonnegative initialization. This method, referred to as nonnegative SVD with low-rank correction (NNSVD-LRC), allows us to significantly reduce the initial error at a negligible additional computational cost using the low-rank structure of the discarded SVD factors. NNSVD-LRC has two other advantages compared to previous SVD-based initializations: (1) it provably generates sparse initial factors, and (2) it is faster as it only requires to compute a truncated SVD of rank $\lceil r/2 + 1 \rceil$ where $r$ is the factorization rank of the sought NMF decomposition (as opposed to a rank-$r$ truncated SVD for other methods). We show on several standard dense and sparse data sets that our new method competes favorably with state-of-the-art SVD-based initializations for NMF.
With the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but require managing adequate metadata to guarantee an efficient access to data. Starting from a multidimensional metadata model designed for an industrial heritage data lake presenting a lack of schema evolutivity, we propose in this paper to use ensemble modeling, and more precisely a data vault, to address this issue. To illustrate the feasibility of this approach, we instantiate our metadata conceptual model into relational and document-oriented logical and physical models, respectively. We also compare the physical models in terms of metadata storage and query response time.
Modern Convolutional Neural Networks (CNN) are extremely powerful on a range of computer vision tasks. However, their performance may degrade when the data is characterised by large intra-class variability caused by spatial transformations. The Spatial Transformer Network (STN) is currently the method of choice for providing CNNs the ability to remove those transformations and improve performance in an end-to-end learning framework. In this paper, we propose Densely Fused Spatial Transformer Network (DeSTNet), which, to the best of our knowledge, is the first dense fusion pattern for combining multiple STNs. Specifically, we show how changing the connectivity pattern of multiple STNs from sequential to dense leads to more powerful alignment modules. Extensive experiments on three benchmarks namely, MNIST, GTSRB, and IDocDB show that the proposed technique outperforms related state-of-the-art methods (i.e., STNs and CSTNs) both in terms of accuracy and robustness.
This paper describes the design and use of the graph-based parsing framework and toolkit UniParse, released as an open-source python software package. UniParse as a framework novelly streamlines research prototyping, development and evaluation of graph-based dependency parsing architectures. UniParse does this by enabling highly efficient, sufficiently independent, easily readable, and easily extensible implementations for all dependency parser components. We distribute the toolkit with ready-made configurations as re-implementations of all current state-of-the-art first-order graph-based parsers, including even more efficient Cython implementations of both encoders and decoders, as well as the required specialised loss functions.
The size of a website’s active user base directly affects its value. Thus, it is important to monitor and influence a user’s likelihood to return to a site. Essential to this is predicting when a user will return. Current state of the art approaches to solve this problem come in two flavors: (1) Recurrent Neural Network (RNN) based solutions and (2) survival analysis methods. We observe that both techniques are severely limited when applied to this problem. Survival models can only incorporate aggregate representations of users instead of automatically learning a representation directly from a raw time series of user actions. RNNs can automatically learn features, but can not be directly trained with examples of non-returning users who have no target value for their return time. We develop a novel RNN survival model that removes the limitations of the state of the art methods. We demonstrate that this model can successfully be applied to return time prediction on a large e-commerce dataset with a superior ability to discriminate between returning and non-returning users than either method applied in isolation.
While we are usually focused on predicting future values of time series, it is often valuable to additionally predict their entire probability distributions, for example to evaluate risk or Monte Carlo simulations. On example of time series of $\approx$ 30000 Dow Jones Industrial Averages, there will be shown application of hierarchical correlation reconstruction for this purpose: mean-square fitting polynomial as joint density for (current value, context), where context is for example a few previous values. Then substituting the currently observed context and normalizing density to 1, we get predicted probability distribution for the current value. In contrast to standard machine learning approaches like neural networks, optimal coefficients here can be inexpensively directly calculated, are unique and independent, each has a specific cumulant-like interpretation, and such approximation can approach complete description of any joint distribution – providing a perfect tool to quantitatively describe and exploit statistical dependencies in time series.
In this work we offer a framework for reasoning about a wide class of existing objectives in machine learning. We develop a formal correspondence between this work and thermodynamics and discuss its implications.
Similarity measures play a fundamental role in memory-based nearest neighbors approaches. They recommend items to a user based on the similarity of either items or users in a neighborhood. In this paper we argue that, although it keeps a leading importance in computing recommendations, similarity between users or items should be paired with a value of dissimilarity (computed not just as the complement of the similarity one). We formally modeled and injected this notion in some of the most used similarity measures and evaluated our approach showing its effectiveness in terms of accuracy results.
Whether neural networks can learn abstract reasoning or whether they merely rely on superficial statistics is a topic of recent debate. Here, we propose a dataset and challenge designed to probe abstract reasoning, inspired by a well-known human IQ test. To succeed at this challenge, models must cope with various generalisation `regimes’ in which the training and test data differ in clearly-defined ways. We show that popular models such as ResNets perform poorly, even when the training and test sets differ only minimally, and we present a novel architecture, with a structure designed to encourage reasoning, that does significantly better. When we vary the way in which the test questions and training data differ, we find that our model is notably proficient at certain forms of generalisation, but notably weak at others. We further show that the model’s ability to generalise improves markedly if it is trained to predict symbolic explanations for its answers. Altogether, we introduce and explore ways to both measure and induce stronger abstract reasoning in neural networks. Our freely-available dataset should motivate further progress in this direction.