We propose a new learning paradigm called Deep Memory. It has the potential to completely revolutionize the Machine Learning field. Surprisingly, this paradigm has not been reinvented yet, unlike Deep Learning. At the core of this approach is the \textit{Learning By Heart} principle, well studied in primary schools all over the world. Inspired by poem recitation, or by $\pi$ decimal memorization, we propose a concrete algorithm that mimics human behavior. We implement this paradigm on the task of generative modeling, and apply to images, natural language and even the $\pi$ decimals as long as one can print them as text. The proposed algorithm even generated this paper, in a one-shot learning setting. In carefully designed experiments, we show that the generated samples are indistinguishable from the training examples, as measured by any statistical tests or metrics.
As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine learning and predictive modeling, distributed optimization methods have recently garnered ample attention, in particular when either observations or features are distributed, but not both. We propose a general stochastic algorithm where observations, features, and gradient components can be sampled in a double distributed setting, i.e., with both features and observations distributed. Very technical analyses establish convergence properties of the algorithm under different conditions on the learning rate (diminishing to zero or constant). Computational experiments in Spark demonstrate a superior performance of our algorithm versus a benchmark in early iterations of the algorithm, which is due to the stochastic components of the algorithm.
Many modern applications require real-time processing of large volumes of high-speed data. Such data processing needs can be modeled as a streaming computation. A streaming computation is specified as a dataflow graph that exposes multiple opportunities for parallelizing its execution, in the form of data, pipeline and task parallelism. On the other hand, many important applications require that processing of the stream be ordered, where inputs are processed in the same order as they arrive. There is a fundamental conflict between ordered processing and parallelizing the streaming computation. This paper focuses on the problem of effectively parallelizing ordered streaming computations on a shared-memory multicore machine. We first address the key challenges in exploiting data parallelism in the ordered setting. We present a low-latency, non-blocking concurrent data structure to order outputs produced by concurrent workers on an operator. We also propose a new approach to parallelizing partitioned stateful operators that can handle load imbalance across partitions effectively and mostly avoid delays due to ordering. We illustrate the trade-offs and effectiveness of our concurrent data-structures on micro-benchmarks and streaming queries from the TPCx-BB benchmark. We then present an adaptive runtime that dynamically maps the exposed parallelism in the computation to that of the machine. We propose several intuitive scheduling heuristics and compare them empirically on the TPCx-BB queries. We find that for streaming computations, heuristics that exploit as much pipeline parallelism as possible perform better than those that seek to exploit data parallelism.
As neural network algorithms show high performance in many applications, their efficient inference on mobile and embedded systems are of great interests. When a single stream recurrent neural network (RNN) is executed for a personal user in embedded systems, it demands a large amount of DRAM accesses because the network size is usually much bigger than the cache size and the weights of an RNN are used only once at each time step. We overcome this problem by parallelizing the algorithm and executing it multiple time steps at a time. This approach also reduces the power consumption by lowering the number of DRAM accesses. QRNN (Quasi Recurrent Neural Networks) and SRU (Simple Recurrent Unit) based recurrent neural networks are used for implementation. The experiments for SRU showed about 300% and 930% of speed-up when the numbers of multi time steps are 4 and 16, respectively, in an ARM CPU based system.
Interaction and collaboration between humans and intelligent machines has become increasingly important as machine learning methods move into real-world applications that involve end users. While much prior work lies at the intersection of natural language and vision, such as image captioning or image generation from text descriptions, less focus has been placed on the use of language to guide or improve the performance of a learned visual processing algorithm. In this paper, we explore methods to flexibly guide a trained convolutional neural network through user input to improve its performance during inference. We do so by inserting a layer that acts as a spatio-semantic guide into the network. This guide is trained to modify the network’s activations, either directly via an energy minimization scheme or indirectly through a recurrent model that translates human language queries to interaction weights. Learning the verbal interaction is fully automatic and does not require manual text annotations. We evaluate the method on two datasets, showing that guiding a pre-trained network can improve performance, and provide extensive insights into the interaction between the guide and the CNN.
In this work, we build an ensemble of machine learning models to predict the price of a product given its image, and visualize the features that result in higher or lower price predictions. We collect two novel datasets of product images and their MSRP prices for this purpose: a bicycle dataset and a car dataset. We set baselines for price regression using linear regression on histogram of oriented gradients (HOG) and convolutional neural network (CNN) features, and a baseline for price segment classification using a multiclass SVM. For our main models, we train several deep CNNs using both transfer learning and our own architectures, for both regression and classification. We achieve strong results on both datasets, with deep CNNs significantly outperforming other models in a variety of metrics. Finally, we use several recently-developed methods to visualize the image features that result in higher or lower prices.
This essay examines how what is considered to be artificial intelligence (AI) has changed over time and come to intersect with the expertise of the author. Initially, AI developed on a separate trajectory, both topically and institutionally, from pattern recognition, neural information processing, decision and control systems, and allied topics by focusing on symbolic systems within computer science departments rather than on continuous systems in electrical engineering departments. The separate evolutions continued throughout the author’s lifetime, with some crossover in reinforcement learning and graphical models, but were shocked into converging by the virality of deep learning, thus making an electrical engineer into an AI researcher. Now that this convergence has happened, opportunity exists to pursue an agenda that combines learning and reasoning bridged by interpretable machine learning models.
Machine-learning algorithms have gained popularity in recent years in the field of ecological modeling due to their promising results in predictive performance of classification problems. While the application of such algorithms has been highly simplified in the last years due to their well-documented integration in commonly used statistical programming languages such as R, there are several practical challenges in the field of ecological modeling related to unbiased performance estimation, optimization of algorithms using hyperparameter tuning and spatial autocorrelation. We address these issues in the comparison of several widely used machine-learning algorithms such as Boosted Regression Trees (BRT), k-Nearest Neighbor (WKNN), Random Forest (RF) and Support Vector Machine (SVM) to traditional parametric algorithms such as logistic regression (GLM) and semi-parametric ones like generalized additive models (GAM). Different nested cross-validation methods including hyperparameter tuning methods are used to evaluate model performances with the aim to receive bias-reduced performance estimates. As a case study the spatial distribution of forest disease Diplodia sapinea in the Basque Country in Spain is investigated using common environmental variables such as temperature, precipitation, soil or lithology as predictors. Results show that GAM and RF (mean AUROC estimates 0.708 and 0.699) outperform all other methods in predictive accuracy. The effect of hyperparameter tuning saturates at around 50 iterations for this data set. The AUROC differences between the bias-reduced (spatial cross-validation) and overoptimistic (non-spatial cross-validation) performance estimates of the GAM and RF are 0.167 (24%) and 0.213 (30%), respectively. It is recommended to also use spatial partitioning for cross-validation hyperparameter tuning of spatial data.
Reliable identification of molecular biomarkers is essential for accurate patient stratification. While state-of-the-art machine learning approaches for sample classification continue to push boundaries in terms of performance, most of these methods are not able to integrate different data types and lack generalization power limiting their application in a clinical setting. Furthermore, many methods behave as black boxes, therefore we have very little understanding about the mechanisms that lead to the prediction provided. While opaqueness concerning machine behaviour might not be a problem in deterministic domains, in health care, providing explanations about the molecular factors and phenotypes that are driving the classification is crucial to build trust in the performance of the predictive system. We propose Pathway Induced Multiple Kernel Learning (PIMKL), a novel methodology to classify samples reliably that can, at the same time, provide a pathway-based molecular fingerprint of the signature that underlies the classification. PIMKL exploits prior knowledge in the form of molecular interaction networks and annotated gene sets, by optimizing a mixture of pathway-induced kernels using a Multiple Kernel Learning algorithm (MKL), an approach that has demonstrated excellent performance in different machine learning applications. After optimizing the combination of kernels for prediction of a specific phenotype, the model provides a stable molecular signature that can be interpreted in the light of the ingested prior knowledge and that can be used in transfer learning tasks.
Over the past several decades, subspace clustering has been receiving increasing interest and continuous progress. However, due to the lack of scalability and/or robustness, existing methods still have difficulty in dealing with the data that possesses simultaneously three characteristics: high-dimensional, massive and grossly corrupted. To tackle the scalability and robustness issues simultaneously, in this paper we suggest to consider a problem called compressive robust subspace clustering, which is to perform robust subspace clustering with the compressed data, and which is generated by projecting the original high-dimensional data onto a lower-dimensional subspace chosen at random. Given these random projections, the proposed method, row space pursuit (RSP), recovers not only the authentic row space, which provably leads to correct clustering results under certain conditions, but also the gross errors possibly existing in data. The compressive nature of the random projections gives our RSP high computational and storage efficiency, and the recovery property enables the ability for RSP to deal with the grossly corrupted data. Extensive experiments on high-dimensional and/or large-scale datasets show that RSP can maintain comparable accuracies to to prevalent methods with significant reductions in the computational time.
Most existing Zero-Shot Learning (ZSL) methods have the strong bias problem, in which instances of unseen (target) classes tend to be categorized as one of the seen (source) classes. So they yield poor performance after being deployed in the generalized ZSL settings. In this paper, we propose a straightforward yet effective method named Quasi-Fully Supervised Learning (QFSL) to alleviate the bias problem. Our method follows the way of transductive learning, which assumes that both the labeled source images and unlabeled target images are available for training. In the semantic embedding space, the labeled source images are mapped to several fixed points specified by the source categories, and the unlabeled target images are forced to be mapped to other points specified by the target categories. Experiments conducted on AwA2, CUB and SUN datasets demonstrate that our method outperforms existing state-of-the-art approaches by a huge margin of 9.3~24.5% following generalized ZSL settings, and by a large margin of 0.2~16.2% following conventional ZSL settings.
Although reinforcement learning methods can achieve impressive results in simulation, the real world presents two major challenges: generating samples is exceedingly expensive, and unexpected perturbations can cause proficient but narrowly-learned policies to fail at test time. In this work, we propose to learn how to quickly and effectively adapt online to new situations as well as to perturbations. To enable sample-efficient meta-learning, we consider learning online adaptation in the context of model-based reinforcement learning. Our approach trains a global model such that, when combined with recent data, the model can be be rapidly adapted to the local context. Our experiments demonstrate that our approach can enable simulated agents to adapt their behavior online to novel terrains, to a crippled leg, and in highly-dynamic environments.
Convolutional neural network (CNN) architectures utilize downsampling layers, which restrict the subsequent layers to learn spatially invariant features while reducing computational costs. However, such a downsampling operation makes it impossible to use the full spectrum of input features. Motivated by this observation, we propose a novel layer called parallel grid pooling (PGP) which is applicable to various CNN models. PGP performs downsampling without discarding any intermediate feature. It works as data augmentation and is complementary to commonly used data augmentation techniques. Furthermore, we demonstrate that a dilated convolution can naturally be represented using PGP operations, which suggests that the dilated convolution can also be regarded as a type of data augmentation technique. Experimental results based on popular image classification benchmarks demonstrate the effectiveness of the proposed method. Code is available at: https://…/akitotakeki
We train a network to generate mappings between training sets and classification policies (a ‘classifier generator’) by conditioning on the entire training set via an attentional mechanism. The network is directly optimized for test set performance on an training set of related tasks, which is then transferred to unseen ‘test’ tasks. We use this to optimize for performance in the low-data and unsupervised learning regimes, and obtain significantly better performance in the 10-50 datapoint regime than support vector classifiers, random forests, XGBoost, and k-nearest neighbors on a range of small datasets.
We investigate the behavior of convolutional neural networks (CNN) in the presence of label noise. We show empirically that CNN prediction for a given test sample depends on the labels of the training samples in its local neighborhood. This is similar to the way that the K-nearest neighbors (K-NN) classifier works. With this understanding, we derive an analytical expression for the expected accuracy of a K-NN, and hence a CNN, classifier for any level of noise. In particular, we show that K-NN, and CNN, are resistant to label noise that is randomly spread across the training set, but are very sensitive to label noise that is concentrated. Experiments on real datasets validate our analytical expression by showing that they match the empirical results for varying degrees of label noise.
In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.
Online learning algorithms have a wide variety of applications in large scale machine learning problems due to their low computational and memory requirements. However, standard online learning methods still suffer some issues such as lower convergence rates and limited capability to select features or to recover the true features. In this paper, we present a novel framework for online learning based on running averages and introduce a series of online versions of some popular existing offline algorithms such as Adaptive Lasso, Elastic Net and Feature Selection with Annealing. We prove the equivalence between our online methods and their offline counterparts and give theoretical feature selection and convergence guarantees for some of them. In contrast to the existing online methods, the proposed methods can extract model with any desired sparsity level at any time. Numerical experiments indicate that our new methods enjoy high feature selection accuracy and a fast convergence rate, compared with standard stochastic algorithms and offline learning algorithms. We also present some applications to large datasets where again the proposed framework shows competitive results compared to popular online and offline algorithms.
Learning through experience is time-consuming, inefficient and often bad for your cortisol levels. To address this problem, a number of recently proposed teacher-student methods have demonstrated the benefits of private tuition, in which a single model learns from an ensemble of more experienced tutors. Unfortunately, the cost of such supervision restricts good representations to a privileged minority. Unsupervised learning can be used to lower tuition fees, but runs the risk of producing networks that require extracurriculum learning to strengthen their CVs and create their own LinkedIn profiles. Inspired by the logo on a promotional stress ball at a local recruitment fair, we make the following three contributions. First, we propose a novel almost no supervision training algorithm that is effective, yet highly scalable in the number of student networks being supervised, ensuring that education remains affordable. Second, we demonstrate our approach on a typical use case: learning to bake, developing a method that tastily surpasses the current state of the art. Finally, we provide a rigorous quantitive analysis of our method, proving that we have access to a calculator. Our work calls into question the long-held dogma that life is the best teacher.