A significant challenge for the practical application of reinforcement learning in the real world is the need to specify an oracle reward function that correctly defines a task. Inverse reinforcement learning (IRL) seeks to avoid this challenge by instead inferring a reward function from expert behavior. While appealing, it can be impractically expensive to collect datasets of demonstrations that cover the variation common in the real world (e.g. opening any type of door). Thus in practice, IRL must commonly be performed with only a limited set of demonstrations where it can be exceedingly difficult to unambiguously recover a reward function. In this work, we exploit the insight that demonstrations from other tasks can be used to constrain the set of possible reward functions by learning a ‘prior’ that is specifically optimized for the ability to infer expressive reward functions from limited numbers of demonstrations. We demonstrate that our method can efficiently recover rewards from images for novel tasks and provide intuition as to how our approach is analogous to learning a prior.
In this study we present a sequential sampling methodology for Bayesian inference in decomposable graphical models. We recast the problem of graph estimation, which in general lacks natural sequential interpretation, into a sequential setting. Specifically, we propose a recursive Feynman-Kac model which generates a flow of junction tree distributions over a space of increasing dimensions and develop an efficient sequential Monte Carlo sampler. As a key ingredient of the proposal kernel in our sampler we use the Christmas tree algorithm developed in the companion paper Olsson et al. [2018]. We focus on particle MCMC methods, in particular particle Gibbs (PG) as it allows for generating MCMC chains with global moves on an underlying space of decomposable graphs. To further improve the algorithm mixing properties of this PG, we incorporate a systematic refreshment step implemented through direct sampling from a backward kernel. The theoretical properties of the algorithm are investigated, showing in particular that the refreshment step improves the algorithm performance in terms of asymptotic variance of the estimated distribution. Performance accuracy of the graph estimators are illustrated through a collection of numerical examples demonstrating the feasibility of the suggested approach in both discrete and continuous graphical models.
Semi-supervised node classification involves learning to classify unlabelled nodes given a partially labeled graph. In transductive learning, all unlabelled nodes to be classified are observed during training and in inductive learning, predictions are to be made for nodes not seen at training. In this paper, we focus on both these settings for node classification in attributed graphs, i.e., graphs in which nodes have additional features. State-of-the-art models for node classification on such attributed graphs use differentiable recursive functions. These differentiable recursive functions enable aggregation and filtering of neighborhood information from multiple hops (depths). Despite being powerful, these variants are limited in their ability to combine information from different hops efficiently. In this work, we analyze this limitation of recursive graph functions in terms of their representation capacity to effectively capture multi-hop neighborhood information. Further, we provide a simple fusion component which is mathematically motivated to address this limitation and improve the existing models to explicitly learn the importance of information from different hops. This proposed mechanism is shown to improve over existing methods across 8 popular datasets from different domains. Specifically, our model improves the Graph Convolutional Network (GCN) and a variant of Graph SAGE by a significant margin providing highly competitive state-of-the-art results.
A longstanding problem in machine learning is to find unsupervised methods that can learn the statistical structure of high dimensional signals. In recent years, GANs have gained much attention as a possible solution to the problem, and in particular have shown the ability to generate remarkably realistic high resolution sampled images. At the same time, many authors have pointed out that GANs may fail to model the full distribution (‘mode collapse’) and that using the learned models for anything other than generating samples may be very difficult. In this paper, we examine the utility of GANs in learning statistical models of images by comparing them to perhaps the simplest statistical model, the Gaussian Mixture Model. First, we present a simple method to evaluate generative models based on relative proportions of samples that fall into predetermined bins. Unlike previous automatic methods for evaluating models, our method does not rely on an additional neural network nor does it require approximating intractable computations. Second, we compare the performance of GANs to GMMs trained on the same datasets. While GMMs have previously been shown to be successful in modeling small patches of images, we show how to train them on full sized images despite the high dimensionality. Our results show that GMMs can generate realistic samples (although less sharp than those of GANs) but also capture the full distribution, which GANs fail to do. Furthermore, GMMs allow efficient inference and explicit representation of the underlying statistical structure. Finally, we discuss how a pix2pix network can be used to add high-resolution details to GMM samples while maintaining the basic diversity.
Hidden tree Markov models allow learning distributions for tree structured data while being interpretable as nondeterministic automata. We provide a concise summary of the main approaches in literature, focusing in particular on the causality assumptions introduced by the choice of a specific tree visit direction. We will then sketch a novel non-parametric generalization of the bottom-up hidden tree Markov model with its interpretation as a nondeterministic tree automaton with infinite states.
Most artificial intelligence models have limiting ability to solve new tasks faster, without forgetting previously acquired knowledge. The recently emerging paradigm of continual learning aims to solve this issue, in which the model learns various tasks in a sequential fashion. In this work, a novel approach for continual learning is proposed, which searches for the best neural architecture for each coming task via sophisticatedly designed reinforcement learning strategies. We name it as Reinforced Continual Learning. Our method not only has good performance on preventing catastrophic forgetting but also fits new tasks well. The experiments on sequential classification tasks for variants of MNIST and CIFAR-100 datasets demonstrate that the proposed approach outperforms existing continual learning alternatives for deep networks.
Performance monitoring, anomaly detection, and root-cause analysis in complex cyber-physical systems (CPSs) are often highly intractable due to widely diverse operational modes, disparate data types, and complex fault propagation mechanisms. This paper presents a new data-driven framework for root-cause analysis, based on a spatiotemporal graphical modeling approach built on the concept of symbolic dynamics for discovering and representing causal interactions among sub-systems of complex CPSs. We formulate the root-cause analysis problem as a minimization problem via the proposed inference based metric and present two approximate approaches for root-cause analysis, namely the sequential state switching ($S^3$, based on free energy concept of a restricted Boltzmann machine, RBM) and artificial anomaly association ($A^3$, a classification framework using deep neural networks, DNN). Synthetic data from cases with failed pattern(s) and anomalous node(s) are simulated to validate the proposed approaches. Real dataset based on Tennessee Eastman process (TEP) is also used for comparison with other approaches. The results show that: (1) $S^3$ and $A^3$ approaches can obtain high accuracy in root-cause analysis under both pattern-based and node-based fault scenarios, in addition to successfully handling multiple nominal operating modes, (2) the proposed tool-chain is shown to be scalable while maintaining high accuracy, and (3) the proposed framework is robust and adaptive in different fault conditions and performs better in comparison with the state-of-the-art methods.
Automated dermoscopic image analysis has witnessed rapid growth in diagnostic performance. Yet adoption faces resistance, in part, because no evidence is provided to support decisions. In this work, an approach for evidence-based classification is presented. A feature embedding is learned with CNNs, triplet-loss, and global average pooling, and used to classify via kNN search. Evidence is provided as both the discovered neighbors, as well as localized image regions most relevant to measuring distance between query and neighbors. To ensure that results are relevant in terms of both label accuracy and human visual similarity for any skill level, a novel hierarchical triplet logic is implemented to jointly learn an embedding according to disease labels and non-expert similarity. Results are improved over baselines trained on disease labels alone, as well as standard multiclass loss. Quantitative relevance of results, according to non-expert similarity, as well as localized image regions, are also significantly improved.
The problem of attributing a deep network’s prediction to its \emph{input/base} features is well-studied. We introduce the notion of \emph{conductance} to extend the notion of attribution to the understanding the importance of \emph{hidden} units. Informally, the conductance of a hidden unit of a deep network is the \emph{flow} of attribution via this hidden unit. We use conductance to understand the importance of a hidden unit to the prediction for a specific input, or over a set of inputs. We evaluate the effectiveness of conductance in multiple ways, including theoretical properties, ablation studies, and a feature selection task. The empirical evaluations are done using the Inception network over ImageNet data, and a sentiment analysis network over reviews. In both cases, we demonstrate the effectiveness of conductance in identifying interesting insights about the internal workings of these networks.
We consider the problem of how to improve automatic target recognition by fusing the naive sensor-level classification decisions with ‘intuition,’ or context, in a mathematically principled way. This is a general approach that is compatible with many definitions of context, but for specificity, we consider context as co-occurrence in imagery. In particular, we consider images that contain multiple objects identified at various confidence levels. We learn the patterns of co-occurrence in each context, then use these patterns as hyper-parameters for a Hierarchical Bayesian Model. The result is that low-confidence sensor classification decisions can be dramatically improved by fusing those readings with context. We further use hyperpriors to address the case where multiple contexts may be appropriate. We also consider the Bayesian Network, an alternative to the Hierarchical Bayesian Model, which is computationally more efficient but assumes that context and sensor readings are uncorrelated.
Clustering methods are a valuable tool for the identification of patterns in high dimensional data with applications in many scientific problems. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with High Dimension Low Sample Size (HDLSS) data. We develop here a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These non-parametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the euclidean distance captures relevant features. We propose two significance clustering algorithms, a hierarchical method and a non-nested version. In order to do so, we first propose an extension of a relevant U-statistics and develop its asymptotic theory. Our methods are tested through extensive simulations and found to be more powerful than competing alternatives. They are further showcased in two applications ranging from genetics to image recognition problems.
Deep convolutional network architectures are often assumed to guarantee generalization for small image translations and deformations. In this paper we show that modern CNNs (VGG16, ResNet50, and InceptionResNetV2) can drastically change their output when an image is translated in the image plane by a few pixels, and that this failure of generalization also happens with other realistic small image transformations. Furthermore, the deeper the network the more we see these failures to generalize. We show that these failures are related to the fact that the architecture of modern CNNs ignores the classical sampling theorem so that generalization is not guaranteed. We also show that biases in the statistics of commonly used image datasets makes it unlikely that CNNs will learn to be invariant to these transformations. Taken together our results suggest that the performance of CNNs in object recognition falls far short of the generalization capabilities of humans.
Employing deep neural networks to obtain state-of-the-art performance on computer vision tasks can consume billions of floating point operations and several Joules of energy per evaluation. Network pruning, which statically removes unnecessary features and weights, has emerged as a promising way to reduce this computation cost. In this paper, we propose channel gating, a dynamic, fine-grained, training-based computation-cost-reduction scheme. Channel gating works by identifying the regions in the features which contribute less to the classification result and turning off a subset of the channels for computing the pixels within these uninteresting regions. Unlike static network pruning, the channel gating optimizes computations exploiting characteristics specific to each input at run-time. We show experimentally that applying channel gating in state-of-the-art networks can achieve 66% and 60% reduction in FLOPs with 0.22% and 0.29% accuracy loss on the CIFAR-10 and CIFAR-100 datasets, respectively.