The present paper considers a finite sequence of graphs, e.g., coming from technological, biological, and social networks, each of which is modelled as a realization of a graph-valued random variable, and proposes a methodology to identify possible changes in stationarity in its generating stochastic process. In order to cover a large class of applications, we consider a general family of attributed graphs, chatacterized by a possible variable topology (edges and vertices) also in the stationary case. A Change Point Method (CPM) approach is proposed, that (i) maps graphs into a vector domain; (ii) applies a suitable statistical test; (iii) detects the change –if any– according to a confidence level and provides an estimate for its time of occurrence. Two specific CPMs are proposed: one detecting shifts in the distribution mean, the other addressing generic changes affecting the distribution. We ground our proposal with theoretical results showing how to relate the inference attained in the numerical vector space to the graph domain, and vice versa. Finally, simulations on epileptic-seizure detection problems are conducted on real-world data providing evidence for the CPMs effectiveness.
A code package, BlurRing, is developed as a method to allow for multi-dimensional likelihood visualisation. From the BlurRing visualisation additional information about the likelihood can be extracted. The spread in any direction of the overlaid likelihood curves gives information about the uncertainty on the confidence intervals presented in the two-dimensional likelihood plots.
Our goal is to explore how the abilities brought in by a dialogue manager can be included in end-to-end visually grounded conversational agents. We make initial steps towards this general goal by augmenting a task-oriented visual dialogue model with a decision-making component that decides whether to ask a follow-up question to identify a target referent in an image, or to stop the conversation to make a guess. Our analyses show that adding a decision making component produces dialogues that are less repetitive and that include fewer unnecessary questions, thus potentially leading to more efficient and less unnatural interactions.
We present a novel framework for augmenting data sets for machine learning based on counterexamples. Counterexamples are misclassified examples that have important properties for retraining and improving the model. Key components of our framework include a counterexample generator, which produces data items that are misclassified by the model and error tables, a novel data structure that stores information pertaining to misclassifications. Error tables can be used to explain the model’s vulnerabilities and are used to efficiently generate counterexamples for augmentation. We show the efficacy of the proposed framework by comparing it to classical augmentation techniques on a case study of object detection in autonomous driving based on deep neural networks.
Classical approach to regularization is to design norms enhancing smoothness or sparsity and then to use this norm or some power of this norm as a regularization function. The choice of the regularization function (for instance a power function) in terms of the norm is mostly dictated by computational purpose rather than theoretical considerations. In this work, we design regularization functions that are motivated by theoretical arguments. To that end we introduce a concept of optimal regularization called ‘minimax regularization’ and, as a proof of concept, we show how to construct such a regularization function for the $\ell_1^d$ norm for the random design setup. We develop a similar construction for the deterministic design setup. It appears that the resulting regularized procedures are different from the one used in the LASSO in both setups.
High-dimensional logistic regression is widely used in analyzing data with binary outcomes. In this paper, global testing and large-scale multiple testing for the regression coefficients are considered in both single- and two-regression settings. A test statistic for testing the global null hypothesis is constructed using a generalized low-dimensional projection for bias correction and its asymptotic null distribution is derived. A minimax lower bound for the global testing is established, which shows that the proposed test is asymptotically minimax optimal. For testing the individual coefficients simultaneously, multiple testing procedures are proposed and shown to control the false discovery rate (FDR) and falsely discovered variables (FDV) asymptotically. Simulation studies are carried out to examine the numerical performance of the proposed tests and their superiority over existing methods. The testing procedures are also illustrated by analyzing a metabolomics study that investigates the association between fecal metabolites and pediatric Crohn’s disease and the effects of treatment on such associations.
We present a new dataset and models for comprehending paragraphs about processes (e.g., photosynthesis), an important genre of text describing a dynamic world. The new dataset, ProPara, is the first to contain natural (rather than machine-generated) text about a changing world along with a full annotation of entity states (location and existence) during those changes (81k datapoints). The end-task, tracking the location and existence of entities through the text, is challenging because the causal effects of actions are often implicit and need to be inferred. We find that previous models that have worked well on synthetic data achieve only mediocre performance on ProPara, and introduce two new neural models that exploit alternative mechanisms for state prediction, in particular using LSTM input encoding and span prediction. The new models improve accuracy by up to 19%. The dataset and models are available to the community at http://…/propara.
Spectral dimensionality reduction methods enable linear separations of complex data with high-dimensional features in a reduced space. However, these methods do not always give the desired results due to irregularities or uncertainties of the data. Thus, we consider aggressively modifying the scales of the features to obtain the desired classification. Using prior knowledge on the labels of partial samples to specify the Fiedler vector, we formulate an eigenvalue problem of a linear matrix pencil whose eigenvector has the feature scaling factors. The resulting factors can modify the features of entire samples to form clusters in the reduced space, according to the known labels. In this study, we propose new dimensionality reduction methods supervised using the feature scaling associated with the spectral clustering. Numerical experiments show that the proposed methods outperform well-established supervised methods for toy problems with more samples than features, and are more robust regarding clustering than existing methods. Also, the proposed methods outperform existing methods regarding classification for real-world problems with more features than samples of gene expression profiles of cancer diseases. Furthermore, the feature scaling tends to improve the clustering and classification accuracies of existing unsupervised methods, as the proportion of training data increases.
Deep hierarchical reinforcement learning has gained a lot of attention in recent years due to its ability to produce state-of-the-art results in challenging environments where non-hierarchical frameworks fail to learn useful policies. However, as problem domains become more complex, deep hierarchical reinforcement learning can become inefficient, leading to longer convergence times and poor performance. We introduce the Deep Nested Agent framework, which is a variant of deep hierarchical reinforcement learning where information from the main agent is propagated to the low level $nested$ agent by incorporating this information into the nested agent’s state. We demonstrate the effectiveness and performance of the Deep Nested Agent framework by applying it to three scenarios in Minecraft with comparisons to a deep non-hierarchical single agent framework, as well as, a deep hierarchical framework.
Modern driver assistance systems rely on a wide range of sensors (RADAR, LIDAR, ultrasound and cameras) for scene understanding and prediction. These sensors are typically used for detecting traffic participants and scene elements required for navigation. In this paper we argue that relying on camera based systems, specifically Around View Monitoring (AVM) system has great potential to achieve these goals in both parking and driving modes with decreased costs. The contributions of this paper are as follows: we present a new end-to-end solution for delimiting the safe drivable area for each frame by means of identifying the closest obstacle in each direction from the driving vehicle, we use this approach to calculate the distance to the nearest obstacles and we incorporate it into a unified end-to-end architecture capable of joint object detection, curb detection and safe drivable area detection. Furthermore, we describe the family of networks for both a high accuracy solution and a low complexity solution. We also introduce further augmentation of the base architecture with 3D object detection.
In this paper, we study the problem of modeling users’ diverse interests. Previous methods usually learn a fixed user representation, which has a limited ability to represent distinct interests of a user. In order to model users’ various interests, we propose a Memory Attention-aware Recommender System (MARS). MARS utilizes a memory component and a novel attentional mechanism to learn deep \textit{adaptive user representations}. Trained in an end-to-end fashion, MARS adaptively summarizes users’ interests. In the experiments, MARS outperforms seven state-of-the-art methods on three real-world datasets in terms of recall and mean average precision. We also demonstrate that MARS has a great interpretability to explain its recommendation results, which is important in many recommendation scenarios.
Aspect based sentiment analysis (ABSA) can provide more detailed information than general sentiment analysis, because it aims to predict the sentiment polarities of the given aspects or entities in text. We summarize previous approaches into two subtasks: aspect-category sentiment analysis (ACSA) and aspect-term sentiment analysis (ATSA). Most previous approaches employ long short-term memory and attention mechanisms to predict the sentiment polarity of the concerned targets, which are often complicated and need more training time. We propose a model based on convolutional neural networks and gating mechanisms, which is more accurate and efficient. First, the novel Gated Tanh-ReLU Units can selectively output the sentiment features according to the given aspect or entity. The architecture is much simpler than attention layer used in the existing models. Second, the computations of our model could be easily parallelized during training, because convolutional layers do not have time dependency as in LSTM layers, and gating units also work independently. The experiments on SemEval datasets demonstrate the efficiency and effectiveness of our models.
In this article, we propose a new class of priors for Bayesian inference with multiple Gaussian graphical models. We introduce fully Bayesian treatments of two popular procedures, the group graphical lasso and the fused graphical lasso, and extend them to a continuous spike-and-slab framework to allow self-adaptive shrinkage and model selection simultaneously. We develop an EM algorithm that performs fast and dynamic explorations of posterior modes. Our approach selects sparse models efficiently with substantially smaller bias than would be induced by alternative regularization procedures. The performance of the proposed methods are demonstrated through simulation and two real data examples.
This paper reviews recent developments in statistical structure learning; namely, Bayesian model reduction. Bayesian model reduction is a special but ubiquitous case of Bayesian model comparison that, in the setting of variational Bayes, furnishes an analytic solution for (a lower bound on) model evidence induced by a change in priors. This analytic solution finesses the problem of scoring large model spaces in model comparison or structure learning. This is because each new model can be cast in terms of an alternative set of priors over model parameters. Furthermore, the reduced free energy (i.e. evidence bound on the reduced model) finds an expedient application in hierarchical models, where it plays the role of a summary statistic. In other words, it contains all the necessary information contained in the posterior distributions over parameters of lower levels. In this technical note, we review Bayesian model reduction – in terms of common forms of reduced free energy – and illustrate recent applications in structure learning, hierarchical or empirical Bayes and as a metaphor for neurobiological processes like abductive reasoning and sleep.
Knowledge distillation compacts deep networks by letting a small student network learn from a large teacher network. The accuracy of knowledge distillation recently benefited from adding residual layers. We propose to reduce the size of the student network even further by recasting multiple residual layers in the teacher network into a single recurrent student layer. We propose three variants of adding recurrent connections into the student network, and show experimentally on CIFAR-10, Scenes and MiniPlaces, that we can reduce the number of parameters at little loss in accuracy.
Capsule Networks have shown encouraging results on \textit{defacto} benchmark computer vision datasets such as MNIST, CIFAR and smallNORB. Although, they are yet to be tested on tasks where (1) the entities detected inherently have more complex internal representations and (2) there are very few instances per class to learn from and (3) where point-wise classification is not suitable. Hence, this paper carries out experiments on face verification in both controlled and uncontrolled settings that together address these points. In doing so we introduce \textit{Siamese Capsule Networks}, a new variant that can be used for pairwise learning tasks. The model is trained using contrastive loss with $\ell_2$-normalized capsule encoded pose features. We find that \textit{Siamese Capsule Networks} perform well against strong baselines on both pairwise learning datasets, yielding best results in the few-shot learning setting where image pairs in the test set contain unseen subjects.
Network embedding has become a hot research topic recently which can provide low-dimensional feature representations for many machine learning applications. Current work focuses on either (1) whether the embedding is designed as an unsupervised learning task by explicitly preserving the structural connectivity in the network, or (2) whether the embedding is a by-product during the supervised learning of a specific discriminative task in a deep neural network. In this paper, we focus on bridging the gap of the two lines of the research. We propose to adapt the Generative Adversarial model to perform network embedding, in which the generator is trying to generate vertex pairs, while the discriminator tries to distinguish the generated vertex pairs from real connections (edges) in the network. Wasserstein-1 distance is adopted to train the generator to gain better stability. We develop three variations of models, including GANE which applies cosine similarity, GANE-O1 which preserves the first-order proximity, and GANE-O2 which tries to preserves the second-order proximity of the network in the low-dimensional embedded vector space. We later prove that GANE-O2 has the same objective function as GANE-O1 when negative sampling is applied to simplify the training process in GANE-O2. Experiments with real-world network datasets demonstrate that our models constantly outperform state-of-the-art solutions with significant improvements on precision in link prediction, as well as on visualizations and accuracy in clustering tasks.
This paper presents a new mathematical framework to analyze the loss functions of deep neural networks with ReLU functions. Furthermore, as as application of this theory, we prove that the loss functions can reconstruct the inputs of the training samples up to scalar multiplication (as vectors) and can provide the number of layers and nodes of the deep neural network. Namely, if we have all input and output of a loss function (or equivalently all possible learning process), for all input of each training sample $x_i \in \mathbb{R}^n$, we can obtain vectors $x'_i\in \mathbb{R}^n$ satisfying $x_i=c_ix'_i$ for some $c_i \neq 0$. To prove theorem, we introduce the notion of virtual polynomials, which are polynomials written as the output of a node in a deep neural network. Using virtual polynomials, we find an algebraic structure for the loss surfaces, called semi-algebraic sets. We analyze these loss surfaces from the algebro-geometric point of view. Factorization of polynomials is one of the most standard ideas in algebra. Hence, we express the factorization of the virtual polynomials in terms of their active paths. This framework can be applied to the leakage problem in the training of deep neural networks. The main theorem in this paper indicates that there are many risks associated with the training of deep neural networks. For example, if we have N (the dimension of weight space) + 1 nonsmooth points on the loss surface, which are sufficiently close to each other, we can obtain the input of training sample up to scalar multiplication. We also point out that the structures of the loss surfaces depend on the shape of the deep neural network and not on the training samples.
Recurrent neural networks have become ubiquitous in computing representations of sequential data, especially textual data in natural language processing. In particular, Bidirectional LSTMs are at the heart of several neural models achieving state-of-the-art performance in a wide variety of tasks in NLP. We propose a general and effective improvement to the BiLSTM model which encodes each suffix and prefix of a sequence of tokens in both forward and reverse directions. We call our model Suffix BiLSTM or SuBiLSTM. Using an extensive set of experiments, we demonstrate that using SuBiLSTM instead of a BiLSTM in existing base models leads to improvements in performance in learning general sentence representations, text classification, textual entailment and named entity recognition. We achieve new state-of-the-art results for fine-grained sentiment classification and question classification using SuBiLSTM.