For a deep learning model, efficient execution of its computation graph is key to achieving high performance. Previous work has focused on improving the performance for individual nodes of the computation graph, while ignoring the parallelization of the graph as a whole. However, we observe that running multiple operations simultaneously without interference is critical to efficiently perform parallelizable small operations. The attempt of executing the computation graph in parallel in deep learning frameworks usually involves much resource contention among concurrent operations, leading to inferior performance on manycore CPUs. To address these issues, in this paper, we propose Graphi, a generic and high-performance execution engine to efficiently execute a computation graph in parallel on manycore CPUs. Specifically, Graphi minimizes the interference on both software/hardware resources, discovers the best parallel setting with a profiler, and further optimizes graph execution with the critical-path first scheduling. Our experiments show that the parallel execution consistently outperforms the sequential one. The training times on four different neural networks with Graphi are 2.1x to 9.5x faster than those with TensorFlow on a 68-core Intel Xeon Phi processor.
We provide the first information theoretic tight analysis for inference of latent community structure given a sparse graph along with high dimensional node covariates, correlated with the same latent communities. Our work bridges recent theoretical breakthroughs in the detection of latent community structure without nodes covariates and a large body of empirical work using diverse heuristics for combining node covariates with graphs for inference. The tightness of our analysis implies in particular, the information theoretical necessity of combining the different sources of information. Our analysis holds for networks of large degrees as well as for a Gaussian version of the model.
In the Text Classification areas of Sentiment Analysis, Subjectivity/Objectivity Analysis, and Opinion Polarity, Convolutional Neural Networks have gained special attention because of their performance and accuracy. In this work, we applied recent advances in CNNs and propose a novel architecture, Multiple Block Convolutional Highways (MBCH), which achieves improved accuracy on multiple popular benchmark datasets, compared to previous architectures. The MBCH is based on new techniques and architectures including highway networks, DenseNet, batch normalization and bottleneck layers. In addition, to cope with the limitations of existing pre-trained word vectors which are used as inputs for the CNN, we propose a novel method, Improved Word Vectors (IWV). The IWV improves the accuracy of CNNs which are used for text classification tasks.
A weighted Shiryaev-Roberts change detection procedure is shown to approximately minimize the expected delay to detection as well as higher moments of the detection delay among all change-point detection procedures with the given low maximal local probability of a false alarm within a window of a fixed length in pointwise and minimax settings for general non-i.i.d. data models and for the composite post-change hypothesis when the post-change parameter is unknown. We establish very general conditions for the models under which the weighted Shiryaev–Roberts procedure is asymptotically optimal. These conditions are formulated in terms of the rate of convergence in the strong law of large numbers for the log-likelihood ratios between the change’ and no-change’ hypotheses, and we also provide sufficient conditions for a large class of ergodic Markov processes. Examples, where these conditions hold, are given.
In this work we provide a new technique to design fast approximation algorithms for graph problems where the points of the graph lie in a metric space. Specifically, we present a sampling approach for such metric graphs that, using a sublinear number of edge weight queries, provides a {\em linear sampling}, where each edge is (roughly speaking) sampled proportionally to its weight. For several natural problems, such as densest subgraph and max cut among others, we show that by sparsifying the graph using this sampling process, we can run a suitable approximation algorithm on the sparsified graph and the result remains a good approximation for the original problem. Our results have several interesting implications, such as providing the first sublinear time approximation algorithm for densest subgraph in a metric space, and improving the running time of estimating the average distance.
Making sense of a dataset in an automatic and unsupervised fashion is a challenging problem in statistics and AI. Classical approaches for density estimation, even when taking into account mixtures of probabilistic models, are not flexible enough to deal with the uncertainty inherent to real-world data: they are generally restricted to a priori fixed homogeneous likelihood model and to latent variable structures where expressiveness comes at the price of tractability. We propose Automatic Bayesian Density Analysis (ABDA) to go beyond classical mixture model density estimation, casting uncertainty estimation on both the underlying structure in the data, as well as the selection of adequate likelihood models for the data—thus statistical data types of the variable in the data—into a joint inference problem. Specifically, ABDA relies on a hierarchical model explicitly incorporating arbitrarily rich collections of likelihood models at a local level, while capturing global variable interactions by an expressive deep structure built on a sum-product network. Extensive empirical evidence shows that ABDA is more accurate than density estimators in the literature at dealing with both kinds of uncertainties, at modeling and predicting real-world (mixed continuous and discrete) data in both transductive and inductive scenarios, and at recovering the statistical data types.
The article focuses on determining the predictive uncertainty of a model on the example of atrial fibrillation detection problem by a single-lead ECG signal. To this end, the model predicts parameters of the beta distribution over class probabilities instead of these probabilities themselves. It was shown that the described approach allows to detect atypical recordings and significantly improve the quality of the algorithm on confident predictions.
In recent years, deep generative models have been shown to ‘imagine’ convincing high-dimensional observations such as images, audio, and even video, learning directly from raw data. In this work, we ask how to imagine goal-directed visual plans — a plausible sequence of observations that transition a dynamical system from its current configuration to a desired goal state, which can later be used as a reference trajectory for control. We focus on systems with high-dimensional observations, such as images, and propose an approach that naturally combines representation learning and planning. Our framework learns a generative model of sequential observations, where the generative process is induced by a transition in a low-dimensional planning model, and an additional noise. By maximizing the mutual information between the generated observations and the transition in the planning model, we obtain a low-dimensional representation that best explains the causal nature of the data. We structure the planning model to be compatible with efficient planning algorithms, and we propose several such models based on either discrete or continuous states. Finally, to generate a visual plan, we project the current and goal observations onto their respective states in the planning model, plan a trajectory, and then use the generative model to transform the trajectory to a sequence of observations. We demonstrate our method on imagining plausible visual plans of rope manipulation.
Inference models are a key component in scaling variational inference to deep latent variable models, most notably as encoder networks in variational auto-encoders (VAEs). By replacing conventional optimization-based inference with a learned model, inference is amortized over data examples and therefore more computationally efficient. However, standard inference models are restricted to direct mappings from data to approximate posterior estimates. The failure of these models to reach fully optimized approximate posterior estimates results in an amortization gap. We aim toward closing this gap by proposing iterative inference models, which learn to perform inference optimization through repeatedly encoding gradients. Our approach generalizes standard inference models in VAEs and provides insight into several empirical findings, including top-down inference techniques. We demonstrate the inference optimization capabilities of iterative inference models and show that they outperform standard inference models on several benchmark data sets of images and text.
We present a system comprising a hybridization of self-organized map (SOM) properties with spiking neural networks (SNNs) that retain many of the features of SOMs. Networks are trained in an unsupervised manner to learn a self-organized lattice of filters via excitatory-inhibitory interactions among populations of neurons. We develop and test various inhibition strategies, such as growing with inter-neuron distance and two distinct levels of inhibition. The quality of the unsupervised learning algorithm is evaluated using examples with known labels. Several biologically-inspired classification tools are proposed and compared, including population-level confidence rating, and n-grams using spike motif algorithm. Using the optimal choice of parameters, our approach produces improvements over state-of-art spiking neural networks.
This thesis applies entropy as a model independent measure to address three research questions concerning financial time series. In the first study we apply transfer entropy to drawdowns and drawups in foreign exchange rates, to study their correlation and cross correlation. When applied to daily and hourly EUR/USD and GBP/USD exchange rates, we find evidence of dependence among the largest draws (i.e. 5% and 95% quantiles), but not as strong as the correlation between the daily returns of the same pair of FX rates. In the second study we use state space models (Hidden Markov Models) of volatility to investigate volatility spill overs between exchange rates. Among the currency pairs, the co-movement of EUR/USD and CHF/USD volatility states show the strongest observed relationship. With the use of transfer entropy, we find evidence for information flows between the volatility state series of AUD, CAD and BRL. The third study uses the entropy of S&P realised volatility in detecting changes of volatility regime in order to re-examine the theme of market volatility timing of hedge funds. A one-factor model is used, conditioned on information about the entropy of market volatility, to measure the dynamic of hedge funds equity exposure. On a cross section of around 2500 hedge funds with a focus on the US equity markets we find that, over the period from 2000 to 2014, hedge funds adjust their exposure dynamically in response to changes in volatility regime. This adds to the literature on the volatility timing behaviour of hedge fund manager, but using entropy as a model independent measure of volatility regime.
Reinforcement Learning (RL) is a learning paradigm concerned with learning to control a system so as to maximize an objective over the long term. This approach to learning has received immense interest in recent times and success manifests itself in the form of human-level performance on games like \textit{Go}. While RL is emerging as a practical component in real-life systems, most successes have been in Single Agent domains. This report will instead specifically focus on challenges that are unique to Multi-Agent Systems interacting in mixed cooperative and competitive environments. The report concludes with advances in the paradigm of training Multi-Agent Systems called \textit{Decentralized Actor, Centralized Critic}, based on an extension of MDPs called \textit{Decentralized Partially Observable MDP}s, which has seen a renewed interest lately.
Convolutional neural networks (CNNs) have achieved great successes in many computer vision problems. Unlike existing works that designed CNN architectures to improve performance on a single task of a single domain and not generalizable, we present IBN-Net, a novel convolutional architecture, which remarkably enhances a CNN’s modeling ability on one domain (e.g. Cityscapes) as well as its generalization capacity on another domain (e.g. GTA5) without finetuning. IBN-Net carefully integrates Instance Normalization (IN) and Batch Normalization (BN) as building blocks, and can be wrapped into many advanced deep networks to improve their performances. This work has three key contributions. (1) By delving into IN and BN, we disclose that IN learns features that are invariant to appearance changes, such as colors, styles, and virtuality/reality, while BN is essential for preserving content related information. (2) IBN-Net can be applied to many advanced deep architectures, such as DenseNet, ResNet, ResNeXt, and SENet, and consistently improve their performance without increasing computational cost. (3) When applying the trained networks to new domains, e.g. from GTA5 to Cityscapes, IBN-Net achieves comparable improvements as domain adaptation methods, even without using data from the target domain. With IBN-Net, we won the 1st place on the WAD 2018 Challenge Drivable Area track, with an mIoU of 86.18%.
We proposed the expected energy-based restricted Boltzmann machine (EE-RBM) as a discriminative RBM method for classification. Two characteristics of the EE-RBM are that the output is unbounded and that the target value of correct classification is set to a value much greater than one. In this study, by adopting features of the EE-RBM approach to feed-forward neural networks, we propose the UnBounded output network (UBnet) which is characterized by three features: (1) unbounded output units; (2) the target value of correct classification is set to a value much greater than one; and (3) the models are trained by a modified mean-squared error objective. We evaluate our approach using the MNIST, CIFAR-10, and CIFAR-100 benchmark datasets. We first demonstrate, for shallow UBnets on MNIST, that a setting of the target value equal to the number of hidden units significantly outperforms a setting of the target value equal to one, and it also outperforms standard neural networks by about 25\%. We then validate our approach by achieving high-level classification performance on the three datasets using unbounded output residual networks. We finally use MNIST to analyze the learned features and weights, and we demonstrate that UBnets are much more robust against adversarial examples than the standard approach of using a softmax output layer and training the networks by a cross-entropy objective.
Data mining and machine learning techniques such as classification and regression trees (CART) represent a promising alternative to conventional logistic regression for propensity score estimation. Whereas incomplete data preclude the fitting of a logistic regression on all subjects, CART is appealing in part because some implementations allow for incomplete records to be incorporated in the tree fitting and provide propensity score estimates for all subjects. Based on theoretical considerations, we argue that the automatic handling of missing data by CART may however not be appropriate. Using a series of simulation experiments, we examined the performance of different approaches to handling missing covariate data; (i) applying the CART algorithm directly to the (partially) incomplete data, (ii) complete case analysis, and (iii) multiple imputation. Performance was assessed in terms of bias in estimating exposure-outcome effects \add{among the exposed}, standard error, mean squared error and coverage. Applying the CART algorithm directly to incomplete data resulted in bias, even in scenarios where data were missing completely at random. Overall, multiple imputation followed by CART resulted in the best performance. Our study showed that automatic handling of missing data in CART can cause serious bias and does not outperform multiple imputation as a means to account for missing data.
From the viewpoint of physical-layer authentication, spoofing attacks can be foiled by checking channel state information (CSI). Existing CSI-based authentication algorithms mostly require a deep knowledge of the channel to deliver decent performance. In this paper, we investigate CSI-based authenticators that can spare the effort to predetermine channel properties by utilizing deep neural networks (DNNs). We first propose a convolutional neural network (CNN)-enabled authenticator that is able to extract the local features in CSI. Next, we employ the recurrent neural network (RNN) to capture the dependencies between different frequencies in CSI. In addition, we propose to use the convolutional recurrent neural network (CRNN)—a combination of the CNN and the RNN—to learn local and contextual information in CSI for user authentication. To effectively train these DNNs, one needs a large amount of labeled channel records. However, it is often expensive to label large channel observations in the presence of a spoofer. In view of this, we further study a case in which only a small part of the the channel observations are labeled. To handle it, we extend these DNNs-enabled approaches into semi-supervised ones. This extension is based on a semi-supervised learning technique that employs both the labeled and unlabeled data to train a DNN. To be specific, our semi-supervised method begins by generating pseudo labels for the unlabeled channel samples through implementing the K-means algorithm in a semi-supervised manner. Subsequently, both the labeled and pseudo labeled data are exploited to pre-train a DNN, which is then fine-tuned based on the labeled channel records.
Generative adversarial networks (GANs) are one of the most popular methods for generating images today. While impressive results have been validated by visual inspection, a number of quantitative criteria have emerged only recently. We argue here that the existing ones are insufficient and need to be in adequation with the task at hand. In this paper we introduce two measures based on image classification—GAN-train and GAN-test, which approximate the recall (diversity) and precision (quality of the image) of GANs respectively. We evaluate a number of recent GAN approaches based on these two measures and demonstrate a clear difference in performance. Furthermore, we observe that the increasing difficulty of the dataset, from CIFAR10 over CIFAR100 to ImageNet, shows an inverse correlation with the quality of the GANs, as clearly evident from our measures.
Deep neural network models owe their representational power to the high number of learnable parameters. It is often infeasible to run these largely parametrized deep models in limited resource environments, like mobile phones. Network models employing conditional computing are able to reduce computational requirements while achieving high representational power, with their ability to model hierarchies. We propose Conditional Information Gain Networks, which allow the feed forward deep neural networks to execute conditionally, skipping parts of the model based on the sample and the decision mechanisms inserted in the architecture. These decision mechanisms are trained using cost functions based on differentiable Information Gain, inspired by the training procedures of decision trees. These information gain based decision mechanisms are differentiable and can be trained end-to-end using a unified framework with a general cost function, covering both classification and decision losses. We test the effectiveness of the proposed method on MNIST and recently introduced Fashion MNIST datasets and show that our information gain based conditional execution approach can achieve better or comparable classification results using significantly fewer parameters, compared to standard convolutional neural network baselines.
Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally. This is due to current neural network architectures requiring the entire dataset, consisting of all the samples from the old as well as the new classes, to update the model -a requirement that becomes easily unsustainable as the number of classes grows. We address this issue with our approach to learn deep neural networks incrementally, using new data and only a small exemplar set corresponding to samples from the old classes. This is based on a loss composed of a distillation measure to retain the knowledge acquired from the old classes, and a cross-entropy loss to learn the new classes. Our incremental training is achieved while keeping the entire framework end-to-end, i.e., learning the data representation and the classifier jointly, unlike recent methods with no such guarantees. We evaluate our method extensively on the CIFAR-100 and ImageNet (ILSVRC 2012) image classification datasets, and show state-of-the-art performance.
Extracting textual features from tweets is a challenging process due to the noisy nature of the content and the weak signal of most of the words used. In this paper, we propose using singular value decomposition (SVD) with clustering to enhance the signals of the textual features in the tweets to improve the correlation with events. The proposed technique applies SVD to the time series vector for each feature to factorize the matrix of feature/day counts, in order to ensure the independence of the feature vectors. Afterwards, the k-means clustering is applied to build a look-up table that maps members of each cluster to the cluster-centroid. The lookup table is used to map each feature in the original data to the centroid of its cluster, then we calculate the sum of the term frequency vectors of all features in each cluster to the term-frequency-vector of the cluster centroid. To test the technique we calculated the correlations of the cluster centroids with the golden standard record (GSR) vector before and after summing the vectors of the cluster members to the centroid-vector. The proposed method is applied to multiple correlation techniques including the Pearson, Spearman, distance correlation and Kendal Tao. The experiments have also considered the different word forms and lengths of the features including keywords, n-grams, skip-grams and bags-of-words. The correlation results are enhanced significantly as the highest correlation scores have increased from 0.3 to 0.6, and the average correlation scores have increased from 0.3 to 0.4.
Given two networks with the same training loss on a dataset, when would they have drastically different test losses and errors? Better understanding of this question of generalization may improve practical applications of deep networks. In this paper we show that with cross-entropy loss it is surprisingly simple to induce significantly different generalization performances for two networks that have the same architecture, the same meta parameters and the same training error: one can either pretrain the networks with different levels of ‘corrupted’ data or simply initialize the networks with weights of different Gaussian standard deviations. A corollary of recent theoretical results on overfitting shows that these effects are due to an intrinsic problem of measuring test performance with a cross-entropy/exponential-type loss, which can be decomposed into two components both minimized by SGD — one of which is not related to expected classification performance. However, if we factor out this component of the loss, a linear relationship emerges between training and test losses. Under this transformation, classical generalization bounds are surprisingly tight: the empirical/training loss is very close to the expected/test loss. Furthermore, the empirical relation between classification error and normalized cross-entropy loss seem to be approximately monotonic.
Ride-hailing services have expanded the role of shared mobility in passenger transportation systems, creating new markets and creative planning solutions for major urban centers. In this paper, we consider their use for last-mile passenger transportation in coordination with a mass transit service to provide a seamless multimodal transportation experience for the user. A system that provides passengers with predictable information on travel and waiting times in their commutes is immensely valuable. We envision that the passengers will inform the system in advance of their desired travel and arrival windows so that the system can jointly optimize the schedules of passengers. The problem we study balances minimizing travel time and the number of trips taken by the last-mile vehicles, so that long-term planning, maintenance, and environmental impact considerations can be taken into account. We focus our attention on the problem where the last-mile service aggregates passengers by destination. We show that this problem is NP-hard, and propose a decision diagram-based branch-and-price decomposition model that can solve instances of real-world size (10,000 passengers, 50 last-mile destinations, 600 last-mile vehicles) in time (~1 minute) that is orders-of-magnitude faster than other methods appearing in the literature. Our experiments also indicate that single-destination last-mile service provides high-quality solutions to more general settings.
We initiate the theoretical study of directory reconciliation, a generalization of document exchange, in which Alice and Bob each have different versions of a set of documents that they wish to synchronize. This problem is designed to capture the setting of synchronizing different versions of file directories, while allowing for changes of file names and locations without significant expense. We present protocols for efficiently solving directory reconciliation based on a reduction to document exchange under edit distance with block moves, as well as protocols combining techniques for reconciling sets of sets with document exchange protocols. Along the way, we develop a new protocol for document exchange under edit distance with block moves inspired by noisy binary search in graphs, which uses only $O(k \log n)$ bits of communication at the expense of $O(k \log n)$ rounds of communication.
Studies of affect labeling, i.e. putting your feelings into words, indicate that it can attenuate positive and negative emotions. Here we track the evolution of individual emotions for tens of thousands of Twitter users by analyzing the emotional content of their tweets before and after they explicitly report having a strong emotion. Our results reveal how emotions and their expression evolve at the temporal resolution of one minute. While the expression of positive emotions is preceded by a short but steep increase in positive valence and followed by short decay to normal levels, negative emotions build up more slowly, followed by a sharp reversal to previous levels, matching earlier findings of the attenuating effects of affect labeling. We estimate that positive and negative emotions last approximately 1.25 and 1.5 hours from onset to evanescence. A separate analysis for male and female subjects is suggestive of possible gender-specific differences in emotional dynamics.