Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow

In this paper, we propose a new volume-preserving flow and show that it performs similarly to the linear general normalizing flow. The idea is to enrich a linear Inverse Autoregressive Flow by introducing multiple lower-triangular matrices with ones on the diagonal and combining them using a convex combination. In the experimental studies on MNIST and Histopathology data we show that the proposed approach outperforms other volume-preserving flows and is competitive with current state-of-the-art linear normalizing flow.

Active Learning for Structured Prediction from Partially Labeled Data

We propose a general purpose active learning algorithm for structured prediction, gathering labeled data for training a model that outputs a set of related labels for an image or video. Active learning starts with a limited initial training set, then iterates querying a user for labels on unlabeled data and retraining the model. We propose a novel algorithm for selecting data for labeling, choosing examples to maximize expected information gain based on belief propagation inference. This is a general purpose method and can be applied to a variety of tasks or models. As a specific example we demonstrate this framework for learning to recognize human actions and group activities in video sequences. Experiments show that our proposed algorithm outperforms previous active learning methods and can achieve accuracy comparable to fully supervised methods while utilizing significantly less labeled data.

Creating Virtual Universes Using Generative Adversarial Networks

Inferring model parameters from experimental data is a grand challenge in many sciences, including cosmology. This often relies critically on high fidelity numerical simulations, which are prohibitively computationally expensive. The application of deep learning techniques to generative modeling is renewing interest in using high dimensional density estimators as computationally inexpensive emulators of fully-fledged simulations. These generative models have the potential to make a dramatic shift in the field of scientific simulations, but for that shift to happen we need to study the performance of such generators in the precision regime needed for science applications. To this end, in this letter we apply Generative Adversarial Networks to the problem of generating cosmological weak lensing convergence maps. We show that our generator network produces maps that are described by, with high statistical confidence, the same summary statistics as the fully simulated maps.

ShiftCNN: Generalized Low-Precision Architecture for Inference of Convolutional Neural Networks

In this paper we introduce ShiftCNN, a generalized low-precision architecture for inference of multiplierless convolutional neural networks (CNNs). ShiftCNN is based on a power-of-two weight representation and, as a result, performs only shift and addition operations. Furthermore, ShiftCNN substantially reduces computational cost of convolutional layers by precomputing convolution terms. Such an optimization can be applied to any CNN architecture with a relatively small codebook of weights and allows to decrease the number of product operations by at least two orders of magnitude. The proposed architecture targets custom inference accelerators and can be realized on FPGAs or ASICs. Extensive evaluation on ImageNet shows that the state-of-the-art CNNs can be converted without retraining into ShiftCNN with less than 1% drop in accuracy when the proposed quantization algorithm is employed. RTL simulations, targeting modern FPGAs, show that power consumption of convolutional layers is reduced by a factor of 4 compared to conventional 8-bit fixed-point architectures.

A Convex Framework for Fair Regression

We introduce a flexible family of fairness regularizers for (linear and logistic) regression problems. These regularizers all enjoy convexity, permitting fast optimization, and they span the rang from notions of group fairness to strong individual fairness. By varying the weight on the fairness regularizer, we can compute the efficient frontier of the accuracy-fairness trade-off on any given dataset, and we measure the severity of this trade-off via a numerical quantity we call the Price of Fairness (PoF). The centerpiece of our results is an extensive comparative study of the PoF across six different datasets in which fairness is a primary consideration.

Outlier Detection Using Distributionally Robust Optimization under the Wasserstein Metric

We present a Distributionally Robust Optimization (DRO) approach to outlier detection in a linear regression setting, where the closeness of probability distributions is measured using the Wasserstein metric. Training samples contaminated with outliers skew the regression plane computed by least squares and thus impede outlier detection. Classical approaches, such as robust regression, remedy this problem by downweighting the contribution of atypical data points. In contrast, our Wasserstein DRO approach hedges against a family of distributions that are close to the empirical distribution. We show that the resulting formulation encompasses a class of models, which include the regularized Least Absolute Deviation (LAD) as a special case. We provide new insights into the regularization term and give guidance on the selection of the regularization coefficient from the standpoint of a confidence region. We establish two types of performance guarantees for the solution to our formulation under mild conditions. One is related to its out-of-sample behavior, and the other concerns the discrepancy between the estimated and true regression planes. Extensive numerical results demonstrate the superiority of our approach to both robust regression and the regularized LAD in terms of estimation accuracy and outlier detection rates.

Generalized Value Iteration Networks: Life Beyond Lattices

In this paper, we introduce a generalized value iteration network (GVIN), which is an end-to-end neural network planning module. GVIN emulates the value iteration algorithm by using a novel graph convolution operator, which enables GVIN to learn and plan on irregular spatial graphs. We propose three novel differentiable kernels as graph convolution operators and show that the embedding based kernel achieves the best performance. We further propose episodic Q-learning, an improvement upon traditional n-step Q-learning that stabilizes training for networks that contain a planning module. Lastly, we evaluate GVIN on planning problems in 2D mazes, irregular graphs, and real-world street networks, showing that GVIN generalizes well for both arbitrary graphs and unseen graphs of larger scale and outperforms a naive generalization of VIN (discretizing a spatial graph into a 2D image).

Forward Thinking: Building and Training Neural Networks One Layer at a Time

We present a general framework for training deep neural networks without backpropagation. This substantially decreases training time and also allows for construction of deep networks with many sorts of learners, including networks whose layers are defined by functions that are not easily differentiated, like decision trees. The main idea is that layers can be trained one at a time, and once they are trained, the input data are mapped forward through the layer to create a new learning problem. The process is repeated, transforming the data through multiple layers, one at a time, rendering a new data set, which is expected to be better behaved, and on which a final output layer can achieve good performance. We call this forward thinking and demonstrate a proof of concept by achieving state-of-the-art accuracy on the MNIST dataset for convolutional neural networks. We also provide a general mathematical formulation of forward thinking that allows for other types of deep learning problems to be considered.

Learning Deep Representations for Scene Labeling with Guided Supervision

Scene labeling is a challenging classification problem where each input image requires a pixel-level prediction map. Recently, deep-learning-based methods have shown their effectiveness on solving this problem. However, we argue that the large intra-class variation provides ambiguous training information and hinders the deep models’ ability to learn more discriminative deep feature representations. Unlike existing methods that mainly utilize semantic context for regularizing or smoothing the prediction map, we design novel supervisions from semantic context for learning better deep feature representations. Two types of semantic context, scene names of images and label map statistics of image patches, are exploited to create label hierarchies between the original classes and newly created subclasses as the learning supervisions. Such subclasses show lower intra-class variation, and help CNN detect more meaningful visual patterns and learn more effective deep features. Novel training strategies and network structure that take advantages of such label hierarchies are introduced. Our proposed method is evaluated extensively on four popular datasets, Stanford Background (8 classes), SIFTFlow (33 classes), Barcelona (170 classes) and LM+Sun datasets (232 classes) with 3 different networks structures, and show state-of-the-art performance. The experiments show that our proposed method makes deep models learn more discriminative feature representations without increasing model size or complexity.

Context encoders as a simple but powerful extension of word2vec

With a simple architecture and the ability to learn meaningful word embeddings efficiently from texts containing billions of words, word2vec remains one of the most popular neural language models used today. However, as only a single embedding is learned for every word in the vocabulary, the model fails to optimally represent words with multiple meanings. Additionally, it is not possible to create embeddings for new (out-of-vocabulary) words on the spot. Based on an intuitive interpretation of the continuous bag-of-words (CBOW) word2vec model’s negative sampling training objective in terms of predicting context based similarities, we motivate an extension of the model we call context encoders (ConEc). By multiplying the matrix of trained word2vec embeddings with a word’s average context vector, out-of-vocabulary (OOV) embeddings and representations for a word with multiple meanings can be created based on the word’s local contexts. The benefits of this approach are illustrated by using these word embeddings as features in the CoNLL 2003 named entity recognition (NER) task.

Unlocking the Potential of Simulators: Design with RL in Mind

Using Reinforcement Learning (RL) in simulation to construct policies useful in real life is challenging. This is often attributed to the sequential decision making aspect: inaccuracies in simulation accumulate over multiple steps, hence the simulated trajectories diverge from what would happen in reality. In our work we show the need to consider another important aspect: the mismatch in simulating control. We bring attention to the need for modeling control as well as dynamics, since oversimplifying assumptions about applying actions of RL policies could make the policies fail on real-world systems. We design a simulator for solving a pivoting task (of interest in Robotics) and demonstrate that even a simple simulator designed with RL in mind outperforms high-fidelity simulators when it comes to learning a policy that is to be deployed on a real robotic system. We show that a phenomenon that is hard to model – friction – could be exploited successfully, even when RL is performed using a simulator with a simple dynamics and noise model. Hence, we demonstrate that as long as the main sources of uncertainty are identified, it could be possible to learn policies applicable to real systems even using a simple simulator. RL-compatible simulators could open the possibilities for applying a wide range of RL algorithms in various fields. This is important, since currently data sparsity in fields like healthcare and education frequently forces researchers and engineers to only consider sample-efficient RL approaches. Successful simulator-aided RL could increase flexibility of experimenting with RL algorithms and help applying RL policies to real-world settings in fields where data is scarce. We believe that lessons learned in Robotics could help other fields design RL-compatible simulators, so we summarize our experience and conclude with suggestions.

Self-Normalizing Neural Networks

Deep Learning has revolutionized vision via convolutional neural networks (CNNs) and natural language processing via recurrent neural networks (RNNs). However, success stories of Deep Learning with standard feed-forward neural networks (FNNs) are rare. FNNs that perform well are typically shallow and, therefore cannot exploit many levels of abstract representations. We introduce self-normalizing neural networks (SNNs) to enable high-level abstract representations. While batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are ‘scaled exponential linear units’ (SELUs), which induce self-normalizing properties. Using the Banach fixed-point theorem, we prove that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance — even under the presence of noise and perturbations. This convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularization, and (3) to make learning highly robust. Furthermore, for activations not close to unit variance, we prove an upper and lower bound on the variance, thus, vanishing and exploding gradients are impossible. We compared SNNs on (a) 121 tasks from the UCI machine learning repository, on (b) drug discovery benchmarks, and on (c) astronomy tasks with standard FNNs and other machine learning methods such as random forests and support vector machines. SNNs significantly outperformed all competing FNN methods at 121 UCI tasks, outperformed all competing methods at the Tox21 dataset, and set a new record at an astronomy data set. The winning SNN architectures are often very deep. Implementations are available at:

Clustering with t-SNE, provably

t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering and visualization method proposed by van der Maaten & Hinton in 2008, has rapidly become a standard tool in a number of natural sciences. Despite its overwhelming success, there is a distinct lack of mathematical foundations and the inner workings of the algorithm are not well understood. The purpose of this paper is to prove that t-SNE is able to recover well-separated clusters; more precisely, we prove that t-SNE in the `early exaggeration’ phase, an optimization technique proposed by van der Maaten & Hinton (2008) and van der Maaten (2014), can be rigorously analyzed. As a byproduct, the proof suggests novel ways for setting the exaggeration parameter \alpha and step size h. Numerical examples illustrate the effectiveness of these rules: in particular, the quality of embedding of topological structures (e.g. the swiss roll) improves. We also discuss a connection to spectral clustering methods.

Reading Twice for Natural Language Understanding

Despite the recent success of neural networks in tasks involving natural language understanding (NLU) there has only been limited progress in some of the fundamental challenges of NLU, such as the disambiguation of the meaning and function of words in context. This work approaches this problem by incorporating contextual information into word representations prior to processing the task at hand. To this end we propose a general-purpose reading architecture that is employed prior to a task-specific NLU model. It is responsible for refining context-agnostic word representations with contextual information and lends itself to the introduction of additional, context-relevant information from external knowledge sources. We demonstrate that previously non-competitive models benefit dramatically from employing contextual representations, closing the gap between general-purpose reading architectures and the state-of-the-art performance obtained with fine-tuned, task-specific architectures. Apart from our empirical results we present a comprehensive analysis of the computed representations which gives insights into the kind of information added during the refinement process.

Generative Autotransporters

In this paper, we aim to introduce the classic Optimal Transport theory to enhance deep generative probabilistic modeling. For this purpose, we design a Generative Autotransporter (GAT) model with explicit distribution optimal transport. Particularly, the GAT model owns a deep distribution transporter to transfer the target distribution to a specific prior probability distribution, which enables a regular decoder to generate target samples from the input data that follows the transported prior distribution. With such a design, the GAT model can be stably trained to generate novel data by merely using a very simple l_1 reconstruction loss function with a generalized manifold-based Adam training algorithm. The experiments on two standard benchmarks demonstrate its strong generation ability.

Nuclear Discrepancy for Active Learning

Active learning algorithms propose which unlabeled objects should be queried for their labels to improve a predictive model the most. We study active learners that minimize generalization bounds and uncover relationships between these bounds that lead to an improved approach to active learning. In particular we show the relation between the bound of the state-of-the-art Maximum Mean Discrepancy (MMD) active learner, the bound of the Discrepancy, and a new and looser bound that we refer to as the Nuclear Discrepancy bound. We motivate this bound by a probabilistic argument: we show it considers situations which are more likely to occur. Our experiments indicate that active learning using the tightest Discrepancy bound performs the worst in terms of the squared loss. Overall, our proposed loosest Nuclear Discrepancy generalization bound performs the best. We confirm our probabilistic argument empirically: the other bounds focus on more pessimistic scenarios that are rarer in practice. We conclude that tightness of bounds is not always of main importance and that active learning methods should concentrate on realistic scenarios in order to improve performance.

Principled Detection of Out-of-Distribution Examples in Neural Networks

We consider the problem of detecting out-of-distribution examples in neural networks. We propose ODIN, a simple and effective out-of-distribution detector for neural networks, that does not require any change to a pre-trained model. Our method is based on the observation that using temperature scaling and adding small perturbations to the input can separate the softmax score distributions of in- and out-of-distribution samples, allowing for more effective detection. We show in a series of experiments that our approach is compatible with diverse network architectures and datasets. It consistently outperforms the baseline approach[1] by a large margin, establishing a new state-of-the-art performance on this task. For example, ODIN reduces the false positive rate from the baseline 34.7% to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is 95%. We theoretically analyze the method and prove that performance improvement is guaranteed under mild conditions on the image distributions.

Delay Optimal Scheduling for Chunked Random Linear Network Coding Broadcast
CoMaL Tracking: Tracking Points at the Object Boundaries
Low-shot learning with large-scale diffusion
K-polynomials of type A quiver orbit closures and lacing diagrams
Injective chromatic number of outerplanar graphs
Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Network
Testing the simplifying assumption in high-dimensional vine copulas
Backward Pilot Strategy in Constrained Sampling Problems
Sparse Wavelet Estimation in Quantile Regression with Multiple Functional Predictors
New Factor Pairs for Factorizations of Lambert Series Generating Functions
Galerkin approximations for the optimal control of nonlinear delay differential equations
On the Robustness of Deep Convolutional Neural Networks for Music Classification
A New Use of Douglas-Rachford Splitting and ADMM for Identifying Infeasible, Unbounded, and Pathological Conic Programs
Fast Black-box Variational Inference through Stochastic Trust-Region Optimization
Training Quantized Nets: A Deeper Understanding
Microbial Composition Estimation from Sparse Count Data
On learning the structure of Bayesian Networks and submodular function maximization
Time continuity of weak-predictable random field solutions
Weak Moment of a Class of Stochastic Heat Equation with Martingale-valued Harmonic Function
On Non-existence of Global Weak-predictable-random-field Solutions to a Class of SHEs
A sharp multiplier inequality with applications to heavy-tailed regression problems
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
Leveraging deep neural networks to capture psychological representations
Estimating Mixture Entropy with Pairwise Distances
Berry-Esséen bounds for parameter estimation of general Gaussian processes
Seamless Integration and Coordination of Cognitive Skills in Humanoid Robots: A Deep Learning Approach
C-arm Tomographic Imaging Technique for Nephrolithiasis and Detection of Kidney Stones
Enhancement of Network Synchronizability via Two Oscillatory System
Content-Based Table Retrieval for Web Queries
General model discovery using statistical evaluation maps
Image Captioning with Object Detection and Localization
Automatic tracking of vessel-like structures from a single starting point
Precise estimates for biorthogonal families under asymptotic gap conditions
A note on degree distribution in plane-oriented recursive trees
A uniform approach to soliton cellular automata using rigged configurations
Predictive Coding-based Deep Dynamic Neural Network for Visuomotor Learning
Luck is Hard to Beat: The Difficulty of Sports Prediction
Heat trace asymptotics for equiregular sub-Riemannian manifolds
Scaling Exponent and Moderate Deviations Asymptotics of Polar Codes for the AWGN Channel
Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization
Regular Boardgames
Distribution-Free One-Pass Learning
Asynchronous Pattern Formation: the effects of a rigorous approach
Minor stars in plane graphs with minimum degree five
Where is my forearm? Clustering of body parts from simultaneous tactile and linguistic input using sequential mapping
Consistency Results for Stationary Autoregressive Processes with Constrained Coefficients
Physical Layer Security of Generalised Pre-coded Spatial Modulation with Antenna Scrambling
The Generalized Cross Validation Filter
Reciprocal of the First hitting time of the boundary of dihedral wedges by a radial Dunkl process
Quantifying the recency of HIV infection using multiple longitudinal biomarkers
Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017)
Responsible Autonomy
Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes
Monitoring and predicting influenza epidemics from routinely collected severe case data
Convective Instability and Boundary Driven Oscillations in a Reaction-Diffusion-Advection Model
Clique Gossiping
The Analytical Expressions for a Finite-Size 2D Ising Model
OrbTouch: Recognizing Human Touch in Deformable Interfaces with Deep Neural Networks
The Algorithmic Inflection of Russian and Generation of Grammatically Correct Text
Surprise Search for Evolutionary Divergence
Modulation equation and SPDEs on unbounded domains
Pain-Free Random Differential Privacy with Sensitivity Sampling
Toeplitz minors for Szegö and Fisher-Hartwig symbols
ToxTrac: a fast and robust software for tracking organisms
A Vectorization for Nonconvex Set-valued Optimization
DSOS and SDSOS Optimization: More Tractable Alternatives to Sum of Squares and Semidefinite Optimization
Inference For High-Dimensional Split-Plot-Designs: A Unified Approach for Small to Large Numbers of Factor Levels
Explosion and distances in scale-free percolation
Resource Allocation for Wireless Networks: A Distributed Optimization Approach
Chambolle-Pock and Tseng’s methods: relationship and extension to the bilevel optimization
The Chain Group of a Forest
Spatio-Temporal Backpropagation for Training High-performance Spiking Neural Networks
Strong Forms of Stability from Flag Algebra Calculations
Decoupling ‘when to update’ from ‘how to update’
Stochastic LU factorizations, Darboux transformations and urn models
Mobile vs. point guards
Maximum-entropy from the probability calculus: exchangeability, sufficiency
Evidence synthesis for stochastic epidemic models
Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs
Kinetic energy choice in Hamiltonian/hybrid Monte Carlo
Derivation and Analysis of the Primal-Dual Method of Multipliers Based on Monotone Operator Theory
Delocalized Glassy Dynamics and Many Body Localization
The spectral determination of the multicone graphs Kw+P
The Laplacian spectrum of power graphs of cyclic and dicyclic groups
A New Approach to Hierarchical Data Analysis: Targeted Maximum Likelihood Estimation of Cluster-Based Effects Under Interference
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Impact of Detour-Aware Policies on Maximizing Profit in Ridesharing
Learning Local Receptive Fields and their Weight Sharing Scheme on Graphs
Topology of DNA: a honeycomb stable structure under salt effect
What Does a Belief Function Believe In ?
The True Cost of Stochastic Gradient Langevin Dynamics
Structured Light Phase Measuring Profilometry Pattern Design for Binary Spatial Light Modulators