Stochastic Gradient MCMC Methods for Hidden Markov Models

Stochastic gradient MCMC (SG-MCMC) algorithms have proven useful in scaling Bayesian inference to large datasets under an assumption of i.i.d data. We instead develop an SG-MCMC algorithm to learn the parameters of hidden Markov models (HMMs) for time-dependent data. There are two challenges to applying SG-MCMC in this setting: The latent discrete states, and needing to break dependencies when considering minibatches. We consider a marginal likelihood representation of the HMM and propose an algorithm that harnesses the inherent memory decay of the process. We demonstrate the effectiveness of our algorithm on synthetic experiments and an ion channel recording data, with runtimes significantly outperforming batch MCMC.

Information Potential Auto-Encoders

In this paper, we suggest a framework to make use of mutual information as a regularization criterion to train Auto-Encoders (AEs). In the proposed framework, AEs are regularized by minimization of the mutual information between input and encoding variables of AEs during the training phase. In order to estimate the entropy of the encoding variables and the mutual information, we propose a non-parametric method. We also give an information theoretic view of Variational AEs (VAEs), which suggests that VAEs can be considered as parametric methods that estimate entropy. Experimental results show that the proposed non-parametric models have more degree of freedom in terms of representation learning of features drawn from complex distributions such as Mixture of Gaussians, compared to methods which estimate entropy using parametric approaches, such as Variational AEs.

Proximal Backpropagation

We offer a generalized point of view on the backpropagation algorithm, currently the most common technique to train neural networks via stochastic gradient descent and variants thereof. Specifically, we show that backpropagation of a prediction error is equivalent to sequential gradient descent steps on a quadratic penalty energy. This energy comprises the network activations as variables of the optimization and couples them to the network parameters. Based on this viewpoint, we illustrate the limitations on step sizes when optimizing a nested function with gradient descent. Rather than taking explicit gradient steps, where step size restrictions are an impediment for optimization, we propose proximal backpropagation (ProxProp) as a novel algorithm that takes implicit gradient steps to update the network parameters. We experimentally demonstrate that our algorithm is robust in the sense that it decreases the objective function for a wide range of parameter values. In a systematic quantitative analysis, we compare to related approaches on a supervised visual learning task (CIFAR-10) for fully connected as well as convolutional neural networks and for an unsupervised autoencoder (USPS dataset). We demonstrate that ProxProp leads to a significant speed up in training performance.

Accelerated Extra-Gradient Descent: A Novel Accelerated First-Order Method

We provide a novel accelerated first-order method that achieves the asymptotically optimal convergence rate for smooth functions in the first-order oracle model. To this day, Nesterov’s Accelerated Gradient Descent (AGD) and variations thereof were the only methods achieving acceleration in this standard blackbox model. In contrast, our algorithm is significantly different from AGD, as it relies on a predictor-corrector approach similar to that used by Mirror-Prox and Extra-Gradient Descent in the solution of convex-concave saddle point problems. For this reason, we dub our algorithm Accelerated Extra-Gradient Descent (AXGD). Its construction is motivated by the discretization of an accelerated continuous-time dynamics using the classical method of implicit Euler discretization. Our analysis explicitly shows the effects of discretization through a conceptually novel primal-dual viewpoint. Finally, we present experiments showing that our algorithm matches the performance of Nesterov’s method, while appearing more robust to noise in some cases.

A Modularized Efficient Framework for Non-Markov Time Series Estimation

We present a compartmentalized approach to finding the maximum a-posteriori (MAP) estimate of a latent time series that obeys a dynamic stochastic model and is observed through noisy measurements. We specifically consider modern signal processing problems with non-Markov signal dynamics (e.g. group sparsity) and/or non-Gaussian measurement models (e.g. point process observation models used in neuroscience). Through the use of auxiliary variables in the MAP estimation problem, we show that a consensus formulation of the alternating direction method of multipliers (ADMM) enables iteratively computing separate estimates based on the likelihood and prior and subsequently ‘averaging’ them in an appropriate sense using a Kalman smoother. As such, this can be applied to a broad class of problem settings and only requires modular adjustments when interchanging various aspects of the statistical model. Under broad log-concavity assumptions, we show that the separate estimation problems are convex optimization problems and that the iterative algorithm converges to the MAP estimate. As such, this framework can capture non-Markov latent time series models and non-Gaussian measurement models. We provide example applications involving (i) group-sparsity priors, within the context of electrophysiologic specrotemporal estimation, and (ii) non-Gaussian measurement models, within the context of dynamic analyses of learning with neural spiking and behavioral observations.

Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP

Online sparse linear regression is an online problem where an algorithm repeatedly chooses a subset of coordinates to observe in an adversarially chosen feature vector, makes a real-valued prediction, receives the true label, and incurs the squared loss. The goal is to design an online learning algorithm with sublinear regret to the best sparse linear predictor in hindsight. Without any assumptions, this problem is known to be computationally intractable. In this paper, we make the assumption that data matrix satisfies restricted isometry property, and show that this assumption leads to computationally efficient algorithms with sublinear regret for two variants of the problem. In the first variant, the true label is generated according to a sparse linear model with additive Gaussian noise. In the second, the true label is chosen adversarially.

Reinforcement Learning under Model Mismatch

We study reinforcement learning under model misspecification, where we do not have access to the true environment but only to a reasonably close approximation to it. We address this problem by extending the framework of robust MDPs to the model-free Reinforcement Learning setting, where we do not have access to the model parameters, but can only sample states from it. We define robust versions of Q-learning, SARSA, and TD-learning and prove convergence to an approximately optimal robust policy and approximate value function respectively. We scale up the robust algorithms to large MDPs via function approximation and prove convergence under two different settings. We prove convergence of robust approximate policy iteration and robust approximate value iteration for linear architectures (under mild assumptions). We also define a robust loss function, the mean squared robust projected Bellman error and give stochastic gradient descent algorithms that are guaranteed to converge to a local minimum.

Efficient Streaming Algorithms for Submodular Maximization with Multi-Knapsack Constraints

Submodular maximization (SM) has become a silver bullet for a broad class of applications such as influence maximization, data summarization, top-k representative queries, and recommendations. In this paper, we study the SM problem in data streams. Most existing algorithms for streaming SM only support the append-only model with cardinality constraints, which cannot meet the requirements of real-world problems considering either the data recency issues or more general d-knapsack constraints. Therefore, we first propose an append-only streaming algorithm {\sc KnapStream} for SM subject to a d-knapsack constraint (SMDK). Furthermore, we devise the {\sc KnapWindow} algorithm for SMDK over sliding windows to capture the recency constraints. Theoretically, the proposed algorithms have constant approximation ratios for a fixed number of knapsacks and sublinear complexities. We finally evaluate the efficiency and effectiveness of our algorithms in two real-world datasets. The results show that the proposed algorithms achieve two orders of magnitude speedups over the greedy baseline in the batch setting while preserving high quality solutions.

Stochastic Training of Neural Networks via Successive Convex Approximations

This paper proposes a new family of algorithms for training neural networks (NNs). These are based on recent developments in the field of non-convex optimization, going under the general name of successive convex approximation (SCA) techniques. The basic idea is to iteratively replace the original (non-convex, highly dimensional) learning problem with a sequence of (strongly convex) approximations, which are both accurate and simple to optimize. Differently from similar ideas (e.g., quasi-Newton algorithms), the approximations can be constructed using only first-order information of the neural network function, in a stochastic fashion, while exploiting the overall structure of the learning problem for a faster convergence. We discuss several use cases, based on different choices for the loss function (e.g., squared loss and cross-entropy loss), and for the regularization of the NN’s weights. We experiment on several medium-sized benchmark problems, and on a large-scale dataset involving simulated physical data. The results show how the algorithm outperforms state-of-the-art techniques, providing faster convergence to a better minimum. Additionally, we show how the algorithm can be easily parallelized over multiple computational units without hindering its performance. In particular, each computational unit can optimize a tailored surrogate function defined on a randomly assigned subset of the input variables, whose dimension can be selected depending entirely on the available computational power.

Average of Recentered Parallel MCMC for Big Data

In big data context, traditional MCMC methods, such as Metropolis-Hastings algorithms and hybrid Monte Carlo, scale poorly because of their need to evaluate the likelihood over the whole data set at each iteration. In order to resurrect MCMC methods, numerous approaches belonging to two categories: divide-and-conquer and subsampling, are proposed. In this article, we study the parallel MCMC and propose a new combination method in the divide-and-conquer framework. Compared with some parallel MCMC methods, such as consensus Monte Carlo, Weierstrass Sampler, instead of sampling from subposteriors, our method runs MCMC on rescaled subposteriors, but share the same computation cost in the parallel stage. We also give the mathematical justification of our method and show its performance in several models. Besides, even though our new methods is proposed in parametric framework, it can been applied to non-parametric cases without difficulty.

Sobolev Training for Neural Networks

At the heart of deep learning we aim to use neural networks as function approximators – training them to produce outputs from inputs in emulation of a ground truth function or data creation process. In many cases we only have access to input-output pairs from the ground truth, however it is becoming more common to have access to derivatives of the target output with respect to the input – for example when the ground truth function is itself a neural network such as in network compression or distillation. Generally these target derivatives are not computed, or are ignored. This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the function’s outputs but also the function’s derivatives we encode additional information about the target function within the parameters of the neural network. Thereby we can improve the quality of our predictors, as well as the data-efficiency and generalization capabilities of our learned function approximation. We provide theoretical justifications for such an approach as well as examples of empirical evidence on three distinct domains: regression on classical optimisation datasets, distilling policies of an agent playing Atari, and on large-scale applications of synthetic gradients. In all three domains the use of Sobolev Training, employing target derivatives in addition to target values, results in models with higher accuracy and stronger generalisation.

Estimation and Mitigation of Channel Non-Reciprocity in Massive MIMO
Simple and explicit bounds for multi-server queues with universal 1 / (1 – rho) scaling
Subjects classification from high-dimensional and small-sample size datasets using a strategy based on Clustering Variables around Latent Components (CLV) method
A distributed algorithm for average aggregative games with coupling constraints
Approximating Gains from Trade in Two-sided Markets via Simple Mechanisms
Stochastic Modeling and Simulation of Viral Evolution
Pseudo-deterministic Proofs
Bayesian analysis of accumulated damage models in lumber reliability
Differentially Private Learning of Undirected Graphical Models using Collective Graphical Models
Parabolic Catalan numbers count flagged Schur functions and their appearances as type A Demazure characters (key polynomials)
Toward Better Spatial Regression Inference Through Bayesian Spatial Filtering
Learning a visuomotor controller for real world robotic grasping using easily simulated depth images
Weak convergence of quantile and expectile processes under general assumptions
Feature Enhancement in Visually Impaired Images
Gene Hunting with Knockoffs for Hidden Markov Models
A Practical Method for Solving Contextual Bandit Problems Using Decision Trees
Bias and high-dimensional adjustment in observational studies of peer effects
A New Adaptive Video SRR Algorithm With Improved Robustness to Innovations
Adversarial Example Defenses: Ensembles of Weak Defenses are not Strong
Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations
Experimental Study of Compressed Stack Algorithms in Limited Memory Environments
An Exact Spectrum Formula for the Maximum Size of Finite Length Block Codes
Optimal performance of heterogeneous networks bases on the bit rate
Recent Progress of Face Image Synthesis
Effective Sequential Classifier Training for Multitemporal Remote Sensing Image Classification
Target Curricula via Selection of Minimum Feature Sets: a Case Study in Boolean Networks
The design of a streaming analytical workflow for processing massive transit feeds
Verifiable sufficient conditions for the error bound property of second-order cone complementarity problems
Sequential detection of low-rank changes using extreme eigenvalues
Revenue Optimization with Approximate Bid Predictions
On Functional Graphs of Quadratic Polynomials
Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation
On the spanning connectivity of tournaments
Improved Distributed Degree Splitting and Edge Coloring
Hybrid LISA Precoding for Multiuser Millimeter-Wave Communications
Holistic Planimetric prediction to Local Volumetric prediction for 3D Human Pose Estimation
The tail process revisited
An independence system as knot invariant
Generalized Bouncy Particle Sampler
Mapping higher-order network flows in memory and multilayer networks with Infomap
A Policy-Aware Model for Intelligent Transportation Systems
S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension
Towards Grounding Conceptual Spaces in Neural Representations
Asymptotic normality of high level-large time crossings of a Gaussian process
On the minimizers of energy forms with completely monotone kernel
A Lyapunov-type approach to convergence of the Douglas-Rachford algorithm
New bounds for the Probability of Causation in Mediation Analysis
Arabian Horse Identification Benchmark Dataset
Towards a theory of word order. Comment on ‘Dependency distance: a new perspective on syntactic patterns in natural language’ by Haitao Liu et al
Statistics of Reflection and Transmission in the Strong Overlap Regime of Fully Chaotic Reverberation Chambers
Locally Feller processes and martingale local problems. Part I: general results
Introducing Enhanced Transport to the Effective Hamiltonian Approach via Random Matrices with a Pair of Connecting States
Improved Set-based Symbolic Algorithms for Parity Games
Second-Order Kernel Online Convex Optimization with Adaptive Sketching
A Novel Construction of Low-Complexity MDS Codes with Optimal Repair Capability for Distributed Storage Systems
The uniform local asymptotics of the total net loss process in a new time-dependent bidimensional renewal model
A survey of cross-lingual embedding models
A remark on Hamilton cycles with few colors
A softening-healing law for self-healing quasi-brittle materials: analyzing with Strong Discontinuity embedded Approach
Dynamical and Topological Aspects of Consensus Formation in Complex Networks
Robust Submodular Maximization: A Non-Uniform Partitioning Approach
Convergence diagnostics for MCMC draws of a categorical variable
DSRIM: A Deep Neural Information Retrieval Model Enhanced by a Knowledge Resource Driven Representation of Documents
The complexity of the Multiple Pattern Matching Problem for random strings
Multi-objective Bandits: Optimizing the Generalized Gini Index
Schubert polynomials as lattice point enumerators of generalized permutahedra
Entropy inequalities for factors of IID
Online Strip Packing with Polynomial Migration
Plus-Minus Player Ratings for Soccer
A Generalized Girsanov’s Theorem
Long rainbow cycles and Hamiltonian cycles using many colors in properly edge-colored complete graphs
DOTE: Dual cOnvolutional filTer lEarning for Super-Resolution and Cross-Modality Synthesis in MRI
Stochastic Primal-Dual Hybrid Gradient Algorithm with Arbitrary Sampling and Imaging Application
Fast Bayesian inference of the multivariate Ornstein-Uhlenbeck process
Learning Deep ResNet Blocks Sequentially using Boosting Theory
Latent Variable Modeling for the Microbiome
A convolutional autoencoder approach for mining features in cellular electron cryo-tomograms and weakly supervised coarse segmentation
German in Flux: Detecting Metaphoric Change via Word Entropy
Research Topics Map: rtopmap
FreezeOut: Accelerate Training by Progressively Freezing Layers
On the $1/3-2/3$ Conjecture
Variational Approaches for Auto-Encoding Generative Adversarial Networks