Input Fast-Forwarding for Better Deep Learning

This paper introduces a new architectural framework, known as input fast-forwarding, that can enhance the performance of deep networks. The main idea is to incorporate a parallel path that sends representations of input values forward to deeper network layers. This scheme is substantially different from ‘deep supervision’ in which the loss layer is re-introduced to earlier layers. The parallel path provided by fast-forwarding enhances the training process in two ways. First, it enables the individual layers to combine higher-level information (from the standard processing path) with lower-level information (from the fast-forward path). Second, this new architecture reduces the problem of vanishing gradients substantially because the fast-forwarding path provides a shorter route for gradient backpropagation. In order to evaluate the utility of the proposed technique, a Fast-Forward Network (FFNet), with 20 convolutional layers along with parallel fast-forward paths, has been created and tested. The paper presents empirical results that demonstrate improved learning capacity of FFNet due to fast-forwarding, as compared to GoogLeNet (with deep supervision) and CaffeNet, which are 4x and 18x larger in size, respectively. All of the source code and deep learning models described in this paper will be made available to the entire research community

The Prediction Advantage: A Universally Meaningful Performance Measure for Classification and Regression

We introduce the Prediction Advantage (PA), a novel performance measure for prediction functions under any loss function (e.g., classification or regression). The PA is defined as the performance advantage relative to the Bayesian risk restricted to knowing only the distribution of the labels. We derive the PA for well-known loss functions, including 0/1 loss, cross-entropy loss, absolute loss, and squared loss. In the latter case, the PA is identical to the well-known R-squared measure, widely used in statistics. The use of the PA ensures meaningful quantification of prediction performance, which is not guaranteed, for example, when dealing with noisy imbalanced classification problems. We argue that among several known alternative performance measures, PA is the best (and only) quantity ensuring meaningfulness for all noise and imbalance levels.

Selective Classification for Deep Neural Networks

Selective classification techniques (also known as reject option) have not yet been considered in the context of deep neural networks (DNNs). These techniques can potentially significantly improve DNNs prediction performance by trading-off coverage. In this paper we propose a method to construct a selective classifier given a trained neural network. Our method allows a user to set a desired risk level. At test time, the classifier rejects instances as needed, to grant the desired risk (with high probability). Empirical results over CIFAR and ImageNet convincingly demonstrate the viability of our method, which opens up possibilities to operate DNNs in mission-critical applications. For example, using our method an unprecedented 2% error in top-5 ImageNet classification can be guaranteed with probability 99.9%, and almost 60% test coverage.

Interpreting Blackbox Models via Model Extraction

Interpretability has become an important issue as machine learning is increasingly used to inform consequential decisions. We propose an approach for interpreting a blackbox model by extracting a decision tree that approximates the model. Our model extraction algorithm avoids overfitting by leveraging blackbox model access to actively sample new training points. We prove that as the number of samples goes to infinity, the decision tree learned using our algorithm converges to the exact greedy decision tree. In our evaluation, we use our algorithm to interpret random forests and neural nets trained on several datasets from the UCI Machine Learning Repository, as well as control policies learned for three classical reinforcement learning problems. We show that our algorithm improves over a baseline based on CART on every problem instance. Furthermore, we show how an interpretation generated by our approach can be used to understand and debug these models.

An effective algorithm for hyperparameter optimization of neural networks

A major challenge in designing neural network (NN) systems is to determine the best structure and parameters for the network given the data for the machine learning problem at hand. Examples of parameters are the number of layers and nodes, the learning rates, and the dropout rates. Typically, these parameters are chosen based on heuristic rules and manually fine-tuned, which may be very time-consuming, because evaluating the performance of a single parametrization of the NN may require several hours. This paper addresses the problem of choosing appropriate parameters for the NN by formulating it as a box-constrained mathematical optimization problem, and applying a derivative-free optimization tool that automatically and effectively searches the parameter space. The optimization tool employs a radial basis function model of the objective function (the prediction accuracy of the NN) to accelerate the discovery of configurations yielding high accuracy. Candidate configurations explored by the algorithm are trained to a small number of epochs, and only the most promising candidates receive full training. The performance of the proposed methodology is assessed on benchmark sets and in the context of predicting drug-drug interactions, showing promising results. The optimization tool used in this paper is open-source.

Causal inference for social network data

We extend recent work by van der Laan (2014) on causal inference for causally connected units to more general social network settings. Our asymptotic results allow for dependence of each observation on a growing number of other units as sample size increases. We are not aware of any previous methods for inference about network members in observational settings that allow the number of ties per node to increase as the network grows. While previous methods have generally implicitly focused on one of two possible sources of dependence among social network observations, we allow for both dependence due to contagion, or transmission of information across network ties, and for dependence due to latent similarities among nodes sharing ties. We describe estimation and inference for causal effects that are specifically of interest in social network settings.

Grounded Recurrent Neural Networks

In this work, we present the Grounded Recurrent Neural Network (GRNN), a recurrent neural network architecture for multi-label prediction which explicitly ties labels to specific dimensions of the recurrent hidden state (we call this process ‘grounding’). The approach is particularly well-suited for extracting large numbers of concepts from text. We apply the new model to address an important problem in healthcare of understanding what medical concepts are discussed in clinical text. Using a publicly available dataset derived from Intensive Care Units, we learn to label a patient’s diagnoses and procedures from their discharge summary. Our evaluation shows a clear advantage to using our proposed architecture over a variety of strong baselines.

Towards Interrogating Discriminative Machine Learning Models

It is oftentimes impossible to understand how machine learning models reach a decision. While recent research has proposed various technical approaches to provide some clues as to how a learning model makes individual decisions, they cannot provide users with ability to inspect a learning model as a complete entity. In this work, we propose a new technical approach that augments a Bayesian regression mixture model with multiple elastic nets. Using the enhanced mixture model, we extract explanations for a target model through global approximation. To demonstrate the utility of our approach, we evaluate it on different learning models covering the tasks of text mining and image recognition. Our results indicate that the proposed approach not only outperforms the state-of-the-art technique in explaining individual decisions but also provides users with an ability to discover the vulnerabilities of a learning model.

MMD GAN: Towards Deeper Understanding of Moment Matching Network

Generative moment matching network (GMMN) is a deep generative model that differs from Generative Adversarial Network (GAN) by replacing the discriminator in GAN with a two-sample test based on kernel maximum mean discrepancy (MMD). Although some theoretical guarantees of MMD have been studied, the empirical performance of GMMN is still not as competitive as that of GAN on challenging and large benchmark datasets. The computational efficiency of GMMN is also less desirable in comparison with GAN, partially due to its requirement for a rather large batch size during the training. In this paper, we propose to improve both the model expressiveness of GMMN and its computational efficiency by introducing adversarial kernel learning techniques, as the replacement of a fixed Gaussian kernel in the original GMMN. The new approach combines the key ideas in both GMMN and GAN, hence we name it MMD-GAN. The new distance measure in MMD-GAN is a meaningful loss that enjoys the advantage of weak topology and can be optimized via gradient descent with relatively small batch sizes. In our evaluation on multiple benchmark datasets, including MNIST, CIFAR- 10, CelebA and LSUN, the performance of MMD-GAN significantly outperforms GMMN, and is competitive with other representative GAN works.

Nonparametric Preference Completion

We consider the task of collaborative preference completion: given a pool of items, a pool of users and a partially observed item-user rating matrix, the goal is to recover the personalized ranking of each user over all of the items. Our approach is nonparametric: we assume that each item i and each user u have unobserved features x_i and y_u, and that the associated rating is given by g_u(f(x_i,y_u)) where f is Lipschitz and g_u is a monotonic transformation that depends on the user. We propose a k-nearest neighbors-like algorithm and prove that it is consistent. To the best of our knowledge, this is the first consistency result for the collaborative preference completion problem in a nonparametric setting. Finally, we conduct experiments on the Netflix and Movielens datasets that suggest that our algorithm has some advantages over existing neighborhood-based methods and that its performance is comparable to some state-of-the art matrix factorization methods.

Deep Rotation Equivariant Network

Recently, learning equivariant representations has attracted considerable research attention. Dieleman et al. introduce four operations which can be inserted to CNN to learn deep representations equivariant to rotation. However, feature maps should be copied and rotated four times in each layer in their approach, which causes much running time and memory overhead. In order to address this problem, we propose Deep Rotation Equivariant Network(DREN) consisting of cycle layers, isotonic layers and decycle layers.Our proposed layers apply rotation transformation on filters rather than feature maps, achieving a speed up of more than 2 times with even less memory overhead. We evaluate DRENs on Rotated MNIST and CIFAR-10 datasets and demonstrate that it can improve the performance of state-of-the-art architectures. Our codes are released on GitHub.

Fast-Slow Recurrent Neural Networks

Processing sequential data of variable length is a major challenge in a wide range of applications, such as speech recognition, language modeling, generative image modeling and machine translation. Here, we address this challenge by proposing a novel recurrent neural network (RNN) architecture, the Fast-Slow RNN (FS-RNN). The FS-RNN incorporates the strengths of both multiscale RNNs and deep transition RNNs as it processes sequential data on different timescales and learns complex transition functions from one time step to the next. We evaluate the FS-RNN on two character level language modeling data sets, Penn Treebank and Hutter Prize Wikipedia, where we improve state of the art results to 1.19 and 1.25 bits-per-character (BPC), respectively. In addition, an ensemble of two FS-RNNs achieves 1.20 BPC on Hutter Prize Wikipedia outperforming the best known compression algorithm with respect to the BPC measure. We also present an empirical investigation of the learning and network dynamics of the FS-RNN, which explains the improved performance compared to other RNN architectures. Our approach is general as any kind of RNN cell is a possible building block for the FS-RNN architecture, and thus can be flexibly applied to different tasks.

General Algorithmic Search

In this paper we present a metaheuristic for global optimization called General Algorithmic Search (GAS). Specifically, GAS is a stochastic, single-objective method that evolves a swarm of agents in search of a global extremum. Numerical simulations with a sample of 31 test functions show that GAS outperforms Basin Hopping, Cuckoo Search, and Differential Evolution, especially in concurrent optimization, i.e., when several runs with different initial settings are executed and the first best wins. Python codes of all algorithms and complementary information are available online.

Learning with Average Top-k Loss

In this work, we introduce the average top-k (AT_k) loss as a new ensemble loss for supervised learning, which is the average over the k largest individual losses over a training dataset. We show that the AT_k loss is a natural generalization of the two widely used ensemble losses, namely the average loss and the maximum loss, but can combines their advantages and mitigate their drawbacks to better adapt to different data distributions. Furthermore, it remains a convex function over all individual losses, which can lead to convex optimization problems that can be solved effectively with conventional gradient-based method. We provide an intuitive interpretation of the AT_k loss based on its equivalent effect on the continuous individual loss functions, suggesting that it can reduce the penalty on correctly classified data. We further give a learning theory analysis of MAT_k learning on the classification calibration of the AT_k loss and the error bounds of AT_k-SVM. We demonstrate the applicability of minimum average top-k learning for binary classification and regression using synthetic and real datasets.

Dense Transformer Networks

The key idea of current deep learning methods for dense prediction is to apply a model on a regular patch centered on each pixel to make pixel-wise predictions. These methods are limited in the sense that the patches are determined by network architecture instead of learned from data. In this work, we propose the dense transformer networks, which can learn the shapes and sizes of patches from data. The dense transformer networks employ an encoder-decoder architecture, and a pair of dense transformer modules are inserted into each of the encoder and decoder paths. The novelty of this work is that we provide technical solutions for learning the shapes and sizes of patches from data and efficiently restoring the spatial correspondence required for dense prediction. The proposed dense transformer modules are differentiable, thus the entire network can be trained. We apply the proposed networks on natural and biological image segmentation tasks and show superior performance is achieved in comparison to baseline methods.

Sequential noise-induced escapes for oscillatory network dynamics

Weakly-normal basis vector fields in RKHS with an application to shape Newton methods

The Benefit of Being Flexible in Distributed Computation

Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation

Efficiently applying attention to sequential data with the Recurrent Discounted Attention unit

Bayesian Pool-based Active Learning with Abstention Feedbacks

Second-Order Word Embeddings from Nearest Neighbor Topological Features

Uplift Modeling with Multiple Treatments and General Response Types

A study on exponential-size neighborhoods for the bin packing problem with conflicts

Clinical Intervention Prediction and Understanding using Deep Networks

Predictive Analytics for Enhancing Travel Time Estimation in Navigation Apps of Apple, Google, and Microsoft

Discontinuous Hamiltonian Monte Carlo for sampling discrete parameters

Designs for estimating the treatment effect in networks with interference

Data-driven Random Fourier Features using Stein Effect

Model-free causal inference of binary experimental data

Convolution estimates and number of disjoint partitions

Statistical Convergence Analysis of Gradient EM on General Gaussian Mixture Models

Conscious and controlling elements in combinatorial group testing problems with more defectives

Critical two-point function for long-range $O(n)$ models below the upper critical dimension

Self-Organized Supercriticality and Oscillations in Networks of Stochastic Spiking Neurons

Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classification

Safe Model-based Reinforcement Learning with Stability Guarantees

Random ordering formula for sofic and Rokhlin entropy of Gibbs measures

Hashing as Tie-Aware Learning to Rank

Simple Pricing Schemes for the Cloud

Joint Rate Control and Power Allocation for Non-Orthogonal Multiple Access Systems

Flexible Cache-Aided Networks with Backhauling

Exact Recovery of Number of Blocks in Blockmodels

On the multiply robust estimation of the mean of the g-functional

Sequence Summarization Using Order-constrained Kernelized Feature Subspaces

Generative Model with Coordinate Metric Learning for Object Recognition Based on 3D Models

Sufficient conditions for the existence of a path-factor which are related to odd components

Deep Learning Improves Template Matching by Normalized Cross Correlation

Journalists’ information needs, seeking behavior, and its determinants on social media

Substitution invariant Sturmian words and binary trees

Fully reliable error control for evolutionary problems

Which bridge estimator is optimal for variable selection?

Multi-Task Learning for Contextual Bandits

Dictionary-based Monitoring of Premature Ventricular Contractions: An Ultra-Low-Cost Point-of-Care Service

Robust Data Geometric Structure Aligned Close yet Discriminative Domain Adaptation

VANETs Meet Autonomous Vehicles: A Multimodal 3D Environment Learning Approach

On Using Time Without Clocks via Zigzag Causality

Self-supervised learning of visual features through embedding images into text topic spaces

Representing the suffix tree with the CDAWG

Higher order Cheeger inequalities for Steklov eigenvalues

On the Success Probability of Decoding (Partial) Unit Memory Codes

Restriction of odd degree characters of $\mathfrak{S}_n$

Efficient Covariance Approximations for Large Sparse Precision Matrices

Combinatorial n-fold Integer Programming and Applications

Towards Understanding the Invertibility of Convolutional Neural Networks

Bayesian Compression for Deep Learning

Packing parameters in graphs: New bounds and a solution to an open problem

Alliance formation with exclusion in the spatial public goods game

Stochastic decomposition applied to large-scale hydro valleys management

Daisy cubes and distance cube polynomial

On the Möbius Function and Topology of General Pattern Posets

On The Fixatic Number of Graphs

Continual Learning with Deep Generative Replay

Stochastic Sequential Neural Networks with Structured Inference

Tree-Structured Modelling of Varying Coefficients

The de Bruijn-Erdös-Hanani theorem

Inclusive Flavour Tagging Algorithm

V2X Meets NOMA: Non-Orthogonal Multiple Access for 5G Enabled Vehicular Networks

Non-orthogonal Multiple Access for High-reliable and Low-latency V2X Communications in 5G Systems

An experimental study of graph-based semi-supervised classification with additional node information

Open-Category Classification by Adversarial Sample Generation

Hajós’ cycle conjecture for small graphs

Weighted Poisson-Delaunay Mosaics

Non-Stationary Spectral Kernels

Efficient algorithm for large spectral partitions

A counterexample to Comon’s conjecture

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

A causal approach to analysis of censored medical costs in the presence of time-varying treatment

A.Ya. Khintchine’s Work in Probability Theory

Speeding up Dynamic Programming on DAGs through a Fast Approximation of Path Cover

Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning

Small Sets with Large Difference Sets

Adaptive Detrending to Accelerate Convolutional Gated Recurrent Unit Training for Contextual Video Recognition

Three observations on spectra of zero-nonzero patterns

Threshold functions for small subgraphs: an analytic approach

A Two-Level Graph Partitioning Problem Arising in Mobile Wireless Communications

Group divisible (K_4-e)-packings with any minimum leave

Optimization of the Jaccard index for image segmentation with the Lovász hinge

STFT with Adaptive Window Width Based on the Chirp Rate

Continuous testing for Poisson process intensities: A new perspective on scanning statistics

Characterizing path-like trees from linear configurations

Beyond Parity: Fairness Objectives for Collaborative Filtering

A Bayesian Mallows approach to non-transitive pair comparison data: how human are sounds?

When Will AI Exceed Human Performance? Evidence from AI Experts

Boundary Crossing Probabilities for General Exponential Families

Power Systems Data Fusion based on Belief Propagation

Matrix-product structure of repeated-root constacyclic codes over finite fields

Causal Effect Inference with Deep Latent-Variable Models

From source to target and back: symmetric bi-directional adaptive GAN

Deep Investigation of Cross-Language Plagiarism Detection Methods

Transition to Shock Fluctuations in TASEP and Last Passage Percolation

Multi-Level Variational Autoencoder: Learning Disentangled Representations from Grouped Observations

Parsing with CYK over Distributed Representations: ‘Classical’ Syntactic Parsing in the Novel Era of Neural Networks

How a General-Purpose Commonsense Ontology can Improve Performance of Learning-Based Image Retrieval

Joint Distribution Optimal Transportation for Domain Adaptation

Improved Semi-supervised Learning with GANs using Manifold Invariances

Perturbation of Conservation Laws and Averaging on Manifolds

Audio-replay attack detection countermeasures

More Circulant Graphs exhibiting Pretty Good State Transfer

Anti-spoofing Methods for Automatic SpeakerVerification System

Transport and optics at the node in a nodal loop semimetal

Flow-GAN: Bridging implicit and prescribed learning in generative models

Quantum Channel Capacities Per Unit Cost

Sharp threshold for $K_4$-percolation

Modeling flow in porous media with double porosity/permeability: A stabilized mixed formulation, error analysis, and numerical solutions

Linearizable Iterators for Concurrent Data Structures