Understanding Generalization and Stochastic Gradient Descent

This paper tackles two related questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work is inspired by Zhang et al. (2017), who showed deep networks can easily memorize randomly labeled training data, despite generalizing well when shown real labels of the same inputs. We show here that the same phenomenon occurs in small linear models. These observations are explained by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima. Next, we explore the ‘generalization gap’ between small and large batch training, identifying an optimum batch size which maximizes the test set accuracy. Noise in the gradient updates is beneficial, driving the dynamics towards robust minima for which the evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we predict the optimum batch size is proportional to both the learning rate and the size of the training set, and verify these predictions empirically.

Coded Fourier Transform

We consider the problem of computing the Fourier transform of high-dimensional vectors, distributedly over a cluster of machines consisting of a master node and multiple worker nodes, where the worker nodes can only store and process a fraction of the inputs. We show that by exploiting the algebraic structure of the Fourier transform operation and leveraging concepts from coding theory, one can efficiently deal with the straggler effects. In particular, we propose a computation strategy, named as coded FFT, which achieves the optimal recovery threshold, defined as the minimum number of workers that the master node needs to wait for in order to compute the output. This is the first code that achieves the optimum robustness in terms of tolerating stragglers or failures for computing Fourier transforms. Furthermore, the reconstruction process for coded FFT can be mapped to MDS decoding, which can be solved efficiently. Moreover, we extend coded FFT to settings including computing general n-dimensional Fourier transforms, and provide the optimal computing strategy for those settings.

Do Convolutional Neural Networks Learn Class Hierarchy?

Convolutional Neural Networks (CNNs) currently achieve state-of-the-art accuracy in image classification. With a growing number of classes, the accuracy usually drops as the possibilities of confusion increase. Interestingly, the class confusion patterns follow a hierarchical structure over the classes. We present visual-analytics methods to reveal and analyze this hierarchy of similar classes in relation with CNN-internal data. We found that this hierarchy not only dictates the confusion patterns between the classes, it furthermore dictates the learning behavior of CNNs. In particular, the early layers in these networks develop feature detectors that can separate high-level groups of classes quite well, even after a few training epochs. In contrast, the latter layers require substantially more epochs to develop specialized feature detectors that can separate individual classes. We demonstrate how these insights are key to significant improvement in accuracy by designing hierarchy-aware CNNs that accelerate model convergence and alleviate overfitting. We further demonstrate how our methods help in identifying various quality issues in the training data.

LASAGNE: Locality And Structure Aware Graph Node Embedding

In this work we propose Lasagne, a methodology to learn locality and structure aware graph node embeddings in an unsupervised way. In particular, we show that the performance of existing random-walk based approaches depends strongly on the structural properties of the graph, e.g., the size of the graph, whether the graph has a flat or upward-sloping Network Community Profile (NCP), whether the graph is expander-like, whether the classes of interest are more k-core-like or more peripheral, etc. For larger graphs with flat NCPs that are strongly expander-like, existing methods lead to random walks that expand rapidly, touching many dissimilar nodes, thereby leading to lower-quality vector representations that are less useful for downstream tasks. Rather than relying on global random walks or neighbors within fixed hop distances, Lasagne exploits strongly local Approximate Personalized PageRank stationary distributions to more precisely engineer local information into node embeddings. This leads, in particular, to more meaningful and more useful vector representations of nodes in poorly-structured graphs. We show that Lasagne leads to significant improvement in downstream multi-label classification for larger graphs with flat NCPs, that it is comparable for smaller graphs with upward-sloping NCPs, and that is comparable to existing methods for link prediction tasks.

Unsupervised Sentence Representations as Word Information Series: Revisiting TF–IDF

Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the word embeddings are fitted by using Shannon’s word entropies provided by the Term Frequency–Inverse Document Frequency (TF–IDF) transform. The hyperparameters of the model can be selected according to the properties of data (e.g. sentence length and textual gender). Hyperparameter selection involves word embedding methods and dimensionalities, as well as weighting schemata. Our method offers advantages over existing methods: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and language resources. Results showed that our model outperformed the state of the art in well-known Semantic Textual Similarity (STS) benchmarks. Moreover, our model reached state-of-the-art performance when compared to supervised and knowledge-based STS systems.

Basic tasks of sentiment analysis

Subjectivity detection is the task of identifying objective and subjective sentences. Objective sentences are those which do not exhibit any sentiment. So, it is desired for a sentiment analysis engine to find and separate the objective sentences for further analysis, e.g., polarity detection. In subjective sentences, opinions can often be expressed on one or multiple topics. Aspect extraction is a subtask of sentiment analysis that consists in identifying opinion targets in opinionated text, i.e., in detecting the specific aspects of a product or service the opinion holder is either praising or complaining about.

Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting

We describe Honk, an open-source PyTorch reimplementation of convolutional neural networks for keyword spotting that are included as examples in TensorFlow. These models are useful for recognizing ‘command triggers’ in speech-based interfaces (e.g., ‘Hey Siri’), which serve as explicit cues for audio recordings of utterances that are sent to the cloud for full speech recognition. Evaluation on Google’s recently released Speech Commands Dataset shows that our reimplementation is comparable in accuracy and provides a starting point for future work on the keyword spotting task.

Replacement AutoEncoder: A Privacy-Preserving Algorithm for Sensory Data Analysis

An increasing number of sensors on mobile, Internet of things (IoT), and wearable devices generate time-series measurements of physical activities. Though access to the sensory data is critical to the success of many beneficial applications such as health monitoring or activity recognition, a wide range of potentially sensitive information about the individuals can also be discovered through these datasets and this cannot easily be protected using traditional privacy approaches. In this paper, we propose an integrated sensing framework for managing access to personal time-series data in order to provide utility while protecting individuals’ privacy. We introduce \textit{Replacement AutoEncoder}, a novel feature-learning algorithm which learns how to transform discriminative features of multidimensional time-series that correspond to sensitive inferences, into some features that have been more observed in non-sensitive inferences, to protect users’ privacy. The main advantage of Replacement AutoEncoder is its ability to keep important features of desired inferences unchanged to preserve the utility of the data. We evaluate the efficacy of the algorithm with an activity recognition task in a multi-sensing environment using extensive experiments on three benchmark datasets. We show that it can retain the recognition accuracy of state-of-the-art techniques while simultaneously preserving the privacy of sensitive information. We use a Generative Adversarial Network to attempt to detect the replacement of sensitive data with fake non-sensitive data. We show that this approach does not detect the replacement unless the network can train using the users’ original unmodified data.

On Least Squares Linear Regression Without Second Moment

If X and Y are real valued random variables such that the first moments of X, Y, and XY exist and the conditional expectation of Y given X is an affine function of X, then the intercept and slope of the conditional expectation equal the intercept and slope of the least squares linear regression function, even though Y may not have a finite second moment. As a consequence, the affine in X form of the conditional expectation and zero covariance imply mean independence.

A Correspondence Between Random Neural Networks and Statistical Field Theory

A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively ad-hoc. In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics. We argue that several previous investigations of stochastic networks actually studied a particular factorial approximation to the full lattice model. For random linear networks and random rectified linear networks we show that the corresponding lattice models in the wide network limit may be systematically approximated by a Gaussian distribution with covariance between the layers of the network. In each case, the approximate distribution can be diagonalized by Fourier transformation. We show that this approximation accurately describes the results of numerical simulations of wide random neural networks. Finally, we demonstrate that in each case the large scale behavior of the random networks can be approximated by an effective field theory.

Learning Social Image Embedding with Deep Multimodal Attention Networks

Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in sub-optimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the fine-granularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multi-label classification and cross-modal search. Compared to state-of-the-art image embeddings, our proposed DMAN achieves significant improvement in the tasks of multi-label classification and cross-modal search.

An inferential procedure for community structure validation in networks

`Community structure’ is a commonly observed feature of real networks. The term refers to the presence in a network of groups of nodes (communities) that feature high internal connectivity and are poorly connected to each other. Whereas the issue of community detection has been addressed in several works, the problem of validating a partition of nodes as a good community structure for a network has received little attention and remains an open issue. We propose an inferential procedure for community structure validation of network partitions, which relies on concepts from network enrichment analysis. The proposed procedure allows to compare the adequacy of different partitions of nodes as community structures. Moreover, it can be employed to assess whether two networks share the same community structure, and to compare the performance of different network clustering algorithms.

Towards a Seamless Integration of Word Senses into Downstream NLP Applications

Lexical ambiguity can impede NLP systems from accurate understanding of semantics. Despite its potential benefits, the integration of sense-level information into NLP systems has remained understudied. By incorporating a novel disambiguation algorithm into a state-of-the-art classification model, we create a pipeline to integrate sense-level information into downstream NLP applications. We show that a simple disambiguation of the input text can lead to consistent performance improvement on multiple topic categorization and polarity detection datasets, particularly when the fine granularity of the underlying sense inventory is reduced and the document is sufficiently large. Our results also point to the need for sense representation research to focus more on in vivo evaluations which target the performance in downstream NLP applications rather than artificial benchmarks.

Phase Transitions in the Pooled Data Problem

In this paper, we study the pooled data problem of identifying the labels associated with a large collection of items, based on a sequence of pooled tests revealing the counts of each label within the pool. In the noiseless setting, we identify an exact asymptotic threshold on the required number of tests with optimal decoding, and prove a phase transition between complete success and complete failure. In addition, we present a novel noisy variation of the problem, and provide an information-theoretic framework for characterizing the required number of tests for general random noise models. Our results reveal that noise can make the problem considerably more difficult, with strict increases in the scaling laws even at low noise levels. Finally, we demonstrate similar behavior in an approximate recovery setting, where a given number of errors is allowed in the decoded labels.

Weighted Tensor Decomposition for Learning Latent Variables with Partial Data

Tensor decomposition methods are popular tools for learning latent variables given only lower-order moments of the data. However, the standard assumption is that we have sufficient data to estimate these moments to high accuracy. In this work, we consider the case in which certain dimensions of the data are not always observed—common in applied settings, where not all measurements may be taken for all observations—resulting in moment estimates of varying quality. We derive a weighted tensor decomposition approach that is computationally as efficient as the non-weighted approach, and demonstrate that it outperforms methods that do not appropriately leverage these less-observed dimensions.

The Origins of Computational Mechanics: A Brief Intellectual History and Several Clarifications

The principle goal of computational mechanics is to define pattern and structure so that the organization of complex systems can be detected and quantified. Computational mechanics developed from efforts in the 1970s and early 1980s to identify strange attractors as the mechanism driving weak fluid turbulence via the method of reconstructing attractor geometry from measurement time series and in the mid-1980s to estimate equations of motion directly from complex time series. In providing a mathematical and operational definition of structure it addressed weaknesses of these early approaches to discovering patterns in natural systems. Since then, computational mechanics has led to a range of results from theoretical physics and nonlinear mathematics to diverse applications—from closed-form analysis of Markov and non-Markov stochastic processes that are ergodic or nonergodic and their measures of information and intrinsic computation to complex materials and deterministic chaos and intelligence in Maxwellian demons to quantum compression of classical processes and the evolution of computation and language. This brief review clarifies several misunderstandings and addresses concerns recently raised regarding early works in the field (1980s). We show that misguided evaluations of the contributions of computational mechanics are groundless and stem from a lack of familiarity with its basic goals and from a failure to consider its historical context. For all practical purposes, its modern methods and results largely supersede the early works. This not only renders recent criticism moot and shows the solid ground on which computational mechanics stands but, most importantly, shows the significant progress achieved over three decades and points to the many intriguing and outstanding challenges in understanding the computational nature of complex dynamic systems.

Complexity and capacity bounds for quantum channels
S-Isomap++: Multi Manifold Learning from Streaming Data
Learning Inverse Statics Models Efficiently
Convex optimization over intersection of simple sets: improved convergence rate guarantees via an exact penalty approach
Relative Hard Lefschetz Theorem for Fans
On the number of unknot diagrams
Superpixels Based Marker Tracking Vs. Hue Thresholding In Rodent Biomechanics Application
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
Successive Four-Dimensional Stokes-Space Direct Detection
Fine asymptotics for models with Gamma type moments
Classification and Geometry of General Perceptual Manifolds
Hultman elements for the hyperoctahedral groups
A Line-Point Unified Solution to Relative Camera Pose Estimation
Shape optimisation with nearly conformal transformations
Fundamental Limits of Low-Density Spreading NOMA with Fading
Chain Reduction for Binary and Zero-Suppressed Decision Diagrams
Scene Parsing with Global Context Embedding
Multi-focus image fusion using VOL and EOL in DCT domain
Pose-based Deep Gait Recognition
Learning Knowledge-guided Pose Grammar Machine for 3D Human Pose Estimation
On reducing sampling variance in covariate shift using control variates
Dihedral Sieving Phenomena
Sistema de Navegação Autônomo Baseado em Visão Computacional
Near-Optimal Adversarial Policy Switching for Decentralized Asynchronous Multi-Agent Systems
Spectral Conditions for Existence and Uniqueness of Recursive Utilities
Maximum number of colourings. II. 5-chromatic graphs
Asymmetric Actor Critic for Image-Based Robot Learning
New inertial regularized algorithm for solving strongly pseudomonotone equilibrium problems
Exploiting oddsmaker bias to improve the prediction of NFL outcomes
Learning Deep Context-aware Features over Body and Latent Parts for Person Re-identification
A recognition algorithm for simple-triangle graphs
Revenue-based Attribution Modeling for Online Advertising
Hydrodynamic limit and Propagation of Chaos for Brownian Particles reflecting from a Newtonian barrier
Towards a unified theory for testing statistical hypothesis: Multinormal mean with nuisance covariance matrix
The Effects of Memory Replay in Reinforcement Learning
Distributed optimization-based control of dynamically coupled networks
Association and Load Optimization with User Priorities in Load-Coupled Heterogeneous Networks
MEDOC: a Python wrapper to load MEDLINE into a local MySQL database
Review of Data Structures for Computationally Efficient Nearest-Neighbour Entropy Estimators for Large Systems with Periodic Boundary Conditions
Eigenvalue fluctuations for lattice Anderson Hamiltonians: Unbounded potentials
Variational Inference based on Robust Divergences
A New Hierarchy of Second-order Cone Programming Relaxations for Polynomial Optimization: Asymptotic and Finite Convergence
Cell Segmentation in 3D Confocal Images using Supervoxel Merge-Forests with CNN-based Hypothesis Selection
Mirror Descent and Convex Optimization Problems With Non-Smooth Inequality Constraints
Bayesian inversion of convolved hidden Markov models with applications in reservoir prediction
The Robust Reading Competition Annotation and Evaluation Platform
On Optimal Operational Sequence of Components in a Warm Standby System
A Nonconvex Proximal Splitting Algorithm under Moreau-Yosida Regularization
Non-Gaussian Error Distributions of Galactic Rotation Speed Measurements
A Sinkhorn-Newton method for entropic optimal transport
Deceased Organ Matching in Australia
Empirical regression quantile process with possible application to risk analysis
Is the bump significant? An axion-search example
A computerised classification of some almost minimal triangle-free Ramsey graphs
Image Restoration by Iterative Denoising and Backward Projections
Cluster-glass phase in pyrochlore XY antiferromagnets with quenched disorder
Variable selection for the prediction of C[0,1]-valued AR processes using RKHS
On cyclic descents for tableaux
Simultaneous Recognition and Pose Estimation of Instruments in Minimally Invasive Surgery
A quasi-Bayesian framework for the development of computer models
Metastability of one-dimensional, non-reversible diffusions with periodic boundary conditions
A five-decision testing procedure to infer on unidimensional parameter
Dropout Sampling for Robust Object Detection in Open-Set Conditions
Near-domination in graphs
Confidence interval for correlation estimator between latent processes
Statistics of optimal information flow in ensembles of regulatory motifs
Self-to-self transitions in open quantum systems: the origin and solutions
First-Order Perturbation Analysis of the SECSI Framework for the Approximate CP Decomposition of 3-D Noise-Corrupted Low-Rank Tensors
On the Hurwitz action in affine Coxeter groups
Build Fast and Accurate Lemmatization for Arabic
Stochastic Weighted Function Norm Regularization
Application de la récurrence topologique aux intégrales de matrices aléatoires et aux systèmes intégrables
On spectral radii of unraveled balls
Backward SDEs and infinite horizon stochastic optimal control
Semi-Parametric Causal Sufficient Dimension Reduction Of High Dimensional Treatments
Brownian motion with general drift
Robust Model Order Selection in Large Dimensional Elliptically Symmetric Noise
Jointly Optimal Spatial Channel Assignment and Power Allocation for MIMO SWIPT Systems
Characterizing correlations and synchronization in collective dynamics
A new approach for the construction of a Wasserstein diffusion
Caching in Combination Networks: Novel Multicast Message Generation and Delivery by Leveraging the Network Topology
Universally Weakly Secure Coset Coding Schemes for Minimum Storage Regenerating (MSR) Codes
Perfect $k$-colored matchings and $k\!+\!2$-gonal tilings
Koopman operator-based model reduction for switched-system control of PDEs
A complete characterization of optimal dictionaries for least squares representation
SQG-Differential Evolution for difficult optimization problems under a tight function evaluation budget
Counting compositions over finite abelian groups
Enhancing the Performance of Convolutional Neural Networks on Quality Degraded Datasets
Minimax Linear Estimation at a Boundary Point
Photo-Guided Exploration of Volume Data Features
Identifying Mild Traumatic Brain Injury Patients From MR Images Using Bag of Visual Words
Nonlinear Phase-Quantized Constant-Envelope Precoding for Massive MU-MIMO-OFDM
Spatial random field models based on Lévy indicator convolutions
Using Deep Convolutional Networks for Gesture Recognition in American Sign Language