**Understanding Generalization and Stochastic Gradient Descent**

This paper tackles two related questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work is inspired by Zhang et al. (2017), who showed deep networks can easily memorize randomly labeled training data, despite generalizing well when shown real labels of the same inputs. We show here that the same phenomenon occurs in small linear models. These observations are explained by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima. Next, we explore the ‘generalization gap’ between small and large batch training, identifying an optimum batch size which maximizes the test set accuracy. Noise in the gradient updates is beneficial, driving the dynamics towards robust minima for which the evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we predict the optimum batch size is proportional to both the learning rate and the size of the training set, and verify these predictions empirically.

**Coded Fourier Transform**

We consider the problem of computing the Fourier transform of high-dimensional vectors, distributedly over a cluster of machines consisting of a master node and multiple worker nodes, where the worker nodes can only store and process a fraction of the inputs. We show that by exploiting the algebraic structure of the Fourier transform operation and leveraging concepts from coding theory, one can efficiently deal with the straggler effects. In particular, we propose a computation strategy, named as coded FFT, which achieves the optimal recovery threshold, defined as the minimum number of workers that the master node needs to wait for in order to compute the output. This is the first code that achieves the optimum robustness in terms of tolerating stragglers or failures for computing Fourier transforms. Furthermore, the reconstruction process for coded FFT can be mapped to MDS decoding, which can be solved efficiently. Moreover, we extend coded FFT to settings including computing general

-dimensional Fourier transforms, and provide the optimal computing strategy for those settings.

**Do Convolutional Neural Networks Learn Class Hierarchy?**

Convolutional Neural Networks (CNNs) currently achieve state-of-the-art accuracy in image classification. With a growing number of classes, the accuracy usually drops as the possibilities of confusion increase. Interestingly, the class confusion patterns follow a hierarchical structure over the classes. We present visual-analytics methods to reveal and analyze this hierarchy of similar classes in relation with CNN-internal data. We found that this hierarchy not only dictates the confusion patterns between the classes, it furthermore dictates the learning behavior of CNNs. In particular, the early layers in these networks develop feature detectors that can separate high-level groups of classes quite well, even after a few training epochs. In contrast, the latter layers require substantially more epochs to develop specialized feature detectors that can separate individual classes. We demonstrate how these insights are key to significant improvement in accuracy by designing hierarchy-aware CNNs that accelerate model convergence and alleviate overfitting. We further demonstrate how our methods help in identifying various quality issues in the training data.

**LASAGNE: Locality And Structure Aware Graph Node Embedding**

In this work we propose Lasagne, a methodology to learn locality and structure aware graph node embeddings in an unsupervised way. In particular, we show that the performance of existing random-walk based approaches depends strongly on the structural properties of the graph, e.g., the size of the graph, whether the graph has a flat or upward-sloping Network Community Profile (NCP), whether the graph is expander-like, whether the classes of interest are more k-core-like or more peripheral, etc. For larger graphs with flat NCPs that are strongly expander-like, existing methods lead to random walks that expand rapidly, touching many dissimilar nodes, thereby leading to lower-quality vector representations that are less useful for downstream tasks. Rather than relying on global random walks or neighbors within fixed hop distances, Lasagne exploits strongly local Approximate Personalized PageRank stationary distributions to more precisely engineer local information into node embeddings. This leads, in particular, to more meaningful and more useful vector representations of nodes in poorly-structured graphs. We show that Lasagne leads to significant improvement in downstream multi-label classification for larger graphs with flat NCPs, that it is comparable for smaller graphs with upward-sloping NCPs, and that is comparable to existing methods for link prediction tasks.

**Unsupervised Sentence Representations as Word Information Series: Revisiting TF–IDF**

Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the word embeddings are fitted by using Shannon’s word entropies provided by the Term Frequency–Inverse Document Frequency (TF–IDF) transform. The hyperparameters of the model can be selected according to the properties of data (e.g. sentence length and textual gender). Hyperparameter selection involves word embedding methods and dimensionalities, as well as weighting schemata. Our method offers advantages over existing methods: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and language resources. Results showed that our model outperformed the state of the art in well-known Semantic Textual Similarity (STS) benchmarks. Moreover, our model reached state-of-the-art performance when compared to supervised and knowledge-based STS systems.

**Basic tasks of sentiment analysis**

Subjectivity detection is the task of identifying objective and subjective sentences. Objective sentences are those which do not exhibit any sentiment. So, it is desired for a sentiment analysis engine to find and separate the objective sentences for further analysis, e.g., polarity detection. In subjective sentences, opinions can often be expressed on one or multiple topics. Aspect extraction is a subtask of sentiment analysis that consists in identifying opinion targets in opinionated text, i.e., in detecting the specific aspects of a product or service the opinion holder is either praising or complaining about.

**Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting**

We describe Honk, an open-source PyTorch reimplementation of convolutional neural networks for keyword spotting that are included as examples in TensorFlow. These models are useful for recognizing ‘command triggers’ in speech-based interfaces (e.g., ‘Hey Siri’), which serve as explicit cues for audio recordings of utterances that are sent to the cloud for full speech recognition. Evaluation on Google’s recently released Speech Commands Dataset shows that our reimplementation is comparable in accuracy and provides a starting point for future work on the keyword spotting task.

**Replacement AutoEncoder: A Privacy-Preserving Algorithm for Sensory Data Analysis**

An increasing number of sensors on mobile, Internet of things (IoT), and wearable devices generate time-series measurements of physical activities. Though access to the sensory data is critical to the success of many beneficial applications such as health monitoring or activity recognition, a wide range of potentially sensitive information about the individuals can also be discovered through these datasets and this cannot easily be protected using traditional privacy approaches. In this paper, we propose an integrated sensing framework for managing access to personal time-series data in order to provide utility while protecting individuals’ privacy. We introduce \textit{Replacement AutoEncoder}, a novel feature-learning algorithm which learns how to transform discriminative features of multidimensional time-series that correspond to sensitive inferences, into some features that have been more observed in non-sensitive inferences, to protect users’ privacy. The main advantage of Replacement AutoEncoder is its ability to keep important features of desired inferences unchanged to preserve the utility of the data. We evaluate the efficacy of the algorithm with an activity recognition task in a multi-sensing environment using extensive experiments on three benchmark datasets. We show that it can retain the recognition accuracy of state-of-the-art techniques while simultaneously preserving the privacy of sensitive information. We use a Generative Adversarial Network to attempt to detect the replacement of sensitive data with fake non-sensitive data. We show that this approach does not detect the replacement unless the network can train using the users’ original unmodified data.

**On Least Squares Linear Regression Without Second Moment**

If X and Y are real valued random variables such that the first moments of X, Y, and XY exist and the conditional expectation of Y given X is an affine function of X, then the intercept and slope of the conditional expectation equal the intercept and slope of the least squares linear regression function, even though Y may not have a finite second moment. As a consequence, the affine in X form of the conditional expectation and zero covariance imply mean independence.

**A Correspondence Between Random Neural Networks and Statistical Field Theory**

A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively ad-hoc. In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics. We argue that several previous investigations of stochastic networks actually studied a particular factorial approximation to the full lattice model. For random linear networks and random rectified linear networks we show that the corresponding lattice models in the wide network limit may be systematically approximated by a Gaussian distribution with covariance between the layers of the network. In each case, the approximate distribution can be diagonalized by Fourier transformation. We show that this approximation accurately describes the results of numerical simulations of wide random neural networks. Finally, we demonstrate that in each case the large scale behavior of the random networks can be approximated by an effective field theory.

**Learning Social Image Embedding with Deep Multimodal Attention Networks**

Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in sub-optimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the fine-granularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multi-label classification and cross-modal search. Compared to state-of-the-art image embeddings, our proposed DMAN achieves significant improvement in the tasks of multi-label classification and cross-modal search.

**An inferential procedure for community structure validation in networks**

`Community structure’ is a commonly observed feature of real networks. The term refers to the presence in a network of groups of nodes (communities) that feature high internal connectivity and are poorly connected to each other. Whereas the issue of community detection has been addressed in several works, the problem of validating a partition of nodes as a good community structure for a network has received little attention and remains an open issue. We propose an inferential procedure for community structure validation of network partitions, which relies on concepts from network enrichment analysis. The proposed procedure allows to compare the adequacy of different partitions of nodes as community structures. Moreover, it can be employed to assess whether two networks share the same community structure, and to compare the performance of different network clustering algorithms.

**Towards a Seamless Integration of Word Senses into Downstream NLP Applications**

Lexical ambiguity can impede NLP systems from accurate understanding of semantics. Despite its potential benefits, the integration of sense-level information into NLP systems has remained understudied. By incorporating a novel disambiguation algorithm into a state-of-the-art classification model, we create a pipeline to integrate sense-level information into downstream NLP applications. We show that a simple disambiguation of the input text can lead to consistent performance improvement on multiple topic categorization and polarity detection datasets, particularly when the fine granularity of the underlying sense inventory is reduced and the document is sufficiently large. Our results also point to the need for sense representation research to focus more on in vivo evaluations which target the performance in downstream NLP applications rather than artificial benchmarks.

**Phase Transitions in the Pooled Data Problem**

In this paper, we study the pooled data problem of identifying the labels associated with a large collection of items, based on a sequence of pooled tests revealing the counts of each label within the pool. In the noiseless setting, we identify an exact asymptotic threshold on the required number of tests with optimal decoding, and prove a phase transition between complete success and complete failure. In addition, we present a novel noisy variation of the problem, and provide an information-theoretic framework for characterizing the required number of tests for general random noise models. Our results reveal that noise can make the problem considerably more difficult, with strict increases in the scaling laws even at low noise levels. Finally, we demonstrate similar behavior in an approximate recovery setting, where a given number of errors is allowed in the decoded labels.

**Weighted Tensor Decomposition for Learning Latent Variables with Partial Data**

Tensor decomposition methods are popular tools for learning latent variables given only lower-order moments of the data. However, the standard assumption is that we have sufficient data to estimate these moments to high accuracy. In this work, we consider the case in which certain dimensions of the data are not always observed—common in applied settings, where not all measurements may be taken for all observations—resulting in moment estimates of varying quality. We derive a weighted tensor decomposition approach that is computationally as efficient as the non-weighted approach, and demonstrate that it outperforms methods that do not appropriately leverage these less-observed dimensions.

**The Origins of Computational Mechanics: A Brief Intellectual History and Several Clarifications**

The principle goal of computational mechanics is to define pattern and structure so that the organization of complex systems can be detected and quantified. Computational mechanics developed from efforts in the 1970s and early 1980s to identify strange attractors as the mechanism driving weak fluid turbulence via the method of reconstructing attractor geometry from measurement time series and in the mid-1980s to estimate equations of motion directly from complex time series. In providing a mathematical and operational definition of structure it addressed weaknesses of these early approaches to discovering patterns in natural systems. Since then, computational mechanics has led to a range of results from theoretical physics and nonlinear mathematics to diverse applications—from closed-form analysis of Markov and non-Markov stochastic processes that are ergodic or nonergodic and their measures of information and intrinsic computation to complex materials and deterministic chaos and intelligence in Maxwellian demons to quantum compression of classical processes and the evolution of computation and language. This brief review clarifies several misunderstandings and addresses concerns recently raised regarding early works in the field (1980s). We show that misguided evaluations of the contributions of computational mechanics are groundless and stem from a lack of familiarity with its basic goals and from a failure to consider its historical context. For all practical purposes, its modern methods and results largely supersede the early works. This not only renders recent criticism moot and shows the solid ground on which computational mechanics stands but, most importantly, shows the significant progress achieved over three decades and points to the many intriguing and outstanding challenges in understanding the computational nature of complex dynamic systems.

• Complexity and capacity bounds for quantum channels

• S-Isomap++: Multi Manifold Learning from Streaming Data

• Learning Inverse Statics Models Efficiently

• Convex optimization over intersection of simple sets: improved convergence rate guarantees via an exact penalty approach

• Relative Hard Lefschetz Theorem for Fans

• On the number of unknot diagrams

• Superpixels Based Marker Tracking Vs. Hue Thresholding In Rodent Biomechanics Application

• Constructing Datasets for Multi-hop Reading Comprehension Across Documents

• Successive Four-Dimensional Stokes-Space Direct Detection

• Fine asymptotics for models with Gamma type moments

• Classification and Geometry of General Perceptual Manifolds

• Hultman elements for the hyperoctahedral groups

• A Line-Point Unified Solution to Relative Camera Pose Estimation

• Shape optimisation with nearly conformal transformations

• Fundamental Limits of Low-Density Spreading NOMA with Fading

• Chain Reduction for Binary and Zero-Suppressed Decision Diagrams

• Scene Parsing with Global Context Embedding

• Multi-focus image fusion using VOL and EOL in DCT domain

• Pose-based Deep Gait Recognition

• Learning Knowledge-guided Pose Grammar Machine for 3D Human Pose Estimation

• On reducing sampling variance in covariate shift using control variates

• Dihedral Sieving Phenomena

• Sistema de Navegação Autônomo Baseado em Visão Computacional

• Near-Optimal Adversarial Policy Switching for Decentralized Asynchronous Multi-Agent Systems

• Spectral Conditions for Existence and Uniqueness of Recursive Utilities

• Maximum number of colourings. II. 5-chromatic graphs

• Asymmetric Actor Critic for Image-Based Robot Learning

• New inertial regularized algorithm for solving strongly pseudomonotone equilibrium problems

• Exploiting oddsmaker bias to improve the prediction of NFL outcomes

• Learning Deep Context-aware Features over Body and Latent Parts for Person Re-identification

• A recognition algorithm for simple-triangle graphs

• Revenue-based Attribution Modeling for Online Advertising

• Hydrodynamic limit and Propagation of Chaos for Brownian Particles reflecting from a Newtonian barrier

• Towards a unified theory for testing statistical hypothesis: Multinormal mean with nuisance covariance matrix

• The Effects of Memory Replay in Reinforcement Learning

• Distributed optimization-based control of dynamically coupled networks

• Association and Load Optimization with User Priorities in Load-Coupled Heterogeneous Networks

• MEDOC: a Python wrapper to load MEDLINE into a local MySQL database

• Review of Data Structures for Computationally Efficient Nearest-Neighbour Entropy Estimators for Large Systems with Periodic Boundary Conditions

• Eigenvalue fluctuations for lattice Anderson Hamiltonians: Unbounded potentials

• Variational Inference based on Robust Divergences

• A New Hierarchy of Second-order Cone Programming Relaxations for Polynomial Optimization: Asymptotic and Finite Convergence

• Cell Segmentation in 3D Confocal Images using Supervoxel Merge-Forests with CNN-based Hypothesis Selection

• Mirror Descent and Convex Optimization Problems With Non-Smooth Inequality Constraints

• Bayesian inversion of convolved hidden Markov models with applications in reservoir prediction

• The Robust Reading Competition Annotation and Evaluation Platform

• On Optimal Operational Sequence of Components in a Warm Standby System

• A Nonconvex Proximal Splitting Algorithm under Moreau-Yosida Regularization

• Non-Gaussian Error Distributions of Galactic Rotation Speed Measurements

• A Sinkhorn-Newton method for entropic optimal transport

• Deceased Organ Matching in Australia

• Empirical regression quantile process with possible application to risk analysis

• Is the bump significant? An axion-search example

• A computerised classification of some almost minimal triangle-free Ramsey graphs

• Image Restoration by Iterative Denoising and Backward Projections

• Cluster-glass phase in pyrochlore XY antiferromagnets with quenched disorder

• Variable selection for the prediction of C[0,1]-valued AR processes using RKHS

• On cyclic descents for tableaux

• Simultaneous Recognition and Pose Estimation of Instruments in Minimally Invasive Surgery

• A quasi-Bayesian framework for the development of computer models

• Metastability of one-dimensional, non-reversible diffusions with periodic boundary conditions

• A five-decision testing procedure to infer on unidimensional parameter

• Dropout Sampling for Robust Object Detection in Open-Set Conditions

• Near-domination in graphs

• Confidence interval for correlation estimator between latent processes

• Statistics of optimal information flow in ensembles of regulatory motifs

• Self-to-self transitions in open quantum systems: the origin and solutions

• First-Order Perturbation Analysis of the SECSI Framework for the Approximate CP Decomposition of 3-D Noise-Corrupted Low-Rank Tensors

• On the Hurwitz action in affine Coxeter groups

• Build Fast and Accurate Lemmatization for Arabic

• Stochastic Weighted Function Norm Regularization

• Application de la récurrence topologique aux intégrales de matrices aléatoires et aux systèmes intégrables

• On spectral radii of unraveled balls

• Backward SDEs and infinite horizon stochastic optimal control

• Semi-Parametric Causal Sufficient Dimension Reduction Of High Dimensional Treatments

• Brownian motion with general drift

• Robust Model Order Selection in Large Dimensional Elliptically Symmetric Noise

• Jointly Optimal Spatial Channel Assignment and Power Allocation for MIMO SWIPT Systems

• Characterizing correlations and synchronization in collective dynamics

• A new approach for the construction of a Wasserstein diffusion

• Caching in Combination Networks: Novel Multicast Message Generation and Delivery by Leveraging the Network Topology

• Universally Weakly Secure Coset Coding Schemes for Minimum Storage Regenerating (MSR) Codes

• Perfect $k$-colored matchings and $k\!+\!2$-gonal tilings

• Koopman operator-based model reduction for switched-system control of PDEs

• A complete characterization of optimal dictionaries for least squares representation

• SQG-Differential Evolution for difficult optimization problems under a tight function evaluation budget

• Counting compositions over finite abelian groups

• Enhancing the Performance of Convolutional Neural Networks on Quality Degraded Datasets

• Minimax Linear Estimation at a Boundary Point

• Photo-Guided Exploration of Volume Data Features

• Identifying Mild Traumatic Brain Injury Patients From MR Images Using Bag of Visual Words

• Nonlinear Phase-Quantized Constant-Envelope Precoding for Massive MU-MIMO-OFDM

• Spatial random field models based on Lévy indicator convolutions

• Using Deep Convolutional Networks for Gesture Recognition in American Sign Language