Why & When Deep Learning Works: Looking Inside Deep Learnings

The Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) has been heavily supporting Machine Learning and Deep Learning research from its foundation in 2012. We have asked six leading ICRI-CI Deep Learning researchers to address the challenge of ‘Why & When Deep Learning works’, with the goal of looking inside Deep Learning, providing insights on how deep networks function, and uncovering key observations on their expressiveness, limitations, and potential. The output of this challenge resulted in five papers that address different facets of deep learning. These different facets include a high-level understating of why and when deep networks work (and do not work), the impact of geometry on the expressiveness of deep networks, and making deep networks interpretable.

Tensor Graphical Lasso (TeraLasso)

The Bigraphical Lasso estimator was proposed to parsimoniously model the precision matrices of matrix-normal data based on the Cartesian product of graphs. By enforcing extreme sparsity (the number of parameters) and explicit structures on the precision matrix, this model has excellent potential for improving scalability of the computation and interpretability of complex data analysis. As a result, this model significantly reduces the size of the sample in order to learn the precision matrices, and hence the conditional probability models along different coordinates such as space, time and replicates. In this work, we extend the Bigraphical Lasso (BiGLasso) estimator to the TEnsor gRAphical Lasso (TeraLasso) estimator and propose an analogous method for modeling the precision matrix of tensor-valued data. We establish consistency for both the BiGLasso and TeraLasso estimators and obtain the rates of convergence in the operator and Frobenius norm for estimating the precision matrix. We design a scalable gradient descent method for solving the objective function and analyze the computational convergence rate, showing that the composite gradient descent algorithm is guaranteed to converge at a geometric rate to the global minimizer. Finally, we provide simulation evidence and analysis of a meteorological dataset, showing that we can recover graphical structures and estimate the precision matrices, as predicted by theory.

A survey of Community Question Answering

With the advent of numerous community forums, tasks associated with the same have gained importance in the recent past. With the influx of new questions every day on these forums, the issues of identifying methods to find answers to said questions, or even trying to detect duplicate questions, are of practical importance and are challenging in their own right. This paper aims at surveying some of the aforementioned issues, and methods proposed for tackling the same.

SCNet: Learning Semantic Correspondence

This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category. Previous approaches focus on either combining a spatial regularizer with hand-crafted features, or learning a correspondence model for appearance only. We propose instead a convolutional neural network architecture, called SCNet, for learning a geometrically plausible model for semantic correspondence. SCNet uses region proposals as matching primitives, and explicitly incorporates geometric consistency in its loss function. It is trained on image pairs obtained from the PASCAL VOC 2007 keypoint dataset, and a comparative evaluation on several standard benchmarks demonstrates that the proposed approach substantially outperforms both recent deep learning architectures and previous methods based on hand-crafted features.

Neural Style Transfer: A Review

The recent work of Gatys et al. demonstrated the power of Convolutional Neural Networks (CNN) in creating artistic fantastic imagery by separating and recombing the image content and style. This process of using CNN to migrate the semantic content of one image to different styles is referred to as Neural Style Transfer. Since then, Neural Style Transfer has become a trending topic both in academic literature and industrial applications. It is receiving increasing attention from computer vision researchers and several methods are proposed to either improve or extend the original neural algorithm proposed by Gatys et al. However, there is no comprehensive survey presenting and summarizing recent Neural Style Transfer literature. This review aims to provide an overview of the current progress towards Neural Style Transfer, as well as discussing its various applications and open problems for future research.

Distributed Bayesian Probabilistic Matrix Factorization

Matrix factorization is a common machine learning technique for recommender systems. Despite its high prediction accuracy, the Bayesian Probabilistic Matrix Factorization algorithm (BPMF) has not been widely used on large scale data because of its high computational cost. In this paper we propose a distributed high-performance parallel implementation of BPMF on shared memory and distributed architectures. We show by using efficient load balancing using work stealing on a single node, and by using asynchronous communication in the distributed version we beat state of the art implementations.

A First Empirical Study of Emphatic Temporal Difference Learning

In this paper we present the first empirical study of the emphatic temporal-difference learning algorithm (ETD), comparing it with conventional temporal-difference learning, in particular, with linear TD(0), on on-policy and off-policy variations of the Mountain Car problem. The initial motivation for developing ETD was that it has good convergence properties under \emph{off}-policy training (Sutton, Mahmood \& White 2016), but it is also a new algorithm for the \emph{on}-policy case. In both our on-policy and off-policy experiments, we found that each method converged to a characteristic asymptotic level of error, with ETD better than TD(0). TD(0) achieved a still lower error level temporarily before falling back to its higher asymptote, whereas ETD never showed this kind of ‘bounce’. In the off-policy case (in which TD(0) is not guaranteed to converge), ETD was significantly slower.

Nonnegative Matrix Factorization with Transform Learning

Traditional NMF-based signal decomposition relies on the factorization of spectral data which is typically computed by means of the short-time Fourier transform. In this paper we propose to relax the choice of a pre-fixed transform and learn a short-time unitary transform together with the factorization, using a novel block-descent algorithm. This improves the fit between the processed data and its approximation and is in turn shown to induce better separation performance in a speech enhancement experiment.

Incremental Learning Through Deep Adaptation

Given an existing trained neural network, it is often desirable to be able to add new capabilities without hindering performance of already learned tasks. Existing approaches either learn sub-optimal solutions, require joint training, or incur a substantial increment in the number of parameters for each added task, typically as many as the original network. We propose a method which fully preserves performance on the original task, with only a small increase (around 20%) in the number of required parameters while performing on par with more costly fine-tuning procedures, which typically double the number of parameters. The learned architecture can be controlled to switch between various learned representations, enabling a single network to solve a task from multiple different domains. We conduct extensive experiments showing the effectiveness of our method and explore different aspects of its behavior.

K-sets+: a Linear-time Clustering Algorithm for Data Points with a Sparse Similarity Measure

In this paper, we first propose a new iterative algorithm, called the K-sets+ algorithm for clustering data points in a semi-metric space, where the distance measure does not necessarily satisfy the triangular inequality. We show that the K-sets+ algorithm converges in a finite number of iterations and it retains the same performance guarantee as the K-sets algorithm for clustering data points in a metric space. We then extend the applicability of the K-sets+ algorithm from data points in a semi-metric space to data points that only have a symmetric similarity measure. Such an extension leads to great reduction of computational complexity. In particular, for an n * n similarity matrix with m nonzero elements in the matrix, the computational complexity of the K-sets+ algorithm is O((Kn + m)I), where I is the number of iterations. The memory complexity to achieve that computational complexity is O(Kn + m). As such, both the computational complexity and the memory complexity are linear in n when the n * n similarity matrix is sparse, i.e., m = O(n). We also conduct various experiments to show the effectiveness of the K-sets+ algorithm by using a synthetic dataset from the stochastic block model and a real network from the WonderNetwork website.

Bayesian Distribution Regression

Distribution regression has recently attracted much interest as a generic solution to the problem of supervised learning where labels are available at the group level, rather than at the individual level. Current approaches, however, do not propagate the uncertainty in observations due to sampling variability in the groups. This effectively assumes that small and large groups are estimated equally well, and should have equal weight in the final regression. We construct a Bayesian distribution regression formalism that accounts for this uncertainty, improving the robustness and performance of the model when group sizes vary. We frame the model in a neural network style, allowing for simple MAP inference using backpropagation to learn the parameters, as well as MCMC-based inference which can fully propagate uncertainty. We demonstrate our approach on illustrative toy datasets, as well as on an astrostatistics problem in which velocity distributions are used to predict galaxy cluster masses, quantifying the distribution of dark matter in the universe.

A Deep Reinforced Model for Abstractive Summarization

Attentional, RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. However, for longer documents and summaries, these models often include repetitive and incoherent phrases. We introduce a neural network model with intra-attention and a new training method. This method combines standard supervised word prediction and reinforcement learning (RL). Models trained only with the former often exhibit ‘exposure bias’ — they assume ground truth is provided at each step during training. However, when standard word prediction is combined with the global sequence prediction training of RL the resulting summaries become more readable. We evaluate this model on the CNN/Daily Mail and New York Times datasets. Our model obtains a 41.16 ROUGE-1 score on the CNN/Daily Mail dataset, a 5.7 absolute points improvement over previous state-of-the-art models. It also performs well as the first abstractive model on the New York Times corpus. Human evaluation also shows that our model produces higher quality summaries.

Zero Sets for Spaces of Analytic Functions

On Covering paths with 3 Dimensional Random Walk

Solving Distributed Constraint Optimization Problems Using Logic Programming

Causal Inference with Two Versions of Treatment

A Minimal Span-Based Neural Constituency Parser

Pre-earthquake State Identification by Micro-earthquake Spike Trains Dissimilarity Analysis

The theory of avoided criticality in quantum motion in a random potential in high dimensions

Multiscale Structure of More-than-Binary Variables

On the Relation Between Two Approaches to Necessary Optimality Conditions in Problems with State Constraints

The Riesz basis property of a class of Euler-Bernoulli beam equation

An Improved Video Analysis using Context based Extension of LSH

Autoscaling Bloom Filter: Controlling Trade-off Between True and False Positives

A three-dimensional statistical model for CLSM images of porous polymer films

Overlap synchronisation in multipartite random energy models

Sub-Nyquist Channel Estimation over IEEE 802.11ad Link

Efficient design of experiments for sensitivity analysis based on polynomial chaos expansions

Structural reliability analysis for p-boxes using multi-level meta-models

Zig-zagging in a Triangulation

Learning 3D Object Categories by Looking Around Them

Superlinearly Convergent Asynchronous Distributed Network Newton Method

Convex equipartitions of colored point sets

Convergence of eigenvector empirical spectral distribution of sample covariance matrices

GQ($λ$) Quick Reference and Implementation Guide

Characteristic Matrices and Trellis Reduction for Tail-Biting Convolutional Codes

Distribution of degrees of freedom over structure and motion of rigid bodies

Learning with Noise: Enhance Distantly Supervised Relation Extraction with Dynamic Transition Matrix

Mining Functional Modules by Multiview-NMF of Phenome-Genome Association

Quantum phase transition and non-Fermi liquid behavior in multi-Weyl semimetals

Content-based Approach for Vietnamese Spam SMS Filtering

Beamforming Optimization for Full-Duplex Wireless-powered MIMO Systems

Faster algorithms for 1-mappability of a sequence

Performance of SWIPT-based Differential Amplify-and-Forward Relaying with Direct Link

Distributed Property Testing for Subgraph-Freeness Revisited

Weighted Selection Combinings for Differential Decode-and-Forward Cooperative Networks

Building a Semantic Role Labelling System for Vietnamese

Robust Routing Made Easy

End-to-end Recurrent Neural Network Models for Vietnamese Named Entity Recognition: Word-level vs. Character-level

Phaseless compressive sensing using partial support information

Nash Region of the Linear Deterministic Interference Channel with Noisy Output Feedback

Ten Conferences WORDS: Open Problems and Conjectures

Singular Riesz measures on symmetric cones

Optically levitated nanoparticle as a model system for stochastic bistable dynamics

A Boltzmann approach to percolation on random triangulations

Coded Multicast Fronthauling and Edge Caching for Multi-Connectivity Transmission in Fog Radio Access Networks

Temporal self-similar synchronization patterns and scaling in repulsively coupled oscillators

Autocalibrating and Calibrationless Parallel Magnetic Resonance Imaging as a Bilinear Inverse Problem

Automatic Extrinsic Calibration for Lidar-Stereo Vehicle Sensor Setups

Two Gilbert-Varshamov Type Existential Bounds for Asymmetric Quantum Error-Correcting Codes

A Generative Model of People in Clothing

On the Tightness of Bounds for Transients and Weak CSR Expansions in Max-Plus Algebra

Uplink Non-Orthogonal Multiple Access for 5G Wireless Networks

Automatic discovery of structural rules of permutation classes

Obstacle Avoidance Using Stereo Camera

Coalitional game based cost optimization of energy portfolio in smart grid communities

Memetic search for identifying critical nodes in sparse graphs

On the minimum degree, edge-connectivity and connectivity of power graphs of finite groups

Asymptotics for the Turán number of Berge-$K_{2,t}$

Adaptively Transformed Mixed Model Prediction of General Finite Population Parameters

Fast Stochastic Variance Reduced ADMM for Stochastic Composition Optimization

From Least Squares to Signal Processing and Particle Filtering

Error-Sensitive Proof-Labeling Schemes

Program Induction by Rationale Generation:Learning to Solve and Explain Algebraic Word Problems

Dynamic Compositional Neural Networks over Tree Structure

The Localization Transition in the Ultrametric Ensemble

On the Effective Capacity of MTC Networks in the Finite Blocklength Regime

Exploring the onset of collective intelligence in self-organised trails of social organisms

On the role of words in the network structure of texts: application to authorship attribution

Denominator Bounds and Polynomial Solutions for Systems of q-Recurrences over K(t) for Constant K

K-Monotonicity is Not Testable on the Hypercube

Competitive Equilibria with Indivisible Goods and Generic Budgets: Settling the Open Cases

A critical analysis of resampling strategies for the regularized particle filter

Stochastic differential games with state constraints and Isaacs equations with nonlinear Neumann problems

Distributionally Robust Groupwise Regularization Estimator

Spectral gap estimates in mean field spin glasses

Sketching Word Vectors Through Hashing

Expanders and applications over the prime fields

Probabilistic Image Colorization

A Cascaded Convolutional Nerual Network for X-ray Low-dose CT Image Denoising

Dual attainment for the martingale transport problem

Realizable sets of catenary degrees of numerical monoids

Dynamical Functional Theory for Compressed Sensing

Hardware-Software Codesign of Accurate, Multiplier-free Deep Neural Networks

Maximizing Wiener Index for Trees with Given Vertex Weight and Degree Sequences

Asymptotics of Nonlinear LSE Precoders with Applications to Transmit Antenna Selection

Maximum principle for a stochastic delayed system involving terminal state constraints

Feature-based or Direct: An Evaluation of Monocular Visual Odometry

A Feature Embedding Strategy for High-level CNN representations from Multiple ConvNets

FDR-Corrected Sparse Canonical Correlation Analysis with Applications to Imaging Genomics