Diagrams often depict complex phenomena and serve as a good test bed for visual and textual reasoning. However, understanding diagrams using natural image understanding approaches requires large training datasets of diagrams, which are very hard to obtain. Instead, this can be addressed as a matching problem either between labeled diagrams, images or both. This problem is very challenging since the absence of significant color and texture renders local cues ambiguous and requires global reasoning. We consider the problem of one-shot part labeling: labeling multiple parts of an object in a target image given only a single source image of that category. For this set-to-set matching problem, we introduce the Structured Set Matching Network (SSMN), a structured prediction model that incorporates convolutional neural networks. The SSMN is trained using global normalization to maximize local match scores between corresponding elements and a global consistency score among all matched elements, while also enforcing a matching constraint between the two sets. The SSMN significantly outperforms several strong baselines on three label transfer scenarios: diagram-to-diagram, evaluated on a new diagram dataset of over 200 categories; image-to-image, evaluated on a dataset built on top of the Pascal Part Dataset; and image-to-diagram, evaluated on transferring labels across these datasets.
Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.
This paper describes a family of probabilistic architectures designed for online learning under the logarithmic loss. Rather than relying on non-linear transfer functions, our method gains representational power by the use of data conditioning. We state under general conditions a learnable capacity theorem that shows this approach can in principle learn any bounded Borel-measurable function on a compact subset of euclidean space; the result is stronger than many universality results for connectionist architectures because we provide both the model and the learning procedure for which convergence is guaranteed.
We examine the problem of question answering over knowledge graphs, focusing on simple questions that can be answered by the lookup of a single fact. Adopting a straightforward decomposition of the problem into entity detection, entity linking, relation prediction, and evidence combination, we explore simple yet strong baselines. On the SimpleQuestions dataset, we find that baseline LSTMs and GRUs plus a few heuristics yield accuracies that approach the state of the art, and techniques that do not use neural networks also perform reasonably well. These results show that gains from sophisticated deep learning techniques proposed in the literature are quite modest and that some previous models exhibit unnecessary complexity.
Feature selection has evolved to be a very important step in several machine learning paradigms. Especially in the domains of bio-informatics and text classification which involve data of high dimensions, feature selection can help in drastically reducing the feature space. In cases where it is difficult or infeasible to obtain sufficient training examples, feature selection helps overcome the curse of dimensionality which in turn helps improve performance of the classification algorithm. The focus of our research are five embedded feature selection methods which use the ridge regression, or use Lasso regression, and those which combine the two with the goal of simultaneously performing variable selection and grouping correlated variables.
Product compatibility and their functionality are of utmost importance to customers when they purchase products, and to sellers and manufacturers when they sell products. Due to the huge number of products available online, it is infeasible to enumerate and test the compatibility and functionality of every product. In this paper, we address two closely related problems: product compatibility analysis and function satisfiability analysis, where the second problem is a generalization of the first problem (e.g., whether a product works with another product can be considered as a special function). We first identify a novel question and answering corpus that is up-to-date regarding product compatibility and functionality information. To allow automatic discovery product compatibility and functionality, we then propose a deep learning model called Dual Attention Network (DAN). Given a QA pair for a to-be-purchased product, DAN learns to 1) discover complementary products (or functions), and 2) accurately predict the actual compatibility (or satisfiability) of the discovered products (or functions). The challenges addressed by the model include the briefness of QAs, linguistic patterns indicating compatibility, and the appropriate fusion of questions and answers. We conduct experiments to quantitatively and qualitatively show that the identified products and functions have both high coverage and accuracy, compared with a wide spectrum of baselines.
Training deep neural networks with Stochastic Gradient Descent, or its variants, requires careful choice of both learning rate and batch size. While smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer more parallelism and hence better computational efficiency. We have developed a new training approach that, rather than statically choosing a single batch size for all epochs, adaptively increases the batch size during the training process. Our method delivers the convergence rate of small batch sizes while achieving performance similar to large batch sizes. We analyse our approach using the standard AlexNet, ResNet, and VGG networks operating on the popular CIFAR-10, CIFAR-100, and ImageNet datasets. Our results demonstrate that learning with adaptive batch sizes can improve performance by factors of up to 6.25 on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1% relative to training with fixed batch sizes.
In this paper, we propose a new algorithm for learning general latent-variable probabilistic graphical models using the techniques of predictive state representation, instrumental variable regression, and reproducing-kernel Hilbert space embeddings of distributions. Under this new learning framework, we first convert latent-variable graphical models into corresponding latent-variable junction trees, and then reduce the hard parameter learning problem into a pipeline of supervised learning problems, whose results will then be used to perform predictive belief propagation over the latent junction tree during the actual inference procedure. We then give proofs of our algorithm’s correctness, and demonstrate its good performance in experiments on one synthetic dataset and two real-world tasks from computational biology and computer vision – classifying DNA splice junctions and recognizing human actions in videos.
Attention-based Encoder-Decoder has the effective architecture for neural machine translation (NMT), which typically relies on recurrent neural networks (RNN) to build the blocks that will be lately called by attentive reader during the decoding process. This design of encoder yields relatively uniform composition on source sentence, despite the gating mechanism employed in encoding RNN. On the other hand, we often hope the decoder to take pieces of source sentence at varying levels suiting its own linguistic structure: for example, we may want to take the entity name in its raw form while taking an idiom as a perfectly composed unit. Motivated by this demand, we propose Multi-channel Encoder (MCE), which enhances encoding components with different levels of composition. More specifically, in addition to the hidden state of encoding RNN, MCE takes 1) the original word embedding for raw encoding with no composition, and 2) a particular design of external memory in Neural Turing Machine (NTM) for more complex composition, while all three encoding strategies are properly blended during decoding. Empirical study on Chinese-English translation shows that our model can improve by 6.52 BLEU points upon a strong open source NMT system: DL4MT1. On the WMT14 English- French task, our single shallow system achieves BLEU=38.8, comparable with the state-of-the-art deep models.
We introduce a novel embedding method for knowledge base completion task. Our approach advances state-of-the-art (SOTA) by employing a convolutional neural network (CNN) for the task which can capture global relationships and transitional characteristics. We represent each triple (head entity, relation, tail entity) as a 3-column matrix which is the input for the convolution layer. Different filters having a same shape of 1×3 are operated over the input matrix to produce different feature maps which are then concatenated into a single feature vector. This vector is used to return a score for the triple via a dot product. The returned score is used to predict whether the triple is valid or not. Experiments show that ConvKB achieves better link prediction results than previous SOTA models on two current benchmark datasets WN18RR and FB15k-237.
In this work we consider a generalized version of Probability Smoothing, the core elementary model for sequential prediction in the state of the art PAQ family of data compression algorithms. Our main contribution is a code length analysis that considers the redundancy of Probability Smoothing with respect to a Piecewise Stationary Source. The analysis holds for a finite alphabet and expresses redundancy in terms of the total variation in probability mass of the stationary distributions of a Piecewise Stationary Source. By choosing parameters appropriately Probability Smoothing has redundancy $O(S\cdot\sqrt{T\log T})$ for sequences of length $T$ with respect to a Piecewise Stationary Source with $S$ segments.
Over the last couple of years, deep learning and especially convolutional neural networks have become one of the work horses of computer vision. One limiting factor for the applicability of supervised deep learning to more areas is the need for large, manually labeled datasets. In this paper we propose an easy to implement method we call guided labeling, which automatically determines which samples from an unlabeled dataset should be labeled. We show that using this procedure, the amount of samples that need to be labeled is reduced considerably in comparison to labeling images arbitrarily.
Transfer Learning (TL) aims to transfer knowledge acquired in one problem, the source problem, onto another problem, the target problem, dispensing with the bottom-up construction of the target model. Due to its relevance, TL has gained significant interest in the Machine Learning community since it paves the way to devise intelligent learning models that can easily be tailored to many different applications. As it is natural in a fast evolving area, a wide variety of TL methods, settings and nomenclature have been proposed so far. However, a wide range of works have been reporting different names for the same concepts. This concept and terminology mixture contribute however to obscure the TL field, hindering its proper consideration. In this paper we present a review of the literature on the majority of classification TL methods, and also a distribution-based categorization of TL with a common nomenclature suitable to classification problems. Under this perspective three main TL categories are presented, discussed and illustrated with examples.
Non-negative matrix factorization (NMF) is one of the most popular decomposition techniques for multivariate data. NMF is a core method for many machine-learning related computational problems, such as data compression, feature extraction, word embedding, recommender systems etc. In practice, however, its application is challenging for large datasets. The efficiency of NMF is constrained by long data loading times, by large memory requirements and by limited parallelization capabilities. Here we present a novel and efficient compressed NMF algorithm. Our algorithm applies a random compression scheme to drastically reduce the dimensionality of the problem, preserving well the pairwise distances between data points and inherently limiting the memory and communication load. Our algorithm supersedes existing methods in speed. Nonetheless, it matches the best non-compressed algorithms in reconstruction precision.
This paper addresses the question: Why do neural dialog systems generate short and meaningless replies? We conjecture that, in a dialog system, an utterance may have multiple equally plausible replies, causing the deficiency of neural networks in the dialog application. We propose a systematic way to mimic the dialog scenario in a machine translation system, and manage to reproduce the phenomenon of generating short and less meaningful sentences in the translation setting, showing evidence of our conjecture.
While deep learning has led to significant advances in visual recognition over the past few years, such advances often require a lot of annotated data. While unsupervised domain adaptation has emerged as an alternative approach that doesn’t require as much annotated data, prior evaluations of domain adaptation have been limited to relatively simple datasets. This work pushes the state of the art in unsupervised domain adaptation through an in depth evaluation of AlexNet, DenseNet and Residual Transfer Networks (RTN) on multimodal benchmark datasets that shows and identifies which layers more effectively transfer features across different domains. We also modify the existing RTN architecture and propose a novel domain adaptation architecture called ‘Deep MagNet’ that combines Deep Convolutional Blocks with multiple Maximum Mean Discrepancy losses. Our experiments show quantitative and qualitative improvements in performance of our method on benchmarking datasets for complex data domains.
Named Entity Recognition (NER) aims at locating and classifying named entities in text. In some use cases of NER, including cases where detected named entities are used in creating content recommendations, it is crucial to have a reliable confidence level for the detected named entities. In this work we study the problem of finding confidence levels for detected named entities. We refer to this problem as Named Entity Sequence Classification (NESC). We frame NESC as a binary classification problem and we use NER as well as recurrent neural networks to find the probability of candidate named entity is a real named entity. We apply this approach to Tweet texts and we show how we could find named entities with high confidence levels from Tweets.
We present a technique for jointly denoising bursts of images taken from a handheld camera. In particular, we propose a convolutional neural network architecture for predicting spatially varying kernels that can both align and denoise frames, a synthetic data generation approach based on a realistic noise formation model, and an optimization guided by an annealed loss function to avoid undesirable local minima. Our model matches or outperforms the state-of-the-art across a wide range of noise levels on both real and synthetic data.
The Generative Adversarial Networks (GANs) have demonstrated impressive performance for data synthesis, and are now used in a wide range of computer vision tasks. In spite of this success, they gained a reputation for being difficult to train, what results in a time-consuming and human-involved development process to use them. We consider an alternative training process, named SGAN, in which several adversarial ‘local’ pairs of networks are trained independently so that a ‘global’ supervising pair of networks can be trained against them. The goal is to train the global pair with the corresponding ensemble opponent for improved performances in terms of mode coverage. This approach aims at increasing the chances that learning will not stop for the global pair, preventing both to be trapped in an unsatisfactory local minimum, or to face oscillations often observed in practice. To guarantee the latter, the global pair never affects the local ones. The rules of SGAN training are thus as follows: the global generator and discriminator are trained using the local discriminators and generators, respectively, whereas the local networks are trained with their fixed local opponent. Experimental results on both toy and real-world problems demonstrate that this approach outperforms standard training in terms of better mitigating mode collapse, stability while converging and that it surprisingly, increases the convergence speed as well.