Automatically determining the optimal size of a neural network for a given task without prior information currently requires an expensive global search and training many networks from scratch. In this paper, we address the problem of automatically finding a good network size during a single training cycle. We introduce *nonparametric neural networks*, a non-probabilistic framework for conducting optimization over all possible network sizes and prove its soundness when network growth is limited via an L_p penalty. We train networks under this framework by continuously adding new units while eliminating redundant units via an L_2 penalty. We employ a novel optimization algorithm, which we term *adaptive radial-angular gradient descent* or *AdaRad*, and obtain promising results.
We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.
We introduce Graph-Sparse Logistic Regression, a new algorithm for classification for the case in which the support should be sparse but connected on a graph. We val- idate this algorithm against synthetic data and benchmark it against L1-regularized Logistic Regression. We then explore our technique in the bioinformatics context of proteomics data on the interactome graph. We make all our experimental code public and provide GSLR as an open source package.
Inverse Reinforcement Learning (IRL) is the task of learning a single reward function given a Markov Decision Process (MDP) without defining the reward function, and a set of demonstrations generated by humans/experts. However, in practice, it may be unreasonable to assume that human behaviors can be explained by one reward function since they may be inherently inconsistent. Also, demonstrations may be collected from various users and aggregated to infer and predict user’s behaviors. In this paper, we introduce the Non-parametric Behavior Clustering IRL algorithm to simultaneously cluster demonstrations and learn multiple reward functions from demonstrations that may be generated from more than one behaviors. Our method is iterative: It alternates between clustering demonstrations into different behavior clusters and inverse learning the reward functions until convergence. It is built upon the Expectation-Maximization formulation and non-parametric clustering in the IRL setting. Further, to improve the computation efficiency, we remove the need of completely solving multiple IRL problems for multiple clusters during the iteration steps and introduce a resampling technique to avoid generating too many unlikely clusters. We demonstrate the convergence and efficiency of the proposed method through learning multiple driver behaviors from demonstrations generated from a grid-world environment and continuous trajectories collected from autonomous robot cars using the Gazebo robot simulator.
Semantic similarity based retrieval is playing an increasingly important role in many IR systems such as modern web search, question-answering, similar document retrieval etc. Improvements in retrieval of semantically similar content are very significant to applications like Quora, Stack Overflow, Siri etc. We propose a novel unsupervised model for semantic similarity based content retrieval, where we construct semantic flow graphs for each query, and introduce the concept of ‘soft seeding’ in graph based semi-supervised learning (SSL) to convert this into an unsupervised model. We demonstrate the effectiveness of our model on an equivalent question retrieval problem on the Stack Exchange QA dataset, where our unsupervised approach significantly outperforms the state-of-the-art unsupervised models, and produces comparable results to the best supervised models. Our research provides a method to tackle semantic similarity based retrieval without any training data, and allows seamless extension to different domain QA communities, as well as to other semantic equivalence tasks.
Recently, deep neural networks (DNNs) have been regarded as the state-of-the-art classification methods in a wide range of applications, especially in image classification. Despite the success, the huge number of parameters blocks its deployment to situations with light computing resources. Researchers resort to the redundancy in the weights of DNNs and attempt to find how fewer parameters can be chosen while preserving the accuracy at the same time. Although several promising results have been shown along this research line, most existing methods either fail to significantly compress a well-trained deep network or require a heavy fine-tuning process for the compressed network to regain the original performance. In this paper, we propose the \textit{Block Term} networks (BT-nets) in which the commonly used fully-connected layers (FC-layers) are replaced with block term layers (BT-layers). In BT-layers, the inputs and the outputs are reshaped into two low-dimensional high-order tensors, then block-term decomposition is applied as tensor operators to connect them. We conduct extensive experiments on benchmark datasets to demonstrate that BT-layers can achieve a very large compression ratio on the number of parameters while preserving the representation power of the original FC-layers as much as possible. Specifically, we can get a higher performance while requiring fewer parameters compared with the tensor train method.
We describe Sockeye (version 1.12), an open-source sequence-to-sequence toolkit for Neural Machine Translation (NMT). Sockeye is a production-ready framework for training and applying models as well as an experimental platform for researchers. Written in Python and built on MXNet, the toolkit offers scalable training and inference for the three most prominent encoder-decoder architectures: attentional recurrent neural networks, self-attentional transformers, and fully convolutional networks. Sockeye also supports a wide range of optimizers, normalization and regularization techniques, and inference improvements from current NMT literature. Users can easily run standard training recipes, explore different model settings, and incorporate new ideas. In this paper, we highlight Sockeye’s features and benchmark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): English-German and Latvian-English. We report competitive BLEU scores across all three architectures, including an overall best score for Sockeye’s transformer implementation. To facilitate further comparison, we release all system outputs and training scripts used in our experiments. The Sockeye toolkit is free software released under the Apache 2.0 license.
Most of the weights in a Lightweight Neural Network have a value of zero, while the remaining ones are either +1 or -1. These universal approximators require approximately 1.1 bits/weight of storage, posses a quick forward pass and achieve classification accuracies similar to conventional continuous-weight networks. Their training regimen focuses on error reduction initially, but later emphasizes discretization of weights. They ignore insignificant inputs, remove unnecessary weights, and drop unneeded hidden neurons. We have successfully tested them on the MNIST, credit card fraud, and credit card defaults data sets using networks having 2 to 16 hidden layers and up to 4.4 million weights.
We develop a unifying framework for Bayesian nonparametric regression to study the rates of contraction with respect to the integrated $L_2$-distance without assuming the regression function space to be uniformly bounded. The framework is built upon orthonormal random series in a flexible manner. A general theorem for deriving rates of contraction for Bayesian nonparametric regression is provided under the proposed framework. As specific applications, we obtain the near-parametric rate of contraction for the squared-exponential Gaussian process when the true function is analytic, the adaptive rates of contraction for the sieve prior, and the adaptive-and-exact rates of contraction for the un-modified block prior when the true function is {\alpha}-smooth. Extensions to wavelet series priors and fixed-design regression problems are also discussed.
We consider the consistency properties of a regularised estimator for the simultaneous identification of both changepoints and graphical dependency structure in multivariate time-series. Traditionally, estimation of Gaussian Graphical Models (GGM) is performed in an i.i.d setting. More recently, such models have been extended to allow for changes in the distribution, but only where changepoints are known a-priori. In this work, we study the Group-Fused Graphical Lasso (GFGL) which penalises partial-correlations with an L1 penalty while simultaneously inducing block-wise smoothness over time to detect multiple changepoints. We present a proof of consistency for the estimator, both in terms of changepoints, and the structure of the graphical models in each segment.
Critical to evaluating the capacity, scalability, and availability of web systems are realistic web traffic generators. Web traffic generation is a classic research problem, no generator accounts for the characteristics of web robots or crawlers that are now the dominant source of traffic to a web server. Administrators are thus unable to test, stress, and evaluate how their systems perform in the face of ever increasing levels of web robot traffic. To resolve this problem, this paper introduces a novel approach to generate synthetic web robot traffic with high fidelity. It generates traffic that accounts for both the temporal and behavioral qualities of robot traffic by statistical and Bayesian models that are fitted to the properties of robot traffic seen in web logs from North America and Europe. We evaluate our traffic generator by comparing the characteristics of generated traffic to those of the original data. We look at session arrival rates, inter-arrival times and session lengths, comparing and contrasting them between generated and real traffic. Finally, we show that our generated traffic affects cache performance similarly to actual traffic, using the common LRU and LFU eviction policies.
We present and analyze a new framework for graph clustering based on a specially weighted version of correlation clustering, that unifies several existing objectives and satisfies a number of attractive theoretical properties. Our framework, which we call LambdaCC, relies on a single resolution parameter $\lambda$, which implicitly controls both the edge density and sparsest cut of all output clusters. We prove that our new clustering objective interpolates between the cluster deletion problem and the minimum sparsest cut problem as we vary $\lambda$, and is also closely related to the well-studied maximum modularity objective. We provide several algorithms for optimizing our new objective, including a 5-approximation for the case where $\lambda \geq 1/2$, and also the first constant factor approximation algorithm for the NP-hard cluster deletion problem. We demonstrate the effectiveness of our framework and algorithms in finding communities in several real-world networks.
With the increasing commoditization of computer vision, speech recognition and machine translation systems and the widespread deployment of learning-based back-end technologies such as digital advertising and intelligent infrastructures, AI (Artificial Intelligence) has moved from research labs to production. These changes have been made possible by unprecedented levels of data and computation, by methodological advances in machine learning, by innovations in systems software and architectures, and by the broad accessibility of these technologies. The next generation of AI systems promises to accelerate these developments and increasingly impact our lives via frequent interactions and making (often mission-critical) decisions on our behalf, often in highly personalized contexts. Realizing this promise, however, raises daunting challenges. In particular, we need AI systems that make timely and safe decisions in unpredictable environments, that are robust against sophisticated adversaries, and that can process ever increasing amounts of data across organizations and individuals without compromising confidentiality. These challenges will be exacerbated by the end of the Moore’s Law, which will constrain the amount of data these technologies can store and process. In this paper, we propose several open research directions in systems, architectures, and security that can address these challenges and help unlock AI’s potential to improve lives and society.
We present a new recursive generation algorithm for prefix normal words. These are binary strings with the property that no substring has more 1s than the prefix of the same length. The new algorithm uses two operations on binary strings, which exploit certain properties of prefix normal words in a smart way. We introduce infinite prefix normal words and show that one of the operations used by the algorithm, if applied repeatedly to extend the string, produces an ultimately periodic infinite word, which is prefix normal and whose period’s length and density we can predict from the original word.
The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray—a distributed system to address them. Ray implements a dynamic task graph computation model that supports both the task-parallel and the actor programming models. To meet the performance requirements of AI applications, we propose an architecture that logically centralizes the system’s control state using a sharded storage system and a novel bottom-up distributed scheduler. In our experiments, we demonstrate sub-millisecond remote task latencies and linear throughput scaling beyond 1.8 million tasks per second. We empirically validate that Ray speeds up challenging benchmarks and serves as both a natural and performant fit for an emerging class of reinforcement learning applications and algorithms.
Machine learning libraries such as TensorFlow and PyTorch simplify model implementation. However, researchers are still required to perform a non-trivial amount of manual tasks such as GPU allocation, training status tracking, and comparison of models with different hyperparameter settings. We propose a system to handle these tasks and help researchers focus on models. We present the requirements of the system based on a collection of discussions from an online study group comprising 25k members. These include automatic GPU allocation, learning status visualization, handling model parameter snapshots as well as hyperparameter modification during learning, and comparison of performance metrics between models via a leaderboard. We describe the system architecture that fulfills these requirements and present a proof-of-concept implementation, NAVER Smart Machine Learning (NSML). We test the system and confirm substantial efficiency improvements for model development.
The development of mobile cloud computing has brought many benefits to mobile users as well as cloud service providers. However, mobile cloud computing is facing some challenges, especially security-related problems due to the growing number of cyberattacks which can cause serious losses. In this paper, we propose a dynamic framework together with advanced risk management strategies to minimize losses caused by cyberattacks to a cloud service provider. In particular, this framework allows the cloud service provider to select appropriate security solutions, e.g., security software/hardware implementation and insurance policies, to deal with different types of attacks. Furthermore, the stochastic programming approach is adopted to minimize the expected total loss for the cloud service provider under its financial capability and uncertainty of attacks and their potential losses. Through numerical evaluation, we show that our approach is an effective tool in not only dealing with cyberattacks under uncertainty, but also minimizing the total loss for the cloud service provider given its available budget.
Zero-shot Learners are models capable of predicting unseen classes. In this work, we propose a Zero-shot Learning approach for text categorization. Our method involves training model on a large corpus of sentences to learn the relationship between a sentence and embedding of sentence’s tags. Learning such relationship makes the model generalize to unseen sentences, tags, and even new datasets provided they can be put into same embedding space. The model learns to predict whether a given sentence is related to a tag or not; unlike other classifiers that learn to classify the sentence as one of the possible classes. We propose three different neural networks for the task and report their accuracy on the test set of the dataset used for training them as well as two other standard datasets for which no retraining was done. We show that our models generalize well across new unseen classes in both cases. Although the models do not achieve the accuracy level of the state of the art supervised models, yet it evidently is a step forward towards general intelligence in natural language processing.
A central question in statistical learning is to design algorithms that not only perform well on training data, but also generalize to new and unseen data. In this paper, we tackle this question by formulating a distributionally robust stochastic optimization (DRSO) problem, which seeks a solution that minimizes the worst-case expected loss over a family of distributions that are close to the empirical distribution in Wasserstein distances. We establish a connection between such Wasserstein DRSO and regularization. More precisely, we identify a broad class of loss functions, for which the Wasserstein DRSO is asymptotically equivalent to a regularization problem with a gradient-norm penalty. Such relation provides new interpretations for problems involving regularization, including a great number of statistical learning problems and discrete choice models (e.g. multinomial logit). The connection suggests a principled way to regularize high-dimensional, non-convex problems. This is demonstrated through two applications: the training of Wasserstein generative adversarial networks (WGANs) in deep learning, and learning heterogeneous consumer preferences with mixed logit choice model.
We describe TensorFlow-Serving, a system to serve machine learning models inside Google which is also available in the cloud and via open-source. It is extremely flexible in terms of the types of ML platforms it supports, and ways to integrate with systems that convey new models and updated versions from training to serving. At the same time, the core code paths around model lookup and inference have been carefully optimized to avoid performance pitfalls observed in naive implementations. Google uses it in many production deployments, including a multi-tenant model hosting service called TFS.
Inferring causal effects of treatments is a central goal in many disciplines. The potential outcomes framework is a main statistical approach to causal inference, in which a causal effect is defined as a comparison of the potential outcomes of the same units under different treatment conditions. Because for each unit at most one of the potential outcomes is observed and the rest are missing, causal inference is inherently a missing data problem. Indeed, there is a close analogy in the terminology and the inferential framework between causal inference and missing data. Despite the intrinsic connection between the two subjects, statistical analyses of causal inference and missing data also have marked differences in aims, settings and methods. This article provides a systematic review of causal inference from the missing data perspective. Focusing on ignorable treatment assignment mechanisms, we discuss a wide range of causal inference methods that have analogues in missing data analysis, such as imputation, inverse probability weighting and doubly-robust methods. Under each of the three modes of inference–Frequentist, Bayesian, and Fisherian randomization–we present the general structure of inference for both finite-sample and super-population estimands, and illustrate via specific examples. We identify open questions to motivate more research to bridge the two fields.
Deep Neural Networks (DNNs) are very popular these days, and are the subject of a very intense investigation. A DNN is made by layers of internal units (or neurons), each of which computes an affine combination of the output of the units in the previous layer, applies a nonlinear operator, and outputs the corresponding value (also known as activation). A commonly-used nonlinear operator is the so-called rectified linear unit (ReLU), whose output is just the maximum between its input value and zero. In this (and other similar cases like max pooling, where the max operation involves more than one input value), one can model the DNN as a 0-1 Mixed Integer Linear Program (0-1 MILP) where the continuous variables correspond to the output values of each unit, and a binary variable is associated with each ReLU to model its yes/no nature. In this paper we discuss the peculiarity of this kind of 0-1 MILP models, and describe an effective bound-tightening technique intended to ease its solution. We also present possible applications of the 0-1 MILP model arising in feature visualization and in the construction of adversarial examples. Preliminary computational results are reported, aimed at investigating (on small DNNs) the computational performance of a state-of-the-art MILP solver when applied to a known test case, namely, hand-written digit recognition.
With the fast development of information technology, especially the popularization of internet, multi-view learning becomes more and more popular in machine learning and data mining fields. As we all know that, multi-view semi-supervised learning, such as co-training, co-regularization has gained considerable attentions. Although recently, multi-view clustering (MVC) has developed rapidly, there are not a survey or review to summarize and analyze the current progress. Therefore, this paper sums up the common strategies of combining multiple views and based on that we proposed a novel taxonomy of the MVC approaches. We also discussed the relationships between MVC and multi-view representation, ensemble clustering, multi-task clustering, multi-view supervised and multi-view semi-supervised learning. Several representative real-world applications are elaborated. To promote the further development of MVC, we pointed out several open problems that are worth exploring in the future.
Shapelets are phase independent subsequences designed for time series classification. We propose three adaptations to the Shapelet Transform (ST) to capture multivariate features in multivariate time series classification. We create a unified set of data to benchmark our work on, and compare with three other algorithms. We demonstrate that multivariate shapelets are not significantly worse than other state-of-the-art algorithms.
The quality of mathematical modelling is looked at from the perspective of science’s own quality control arrangement and recent crises. It is argued that the crisis in the quality of modelling is at least as serious as that which has come to light in fields such as medicine, economics, psychology, and nutrition. In the context of the nascent sociology of quantification, the linkages between big data, algorithms, mathematical and statistical modelling (use and misuse of p-values) are evident. Looking at existing proposals for best practices the suggestion is put forward that the field needs a thorough Reformation, leading to a new grammar for modelling. Quantitative methodologies such as uncertainty and sensitivity analysis can form the bedrock on which the new grammar is built, while incorporating important normative and ethical elements. To this effect we introduce sensitivity auditing, quantitative storytelling, and ethics of quantification.
A precise definition of what constitutes a community in networks has remained elusive. Consequently, network scientists have compared community detection algorithms on benchmark networks with a particular form of community structure and classified them based on the mathematical techniques they employ. However, this comparison can be misleading because apparent similarities in their mathematical machinery can disguise different reasons for why we would want to employ community detection in the first place. Here we provide a focused review of these different motivations that underpin community detection. This problem-driven classification is useful in applied network science, where it is important to select an appropriate algorithm for the given purpose. Moreover, highlighting the different approaches to community detection also delineates the many lines of research and points out open directions and avenues for future research.
In this paper, we propose a method of improving Convolutional Neural Networks (CNN) by determining the optimal alignment of weights and inputs using dynamic programming. Conventional CNNs convolve learnable shared weights, or filters, across the input data. The filters use a linear matching of weights to inputs using an inner product between the filter and a window of the input. However, it is possible that there exists a more optimal alignment of weights. Thus, we propose the use of Dynamic Time Warping (DTW) to dynamically align the weights to optimized input elements. This dynamic alignment is useful for time series recognition due to the complexities of temporal relations and temporal distortions. We demonstrate the effectiveness of the proposed architecture on the Unipen online handwritten digit and character datasets, the UCI Spoken Arabic Digit dataset, and the UCI Activities of Daily Life dataset.
Stochastic Gradient Descent (SGD) with small mini-batch is a key component in modern large-scale machine learning. However, its efficiency has not been easy to analyze as most theoretical results require adaptive rates and show convergence rates far slower than that for gradient descent, making computational comparisons difficult. In this paper we aim to clarify the issue of fast SGD convergence. The key observation is that most modern architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, these regimes allow for very fast convergence of SGD, comparable in the number of iterations to gradient descent. Specifically, consider the setting with quadratic objective function, or near a minimum, where the quadratic term is dominant. We show that: (1) Mini-batch size $1$ with constant step size is optimal in terms of computations to achieve a given error. (2) There is a critical mini-batch size such that: (a. linear scaling) SGD iteration with mini-batch size $m$ smaller than the critical size is nearly equivalent to $m$ iterations of mini-batch size $1$. (b. saturation) SGD iteration with mini-batch larger than the critical size is nearly equivalent to a gradient descent step. The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying $O(n)$ acceleration over GD per unit of computation. We give experimental evidence on real data, with the results closely following our theoretical analyses. Finally, we show how the interpolation perspective and our results fit with recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.
Because stochastic gradient descent (SGD) has shown promise optimizing neural networks with millions of parameters and few if any alternatives are known to exist, it has moved to the heart of leading approaches to reinforcement learning (RL). For that reason, the recent result from OpenAI showing that a particular kind of evolution strategy (ES) can rival the performance of SGD-based deep RL methods with large neural networks provoked surprise. This result is difficult to interpret in part because of the lingering ambiguity on how ES actually relates to SGD. The aim of this paper is to significantly reduce this ambiguity through a series of MNIST-based experiments designed to uncover their relationship. As a simple supervised problem without domain noise (unlike in most RL), MNIST makes it possible (1) to measure the correlation between gradients computed by ES and SGD and (2) then to develop an SGD-based proxy that accurately predicts the performance of different ES population sizes. These innovations give a new level of insight into the real capabilities of ES, and lead also to some unconventional means for applying ES to supervised problems that shed further light on its differences from SGD. Incorporating these lessons, the paper concludes by demonstrating that ES can achieve 99% accuracy on MNIST, a number higher than any previously published result for any evolutionary method. While not by any means suggesting that ES should substitute for SGD in supervised learning, the suite of experiments herein enables more informed decisions on the application of ES within RL and other paradigms.