Autoregressive generative models consistently achieve the best results in density estimation tasks involving high dimensional data, such as images or audio. They pose density estimation as a sequence modeling task, where a recurrent neural network (RNN) models the conditional distribution over the next element conditioned on all previous elements. In this paradigm, the bottleneck is the extent to which the RNN can model long-range dependencies, and the most successful approaches rely on causal convolutions, which offer better access to earlier parts of the sequence than conventional RNNs. Taking inspiration from recent work in meta reinforcement learning, where dealing with long-range dependencies is also essential, we introduce a new generative model architecture that combines causal convolutions with self attention. In this note, we describe the resulting model and present state-of-the-art log-likelihood results on CIFAR-10 (2.85 bits per dim) and $32 \times 32$ ImageNet (3.80 bits per dim). Our implementation is available at https://…/pixelsnail-public.
Reinforcement learning (RL) algorithms involve the deep nesting of distinct components, where each component typically exhibits opportunities for distributed computation. Current RL libraries offer parallelism at the level of the entire program, coupling all the components together and making existing implementations difficult to extend, combine, and reuse. We argue for building composable RL components by encapsulating parallelism and resource requirements within individual components, which can be achieved by building on top of a flexible task-based programming model. We demonstrate this principle by building Ray RLLib on top of Ray and show that we can implement a wide range of state-of-the-art algorithms by composing and reusing a handful of standard components. This composability does not come at the cost of performance — in our experiments, RLLib matches or exceeds the performance of highly optimized reference implementations. Ray RLLib is available as part of Ray at https://…/.
Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A well-studied class of integrity constraints is Functional Dependencies (FDs, for short) that specify dependencies among attributes in a relation. In this paper, we address three major challenges in data repairing: (1) Accuracy: Most existing techniques strive to produce repairs that minimize changes to the data. However, this process may produce incorrect combinations of attribute values (or patterns). In this work, we formalize the interaction of FD-induced patterns and select repairs that result in preserving frequent patterns found in the original data. This has the potential to yield a better repair quality both in terms of precision and recall. (2) Interpretability of repairs: Current data repair algorithms produce repairs in the form of data updates that are not necessarily understandable. This makes it hard to debug repair decisions and trace the chain of steps that produced them. To this end, we define a new formalism to declaratively express repairs that are easy for users to reason about. (3) Scalability: We propose a linear-time algorithm to compute repairs that outperforms state-of-the-art FD repairing algorithms by orders of magnitude in repair time. Our experiments using both real-world and synthetic data demonstrate that our new repair approach consistently outperforms existing techniques both in terms of repair quality and scalability.
In many applications of classifier learning, training data suffers from label noise. Deep networks are learned using huge training data where the problem of noisy labels is particularly relevant. The current techniques proposed for learning deep networks under label noise focus on modifying the network architecture and on algorithms for estimating true labels from noisy labels. An alternate approach would be to look for loss functions that are inherently noise-tolerant. For binary classification there exist theoretical results on loss functions that are robust to label noise. In this paper, we provide some sufficient conditions on a loss function so that risk minimization under that loss function would be inherently tolerant to label noise for multiclass classification problems. These results generalize the existing results on noise-tolerant loss functions for binary classification. We study some of the widely used loss functions in deep networks and show that the loss function based on mean absolute value of error is inherently robust to label noise. Thus standard back propagation is enough to learn the true classifier even under label noise. Through experiments, we illustrate the robustness of risk minimization with such loss functions for learning neural networks.
The design of agent-based models (ABMs) is often ad-hoc when it comes to defining their scope. In order for the inclusion of features such as network structure, location, or dynamic change to be justified, their role in a model should be systematically analysed. We propose a mechanism to compare and assess the impact of such features. In particular we are using techniques from software engineering and semantics to support the development and assessment of ABMs, such as graph transformations as semantic representations for agent-based models, feature diagrams to identify ingredients under consideration, and extension relations between graph transformation systems to represent model fragments expressing features.
Text normalization is an essential task in the processing and analysis of social media that is dominated with informal writing. It aims to map informal words to their intended standard forms. Previously proposed text normalization approaches typically require manual selection of parameters for improved performance. In this paper, we present an automatic optimizationbased nearest neighbor matching approach for text normalization. This approach is motivated by the observation that text normalization is essentially a matching problem and nearest neighbor matching with an adaptive similarity function is the most direct procedure for it. Our similarity function incorporates weighted contributions of contextual, string, and phonetic similarity, and the nearest neighbor matching involves a minimum similarity threshold. These four parameters are tuned efficiently using grid search. We evaluate the performance of our approach on two benchmark datasets. The results demonstrate that parameter tuning on small sized labeled datasets produce state-of-the-art text normalization performances. Thus, this approach allows practically easy construction of evolving domain-specific normalization lexicons
Fog computing serves as a computing layer that sits between the edge devices and the cloud in the network topology. They have more compute capacity than the edge but much less so than cloud data centers. They typically have high uptime and always-on Internet connectivity. Applications that make use of the fog can avoid the network performance limitation of cloud computing while being less resource constrained than edge computing. As a result, they offer a useful balance of the current paradigms. This article explores various aspects of fog computing in the context of big data.
In this paper, a neural network-based stock price prediction and trading system using technical analysis indicators is presented. The model developed first converts the financial time series data into a series of buy-sell-hold trigger signals using the most commonly preferred technical analysis indicators. Then, a Multilayer Perceptron (MLP) artificial neural network (ANN) model is trained in the learning stage on the daily stock prices between 1997 and 2007 for all of the Dow30 stocks. Apache Spark big data framework is used in the training stage. The trained model is then tested with data from 2007 to 2017. The results indicate that by choosing the most appropriate technical indicators, the neural network model can achieve comparable results against the Buy and Hold strategy in most of the cases. Furthermore, fine tuning the technical indicators and/or optimization strategy can enhance the overall trading performance.
A novel quantile Fourier neural network is presented for nonparametric probabilistic forecasting. Prediction are provided in the form of composite quantiles using time as the only input to the model. This effectively is a form of extrapolation based quantile regression applied for forecasting. Empirical results showcase that for time series data that have clear seasonality and trend, the model provides high quality probabilistic predictions. This work introduces a new class of forecasting of using only time as the input versus using past data such as an autoregressive model. Extrapolation based regression has not been studied before for probabilistic forecasting.
The Convolution Neural Network (CNN) has demonstrated the unique advantage in audio, image and text learning; recently it has also challenged Recurrent Neural Networks (RNNs) with long short-term memory cells (LSTM) in sequence-to-sequence learning, since the computations involved in CNN are easily parallelizable whereas those involved in RNN are mostly sequential, leading to a performance bottleneck. However, unlike RNN, the native CNN lacks the history sensitivity required for sequence transformation; therefore enhancing the sequential order awareness, or position-sensitivity, becomes the key to make CNN the general deep learning model. In this work we introduce an extended CNN model with strengthen position-sensitivity, called PoseNet. A notable feature of PoseNet is the asymmetric treatment of position information in the encoder and the decoder. Experiments shows that PoseNet allows us to improve the accuracy of CNN based sequence-to-sequence learning significantly, achieving around 33-36 BLEU scores on the WMT 2014 English-to-German translation task, and around 44-46 BLEU scores on the English-to-French translation task.
We present a method to create universal, robust, targeted adversarial image patches in the real world. The patches are universal because they can be used to attack any scene, robust because they work under a wide variety of transformations, and targeted because they can cause a classifier to output any target class. These adversarial patches can be printed, added to any scene, photographed, and presented to image classifiers; even when the patches are small, they cause the classifiers to ignore the other items in the scene and report a chosen target class.
In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time momentum variants of several of these methods are studied. We choose to perform our analysis in a setting in which all of the above methods are equivalent. We prove global nonassymptotic linear convergence rates for all methods and various measures of success, including primal function values, primal iterates (in L2 sense), and dual function values. We also show that the primal iterates converge at an accelerated linear rate in the L1 sense. This is the first time a linear rate is shown for the stochastic heavy ball method (i.e., stochastic gradient descent method with momentum). Under somewhat weaker conditions, we establish a sublinear convergence rate for Cesaro averages of primal iterates. Moreover, we propose a novel concept, which we call stochastic momentum, aimed at decreasing the cost of performing the momentum step. We prove linear convergence of several stochastic methods with stochastic momentum, and show that in some sparse data regimes and for sufficiently small momentum parameters, these methods enjoy better overall complexity than methods with deterministic momentum. Finally, we perform extensive numerical testing on artificial and real datasets, including data coming from average consensus problems.
In this paper, we present libDirectional, a MATLAB library for directional statistics and directional estimation. It supports a variety of commonly used distributions on the unit circle, such as the von Mises, wrapped normal, and wrapped Cauchy distributions. Furthermore, various distributions on higher-dimensional manifolds such as the unit hypersphere and the hypertorus are available. Based on these distributions, several recursive filtering algorithms in libDirectional allow estimation on these manifolds. The functionality is implemented in a clear, well-documented, and object-oriented structure that is both easy to use and easy to extend.
String matching is one of the most fundamental problems in computer science. A natural problem is to find the number of characters that need to be queried (i.e. the decision tree complexity) in a string in order to determine whether this string contains a certain pattern. Rivest showed that for every pattern $p$, in the worst case any deterministic algorithm needs to query at least $n-|p|+1$ characters, where $n$ is the length of the string and $|p|$ is the length of the pattern. He further conjectured that these bounds are tight. By using adversary methods, Tuza disproved this conjecture and showed that more than half of binary patterns are evasive, i.e. any algorithm needs to query all the characters. In this paper, we give a query algorithm which settles the decision tree complexity for almost all patterns. Using the algebraic approach of Rivest and Vuillemin we give a new sufficient condition for the evasiveness of patterns, which reveals an interesting connection to Skolem’s Problem.
Items from a database are often ranked based on a combination of multiple criteria. Often, a user may have flexibility to accept combinations that weight these criteria differently, within limits. On the other hand, this choice of weights can greatly affect the fairness of the ranking produced. In this paper, we develop a system that helps users choose criterion weights that lead to greater fairness. We consider ranking functions that compute the score of each item as a weighted sum of (numeric) attribute values, and then sort items on their score. Each ranking function can then be expressed as a vector of weights, or a point in a multi-dimensional space. For a broad range of fairness criteria, we show how to efficiently identify regions in this space that satisfy these criteria. Using this identification, our system is able to tell users whether their proposed ranking function satisfies the desired fairness criteria and if it does not, to suggest the smallest modification which does. Our extensive experiments on real datasets demonstrate that our methods are able to find solutions that satisfy fairness criteria effectively and efficiently.
We propose a Topic Compositional Neural Language Model (TCNLM), a novel method designed to simultaneously capture both the global semantic meaning and the local word ordering structure in a document. The TCNLM learns the global semantic coherence of a document via a neural topic model, and the probability of each learned latent topic is further used to build a Mixture-of-Experts (MoE) language model, where each expert (corresponding to one topic) is a recurrent neural network (RNN) that accounts for learning the local structure of a word sequence. In order to train the MoE model efficiently, a matrix factorization method is applied, by extending each weight matrix of the RNN to be an ensemble of topic-dependent weight matrices. The degree to which each member of the ensemble is used is tied to the document-dependent probability of the corresponding topics. Experimental results on several corpora show that the proposed approach outperforms both a pure RNN-based model and other topic-guided language models. Further, our model yields sensible topics, and also has the capacity to generate meaningful sentences conditioned on given topics.
Anomaly detection in videos refers to the identification of events that do not conform to expected behavior. However, almost all existing methods tackle the problem by minimizing the reconstruction errors of training data, which cannot guarantee a larger reconstruction error for an abnormal event. In this paper, we propose to tackle the anomaly detection problem within a video prediction framework. To the best of our knowledge, this is the first work that leverages the difference between a predicted future frame and its ground truth to detect an abnormal event. To predict a future frame with higher quality for normal events, other than the commonly used appearance (spatial) constraints on intensity and gradient, we also introduce a motion (temporal) constraint in video prediction by enforcing the optical flow between predicted frames and ground truth frames to be consistent, and this is the first work that introduces a temporal constraint into the video prediction task. Such spatial and motion constraints facilitate the future frame prediction for normal events, and consequently facilitate to identify those abnormal events that do not conform the expectation. Extensive experiments on both a toy dataset and some publicly available datasets validate the effectiveness of our method in terms of robustness to the uncertainty in normal events and the sensitivity to abnormal events
We study the problem of change point detection for covariance matrices in high dimensions. We assume that {Xi}i=1,…,n is a sequence of independent, centered p-dimensional sub-Gaussian random vectors is observed whose covariance matrices are piece-wise constant. Our task is to recover with high accuracy the number and locations the of change points, which are unknown. Our generic model setting allows for all the model parameters to change with n, including the dimension p, the minimal spacing between consecutive change points, the magnitude of smallest change and the maximal operator norm of the covariance matrices of the sample points. We introduce two procedures, one based on the binary segmentation algorithm (e.g. Vostrikova, 1981) and the other on its extension known as wild binary segmentation of Fryzlewicz (2014), and demonstrate that, under suitable conditions, both are able to consistently estimate the number and locations of change points. Our second algorithm, called Wild Binary Segmentation through Independent Projection (WBSIP) is shown to be minimax optimal in the in terms of all the relevant parameters. Our minimax analysis also reveals a phase transition effect based on our generic model setting. To the best of our knowledge, this type of results has not been established elsewhere in the change point detection literature.
Neural network training relies on our ability to find ‘good’ minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple ‘filter normalization’ method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.
While end-to-end neural conversation models have led to promising advances in reducing hand-crafted features and errors induced by the traditional complex system architecture, they typically require an enormous amount of data. Previous studies adopted a hybrid approach with knowledge-based components to abstract out domain-specific things or to augment data to cover more diverse patterns. On the contrary, we propose to directly address the problem using the recent development in the space of continual learning for neural models. Specifically, we adopt a domain-independent neural conversational model and introduce a novel neural continual learning algorithm that allows the conversational agent to accumulate skills across different tasks in a data-efficient way. To the best of our knowledge, this is the first work that applies continual learning for conversation systems. We verified the efficacy of our method through a conversational skill transfer from synthetic dialogs or human-human dialogs to human-computer conversations in a customer support domain.
Tensor robust principal component analysis (TRPCA) has received a substantial amount of attention in various fields. Most existing methods, normally relying on tensor nuclear norm minimization, need to pay an expensive computational cost due to multiple singular value decompositions (SVDs) at each iteration. To overcome the drawback, we propose a scalable and efficient method, named Parallel Active Subspace Decomposition (PASD), which divides the unfolding along each mode of the tensor into a columnwise orthonormal matrix (active subspace) and another small-size matrix in parallel. Such a transformation leads to a nonconvex optimization problem in which the scale of nulcear norm minimization is generally much smaller than that in the original problem. Furthermore, we introduce an alternating direction method of multipliers (ADMM) method to solve the reformulated problem and provide rigorous analyses for its convergence and suboptimality. Experimental results on synthetic and real-world data show that our algorithm is more accurate than the state-of-the-art approaches, and is orders of magnitude faster.