In this note, we point out a basic link between generative adversarial (GA) training and binary classification — any powerful discriminator essentially computes an (f-)divergence between real and generated samples. The result, repeatedly re-derived in decision theory, has implications for GA Networks (GANs), providing an alternative perspective on training f-GANs by designing the discriminator loss function.
Capturing the temporal dynamics of user preferences over items is important for recommendation. Existing methods mainly assume that all time steps in user-item interaction history are equally relevant to recommendation, which however does not apply in real-world scenarios where user-item interactions can often happen accidentally. More importantly, they learn user and item dynamics separately, thus failing to capture their joint effects on user-item interactions. To better model user and item dynamics, we present the Interacting Attention-gated Recurrent Network (IARN) which adopts the attention model to measure the relevance of each time step. In particular, we propose a novel attention scheme to learn the attention scores of user and item history in an interacting way, thus to account for the dependencies between user and item dynamics in shaping user-item interactions. By doing so, IARN can selectively memorize different time steps of a user’s history when predicting her preferences over different items. Our model can therefore provide meaningful interpretations for recommendation results, which could be further enhanced by auxiliary features. Extensive validation on real-world datasets shows that IARN consistently outperforms state-of-the-art methods.
We consider the fundamental question: how a legacy ‘student’ Artificial Intelligent (AI) system could learn from a legacy ‘teacher’ AI system or a human expert without complete re-training and, most importantly, without requiring significant computational resources. Here ‘learning’ is understood as an ability of one system to mimic responses of the other and vice-versa. We call such learning an Artificial Intelligence knowledge transfer. We show that if internal variables of the ‘student’ Artificial Intelligent system have the structure of an $n$-dimensional topological vector space and $n$ is sufficiently high then, with probability close to one, the required knowledge transfer can be implemented by simple cascades of linear functionals. In particular, for $n$ sufficiently large, with probability close to one, the ‘student’ system can successfully and non-iteratively learn $k\ll n$ new examples from the ‘teacher’ (or correct the same number of mistakes) at the cost of two additional inner products. The concept is illustrated with an example of knowledge transfer from a pre-trained convolutional neural network to a simple linear classifier with HOG features.
Segments that span contiguous parts of inputs, such as phonemes in speech, named entities in sentences, actions in videos, occur frequently in sequence prediction problems. Segmental models, a class of models that explicitly hypothesizes segments, have allowed the exploration of rich segment features for sequence prediction. However, segmental models suffer from slow decoding, hampering the use of computationally expensive features. In this thesis, we introduce discriminative segmental cascades, a multi-pass inference framework that allows us to improve accuracy by adding higher-order features and neural segmental features while maintaining efficiency. We also show that instead of including more features to obtain better accuracy, segmental cascades can be used to speed up training and decoding. Segmental models, similarly to conventional speech recognizers, are typically trained in multiple stages. In the first stage, a frame classifier is trained with manual alignments, and then in the second stage, segmental models are trained with manual alignments and the out- puts of the frame classifier. However, obtaining manual alignments are time-consuming and expensive. We explore end-to-end training for segmental models with various loss functions, and show how end-to-end training with marginal log loss can eliminate the need for detailed manual alignments. We draw the connections between the marginal log loss and a popular end-to-end training approach called connectionist temporal classification. We present a unifying framework for various end-to-end graph search-based models, such as hidden Markov models, connectionist temporal classification, and segmental models. Finally, we discuss possible extensions of segmental models to large-vocabulary sequence prediction tasks.
Methods for inferring average causal effects have traditionally relied on two key assumptions: (i) first, that the intervention received by one unit cannot causally influence the outcome of another, i.e. that there is no interference between units; and (ii) that units can be organized into non-overlapping groups such that outcomes of units in separate groups are independent. In this paper, we develop new statistical methods for causal inference based on a single realization of a network of connected units for which neither assumption (i) nor (ii) holds. The proposed approach allows both for arbitrary forms of interference, whereby the outcome of a unit may depend (directly or indirectly) on interventions received by other units with whom a network path through connected units exists; and long range dependence, whereby outcomes for any two units likewise connected by a path may be dependent. Under the standard assumptions of consistency and lack of unobserved confounding adapted to the network setting, statistical inference is further made tractable by an assumption that the outcome vector defined on the network is a single realization of a certain Markov random field (MRF). As we show, this assumption allows inferences about various network causal effects including direct and spillover effects via the auto-g-computation algorithm, a network generalization of Robins’ well-known g-computation algorithm previously described for causal inference under assumptions (i) and (ii).
We present two techniques to improve landmark localization from partially annotated datasets. Our primary goal is to leverage the common situation where precise landmark locations are only provided for a small data subset, but where class labels for classification tasks related to the landmarks are more abundantly available. We propose a new architecture for landmark localization, where training with class labels acts as an auxiliary signal to guide the landmark localization on unlabeled data. A key aspect of our approach is that errors can be backpropagated through a complete landmark localization model. We also propose and explore an unsupervised learning technique for landmark localization based on having a model predict equivariant landmarks with respect to transformations applied to the image. We show that this technique, used as additional regularization, improves landmark prediction considerably and can learn effective detectors even when only a small fraction of the dataset has labels for landmarks. We present results on two toy datasets and three real datasets, with hands and faces, respectively, showing the performance gain of our method on each.
Generative modeling, which learns joint probability distribution from training data and generates samples according to it, is an important task in machine learning and artificial intelligence. Inspired by probabilistic interpretation of quantum physics, we propose a generative model using matrix product states, which is a tensor network originally proposed for describing (particularly one-dimensional) entangled quantum states. Our model enjoys efficient learning by utilizing the density matrix renormalization group method which allows dynamic adjusting dimensions of the tensors, and offers an efficient direct sampling approach, Zipper, for generative tasks. We apply our method to generative modeling of several standard datasets including the principled Bars and Stripes, random binary patterns and the MNIST handwritten digits, to illustrate ability of our model, and discuss features as well as drawbacks of our model over popular generative models such as Hopfield model, Boltzmann machines and generative adversarial networks. Our work shed light on many interesting directions for future exploration on the development of quantum-inspired algorithms for unsupervised machine learning, which is of possibility of being realized by a quantum device.
Over the last few years, deep learning has revolutionized the field of machine learning by dramatically improving the state-of-the-art in various domains. However, as the size of supervised artificial neural networks grows, typically so does the need for larger labeled datasets. Recently, crowdsourcing has established itself as an efficient and cost-effective solution for labeling large sets of data in a scalable manner, but it often requires aggregating labels from multiple noisy contributors with different levels of expertise. In this paper, we address the problem of learning deep neural networks from crowds. We begin by describing an EM algorithm for jointly learning the parameters of the network and the confusion matrices of the different annotators for classification settings. Then, a novel general-purpose crowd layer is proposed, which allows us to train deep neural networks end-to-end, directly from the noisy labels of multiple annotators, using backpropagation. We empirically show that the proposed approach is able to internally capture the reliability and biases of different annotators and achieve new state-of-the-art results for various crowdsourced datasets across different settings, namely classification, regression and sequence labeling.
Distributed Complex Event Processing (DCEP) is a paradigm to infer the occurrence of complex situations in the surrounding world from basic events like sensor readings. In doing so, DCEP operators detect event patterns on their incoming event streams. To yield high operator throughput, data parallelization frameworks divide the incoming event streams of an operator into overlapping windows that are processed in parallel by a number of operator instances. In doing so, the basic assumption is that the different windows can be processed independently from each other. However, consumption policies enforce that events can only be part of one pattern instance; then, they are consumed, i.e., removed from further pattern detection. That implies that the constituent events of a pattern instance detected in one window are excluded from all other windows as well, which breaks the data parallelism between different windows. In this paper, we tackle this problem by means of speculation: Based on the likelihood of an event’s consumption in a window, subsequent windows may speculatively suppress that event. We propose the SPECTRE framework for speculative processing of multiple dependent windows in parallel. Our evaluations show an up to linear scalability of SPECTRE with the number of CPU cores.
Training deep neural networks is known to require a large number of training samples. However, in many applications only few training samples are available. In this work, we tackle the issue of training neural networks for classification task when few training samples are available. We attempt to solve this issue by proposing a regularization term that constrains the hidden layers of a network to learn class-wise invariant features. In our regularization framework, learning invariant features is generalized to the class membership where samples with the same class should have the same feature representation. Numerical experiments over MNIST and its variants showed that our proposal is more efficient for the case of few training samples. Moreover, we show an intriguing property of representation learning within neural networks. The source code of our framework is freely available https://…/learning-class-invariant-features.
Convolutional neural networks (CNNs) are equivariant with respect to translation; a translation in the input causes a translation in the output. Attempts to generalize equivariance have concentrated on rotations. In this paper, we combine the idea of the spatial transformer, and the canonical coordinate representations of groups (polar transform) to realize a network that is invariant to translation, and equivariant to rotation and scale. A conventional CNN is used to predict the origin of a polar transform. The polar transform is performed in a differentiable way, similar to the Spatial Transformer Networks, and the resulting polar representation is fed into a second CNN. The model is trained end-to-end with a classification loss. We apply the method on variations of MNIST, obtained by perturbing it with clutter, translation, rotation, and scaling. We achieve state of the art performance in the rotated MNIST, with fewer parameters and faster training time than previous methods, and we outperform all tested methods in the SIM2MNIST dataset, which we introduce.
We present a practical way of introducing convolutional structure into Gaussian processes, making them more suited to high-dimensional inputs like images. The main contribution of our work is the construction of an inter-domain inducing point approximation that is well-tailored to the convolutional kernel. This allows us to gain the generalisation benefit of a convolutional kernel, together with fast but accurate posterior inference. We investigate several variations of the convolutional kernel, and apply it to MNIST and CIFAR-10, which have both been known to be challenging for Gaussian processes. We also show how the marginal likelihood can be used to find an optimal weighting between convolutional and RBF kernels to further improve performance. We hope that this illustration of the usefulness of a marginal likelihood will help automate discovering architectures in larger models.
Reliable uncertainty estimation for time series prediction is critical in many fields, including physics, biology, and manufacturing. At Uber, probabilistic time series forecasting is used for robust prediction of number of trips during special events, driver incentive allocation, as well as real-time anomaly detection across millions of metrics. Classical time series models are often used in conjunction with a probabilistic formulation for uncertainty estimation. However, such models are hard to tune, scale, and add exogenous variables to. Motivated by the recent resurgence of Long Short Term Memory networks, we propose a novel end-to-end Bayesian deep model that provides time series prediction along with uncertainty estimation. We provide detailed experiments of the proposed solution on completed trips data, and successfully apply it to large-scale time series anomaly detection at Uber.