We present a regularization-inspired approach for reducing bias in learned classifiers. In particular, we focus on binary classification tasks over individuals from two populations, where, as our criterion for fairness, we wish to achieve similar false positive rates in both populations, and similar false negative rates in both populations. As a proof of concept, we implement our approach and empirically evaluate its ability to achieve both fairness and accuracy, using the COMPAS scores data for prediction of recidivism.
We consider the testing and estimation of change-points, locations where the distribution abruptly changes, in a sequence of multivariate or non-Euclidean observations. We study a nonparametric framework that utilizes similarity information among observations, which can be applied to various data types as long as an informative similarity measure on the sample space can be defined. The existing approach along this line has low power and/or biased estimates for change-points under some common scenarios. We address these problems by considering new tests based on similarity information. Simulation studies show that the new approaches exhibit substantial improvements in detecting and estimating change-points. In addition, under some mild conditions, the new test statistics are asymptotically distribution free under the null hypothesis of no change. Analytic p-value approximations to the significance of the new test statistics for the single change-point alternative and changed interval alternative are derived, making the new approaches easy off-the-shelf tools for large datasets. The new approaches are illustrated in an analysis of New York taxi data.
We propose Teacher-Student Curriculum Learning (TSCL), a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train on. We describe a family of Teacher algorithms that rely on the intuition that the Student should practice more those tasks on which it makes the fastest progress, i.e. where the slope of the learning curve is highest. In addition, the Teacher algorithms address the problem of forgetting by also choosing tasks where the Student’s performance is getting worse. We demonstrate that TSCL matches or surpasses the results of carefully hand-crafted curricula in two tasks: addition of decimal numbers with LSTM and navigation in Minecraft. Using our automatically generated curriculum enabled to solve a Minecraft maze that could not be solved at all when training directly on solving the maze, and the learning was an order of magnitude faster than uniform sampling of subtasks.
Recent developments in neural information retrieval models have been promising, but a problem remains: human relevance judgments are expensive to produce, while neural models require a considerable amount of training data. In an attempt to fill this gap, we present an approach for generating weak supervision training data for use in a neural IR model. Specifically, we use a news corpus with article headlines acting as pseudo-queries and article content as pseudo-documents, and we propose a measure of interaction similarity to filter these pseudo-documents. Additionally, we employ techniques for addressing problems related to finding effective negative training examples and disregarding headlines that do not work well as queries. By using our approach to train state-of-the-art neural IR models and comparing to established baselines, we find that training data generated by our approach can lead to good results on a benchmark test collection.
This report is an introduction to transcription methods for trajectory optimization techniques. The first few sections describe the two classes of transcription methods (shooting \& simultaneous) that are used to convert the trajectory optimization problem into a general constrained optimization form. The middle of the report discusses a few extensions to the basic methods, including how to deal with hybrid systems (such as walking robots). The final section goes over a variety of implementation details.
We propose a neural encoder-decoder model with reinforcement learning (NRL) for grammatical error correction (GEC). Unlike conventional maximum likelihood estimation (MLE), the model directly optimizes towards an objective that considers a sentence-level, task-specific evaluation metric, avoiding the exposure bias issue in MLE. We demonstrate that NRL outperforms MLE both in human and automated evaluation metrics, achieving the state-of-the-art on a fluency-oriented GEC corpus.
This paper presents a fast decorrelated neuro-ensemble with heterogeneous features for large-scale data analytics, where stochastic configuration networks (SCNs) are employed as base learner models and the well-known negative correlation learning (NCL) strategy is adopted to evaluate the output weights. By feeding a large number of samples into the SCN base models, we obtain a huge sized linear equation system which is difficult to be solved by means of computing a pseudo-inverse used in the least squares method. Based on the group of heterogeneous features, the block Jacobi and Gauss-Seidel methods are employed to iteratively evaluate the output weights, and a convergence analysis is given with a demonstration on the uniqueness of these iterative solutions. Experiments with comparisons on two large-scale datasets are carried out, and the system robustness with respect to regularizing factor used in NCL is given. Results indicate that the proposed ensemble learning techniques have good potential for resolving large-scale data modelling problems.
Model-based clustering is a popular approach for clustering multivariate data which has seen application in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for selecting relevant clustering variables in model-based clustering. The methods are illustrated by application to real-world data and existing software to implement the methods are indicated.
We introduce a novel approach for training adversarial models by replacing the discriminator score with a bi-modal Gaussian distribution over the real/fake indicator variables. In order to do this, we train the Gaussian classifier to match the target bi-modal distribution implicitly through meta-adversarial training. We hypothesize that this approach ensures a non-zero gradient to the generator, even in the limit of a perfect classifier. We test our method against standard benchmark image datasets as well as show the classifier output distribution is smooth and has overlap between the real and fake modes.
In this study, we propose a new statical approach for high-dimensionality reduction of heterogenous data that limits the curse of dimensionality and deals with missing values. To handle these latter, we propose to use the Random Forest imputation’s method. The main purpose here is to extract useful information and so reducing the search space to facilitate the data exploration process. Several illustrative numeric examples, using data coming from publicly available machine learning repositories are also included. The experimental component of the study shows the efficiency of the proposed analytical approach.
Recently, deep learning approaches have achieved significant performance improvement in various imaging problems. However, it is still unclear why these deep learning architectures work. Moreover, the link between the deep learning and the classical signal processing approaches such as wavelet, non-local processing, compressed sensing, etc, is still not well understood, which often makes signal processors in deep troubles. To address these issues, here we show that the long-searched-for missing link is the convolutional framelets for representing a signal by convolving local and non-local bases. The convolutional framelets was originally developed to generalize the recent theory of low-rank Hankel matrix approaches, and this paper significantly extends the idea to derive a deep neural network using multi-layer convolutional framelets with perfect reconstruction (PR) under rectified linear unit (ReLU). Our analysis also shows that the popular deep network components such as residual block, redundant filter channels, and concatenated ReLU (CReLU) indeed help to achieve the PR, while the pooling and unpooling layers should be augmented with multi-resolution convolutional framelets to achieve PR condition. This discovery reveals the limitations of many existing deep learning architectures for inverse problems, and leads us to propose a novel deep convolutional framelets neural network. Using numerical experiments with sparse view x-ray computed tomography (CT), we demonstrated that our deep convolution framelets network shows consistent improvement. This discovery suggests that the success of deep learning is not from a magical power of a black-box, but rather comes from the power of a novel signal representation using non-local basis combined with data-driven local basis, which is indeed a natural extension of classical signal processing theory.
Dimensionality reduction for high-order tensors is a challenging problem. In conventional approaches, higher order tensors are vectorized via Tucker decomposition to obtain lower order tensors. This will destroy the inherent high-order structures or resulting in undesired tensors, respectively. This paper introduces a probabilistic vectorial dimensionality reduction model for tensorial data. The model represents a tensor by employing a linear combination of same order basis tensors, thus it offers a mechanism to directly reduce a tensor to a vector. Under this expression, the projection base of the model is based on the tensor CandeComp/PARAFAC (CP) decomposition and the number of free parameters in the model only grows linearly with the number of modes rather than exponentially. A Bayesian inference has been established via the variational EM approach. A criterion to set the parameters (factor number of CP decomposition and the number of extracted features) is empirically given. The model outperforms several existing PCA-based methods and CP decomposition on several publicly available databases in terms of classification and clustering accuracy.
Convolutional dictionary learning (CDL or sparsifying CDL) has many applications in image processing and computer vision. There has been growing interest in developing efficient algorithms for CDL, mostly relying on the augmented Lagrangian (AL) method or the variant alternating direction method of multipliers (ADMM). When their parameters are properly tuned, AL methods have shown fast convergence in CDL. However, the parameter tuning process is not trivial due to its data dependence and, in practice, the convergence of AL methods depends on the AL parameters for nonconvex CDL problems. To moderate these problems, this paper proposes a new practically feasible and convergent Block Proximal Gradient method using a Majorizer (BPG-M) for CDL. The BPG-M-based CDL is investigated with different block updating schemes and majorization matrix designs, and further accelerated by incorporating some momentum coefficient formulas and restarting techniques. All of the methods investigated incorporate a boundary artifacts removal operator in the learning model. Numerical experiments show that, without needing any parameter tuning process, the proposed BPG-M approach converges more stably to desirable solutions of lower objective values than the existing state-of-the-art ADMM algorithm does. Compared to the ADMM approach, the BPG-M method using a multi-block updating scheme is particularly useful in single-threaded CDL algorithm handling large datasets. Image denoising experiments show that, for relatively strong additive white Gaussian noise, the filters learned by BPG-M-based CDL outperform those trained by the ADMM approach.
Many supervised learning tasks are emerged in dual forms, e.g., English-to-French translation vs. French-to-English translation, speech recognition vs. text to speech, and image classification vs. image generation. Two dual tasks have intrinsic connections with each other due to the probabilistic correlation between their models. This connection is, however, not effectively utilized today, since people usually train the models of two dual tasks separately and independently. In this work, we propose training the models of two dual tasks simultaneously, and explicitly exploiting the probabilistic correlation between them to regularize the training process. For ease of reference, we call the proposed approach \emph{dual supervised learning}. We demonstrate that dual supervised learning can improve the practical performances of both tasks, for various applications including machine translation, image processing, and sentiment analysis.
Multi-label classification is a practical yet challenging task in machine learning related fields, since it requires the prediction of more than one label category for each input instance. We propose a novel deep neural networks (DNN) based model, Canonical Correlated AutoEncoder (C2AE), for solving this task. Aiming at better relating feature and label domain data for improved classification, we uniquely perform joint feature and label embedding by deriving a deep latent space, followed by the introduction of label-correlation sensitive loss function for recovering the predicted label outputs. Our C2AE is achieved by integrating the DNN architectures of canonical correlation analysis and autoencoder, which allows end-to-end learning and prediction with the ability to exploit label dependency. Moreover, our C2AE can be easily extended to address the learning problem with missing labels. Our experiments on multiple datasets with different scales confirm the effectiveness and robustness of our proposed method, which is shown to perform favorably against state-of-the-art methods for multi-label classification.
We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters. We exploit the phenomenon of flat minima that has been shown to lead to improved generalization error for deep networks. Parle requires very infrequent communication with the parameter server and instead performs more computation on each client, which makes it well-suited to both single-machine, multi-GPU settings and distributed implementations.
In the last decade we have observed a mass increase of information, in particular information that is shared through smartphones. Consequently, the amount of information that is available does not allow the average user to be aware of all his options. In this context, recommender systems use a number of techniques to help a user find the desired product. Hence, nowadays recommender systems play an important role. Recommender Systems’ aim to identify products that best fits user preferences. These techniques are advantageous to both users and vendors, as it enables the user to rapidly find what he needs and the vendors to promote their products and sales. As the industry became aware of the gains that could be accomplished by using these algorithms, also a very interesting problem for many researchers, recommender systems became a very active area since the mid 90’s. Having in mind that this is an ongoing problem the present thesis intends to observe the value of using a recommender algorithm to find users likes by observing her domain preferences. In a balanced probabilistic method, this thesis will show how news topics can be used to recommend news articles. In this thesis, we used different machine learning methods to determine the user ratings for an article. To tackle this problem, supervised learning methods such as linear regression, Naive Bayes and logistic regression are used. All the aforementioned models have a different nature which has an impact on the solution of the given problem. Furthermore, number of experiments are presented and discussed to identify the feature set that fits best to the problem.
The aim of this survey is an attempt to review the kind of machine learning and stochastic techniques and the ways existing work currently uses machine learning and stochastic methods for the challenging problem of visual tracking. It is not intended to study the whole tracking literature of the last decades as this seems impossible by the incredible vast number of published papers. This first draft version of the article focuses very targeted on recent literature that suggests Siamese networks for the learning of tracking. This approach promise a step forward in terms of robustness, accuracy and computational efficiency. For example, the representative tracker SINT performs currently best on the popular OTB-2013 benchmark with AuC/IoU/prec. 65.5/62.5/84.8 % for the one-pass experiment (OPE). The CVPR’17 work CVNet by the Oxford group shows the approach’s large potential of HW/SW co-design with network memory needs around 600 kB and frame-rates of 75 fps and beyond. Before a detailed description of this approach is given, the article recaps the definition of tracking, the current state-of-the-art view on designing algorithms and the state-of-the-art of trackers by summarising insights from existing literature. In future, the article will be extended by the review of two alternative approaches, the one using very general recurrent networks such as the Long Shortterm Memory (LSTM) networks and the other most obvious approach of applying sole convolutional networks (CNN), the earliest approach since the idea of deep learning tracking appeared at NIPS’13.
Due to the importance of zero-shot learning, i.e. classifying images where there is a lack of labeled training data, the number of proposed approaches has recently increased steadily. We argue that it is time to take a step back and to analyze the status quo of the area. The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets used for this task. This is an important contribution as published results are often not comparable and sometimes even flawed due to, e.g. pre-training on zero-shot test classes. Moreover, we propose a new zero-shot learning dataset, the Animals with Attributes 2 (AWA2) dataset which we make publicly available both in terms of image features and the images themselves. Second, we compare and analyze a significant number of the state-of-the-art methods in depth, both in the classic zero-shot setting but also in the more realistic generalized zero-shot setting. Finally, we discuss in detail the limitations of the current status of the area which can be taken as a basis for advancing it.
In this paper, we use variational recurrent model to investigate the time series forecasting problem. Combining recurrent neural network (RNN) and variational inference (VI), this model has both deterministic hidden states and stochastic latent variables while previous RNN methods only consider deterministic states. Based on comprehensive experiments, we show that the proposed methods significantly improves the state-of-art performance of chaotic time series benchmark and has better performance on real-worl data. Both single-output and multiple-output predictions are investigated.