Development of machine learning (ML) workflows is a tedious process of iterative experimentation: developers repeatedly make changes to workflows until the desired accuracy is attained. We describe our vision for a ‘human-in-the-loop’ ML system that accelerates this process: by intelligently tracking changes and intermediate results over time, such a system can enable rapid iteration, quick responsive feedback, introspection and debugging, and background execution and automation. We finally describe Helix, our preliminary attempt at such a system that has already led to speedups of up to 10x on typical iterative workflows against competing systems.
Accurately predicting machine failures in advance can decrease maintenance cost and help allocate maintenance resources more efficiently. Logistic regression was applied to predict machine state 24 hours in the future given the current machine state.
Multidimensional time series are sequences of real valued vectors. They occur in different areas, for example handwritten characters, GPS tracking, and gestures of modern virtual reality motion controllers. Within these areas, a common task is to search for similar time series. Dynamic Time Warping (\dtw) is a common distance function to compare two time series. The Edit Distance with Real Penalty (\erp) and the Dog Keeper Distance (\frechet) are two more distance functions on time series. Their behaviour has been analyzed on 1-dimensional time series. However, it is not easy to evaluate their behaviour in relation to growing dimensionality. For this reason we propose two new data synthesizers generating multidimensional time series. The first synthesizer extends the well known cylinder-bell-funnel (CBF) dataset to multidimensional time series. Here, each time series has an arbitrary type (cylinder, bell, or funnel) in each dimension, thus for $d$-dimensional time series there are $3^{d}$ different classes. The second synthesizer (\ram) creates time series with ideas adapted from Brownian motions which is a common model of movement in physics. Finally, we evaluate the applicability of a 1-nearest neighbor classifier using \dtw{} on datasets generated by our synthesizers.
An ensemble technique is characterized by the mechanism that generates the components and by the mechanism that combines them. A common way to achieve the consensus is to enable each component to equally participate in the aggregation process. A problem with this approach is that poor components are likely to negatively affect the quality of the consensus result. To address this issue, alternatives have been explored in the literature to build selective classifier and cluster ensembles, where only a subset of the components contributes to the computation of the consensus. Of the family of ensemble methods, outlier ensembles are the least studied. Only recently, the selection problem for outlier ensembles has been discussed. In this work we define a new graph-based class of ranking selection methods. A method in this class is characterized by two main steps: (1) Mapping the rankings onto a graph structure; and (2) Mining the resulting graph to identify a subset of rankings. We define a specific instance of the graph-based ranking selection class. Specifically, we map the problem of selecting ensemble components onto a mining problem in a graph. An extensive evaluation was conducted on a variety of heterogeneous data and methods. Our empirical results show that our approach outperforms state-of-the-art selective outlier ensemble techniques.
This paper presents an open-source enforcement learning toolkit named CytonRL (https://…/cytonRL ). The toolkit implements four recent advanced deep Q-learning algorithms from scratch using C++ and NVIDIA’s GPU-accelerated libraries. The code is simple and elegant, owing to an open-source general-purpose neural network library named CytonLib. Benchmark shows that the toolkit achieves competitive performances on the popular Atari game of Breakout.
In this paper, we present BigDL, a distributed deep learning framework for Big Data platforms and workflows. It is implemented on top of Apache Spark, and allows users to write their deep learning applications as standard Spark programs (running directly on large-scale big data clusters in a distributed fashion). It provides an expressive, ‘data-analytics integrated’ deep learning programming model, so that users can easily build the end-to-end analytics + AI pipelines under a unified programming paradigm; by implementing an AllReduce like operation using existing primitives in Spark (e.g., shuffle, broadcast, and in-memory data persistence), it also provides a highly efficient ‘parameter server’ style architecture, so as to achieve highly scalable, data-parallel distributed training. Since its initial open source release, BigDL users have built many analytics and deep learning applications (e.g., object detection, sequence-to-sequence generation, neural recommendations, fraud detection, etc.) on Spark.
Action classification is a widely known and popular task that offers an approach towards video understanding. The absence of an easy-to-use platform containing state-of-the-art (SOTA) models presents an issue for the community. Given that individual research code is not written with an end user in mind and in certain cases code is not released, even for published articles, the importance of a common unified platform capable of delivering results while removing the burden of developing an entire system cannot be overstated. To try and overcome these issues, we develop a tensorflow-based unified platform to abstract away unnecessary overheads in terms of an end-to-end pipeline setup in order to allow the user to quickly and easily prototype action classification models. With the use of a consistent coding style across different models and seamless data flow between various submodules, the platform lends itself to the quick generation of results on a wide range of SOTA methods across a variety of datasets. All of these features are made possible through the use of fully pre-defined training and testing blocks built on top of a small but powerful set of modular functions that handle asynchronous data loading, model initializations, metric calculations, saving and loading of checkpoints, and logging of results. The platform is geared towards easily creating models, with the minimum requirement being the definition of a network architecture and preprocessing steps from a large custom selection of layers and preprocessing functions. M-PACT currently houses four SOTA activity classification models which include, I3D, C3D, ResNet50+LSTM and TSN. The classification performance achieved by these models are, 43.86% for ResNet50+LSTM on HMDB51 while C3D and TSN achieve 93.66% and 85.25% on UCF101 respectively.
Big data analytics is gaining massive momentum in the last few years. Applying machine learning models to big data has become an implicit requirement or an expectation for most analysis tasks, especially on high-stakes applications.Typical applications include sentiment analysis against reviews for analyzing on-line products, image classification in food logging applications for monitoring user’s daily intake and stock movement prediction. Extending traditional database systems to support the above analysis is intriguing but challenging. First, it is almost impossible to implement all machine learning models in the database engines. Second, expertise knowledge is required to optimize the training and inference procedures in terms of efficiency and effectiveness, which imposes heavy burden on the system users. In this paper, we develop and present a system, called Rafiki, to provide the training and inference service of machine learning models, and facilitate complex analytics on top of cloud platforms. Rafiki provides distributed hyper-parameter tuning for the training service, and online ensemble modeling for the inference service which trades off between latency and accuracy. Experimental results confirm the efficiency, effectiveness, scalability and usability of Rafiki.
Graph classification is a fundamental but challenging problem due to the non-Euclidean property of graph. In this work, we jointly leverage the powerful representation ability of random walk and the essential success of standard convolutional network work (CNN), to propose a random walk based convolutional network, called walk-steered convolution (WSC). Different from those existing graph CNNs with deterministic neighbor searching, we randomly sample multi-scale walk fields by using random walk, which is more flexible to the scalability of graph. To encode each-scale walk field consisting of several walk paths, specifically, we characterize the directions of walk field by multiple Gaussian models so as to better analogize the standard CNNs on images. Each Gaussian implicitly defines a directions and all of them properly encode the spatial layout of walks after the gradient projecting to the space of Gaussian parameters. Further, a graph coarsening layer using dynamical clustering is stacked upon the Gaussian encoding to capture high-level semantics of graph. Comprehensive evaluations on several public datasets well demonstrate the superiority of our proposed graph learning method over other state-of-the-arts for graph classification.
In 2018, at the World Economic Forum in Davos it was presented a new countries’ economic performance metric named the Inclusive Development Index (IDI) composed of 12 indicators. The new metric implies that countries might need to realize structural reforms for improving both economic expansion and social inclusion performance. That is why, it is vital for the IDI calculation method to have strong statistical and mathematical basis, so that results are accurate and transparent for public purposes. In the current work, we propose a novel approach for the IDI estimation – the Ranking Relative Principal Component Attributes Network Model (REL-PCANet). The model is based on RELARM and RankNet principles and combines elements of PCA, techniques applied in image recognition and learning to rank mechanisms. Also, we define a new approach for estimation of target probabilities matrix to reflect dynamic changes in countries’ inclusive development. Empirical study proved that REL-PCANet ensures reliable and robust scores and rankings, thus is recommended for practical implementation.
Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be ‘compressed’ to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size. Combined with off-the-shelf compression algorithms, the bound leads to state of the art generalization guarantees; in particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. As additional evidence connecting compression and generalization, we show that compressibility of models that tend to overfit is limited: We establish an absolute limit on expected compressibility as a function of expected generalization error, where the expectations are over the random choice of training examples. The bounds are complemented by empirical results that show an increase in overfitting implies an increase in the number of bits required to describe a trained network.
Receiver operating characteristic (ROC) curves are widely used as a measure of accuracy of diagnostic tests and can be summarized using the area under the ROC curve (AUC). Often, it is useful to construct a confidence intervals for the AUC, however, since there are a number of different proposed methods to measure variance of the AUC, there are thus many different resulting methods for constructing these intervals. In this manuscript, we compare different methods of constructing Wald-type confidence interval in the presence of missing data where the missingness mechanism is ignorable. We find that constructing confidence intervals using multiple imputation (MI) based on logistic regression (LR) gives the most robust coverage probability and the choice of CI method is less important. However, when missingness rate is less severe (e.g. less than 70%), we recommend using Newcombe’s Wald method for constructing confidence intervals along with multiple imputation using predictive mean matching (PMM).
Many problems in NLP require aggregating information from multiple mentions of the same entity which may be far apart in the text. Existing Recurrent Neural Network (RNN) layers are biased towards short-term dependencies and hence not suited to such tasks. We present a recurrent layer which is instead biased towards coreferent dependencies. The layer uses coreference annotations extracted from an external system to connect entity mentions belonging to the same cluster. Incorporating this layer into a state-of-the-art reading comprehension model improves performance on three datasets — Wikihop, LAMBADA and the bAbi AI tasks — with large gains when training data is scarce.
Recently, Factorization Machines (FM) has become more and more popular for recommendation systems, due to its effectiveness in finding informative interactions between features. Usually, the weights for the interactions is learnt as a low rank weight matrix, which is formulated as an inner product of two low rank matrices. This low rank can help improve the generalization ability of Factorization Machines. However, to choose the rank properly, it usually needs to run the algorithm for many times using different ranks, which clearly is inefficient for some large-scale datasets. To alleviate this issue, we propose an Adaptive Boosting framework of Factorization Machines (AdaFM), which can adaptively search for proper ranks for different datasets without re-training. Instead of using a fixed rank for FM, the proposed algorithm will adaptively gradually increases its rank according to its performance until the performance does not grow, using boosting strategy. To verify the performance of our proposed framework, we conduct an extensive set of experiments on many real-world datasets. Encouraging empirical results shows that the proposed algorithms are generally more effective than state-of-the-art other Factorization Machines.
Co-training is a popular semi-supervised learning framework to utilize a large amount of unlabeled data in addition to a small labeled set. Co-training methods exploit predicted labels on the unlabeled data and select samples based on prediction confidence to augment the training. However, the selection of samples in existing co-training methods is based on a predetermined policy, which ignores the sampling bias between the unlabeled and the labeled subsets, and fails to explore the data space. In this paper, we propose a novel method, Reinforced Co-Training, to select high-quality unlabeled samples to better co-train on. More specifically, our approach uses Q-learning to learn a data selection policy with a small labeled dataset, and then exploits this policy to train the co-training classifiers automatically. Experimental results on clickbait detection and generic text classification tasks demonstrate that our proposed method can obtain more accurate text classification results.
In this paper, we propose the Cross-Domain Adversarial Auto-Encoder (CDAAE) to address the problem of cross-domain image inference, generation and transformation. We make the assumption that images from different domains share the same latent code space for content, while having separate latent code space for style. The proposed framework can map cross-domain data to a latent code vector consisting of a content part and a style part. The latent code vector is matched with a prior distribution so that we can generate meaningful samples from any part of the prior space. Consequently, given a sample of one domain, our framework can generate various samples of the other domain with the same content of the input. This makes the proposed framework different from the current work of cross-domain transformation. Besides, the proposed framework can be trained with both labeled and unlabeled data, which makes it also suitable for domain adaptation. Experimental results on data sets SVHN, MNIST and CASIA show the proposed framework achieved visually appealing performance for image generation task. Besides, we also demonstrate the proposed method achieved superior results for domain adaptation. Code of our experiments is available in https://…/CDAAE.
We study feature propagation on graph, an inference process involved in graph representation learning tasks. It’s to spread the features over the whole graph to the $t$-th orders, thus to expand the end’s features. The process has been successfully adopted in graph embedding or graph neural networks, however few works studied the convergence of feature propagation. Without convergence guarantees, it may lead to unexpected numerical overflows and task failures. In this paper, we first define the concept of feature propagation on graph formally, and then study its convergence conditions to equilibrium states. We further link feature propagation to several established approaches such as node2vec and structure2vec. In the end of this paper, we extend existing approaches from represent nodes to edges (edge2vec) and demonstrate its applications on fraud transaction detection in real world scenario. Experiments show that it is quite competitive.
There has been growing interest in extending traditional vector-based machine learning techniques to their tensor forms. An example is the support tensor machine (STM) that utilizes a rank-one tensor to capture the data structure, thereby alleviating the overfitting and curse of dimensionality problems in the conventional support vector machine (SVM). However, the expressive power of a rank-one tensor is restrictive for many real-world data. To overcome this limitation, we introduce a support tensor train machine (STTM) by replacing the rank-one tensor in an STM with a tensor train. Experiments validate and confirm the superiority of an STTM over the SVM and STM.
In many applications of relational learning, the available data can be seen as a sample from a larger relational structure (e.g. we may be given a small fragment from some social network). In this paper we are particularly concerned with scenarios in which we can assume that (i) the domain elements appearing in the given sample have been uniformly sampled without replacement from the (unknown) full domain and (ii) the sample is complete for these domain elements (i.e. it is the full substructure induced by these elements). Within this setting, we study bounds on the error of sufficient statistics of relational models that are estimated on the available data. As our main result, we prove a bound based on a variant of the Vapnik-Chervonenkis dimension which is suitable for relational data.
Ensembles are popular methods for solving practical supervised learning problems. They reduce the risk of having underperforming models in production-grade software. Although critical, methods for learning heterogeneous regression ensembles have not been proposed at large scale, whereas in classical ML literature, stacking, cascading and voting are mostly restricted to classification problems. Regression poses distinct learning challenges that may result in poor performance, even when using well established homogeneous ensemble schemas such as bagging or boosting. In this paper, we introduce MetaBags, a novel, practically useful stacking framework for regression. MetaBags is a meta-learning algorithm that learns a set of meta-decision trees designed to select one base model (i.e. expert) for each query, and focuses on inductive bias reduction. A set of meta-decision trees are learned using different types of meta-features, specially created for this purpose – to then be bagged at meta-level. This procedure is designed to learn a model with a fair bias-variance trade-off, and its improvement over base model performance is correlated with the prediction diversity of different experts on specific input space subregions. The proposed method and meta-features are designed in such a way that they enable good predictive performance even in subregions of space which are not adequately represented in the available training data. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach on synthetic, open and real-world application datasets. The obtained results show that our method significantly outperforms existing state-of-the-art approaches.
Deep Reinforcement Learning (RL) recently emerged as one of the most competitive approaches for learning in sequential decision making problems with fully observable environments, e.g., computer Go. However, very little work has been done in deep RL to handle partially observable environments. We propose a new architecture called Action-specific Deep Recurrent Q-Network (ADRQN) to enhance learning performance in partially observable domains. Actions are encoded by a fully connected layer and coupled with a convolutional observation to form an action-observation pair. The time series of action-observation pairs are then integrated by an LSTM layer that learns latent states based on which a fully connected layer computes Q-values as in conventional Deep Q-Networks (DQNs). We demonstrate the effectiveness of our new architecture in several partially observable domains, including flickering Atari games.