Current machine learning systems operate, almost exclusively, in a statistical, or model-free mode, which entails severe theoretical limits on their power and performance. Such systems cannot reason about interventions and retrospection and, therefore, cannot serve as the basis for strong AI. To achieve human level intelligence, learning machines need the guidance of a model of reality, similar to the ones used in causal inference tasks. To demonstrate the essential role of such models, I will present a summary of seven tasks which are beyond reach of current machine learning systems and which have been accomplished using the tools of causal modeling.
Some deep convolutional neural networks were proposed for time-series classification and class imbalanced problems. However, those models performed degraded and even failed to recognize the minority class of an imbalanced temporal sequences dataset. Minority samples would bring troubles for temporal deep learning classifiers due to the equal treatments of majority and minority class. Until recently, there were few works applying deep learning on imbalanced time-series classification (ITSC) tasks. Here, this paper aimed at tackling ITSC problems with deep learning. An adaptive cost-sensitive learning strategy was proposed to modify temporal deep learning models. Through the proposed strategy, classifiers could automatically assign misclassification penalties to each class. In the experimental section, the proposed method was utilized to modify five neural networks. They were evaluated on a large volume, real-life and imbalanced time-series dataset with six metrics. Each single network was also tested alone and combined with several mainstream data samplers. Experimental results illustrated that the proposed cost-sensitive modified networks worked well on ITSC tasks. Compared to other methods, the cost-sensitive convolution neural network and residual network won out in the terms of all metrics. Consequently, the proposed cost-sensitive learning strategy can be used to modify deep learning classifiers from cost-insensitive to cost-sensitive. Those cost-sensitive convolutional networks can be effectively applied to address ITSC issues.
Over the past decade, multivariate time series classification has been receiving a lot of attention. We propose augmenting the existing univariate time series classification models, LSTM-FCN and ALSTM-FCN with a squeeze and excitation block to further improve performance. Our proposed models outperform most of the state of the art models while requiring minimum preprocessing. The proposed models work efficiently on various complex multivariate time series classification tasks such as activity recognition or action recognition. Furthermore, the proposed models are highly efficient at test time and small enough to deploy on memory constrained systems.
Brain Electroencephalography (EEG) classification is widely applied to analyze cerebral diseases in recent years. Unfortunately, invalid/noisy EEGs degrade the diagnosis performance and most previously developed methods ignore the necessity of EEG selection for classification. To this end, this paper proposes a novel maximum weight clique-based EEG selection approach, named mwcEEGs, to map EEG selection to searching maximum similarity-weighted cliques from an improved Fr\'{e}chet distance-weighted undirected EEG graph simultaneously considering edge weights and vertex weights. Our mwcEEGs improves the classification performance by selecting intra-clique pairwise similar and inter-clique discriminative EEGs with similarity threshold $\delta$. Experimental results demonstrate the algorithm effectiveness compared with the state-of-the-art time series selection algorithms on real-world EEG datasets.
The problem of quickest detection of a change in distribution is considered under the assumption that the pre-change distribution is known, and the post-change distribution is only known to belong to a family of distributions distinguishable from a discretized version of the pre-change distribution. A sequential change detection procedure is proposed that partitions the sample space into a finite number of bins, and monitors the number of samples falling into each of these bins to detect the change. A test statistic that approximates the generalized likelihood ratio test is developed. It is shown that the proposed test statistic can be efficiently computed using a recursive update scheme, and a procedure for choosing the number of bins in the scheme is provided. Various asymptotic properties of the test statistic are derived to offer insights into its performance trade-off between average detection delay and average run length to a false alarm. Testing on synthetic and real data demonstrates that our approach is comparable or better in performance to existing non-parametric change detection methods.
Machine Learning (ML) and Deep Learning (DL) are two technologies used to extract representations of the data for a specific purpose. ML algorithms take a set of data as input to generate one or several predictions. To define the final version of one model, usually there is an initial step devoted to train the algorithm (get the right final values of the parameters of the model). There are several techniques, from supervised learning to reinforcement learning, which have different requirements. On the market, there are some frameworks or APIs that reduce the effort for designing a new ML model. In this report, using the benchmark DLBENCH, we will analyse the performance and the execution modes of some well-known ML frameworks on the Finis Terrae II supercomputer when supervised learning is used. The report will show that placement of data and allocated hardware can have a large influence on the final timeto-solution.
Density estimation is an interdisciplinary topic at the intersection of statistics, theoretical computer science and machine learning. We review some old and new techniques for bounding sample complexity of estimating densities of continuous distributions, focusing on the class of mixtures of Gaussians and its subclasses.
In recent years, there have been tremendous advancements in the field of machine learning. These advancements have been made through both academic as well as industrial research. Lately, a fair amount of research has been dedicated to the usage of generative models in the field of computer vision and image classification. These generative models have been popularized through a new framework called Generative Adversarial Networks. Moreover, many modified versions of this framework have been proposed in the last two years. We study the original model proposed by Goodfellow et al. as well as modifications over the original model and provide a comparative analysis of these models.
We present a noise-injected version of the Expectation-Maximization (EM) algorithm: the Noisy Expectation Maximization (NEM) algorithm. The NEM algorithm uses noise to speed up the convergence of the EM algorithm. The NEM theorem shows that injected noise speeds up the average convergence of the EM algorithm to a local maximum of the likelihood surface if a positivity condition holds. The generalized form of the noisy expectation-maximization (NEM) algorithm allow for arbitrary modes of noise injection including adding and multiplying noise to the data. We demonstrate these noise benefits on EM algorithms for the Gaussian mixture model (GMM) with both additive and multiplicative NEM noise injection. A separate theorem (not presented here) shows that the noise benefit for independent identically distributed additive noise decreases with sample size in mixture models. This theorem implies that the noise benefit is most pronounced if the data is sparse. Injecting blind noise only slowed convergence.
Multiple imputation is a straightforward method for handling missing data in a principled fashion. This paper presents an overview of multiple imputation, including important theoretical results and their practical implications for generating and using multiple imputations. A review of strategies for generating imputations follows, including recent developments in flexible joint modeling and sequential regression/chained equations/fully conditional specification approaches. Finally, we compare and contrast different methods for generating imputations on a range of criteria before identifying promising avenues for future research.
We argue that the estimation of the mutual information between high dimensional continuous random variables is achievable by gradient descent over neural networks. This paper presents a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size. MINE is back-propable and we prove that it is strongly consistent. We illustrate a handful of applications in which MINE is succesfully applied to enhance the property of generative models in both unsupervised and supervised settings. We apply our framework to estimate the information bottleneck, and apply it in tasks related to supervised classification problems. Our results demonstrate substantial added flexibility and improvement in these settings.
Information collection is a fundamental problem in big data, where the size of sampling sets plays a very important role. This work considers the information collection process by taking message importance into account. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. It is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to Kolmogorov-Smirnov statistic, but it offers a new way to characterize the distribution goodness-of-fit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system.
Information transfer which reveals the state variation of variables can play a vital role in big data analytics and processing. In fact, the measure for information transfer can reflect the system change from the statistics by using the variable distributions, similar to KL divergence and Renyi divergence. Furthermore, in terms of the information transfer in big data, small probability events dominate the importance of the total message to some degree. Therefore, it is significant to design an information transfer measure based on the message importance which emphasizes the small probability events. In this paper, we propose the message importance divergence (MID) and investigate its characteristics and applications on three aspects. First, the message importance transfer capacity based on MID is presented to offer an upper bound for the information transfer with disturbance. Then, we utilize the MID to guide the queue length selection, which is the fundamental problem considered to have higher social or academic value in the caching operation of mobile edge computing. Finally, we extend the MID to the continuous case and discuss the robustness by using it to measuring information distance.
Salient object detection is a fundamental problem and has been received a great deal of attentions in computer vision. Recently deep learning model became a powerful tool for image feature extraction. In this paper, we propose a multi-scale deep neural network (MSDNN) for salient object detection. The proposed model first extracts global high-level features and context information over the whole source image with recurrent convolutional neural network (RCNN). Then several stacked deconvolutional layers are adopted to get the multi-scale feature representation and obtain a series of saliency maps. Finally, we investigate a fusion convolution module (FCM) to build a final pixel level saliency map. The proposed model is extensively evaluated on four salient object detection benchmark datasets. Results show that our deep model significantly outperforms other 12 state-of-the-art approaches.
This paper proposes a fully connected neural network model to map samples from a uniform distribution to samples of any explicitly known probability density function. During the training, the Jensen-Shannon divergence between the distribution of the model’s output and the target distribution is minimized. We experimentally demonstrate that our model converges towards the desired state. It provides an alternative to existing sampling methods such as inversion sampling, rejection sampling, Gaussian mixture models and Markov-Chain-Monte-Carlo. Our model has high sampling efficiency and is easily applied to any probability distribution, without the need of further analytical or numerical calculations. It can produce correlated samples, such that the output distribution converges faster towards the target than for independent samples. But it is also able to produce independent samples, if single values are fed into the network and the input values are independent as well. We focus on one-dimensional sampling, but additionally illustrate a two-dimensional example with a target distribution of dependent variables.
Automated decision making systems are increasingly being used in real-world applications. In these systems for the most part, the decision rules are derived by minimizing the training error on the available historical data. Therefore, if there is a bias related to a sensitive attribute such as gender, race, religion, etc. in the data, say, due to cultural/historical discriminatory practices against a certain demographic, the system could continue discrimination in decisions by including the said bias in its decision rule. We present an information theoretic framework for designing fair predictors from data, which aim to prevent discrimination against a specified sensitive attribute in a supervised learning setting. We use equalized odds as the criterion for discrimination, which demands that the prediction should be independent of the protected attribute conditioned on the actual label. To ensure fairness and generalization simultaneously, we compress the data to an auxiliary variable, which is used for the prediction task. This auxiliary variable is chosen such that it is decontaminated from the discriminatory attribute in the sense of equalized odds. The final predictor is obtained by applying a Bayesian decision rule to the auxiliary variable.
Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far beyond the GPU DRAM capacity. SuperNeurons features 3 memory optimizations, \textit{Liveness Analysis}, \textit{Unified Tensor Pool}, and \textit{Cost-Aware Recomputation}, all together they effectively reduce the network-wide peak memory usage down to the maximal memory usage among layers. We also address the performance issues in those memory saving techniques. Given the limited GPU DRAM, SuperNeurons not only provisions the necessary memory for the training, but also dynamically allocates the memory for convolution workspaces to achieve the high performance. Evaluations against Caffe, Torch, MXNet and TensorFlow have demonstrated that SuperNeurons trains at least 3.2432 deeper network than current ones with the leading performance. Particularly, SuperNeurons can train ResNet2500 that has $10^4$ basic network layers on a 12GB K40c.
ConvNets have been very effective in many applications where it is required to learn invariances to within-class nuisance transformations. However, through their architecture, ConvNets only enforce invariance to translation. In this paper, we introduce a new class of convolutional architectures called Non-Parametric Transformation Networks (NPTNs) which can learn general invariances and symmetries directly from data. NPTNs are a direct and natural generalization of ConvNets and can be optimized directly using gradient descent. They make no assumption regarding structure of the invariances present in the data and in that aspect are very flexible and powerful. We also model ConvNets and NPTNs under a unified framework called Transformation Networks which establishes the natural connection between the two. We demonstrate the efficacy of NPTNs on natural data such as MNIST and CIFAR 10 where it outperforms ConvNet baselines with the same number of parameters. We show it is effective in learning invariances unknown apriori directly from data from scratch. Finally, we apply NPTNs to Capsule Networks and show that they enable them to perform even better.
Text Mining is a field that aims at extracting information from textual data. One of the challenges of such field of study comes from the pre-processing stage in which a vector (and structured) representation should be extracted from unstructured data. The common extraction creates large and sparse vectors representing the importance of each term to a document. As such, this usually leads to the curse-of-dimensionality that plagues most machine learning algorithms. To cope with this issue, in this paper we propose a new supervised feature extraction and reduction algorithm, named DCDistance, that creates features based on the distance between a document to a representative of each class label. As such, the proposed technique can reduce the features set in more than 99% of the original set. Additionally, this algorithm was also capable of improving the classification accuracy over a set of benchmark datasets when compared to traditional and state-of-the-art features selection algorithms.
Learning a classifier with control on the false-positive rate plays a critical role in many machine learning applications. Existing approaches either introduce prior knowledge dependent label cost or tune parameters based on traditional classifiers, which lack consistency in methodology because they do not strictly adhere to the false-positive rate constraint. In this paper, we propose a novel scoring-thresholding approach, tau-False Positive Learning (tau-FPL) to address this problem. We show the scoring problem which takes the false-positive rate tolerance into accounts can be efficiently solved in linear time, also an out-of-bootstrap thresholding method can transform the learned ranking function into a low false-positive classifier. Both theoretical analysis and experimental results show superior performance of the proposed tau-FPL over existing approaches.
The growth of big data in domains such as Earth Sciences, Social Networks, Physical Sciences, etc. has lead to an immense need for efficient and scalable linear algebra operations, e.g. Matrix inversion. Existing methods for efficient and distributed matrix inversion using big data platforms rely on LU decomposition based block-recursive algorithms. However, these algorithms are complex and require a lot of side calculations, e.g. matrix multiplication, at various levels of recursion. In this paper, we propose a different scheme based on Strassen’s matrix inversion algorithm (mentioned in Strassen’s original paper in 1969), which uses far fewer operations at each level of recursion. We implement the proposed algorithm, and through extensive experimentation, show that it is more efficient than the state of the art methods. Furthermore, we provide a detailed theoretical analysis of the proposed algorithm, and derive theoretical running times which match closely with the empirically observed wall clock running times, thus explaining the U-shaped behaviour w.r.t. block-sizes.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis.
Learning similarity functions between image pairs with deep neural networks yields highly correlated activations of embeddings. In this work, we show how to improve the robustness of such embeddings by exploiting the independence within ensembles. To this end, we divide the last embedding layer of a deep network into an embedding ensemble and formulate training this ensemble as an online gradient boosting problem. Each learner receives a reweighted training sample from the previous learners. Further, we propose two loss functions which increase the diversity in our ensemble. These loss functions can be applied either for weight initialization or during training. Together, our contributions leverage large embedding sizes more effectively by significantly reducing correlation of the embedding and consequently increase retrieval accuracy of the embedding. Our method works with any differentiable loss function and does not introduce any additional parameters during test time. We evaluate our metric learning method on image retrieval tasks and show that it improves over state-of-the-art methods on the CUB 200-2011, Cars-196, Stanford Online Products, In-Shop Clothes Retrieval and VehicleID datasets.
We propose Machines Talking To Machines (M2M), a framework combining automation and crowdsourcing to rapidly bootstrap end-to-end dialogue agents for goal-oriented dialogues in arbitrary domains. M2M scales to new tasks with just a task schema and an API client from the dialogue system developer, but it is also customizable to cater to task-specific interactions. Compared to the Wizard-of-Oz approach for data collection, M2M achieves greater diversity and coverage of salient dialogue flows while maintaining the naturalness of individual utterances. In the first phase, a simulated user bot and a domain-agnostic system bot converse to exhaustively generate dialogue ‘outlines’, i.e. sequences of template utterances and their semantic parses. In the second phase, crowd workers provide contextual rewrites of the dialogues to make the utterances more natural while preserving their meaning. The entire process can finish within a few hours. We propose a new corpus of 3,000 dialogues spanning 2 domains collected with M2M, and present comparisons with popular dialogue datasets on the quality and diversity of the surface forms and dialogue flows.
Database applications are typically written using a mixture of imperative languages and declarative frameworks for data processing. Application logic gets distributed across the declarative and imperative parts of a program. Often, there is more than one way to implement the same program, whose efficiency may depend on a number of parameters. In this paper, we propose a framework that automatically generates all equivalent alternatives of a given program using a given set of program transformations, and chooses the least cost alternative. We use the concept of program regions as an algebraic abstraction of a program and extend the Volcano/Cascades framework for optimization of algebraic expressions, to optimize programs. We illustrate the use of our framework for optimizing database applications. We show through experimental results, that our framework has wide applicability in real world applications and provides significant performance benefits.