Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not tightly integrated with array computations (e.g., Spark SQL). This paper proposes a novel compiler-based approach where we integrate data frames into the High Performance Analytics Toolkit (HPAT) to build HiFrames. It provides expressive and flexible data frame APIs which are tightly integrated with array operations. HiFrames then automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. We demonstrate that HiFrames is significantly faster than alternatives such as Spark SQL on clusters, without forcing the programmer to switch to embedded SQL for part of the program. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations, and can be up to 20,000x faster for advanced analytics operations, such as weighted moving averages (WMA), that the map-reduce paradigm cannot handle effectively. HiFrames is also 5x faster than Spark SQL for TPCx-BB Q26 on 64 nodes of Cori supercomputer.
In this paper, we introduce an algorithm for performing spectral clustering efficiently. Spectral clustering is a powerful clustering algorithm that suffers from high computational complexity, due to eigen decomposition. In this work, we first build the adjacency matrix of the corresponding graph of the dataset. To build this matrix, we only consider a limited number of points, called landmarks, and compute the similarity of all data points with the landmarks. Then, we present a definition of the Laplacian matrix of the graph that enable us to perform eigen decomposition efficiently, using a deep autoencoder. The overall complexity of the algorithm for eigen decomposition is $O(np)$, where $n$ is the number of data points and $p$ is the number of landmarks. At last, we evaluate the performance of the algorithm in different experiments.
Standard probabilistic discriminant analysis (PLDA) for speaker recognition assumes that the sample’s features (usually, i-vectors) are given by a sum of three terms: a term that depends on the speaker identity, a term that models the within-speaker variability and is assumed independent across samples, and a final term that models any remaining variability and is also independent across samples. In this work, we propose a generalization of this model where the within-speaker variability is not necessarily assumed independent across samples but dependent on another discrete variable. This variable, which we call the channel variable as in the standard PLDA approach, could be, for example, a discrete category for the channel characteristics, the language spoken by the speaker, the type of speech in the sample (conversational, monologue, read), etc. The value of this variable is assumed to be known during training but not during testing. Scoring is performed, as in standard PLDA, by computing a likelihood ratio between the null hypothesis that the two sides of a trial belong to the same speaker versus the alternative hypothesis that the two sides belong to different speakers. The two likelihoods are computed by marginalizing over two hypothesis about the channels in both sides of a trial: that they are the same and that they are different. This way, we expect that the new model will be better at coping with same-channel versus different-channel trials than standard PLDA, since knowledge about the channel (or language, or speech style) is used during training and implicitly considered during scoring.
The Temporal Group LASSO is an example of a multi-task, regularized regression approach for the prediction of response variables that vary over time. The aim of this work is to introduce the reader to the concepts behind the Temporal Group LASSO and its related methods, as well as to the type of potential applications in a healthcare setting that the method has. We argue that the method is attractive because of its ability to reduce overfitting, select predictors, learn smooth effect patterns over time, and finally, its simplicity
This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble, produced by bagging multiple trees. The process of bagging and combining results from multiple trees, produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions. Various aspects are explored in this paper, to assess model complexity, individual model contributions, variable importance and dimension reduction, and uncertainty in prediction associated with individual observations. The ideas are applied to the random forest algorithm, and to the projection pursuit forest, but could be more broadly applied to other bagged ensembles. Interactive graphics are built in R, using the ggplot2, plotly, and shiny packages.
Multi-Label Classification toolbox is a MATLAB/OCTAVE library for Multi-Label Classification (MLC). There exists a few Java libraries for MLC, but no MATLAB/OCTAVE library that covers various methods. This toolbox offers an environment for evaluation, comparison and visualization of the MLC results. One attraction of this toolbox is that it enables us to try many combinations of feature space dimension reduction, sample clustering, label space dimension reduction and ensemble, etc.
Undoubtedly, the MapReduce is the most powerful programming paradigm in distributed computing. The enhancement of the MapReduce is essential and it can lead the computing faster. Therefore, here are many scheduling algorithms to discuss based on their characteristics. Moreover, there are many shortcoming to discover in this field. In this article, we present the state-of-the-art scheduling algorithm to enhance the understanding of the algorithms. The algorithms are presented systematically such that there can be many future possibilities in scheduling algorithm through this article. In this paper, we provide in-depth insight on the MapReduce scheduling algorithm. In addition, we discuss various issues of MapReduce scheduler developed for large-scale computing as well as heterogeneous environment.
In this paper, we present a new feature selection method that is suitable for both unsupervised and supervised problems. We build upon the recently proposed Infinite Feature Selection (IFS) method where feature subsets of all sizes (including infinity) are considered. We extend IFS in two ways. First, we propose a supervised version of it. Second, we propose new ways of forming the feature adjacency matrix that perform better for unsupervised problems. We extensively evaluate our methods on many benchmark datasets, including large image-classification datasets (PASCAL VOC), and show that our methods outperform both the IFS and the widely used ‘minimum-redundancy maximum-relevancy (mRMR)’ feature selection algorithm.
Most popular word embedding techniques involve implicit or explicit factorization of a word co-occurrence based matrix into low rank factors. In this paper, we aim to generalize this trend by using numerical methods to factor higher-order word co-occurrence based arrays, or \textit{tensors}. We present four word embeddings using tensor factorization and analyze their advantages and disadvantages. One of our main contributions is a novel joint symmetric tensor factorization technique related to the idea of coupled tensor factorization. We show that embeddings based on tensor factorization can be used to discern the various meanings of polysemous words without being explicitly trained to do so, and motivate the intuition behind why this works in a way that doesn’t with existing methods. We also modify an existing word embedding evaluation metric known as Outlier Detection [Camacho-Collados and Navigli, 2016] to evaluate the quality of the order-$N$ relations that a word embedding captures, and show that tensor-based methods outperform existing matrix-based methods at this task. Experimentally, we show that all of our word embeddings either outperform or are competitive with state-of-the-art baselines commonly used today on a variety of recent datasets. Suggested applications of tensor factorization-based word embeddings are given, and all source code and pre-trained vectors are publicly available online.
In this work we explore a straightforward variational Bayes scheme for Recurrent Neural Networks. Firstly, we show that a simple adaptation of truncated backpropagation through time can yield good quality uncertainty estimates and superior regularisation at only a small extra computational cost during training. Secondly, we demonstrate how a novel kind of posterior approximation yields further improvements to the performance of Bayesian RNNs. We incorporate local gradient information into the approximate posterior to sharpen it around the current batch statistics. This technique is not exclusive to recurrent neural networks and can be applied more widely to train Bayesian neural networks. We also empirically demonstrate how Bayesian RNNs are superior to traditional RNNs on a language modelling benchmark and an image captioning task, as well as showing how each of these methods improve our model over a variety of other schemes for training them. We also introduce a new benchmark for studying uncertainty for language models so future methods can be easily compared.
Improving the precision of heart diseases detection has been investigated by many researchers in the literature. Such improvement induced by the overwhelming health care expenditures and erroneous diagnosis. As a result, various methodologies have been proposed to analyze the disease factors aiming to decrease the physicians practice variation and reduce medical costs and errors. In this paper, our main motivation is to develop an effective intelligent medical decision support system based on data mining techniques. In this context, five data mining classifying algorithms, with large datasets, have been utilized to assess and analyze the risk factors statistically related to heart diseases in order to compare the performance of the implemented classifiers (e.g., Na\’ive Bayes, Decision Tree, Discriminant, Random Forest, and Support Vector Machine). To underscore the practical viability of our approach, the selected classifiers have been implemented using MATLAB tool with two datasets. Results of the conducted experiments showed that all classification algorithms are predictive and can give relatively correct answer. However, the decision tree outperforms other classifiers with an accuracy rate of 99.0% followed by Random forest. That is the case because both of them have relatively same mechanism but the Random forest can build ensemble of decision tree. Although ensemble learning has been proved to produce superior results, but in our case the decision tree has outperformed its ensemble version.
This paper describes an intuitive generalization to the Generative Adversarial Networks (GANs) to generate samples while capturing diverse modes of the true data distribution. Firstly, we propose a very simple and intuitive multi-agent GAN architecture that incorporates multiple generators capable of generating samples from high probability modes. Secondly, in order to enforce different generators to generate samples from diverse modes, we propose two extensions to the standard GAN objective function. (1) We augment the generator specific GAN objective function with a diversity enforcing term that encourage different generators to generate diverse samples using a user-defined similarity based function. (2) We modify the discriminator objective function where along with finding the real and fake samples, the discriminator has to predict the generator which generated the given fake sample. Intuitively, in order to succeed in this task, the discriminator must learn to push different generators towards different identifiable modes. Our framework is generalizable in the sense that it can be easily combined with other existing variants of GANs to produce diverse samples. Experimentally we show that our framework is able to produce high quality diverse samples for the challenging tasks such as image/face generation and image-to-image translation. We also show that it is capable of learning a better feature representation in an unsupervised setting.