HiFrames: High Performance Data Frames in a Scripting Language

Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not tightly integrated with array computations (e.g., Spark SQL). This paper proposes a novel compiler-based approach where we integrate data frames into the High Performance Analytics Toolkit (HPAT) to build HiFrames. It provides expressive and flexible data frame APIs which are tightly integrated with array operations. HiFrames then automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. We demonstrate that HiFrames is significantly faster than alternatives such as Spark SQL on clusters, without forcing the programmer to switch to embedded SQL for part of the program. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations, and can be up to 20,000x faster for advanced analytics operations, such as weighted moving averages (WMA), that the map-reduce paradigm cannot handle effectively. HiFrames is also 5x faster than Spark SQL for TPCx-BB Q26 on 64 nodes of Cori supercomputer.

Fast Spectral Clustering Using Autoencoders and Landmarks

In this paper, we introduce an algorithm for performing spectral clustering efficiently. Spectral clustering is a powerful clustering algorithm that suffers from high computational complexity, due to eigen decomposition. In this work, we first build the adjacency matrix of the corresponding graph of the dataset. To build this matrix, we only consider a limited number of points, called landmarks, and compute the similarity of all data points with the landmarks. Then, we present a definition of the Laplacian matrix of the graph that enable us to perform eigen decomposition efficiently, using a deep autoencoder. The overall complexity of the algorithm for eigen decomposition is O(np), where n is the number of data points and p is the number of landmarks. At last, we evaluate the performance of the algorithm in different experiments.

Joint Probabilistic Linear Discriminant Analysis

Standard probabilistic discriminant analysis (PLDA) for speaker recognition assumes that the sample’s features (usually, i-vectors) are given by a sum of three terms: a term that depends on the speaker identity, a term that models the within-speaker variability and is assumed independent across samples, and a final term that models any remaining variability and is also independent across samples. In this work, we propose a generalization of this model where the within-speaker variability is not necessarily assumed independent across samples but dependent on another discrete variable. This variable, which we call the channel variable as in the standard PLDA approach, could be, for example, a discrete category for the channel characteristics, the language spoken by the speaker, the type of speech in the sample (conversational, monologue, read), etc. The value of this variable is assumed to be known during training but not during testing. Scoring is performed, as in standard PLDA, by computing a likelihood ratio between the null hypothesis that the two sides of a trial belong to the same speaker versus the alternative hypothesis that the two sides belong to different speakers. The two likelihoods are computed by marginalizing over two hypothesis about the channels in both sides of a trial: that they are the same and that they are different. This way, we expect that the new model will be better at coping with same-channel versus different-channel trials than standard PLDA, since knowledge about the channel (or language, or speech style) is used during training and implicitly considered during scoring.

An Introduction to the Temporal Group LASSO and its Potential Applications in Healthcare

The Temporal Group LASSO is an example of a multi-task, regularized regression approach for the prediction of response variables that vary over time. The aim of this work is to introduce the reader to the concepts behind the Temporal Group LASSO and its related methods, as well as to the type of potential applications in a healthcare setting that the method has. We argue that the method is attractive because of its ability to reduce overfitting, select predictors, learn smooth effect patterns over time, and finally, its simplicity

Interactive Graphics for Visually Diagnosing Forest Classifiers in R

This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble, produced by bagging multiple trees. The process of bagging and combining results from multiple trees, produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions. Various aspects are explored in this paper, to assess model complexity, individual model contributions, variable importance and dimension reduction, and uncertainty in prediction associated with individual observations. The ideas are applied to the random forest algorithm, and to the projection pursuit forest, but could be more broadly applied to other bagged ensembles. Interactive graphics are built in R, using the ggplot2, plotly, and shiny packages.

MLC Toolbox: A MATLAB/OCTAVE Library for Multi-Label Classification

Multi-Label Classification toolbox is a MATLAB/OCTAVE library for Multi-Label Classification (MLC). There exists a few Java libraries for MLC, but no MATLAB/OCTAVE library that covers various methods. This toolbox offers an environment for evaluation, comparison and visualization of the MLC results. One attraction of this toolbox is that it enables us to try many combinations of feature space dimension reduction, sample clustering, label space dimension reduction and ensemble, etc.

MapReduce Scheduler: A 360-degree view

Undoubtedly, the MapReduce is the most powerful programming paradigm in distributed computing. The enhancement of the MapReduce is essential and it can lead the computing faster. Therefore, here are many scheduling algorithms to discuss based on their characteristics. Moreover, there are many shortcoming to discover in this field. In this article, we present the state-of-the-art scheduling algorithm to enhance the understanding of the algorithms. The algorithms are presented systematically such that there can be many future possibilities in scheduling algorithm through this article. In this paper, we provide in-depth insight on the MapReduce scheduling algorithm. In addition, we discuss various issues of MapReduce scheduler developed for large-scale computing as well as heterogeneous environment.

Supervised Infinite Feature Selection

In this paper, we present a new feature selection method that is suitable for both unsupervised and supervised problems. We build upon the recently proposed Infinite Feature Selection (IFS) method where feature subsets of all sizes (including infinity) are considered. We extend IFS in two ways. First, we propose a supervised version of it. Second, we propose new ways of forming the feature adjacency matrix that perform better for unsupervised problems. We extensively evaluate our methods on many benchmark datasets, including large image-classification datasets (PASCAL VOC), and show that our methods outperform both the IFS and the widely used ‘minimum-redundancy maximum-relevancy (mRMR)’ feature selection algorithm.

Word Embeddings via Tensor Factorization

Most popular word embedding techniques involve implicit or explicit factorization of a word co-occurrence based matrix into low rank factors. In this paper, we aim to generalize this trend by using numerical methods to factor higher-order word co-occurrence based arrays, or \textit{tensors}. We present four word embeddings using tensor factorization and analyze their advantages and disadvantages. One of our main contributions is a novel joint symmetric tensor factorization technique related to the idea of coupled tensor factorization. We show that embeddings based on tensor factorization can be used to discern the various meanings of polysemous words without being explicitly trained to do so, and motivate the intuition behind why this works in a way that doesn’t with existing methods. We also modify an existing word embedding evaluation metric known as Outlier Detection [Camacho-Collados and Navigli, 2016] to evaluate the quality of the order-N relations that a word embedding captures, and show that tensor-based methods outperform existing matrix-based methods at this task. Experimentally, we show that all of our word embeddings either outperform or are competitive with state-of-the-art baselines commonly used today on a variety of recent datasets. Suggested applications of tensor factorization-based word embeddings are given, and all source code and pre-trained vectors are publicly available online.

Bayesian Recurrent Neural Networks

In this work we explore a straightforward variational Bayes scheme for Recurrent Neural Networks. Firstly, we show that a simple adaptation of truncated backpropagation through time can yield good quality uncertainty estimates and superior regularisation at only a small extra computational cost during training. Secondly, we demonstrate how a novel kind of posterior approximation yields further improvements to the performance of Bayesian RNNs. We incorporate local gradient information into the approximate posterior to sharpen it around the current batch statistics. This technique is not exclusive to recurrent neural networks and can be applied more widely to train Bayesian neural networks. We also empirically demonstrate how Bayesian RNNs are superior to traditional RNNs on a language modelling benchmark and an image captioning task, as well as showing how each of these methods improve our model over a variety of other schemes for training them. We also introduce a new benchmark for studying uncertainty for language models so future methods can be easily compared.

A Comparative Study for Predicting Heart Diseases Using Data Mining Classification Methods

Improving the precision of heart diseases detection has been investigated by many researchers in the literature. Such improvement induced by the overwhelming health care expenditures and erroneous diagnosis. As a result, various methodologies have been proposed to analyze the disease factors aiming to decrease the physicians practice variation and reduce medical costs and errors. In this paper, our main motivation is to develop an effective intelligent medical decision support system based on data mining techniques. In this context, five data mining classifying algorithms, with large datasets, have been utilized to assess and analyze the risk factors statistically related to heart diseases in order to compare the performance of the implemented classifiers (e.g., Na\’ive Bayes, Decision Tree, Discriminant, Random Forest, and Support Vector Machine). To underscore the practical viability of our approach, the selected classifiers have been implemented using MATLAB tool with two datasets. Results of the conducted experiments showed that all classification algorithms are predictive and can give relatively correct answer. However, the decision tree outperforms other classifiers with an accuracy rate of 99.0% followed by Random forest. That is the case because both of them have relatively same mechanism but the Random forest can build ensemble of decision tree. Although ensemble learning has been proved to produce superior results, but in our case the decision tree has outperformed its ensemble version.

Multi-Agent Diverse Generative Adversarial Networks

This paper describes an intuitive generalization to the Generative Adversarial Networks (GANs) to generate samples while capturing diverse modes of the true data distribution. Firstly, we propose a very simple and intuitive multi-agent GAN architecture that incorporates multiple generators capable of generating samples from high probability modes. Secondly, in order to enforce different generators to generate samples from diverse modes, we propose two extensions to the standard GAN objective function. (1) We augment the generator specific GAN objective function with a diversity enforcing term that encourage different generators to generate diverse samples using a user-defined similarity based function. (2) We modify the discriminator objective function where along with finding the real and fake samples, the discriminator has to predict the generator which generated the given fake sample. Intuitively, in order to succeed in this task, the discriminator must learn to push different generators towards different identifiable modes. Our framework is generalizable in the sense that it can be easily combined with other existing variants of GANs to produce diverse samples. Experimentally we show that our framework is able to produce high quality diverse samples for the challenging tasks such as image/face generation and image-to-image translation. We also show that it is capable of learning a better feature representation in an unsupervised setting.

Periodic behaviour of coronal mass ejections, eruptive events, and solar activity proxies during solar cycles 23 and 24

Evolutionary Many-Objective Optimization Based on Adversarial Decomposition

Automated Unsupervised Segmentation of Liver Lesions in CT scans via Cahn-Hilliard Phase Separation

Gaussian fluctuations of Jack-deformed random Young diagrams

Three-Dimensional Segmentation of Vesicular Networks of Fungal Hyphae in Macroscopic Microscopy Image Stacks

Stability of Service under Time-of-Use Pricing

Testing hereditary properties of ordered graphs and matrices

Efficient MCMC for parameter inference for Markov jump processes

Large Deviations for the Empirical Distribution in the General Branching Random Walk

AppLP: A Dialogue on Applications of Logic Programming

Uncovering Group Level Insights with Accordant Clustering

Exploring an Infinite Space with Finite Memory Scouts

Adaptive estimation of the rank of the coefficient matrix in high dimensional multivariate response regression models

A Trolling Hierarchy in Social Media and A Conditional Random Field For Trolling Detection

Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Learning Where to Look: Data-Driven Viewpoint Set Selection for 3D Scenes

Practical Synchronous Byzantine Consensus

Stein Variational Policy Gradient

GoDP: Globally optimized dual pathway system for facial landmark localization in-the-wild

More on additive triples of bijections

Canonical correlation coefficients of high-dimensional Gaussian vectors: finite rank case

Second order Lyapunov exponents for parabolic and hyperbolic Anderson models

Proceedings Tenth Workshop on Programming Language Approaches to Concurrency- and Communication-cEntric Software

Average-radius list-recovery of random linear codes: it really ties the room together

Proceedings 3rd International Workshop on Symbolic and Numerical Methods for Reachability Analysis

A Deep Cascade of Convolutional Neural Networks for Dynamic MR Image Reconstruction

Gathering in Dynamic Rings

Asymptotic Formulas for Macdonald Polynomials and the boundary of the $(q, t)$-Gelfand-Tsetlin graph

Pieri Integral Formula and Asymptotics of Jack Unitary Characters

Learning Cross-Modal Deep Representations for Robust Pedestrian Detection

Approximation Algorithms for Barrier Sweep Coverage

Coordination game in bidirectional flow

Non-thermalization in trapped atomic ion spin chains

Seismic facies recognition based on prestack data using deep convolutional autoencoder

Weakly-supervised Transfer for 3D Human Pose Estimation in the Wild

Coupled Deep Learning for Heterogeneous Face Recognition

Consistent Approval-Based Multi-Winner Rules

A New Pseudo-color Technique Based on Intensity Information Protection for Passive Sensor Imagery

First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations

Basic Formal Properties of A Relational Model of The Mathematical Theory of Evidence

DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks

Difference bases in finite Abelian groups

Difference bases in dihedral groups

Informed Bayesian T-Tests

Phase limitations of Zames-Falb multipliers

Metric Learning in Codebook Generation of Bag-of-Words for Person Re-identification

Monotonicity of expected $f$-vectors for projections of regular polytopes

On the Linearity of Semantic Change: Investigating Meaning Variation via Dynamic Graph Models

Dynamical Stochastic Higher Spin Vertex Models

Mixing properties of multivariate infinitely divisible random fields

Good Deal Hedging and Valuation under Combined Uncertainty about Drift and Volatility

DualGAN: Unsupervised Dual Learning for Image-to-Image Translation

An Empirical Evaluation of Visual Question Answering for Novel Objects

Deep Generative Adversarial Compression Artifact Removal

Identifiability and Estimation of Structural Vector Autoregressive Models for Subsampled and Mixed Frequency Time Series

Metastability of Queuing Networks with Mobile Servers

Algorithm for Overcoming the Curse of Dimensionality for State-dependent Hamilton-Jacobi equations

A Matrix Variate Generalized Hyperbolic Distribution

Deep Reinforcement Learning framework for Autonomous Driving

Noisy Tensor Completion for Tensors with a Sparse Canonical Polyadic Factor

Wireless Information and Power Transfer in Full-Duplex Systems with Massive Antenna Arrays

Dual polynomials and communication complexity of $\textsf{XOR}$ functions

5G Cellular User Equipment: From Theory to Practical Hardware Design

A note on covariance estimation

Solving Parameter Estimation Problems with Discrete Adjoint Exponential Integrators

Embedded Collaborative Filtering for ‘Cold Start’ Prediction

Controllability of the Strongly Damped Impulsive Semilinear Wave Equation with Memory and Delay

Prosody: The Rhythms and Melodies of Speech

Motion Saliency Based Automatic Delineation of Glottis Contour in High-speed Digital Images

An Outlyingness Matrix for Multivariate Functional Data Classification

On Continuous-Time Gaussian Channels

Strict monotonicity of principal eigenvalues of elliptic operators in $\mathbb{R}^d$ and risk-sensitive control

Efficient and Robust Polylinear Analysis of Noisy Time Series

Strictly Proper Kernel Scoring Rules and Divergences with an Application to Kernel Two-Sample Hypothesis Testing

Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks

Phylogenetic complexity of the Kimura 3-parameter model

Generalized Sylvester Formulas and skew Giambelli Identities

Mutual Information of Buffer-Aided Full-Duplex Relay Channels

A Sample Complexity Measure with Applications to Learning Optimal Auctions

Automatic Image Filtering on Social Networks Using Deep Learning and Perceptual Hashing During Crises

A Framework for the Secretary Problem on the Intersection of Matroids

Diversification benefits under multivariate second order regular variation

BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis

Tail positive words and generalized coinvariant algebras

Mixed Graphical Models for Causal Analysis of Multi-modal Variables

On non-full-rank perfect codes over finite fields

Spectral and Energy Efficiency in Cognitive Radio Systems with Unslotted Primary Users and Sensing Uncertainty

Self-Organization of Self-Clearing Beating Patterns in an Array of Locally Interacting Ciliated Cells Formulated as an Adaptive Boolean Network

Rényi entropy power inequality and a reverse

Duality and Hereditary König-Egerváry Set-systems

Posterior Asymptotic Normality for an Individual Coordinate in High-dimensional Linear Regression

Dimensionality Reduction as a Defense against Evasion Attacks on Machine Learning Classifiers

An Algorithmic Approach to Search Games: Finding Solutions Using Best Response Oracles

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

Centers of probability measures without the mean

Accurate Prediction of Electoral Outcomes

The average size of the kernel of a matrix and orbits of linear groups

Overlap Coefficients Based on Kullback-Leibler Divergence: Exponential Populations Case

Quaternion Based Camera Pose Estimation From Matched Feature Points

Lattice Gaussian Sampling by Markov Chain Monte Carlo: Convergence Rate and Decoding Complexity

Largest regular multigraph with three distinct eigenvalues

Pyramid Vector Quantization for Deep Learning

Learning Important Features Through Propagating Activation Differences

Zero-sum stochastic differential game with risk-sensitive cost

Littlewood–Paley–Stein Estimates for Non-local Dirichlet Forms

Fully Convolutional Deep Neural Networks for Persistent Multi-Frame Multi-Object Detection in Wide Area Aerial Videos

A proof of the $(α,β)$–inversion formula conjectured by Hsu and Ma

Implementing a Cloud Platform for Autonomous Driving

Linear-Time FPT Algorithms via Half-Integral Non-returning $A$-path Packing

Volumes of generalized Chan-Robbins-Yuen polytopes

Automatic Liver Lesion Detection using Cascaded Deep Residual Networks

Computing and Graphing Probability Values of Pearson Distributions: A SAS/IML Macro

Distribution-free Evolvability of Vector Spaces: All it takes is a Generating Set

Improving Implicit Semantic Role Labeling by Predicting Semantic Frame Arguments

Adaptive Relaxed ADMM: Convergence Theory and Practical Implementation

Formal approaches to a definition of agents

Distributed Learning for Cooperative Inference

DeepPermNet: Visual Permutation Learning

A study of the dual problem of the one-dimensional L-infinity optimal transport problem with applications

Degrees of Freedom and Achievable Rate of Wide-Band Multi-cell Multiple Access Channels With No CSIT

Detail-revealing Deep Video Super-resolution

Integrating Additional Knowledge Into Estimation of Graphical Models

Szemeredi-type theorems for subsets of locally compact abelian groups of positive upper Banach density

Robust Connectivity with Multiple Nonisotropic Antennas for Vehicular Communications

An Approach to the High-level Maintenance Planning for EMU Trains Based on Simulated Annealing

Constructing confidence sets for the matrix completion problem

The largest root of random Kac polynomials is heavy tailed

Deterministic Distributed Edge-Coloring via Hypergraph Maximal Matching

Massively parallel implementation and approaches to simulate quantum dynamics using Krylov subspace techniques

Group Importance Sampling for particle filtering and MCMC

Multiscale Bayesian State Space Model for Granger Causality Analysis of Brain Signal

Equivalence between synaptic current dynamics and heterogeneous propagation delays in spiking neuron networks

Tracking the Trackers: An Analysis of the State of the Art in Multiple Object Tracking

The Kth Traveling Salesman Problem is Pseudopolynomial when TSP is polynomial

Deep Affordance-grounded Sensorimotor Object Recognition

Entity Linking for Queries by Searching Wikipedia Sentences

Parsimonious Random Vector Functional Link Network for Data Streams

Efficient SMC$^2$ schemes for stochastic kinetic models

Fine-graind Image Classification via Combining Vision and Language

Voronoi diagrams on planar graphs, and computing the diameter in deterministic $\tilde{O}(n^{5/3})$ time

Topology in colored tensor models via crystallization theory

Bayesian Inference of Individualized Treatment Effects using Multi-task Gaussian Processes

Serving Distance and Coverage in a Closed Access PHP-Based Heterogeneous Cellular Network

R-Clustering for Egocentric Video Segmentation

Character-Word LSTM Language Models

Flags of almost affine codes

Asymptotic ensemble stabilizability of the Bloch equation

Configurations of FK Ising interfaces and hypergeometric SLE

Learning Human Motion Models for Long-term Predictions

Integral Transforms from Finite Data: An Application of Gaussian Process Regression to Fourier Analysis

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Diffusion dynamics and synchronizability of hierarchical products of networks

The quadratic M-convexity testing problem

Local Asymptotic Normality of Infinite-Dimensional Concave Extended Linear Models

Fully Dynamic Approximate Maximum Matching and Minimum Vertex Cover in $O(\log^3 n)$ Worst Case Update Time

Multi-Kernel LS-SVM Based Bio-Clinical Data Integration: Applications to Ovarian Cancer

Unsupervised prototype learning in an associative-memory network

SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific Publications

A Decision Tree Based Approach Towards Adaptive Profiling of Cloud Applications

Spiral determinants

Towards a general theory for non-linear locally stationary processes

Stability in the Erdős–Gallai Theorem on cycles and paths, II

A note on ‘Extremal graphs with bounded vertex bipartiteness number’

Can AIs learn to avoid human interruption?

ActionVLAD: Learning spatio-temporal aggregation for action classification

Continuously heterogeneous hyper-objects in cryo-EM and 3-D movies of many temporal dimensions

Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs

Stable Throughput and Delay Analysis of a Random Access Network With Queue-Aware Transmission

Energy Harvesting Enabled MIMO Relaying through Time Switching

When mmWave Communications Meet Network Densification: A Scalable Interference Coordination Perspective

Fourier dimension and spectral gaps for hyperbolic surfaces

Reinterpreting Importance-Weighted Autoencoders

On quantile residuals in beta regression

Fair splitting of colored paths

Pay Attention to Those Sets! Learning Quantification from Images

Fast Learning and Prediction for Object Detection using Whitened CNN Features

A Cooperative Enterprise Agent Based Control Architecture

Minor-matching hypertree width

The Boolean SATisfiability Problem in Clifford algebra

Spectral radii of sparse random matrices

Incentive-rewarding mechanisms to stimulate participation in heterogeneous DTNs

Largest eigenvalues of sparse inhomogeneous Erdős-Rényi graphs

Surface Normals in the Wild

On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

Improving bounds on packing densities of 4-point permutations

Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization

Using convolutional networks and satellite imagery to identify patterns in urban environments at a large scale

Loss Max-Pooling for Semantic Image Segmentation