Impacts of Dirty Data: and Experimental Evaluation

Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification, clustering, and regression algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.

Applying the Delta method in metric analytics: A practical guide with novel ideas

During the last decade, the information technology industry has adopted a data-driven culture, relying on online metrics to measure and monitor business performance. Under the setting of big data, the majority of such metrics approximately follow normal distributions, opening up potential opportunities to model them directly and solve big data problems using distributed algorithms. However, certain attributes of the metrics, such as their corresponding data generating processes and aggregation levels, pose numerous challenges for constructing trustworthy estimation and inference procedures. Motivated by four real-life examples in metric development and analytics for large-scale A/B testing, we provide a practical guide to applying the Delta method, one of the most important tools from the classic statistics literature, to address the aforementioned challenges. We emphasize the central role of the Delta method in metric analytics, by highlighting both its classic and novel applications.

Vulnerability of Deep Learning

The Renormalisation Group (RG) provides a framework in which it is possible to assess whether a deep-learning network is sensitive to small changes in the input data and hence prone to error, or susceptible to adversarial attack. Distinct classification outputs are associated with different RG fixed points and sensitivity to small changes in the input data is due to the presence of relevant operators at a fixed point. A numerical scheme, based on Monte Carlo RG ideas, is proposed for identifying the existence of relevant operators and the corresponding directions of greatest sensitivity in the input data. Thus, a trained deep-learning network may be tested for its robustness and, if it is vulnerable to attack, dangerous perturbations of the input data identified.

Some HCI Priorities for GDPR-Compliant Machine Learning

In this short paper, we consider the roles of HCI in enabling the better governance of consequential machine learning systems using the rights and obligations laid out in the recent 2016 EU General Data Protection Regulation (GDPR)—a law which involves heavy interaction with people and systems. Focussing on those areas that relate to algorithmic systems in society, we propose roles for HCI in legal contexts in relation to fairness, bias and discrimination; data protection by design; data protection impact assessments; transparency and explanations; the mitigation and understanding of automation bias; and the communication of envisaged consequences of processing.

Big Data and Reliability Applications: The Complexity Dimension

Big data features not only large volumes of data but also data with complicated structures. Complexity imposes unique challenges in big data analytics. Meeker and Hong (2014, Quality Engineering, pp. 102-116) provided an extensive discussion of the opportunities and challenges in big data and reliability, and described engineering systems that can generate big data that can be used in reliability analysis. Meeker and Hong (2014) focused on large scale system operating and environment data (i.e., high-frequency multivariate time series data), and provided examples on how to link such data as covariates to traditional reliability responses such as time to failure, time to recurrence of events, and degradation measurements. This paper intends to extend that discussion by focusing on how to use data with complicated structures to do reliability analysis. Such data types include high-dimensional sensor data, functional curve data, and image streams. We first provide a review of recent development in those directions, and then we provide a discussion on how analytical methods can be developed to tackle the challenging aspects that arise from the complexity feature of big data in reliability applications. The use of modern statistical methods such as variable selection, functional data analysis, scalar-on-image regression, spatio-temporal data models, and machine learning techniques will also be discussed.

ORGaNICs: A Theory of Working Memory in Brains and Machines

Working memory is a cognitive process that is responsible for temporarily holding and manipulating information. Most of the empirical neuroscience research on working memory has focused on measuring sustained activity in prefrontal cortex (PFC) and/or parietal cortex during simple delayed-response tasks, and most of the models of working memory have been based on neural integrators. But working memory means much more than just holding a piece of information online. We describe a new theory of working memory, based on a recurrent neural circuit that we call ORGaNICs (Oscillatory Recurrent GAted Neural Integrator Circuits). ORGaNICs are a variety of Long Short Term Memory units (LSTMs), imported from machine learning and artificial intelligence. ORGaNICs can be used to explain the complex dynamics of delay-period activity in prefrontal cortex (PFC) during a working memory task. The theory is analytically tractable so that we can characterize the dynamics, and the theory provides a means for reading out information from the dynamically varying responses at any point in time, in spite of the complex dynamics. ORGaNICs can be implemented with a biophysical (electrical circuit) model of pyramidal cells, combined with shunting inhibition via a thalamocortical loop. Although introduced as a computational theory of working memory, ORGaNICs are also applicable to models of sensory processing, motor preparation and motor control. ORGaNICs offer computational advantages compared to other varieties of LSTMs that are commonly used in AI applications. Consequently, ORGaNICs are a framework for canonical computation in brains and machines.

Assessment meets Learning: On the relation between Item Response Theory and Bayesian Knowledge Tracing

Few models have been more ubiquitous in their respective fields than Bayesian knowledge tracing and item response theory. Both of these models were developed to analyze data on learners. However, the study designs that these models are designed for differ; Bayesian knowledge tracing is designed to analyze longitudinal data while item response theory is built for cross-sectional data. This paper illustrates a fundamental connection between these two models. Specifically, the stationary distribution of the latent variable and the observed response variable in Bayesian knowledge Tracing are related to an item response theory model. Furthermore, recent advances in network psychometrics demonstrate how this relationship can be exploited and generalized to a network model.

Studying Invariances of Trained Convolutional Neural Networks

Convolutional Neural Networks (CNNs) define an exceptionally powerful class of models for image classification, but the theoretical background and the understanding of how invariances to certain transformations are learned is limited. In a large scale screening with images modified by different affine and nonaffine transformations of varying magnitude, we analyzed the behavior of the CNN architectures AlexNet and ResNet. If the magnitude of different transformations does not exceed a class- and transformation dependent threshold, both architectures show invariant behavior. In this work we furthermore introduce a new learnable module, the Invariant Transformer Net, which enables us to learn differentiable parameters for a set of affine transformations. This allows us to extract the space of transformations to which the CNN is invariant and its class prediction robust.

Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts

Current Zero-Shot Learning (ZSL) approaches are restricted to recognition of a single dominant unseen object category in a test image. We hypothesize that this setting is ill-suited for real-world applications where unseen objects appear only as a part of a complex scene, warranting both the `recognition’ and `localization’ of an unseen category. To address this limitation, we introduce a new \emph{`Zero-Shot Detection’} (ZSD) problem setting, which aims at simultaneously recognizing and locating object instances belonging to novel categories without any training examples. We also propose a new experimental protocol for ZSD based on the highly challenging ILSVRC dataset, adhering to practical issues, e.g., the rarity of unseen objects. To the best of our knowledge, this is the first end-to-end deep network for ZSD that jointly models the interplay between visual and semantic domain information. To overcome the noise in the automatically derived semantic descriptions, we utilize the concept of meta-classes to design an original loss function that achieves synergy between max-margin class separation and semantic space clustering. Furthermore, we present a baseline approach extended from recognition to detection setting. Our extensive experiments show significant performance boost over the baseline on the imperative yet difficult ZSD problem.

Constant-Time Predictive Distributions for Gaussian Processes

One of the most compelling features of Gaussian process (GP) regression is its ability to provide well calibrated posterior distributions. Recent advances in inducing point methods have drastically sped up marginal likelihood and posterior mean computations, leaving posterior covariance estimation and sampling as the remaining computational bottlenecks. In this paper we address this shortcoming by using the Lanczos decomposition algorithm to rapidly approximate the predictive covariance matrix. Our approach, which we refer to as LOVE (LanczOs Variance Estimates), substantially reduces the time and space complexity over any previous method. In practice, it can compute predictive covariances up to 2,000 times faster and draw samples 18,000 time faster than existing methods, all without sacrificing accuracy.

A Kernel Theory of Modern Data Augmentation

Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding modern data augmentation techniques. We start by showing that for kernel classifiers, data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. We connect this general approximation framework to prior work in invariant kernels, tangent propagation, and robust optimization. Next, we explicitly tackle the compositional aspect of modern data augmentation techniques, proposing a novel model of data augmentation as a Markov process. Under this model, we show that performing k-nearest neighbors with data augmentation is asymptotically equivalent to a kernel classifier. Finally, we illustrate ways in which our theoretical framework can be leveraged to accelerate machine learning workflows in practice, including reducing the amount of computation needed to train on augmented data, and predicting the utility of a transformation prior to training.

How robust are Structural Equation Models to model miss-specification? A simulation study

Structural Equation Models (SEMs) are routinely used in the analysis of empirical data by researchers spanning different scientific fields such as psychologists or econometricians. In some fields, such as in ecology, SEMs have only started recently to attract attention and thanks to dedicated software packages the use of SEMs have steadily increased. Yet, common analysis practices in such fields that might be transposed from other statistical techniques such as model acceptance or rejection based on p value screening might be poorly fitted for SEMs. In this simulation study, SEMs were fitted via two commonly used R packages: lavaan and piecewiseSEM. Datasets were simulated under different modelling scenarios to test the impact of sample size and model complexity on various global and local model fitness indices. The results showed that not one single model indices should be used to decide on model fitness but rather a combination of different model fitness indices is needed. The global chi square test for lavaan or the Fisher C statistic are, in isolation, poor indicators of model fitness. Combining the different metrics explored here provided little safeguards against model overfitting, this emphasizes the need to cautiously interpret the inferred (causal) relations from fitted SEMs. Researchers in scientific fields with little experience in SEMs, such as in ecology, should consider and accept these limitations.

Gradients on Sets

For a locally Lipschitz continuous function f:X\to\mathbb{R} the generalized gradient \partial f(x) of Clarke is used to develop some (set-valued) gradient on a set A\subset X. Existence, uniqueness and some approximation are considered for optimal descent directions on set A. The results serve as basis for nonsmooth numerical descent algorithms that can be found in subsequent papers.

Graph Partition Neural Networks for Semi-Supervised Classification

We present graph partition neural networks (GPNN), an extension of graph neural networks (GNNs) able to handle extremely large graphs. GPNNs alternate between locally propagating information between nodes in small subgraphs and globally propagating information between the subgraphs. To efficiently partition graphs, we experiment with several partitioning algorithms and also propose a novel variant for fast processing of large scale graphs. We extensively test our model on a variety of semi-supervised node classification tasks. Experimental results indicate that GPNNs are either superior or comparable to state-of-the-art methods on a wide variety of datasets for graph-based semi-supervised classification. We also show that GPNNs can achieve similar performance as standard GNNs with fewer propagation steps.

High-dimensional Stochastic Inversion via Adjoint Models and Machine Learning

Performing stochastic inversion on a computationally expensive forward simulation model with a high-dimensional uncertain parameter space (e.g. a spatial random field) is computationally prohibitive even with gradient information provided. Moreover, the `nonlinear’ mapping from parameters to observables generally gives rise to non-Gaussian posteriors even with Gaussian priors, thus hampering the use of efficient inversion algorithms designed for models with Gaussian assumptions. In this paper, we propose a novel Bayesian stochastic inversion methodology, characterized by a tight coupling between a gradient-based Langevin Markov Chain Monte Carlo (LMCMC) method and a kernel principal component analysis (KPCA). This approach addresses the `curse-of-dimensionality’ via KPCA to identify a low-dimensional feature space within the high-dimensional and nonlinearly correlated spatial random field. Moreover, non-Gaussian full posterior probability distribution functions are estimated via an efficient LMCMC method on both the projected low-dimensional feature space and the recovered high-dimensional parameter space. We demonstrate this computational framework by integrating and adapting recent developments such as data-driven statistics-on-manifolds constructions and reduction-through-projection techniques to solve inverse problems in linear elasticity.

Nesting Probabilistic Programs

We formalize the notion of nesting probabilistic programming queries and investigate the resulting statistical implications. We demonstrate that query nesting allows the definition of models which could not otherwise be expressed, such as those involving agents reasoning about other agents, but that existing systems take approaches that lead to inconsistent estimates. We show how to correct this by delineating possible ways one might want to nest queries and asserting the respective conditions required for convergence. We further introduce, and prove the correctness of, a new online nested Monte Carlo estimation method that makes it substantially easier to ensure these conditions are met, thereby providing a simple framework for designing statistically correct inference engines.

Snap Machine Learning

We describe an efficient, scalable machine learning library that enables very fast training of generalized linear models. We demonstrate that our library can remove the training time as a bottleneck for machine learning workloads, opening the door to a range of new applications. For instance, it allows more agile development, faster and more fine-grained exploration of the hyper-parameter space, enables scaling to massive datasets and makes frequent re-training of models possible in order to adapt to events as they occur. Our library, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern distributed systems. This allows us to effectively leverage available network, memory and heterogeneous compute resources. On a terabyte-scale publicly available dataset for click-through-rate prediction in computational advertising, we demonstrate the training of a logistic regression classifier in 1.53 minutes, a 46x improvement over the fastest reported performance.

Capturing near-equilibrium solutions: a comparison between high-order Discontinuous Galerkin methods and well-balanced schemes
RankME: Reliable Human Ratings for Natural Language Generation
Multilevel Monte Carlo Method for Ergodic SDEs without Contractivity
Some Closure Results for Polynomial Factorization and Applications
CIM/E Oriented Graph Database Model Architecture and Parallel Network Topology Processing
Noise-induced rectification in out-of-equilibrium structures
A picture is worth a thousand words but how to organize thousands of pictures?
Analog simulator of integro-differential equations with classical memristors
Pivot Sampling in QuickXSort: Precise Analysis of QuickMergesort and QuickHeapsort
Generalized Stirling Numbers I
Interplay of Probabilistic Shaping and the Blind Phase Search Algorithm
Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera
Deep Choice Model Using Pointer Networks for Airline Itinerary Prediction
Contraction and Robustness of Continuous Time Primal-Dual Dynamics
Real-time Deep Registration With Geodesic Loss
Unraveling Go gaming nature by Ising Hamiltonian and common fate graphs: tactics and statistics
Deep Co-Training for Semi-Supervised Image Recognition
EEG machine learning with Higuchi fractal dimension and Sample Entropy as features for successful detection of depression
Scalable analysis of linear networked systems via chordal decomposition
Escaping Saddles with Stochastic Gradients
Symplectic frieze patterns
Efficient Hardware Realization of Convolutional Neural Networks using Intra-Kernel Regular Pruning
Covert Communication over a K-User Multiple Access Channel
Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling
A Unified Theory of Regression Adjustment for Design-based Inference
Database Perspectives on Blockchains
Deep Learning Reconstruction of Ultra-Short Pulses
Crackling to periodic dynamics in sheared granular media
Estimation of lactate threshold with machine learning techniques in recreational runners
Optimal Bipartite Network Clustering
$W^{1,p}$ regularity of solutions to Kolmogorov equation and associated Feller semigroup
Multistage stochastic programs with a random number of stages: dynamic programming equations, solution methods, and application to portfolio selection
Optimality of multi-refraction dividend strategies in the dual model
False discovery rate control for multiple testing based on p-values with càdlàg distribution functions
Matroids and Codes with the Rank Metric
Optimal Boundary Kernels and Weightings for Local Polynomial Regression
Identifying and Estimating Principal Causal Effects in Multi-site Trials
Sufficient Conditions for a Linear Estimator to be a Local Polynomial Regression
Deep Multiple Instance Learning for Zero-shot Image Tagging
The world of research has gone berserk: modeling the consequences of requiring ‘greater statistical stringency’ for scientific publication
Heuristics for vehicle routing problems: Sequence or set optimization?
A Meaning-based Statistical English Math Word Problem Solver
Dynamic-structured Semantic Propagation Network
Modelling sparsity, heterogeneity, reciprocity and community structure in temporal interaction data
Lyapunov Functions for First-Order Methods: Tight Automated Convergence Guarantees
Reconfiguring spanning and induced subgraphs
Real-time Detection, Tracking, and Classification of Moving and Stationary Objects using Multiple Fisheye Images
Load Balancing for 5G Ultra-Dense Networks using Device-to-Device Communications
A Globally Asymptotically Stable Polynomial Vector Field with Rational Coefficients and no Local Polynomial Lyapunov Function
Distributed Caching for Complex Querying of Raw Arrays
Salient Objects in Clutter: Bringing Salient Object Detection to the Foreground
A dataset and architecture for visual reasoning with a working memory
Expected Time to Extinction of SIS Epidemic Model Using Quasy Stationary Distribution
Parameterized Low-Rank Binary Matrix Approximation
Efficient Decoding Schemes for Noisy Non-Adaptive Group Testing when Noise Depends on Number of Items in Test
Varying k-Lipschitz Constraint for Generative Adversarial Networks
Modeling the effects of telephone nursing on healthcare utilization
A constant-ratio approximation algorithm for a class of hub-and-spoke network design problems and metric labeling problems: star metric case
Runlength-Limited Sequences and Shift-Correcting Codes
Gaussian Processes indexed on the symmetric group: prediction and learning
Surjections and double posets
Learning Sparse Deep Feedforward Networks via Tree Skeleton Expansion
On Combination Networks with Cache-aided Relays and Users
Towards Advanced Phenotypic Mutations in Cartesian Genetic Programming
Analysis of an asymptotic preserving scheme for stochastic linear kinetic equations in the diffusion limit
Towards Image Understanding from Deep Compression without Decoding
Signless Laplacian determinations of some graphs with independent edges
Patchwise object tracking via structural local sparse appearance model
Q-process and asymptotic properties of Markov processes conditioned not to hit moving boundaries
Local weak convergence for PageRank
Object Captioning and Retrieval with Natural Language
Semantic Segmentation of Pathological Lung Tissue with Dilated Fully Convolutional Networks
Existence and smoothness of the density for the stochastic continuity equation
Downlink coverage probability in cellular networks with Poisson-Poisson cluster deployed base stations
Fair non-monetary scheduling in federated clouds
The ApolloScape Dataset for Autonomous Driving
Triplet-Center Loss for Multi-View 3D Object Retrieval
Monocular Fisheye Camera Depth Estimation Using Semi-supervised Sparse Velodyne Data
Complex-YOLO: Real-time 3D Object Detection on Point Clouds
Quantile correlation coefficient: a new tail dependence measure
Shellability of face posets of electrical networks and the CW poset property
Further Consequences of the Colorful Helly Hypothesis
Chemi-net: a graph convolutional network for accurate drug property prediction
Trianguloids and Triangulations of Root Polytopes
Induced Saturation of Graphs
Coordination via predictive assistants from a game-theoretic view
Link prediction for interdisciplinary collaboration via co-authorship network
Tropical integrable systems and Young tableaux: Shape equivalence and Littlewood-Richardson correspondence
Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-end Model
Land cover mapping at very high resolution with rotation equivariant CNNs: towards small yet accurate models
Online Controlled Experiments for Personalised e-Commerce Strategies: Design, Challenges, and Pitfalls
Heterogeneous Doppler Spread-based CSI Estimation Planning for TDD Massive MIMO
Consistent sets of lines with no colorful incidence
On the existence of a scalar pressure field in the Bredinger problem
Improved Part Segmentation Performance by Optimising Realism of Synthetic Images using Cycle Generative Adversarial Networks
$EVA^2$ : Exploiting Temporal Redundancy in Live Computer Vision
Activity Detection with Latent Sub-event Hierarchy Learning
A local characterization of crystals for the quantum queer superalgebra
Synchronisation of Partial Multi-Matchings via Non-negative Factorisations
A particle-based variational approach to Bayesian Non-negative Matrix Factorization
Fast approximation and exact computation of negative curvature parameters of graphs
Learning deep structured active contours end-to-end
Faces as Lighting Probes via Unsupervised Deep Highlight Extraction
Distributed Transactions: Dissecting the Nightmare