Kernel clustering: Breiman’s bias and solutions

Clustering is widely used in data analysis where kernel methods are particularly popular due to their generality and discriminating power. However, kernel clustering has a practically significant bias to small dense clusters, e.g. empirically observed in (Shi & Malik, TPAMI’00). Its causes have never been analyzed and understood theoretically, even though many attempts were made to improve the results. We provide conditions and formally prove this bias in kernel clustering. Moreover, we show a general class of locally adaptive kernels directly addressing these conditions. Previously, (Breiman, ML’96) proved a bias to histogram mode isolation in discrete Gini criterion for decision tree learning. We found that kernel clustering reduces to a continuous generalization of Gini criterion for a common class of kernels where we prove a bias to density mode isolation and call it Breiman’s bias. These theoretical findings suggest that a principal solution for the bias should directly address data density inhomogeneity. In particular, our density law shows how density equalization can be done implicitly using certain locally adaptive geodesic kernels. Interestingly, a popular heuristic kernel in (Zelnik-Manor and Perona, NIPS’04) approximates a special case of our Riemannian kernel framework. Our general ideas are relevant to any algorithms for kernel clustering. We show many synthetic and real data experiments illustrating Breiman’s bias and its solution. We anticipate that theoretical understanding of kernel clustering limitations and their principled solutions will be important for a broad spectrum of data analysis applications in diverse disciplines.


REMIX: Automated Exploration for Interactive Outlier Detection

Outlier detection is the identification of points in a dataset that do not conform to the norm. Outlier detection is highly sensitive to the choice of the detection algorithm and the feature subspace used by the algorithm. Extracting domain-relevant insights from outliers needs systematic exploration of these choices since diverse outlier sets could lead to complementary insights. This challenge is especially acute in an interactive setting, where the choices must be explored in a time-constrained manner. In this work, we present REMIX, the first system to address the problem of outlier detection in an interactive setting. REMIX uses a novel mixed integer programming (MIP) formulation for automatically selecting and executing a diverse set of outlier detectors within a time limit. This formulation incorporates multiple aspects such as (i) an upper limit on the total execution time of detectors (ii) diversity in the space of algorithms and features, and (iii) meta-learning for evaluating the cost and utility of detectors. REMIX provides two distinct ways for the analyst to consume its results: (i) a partitioning of the detectors explored by REMIX into perspectives through low-rank non-negative matrix factorization; each perspective can be easily visualized as an intuitive heatmap of experiments versus outliers, and (ii) an ensembled set of outliers which combines outlier scores from all detectors. We demonstrate the benefits of REMIX through extensive empirical validation on real-world data.


HR-CTC: A Large Human Resource Corpus for Text Classification

With the rapid development of online recruitment, a large amount of job postings enables a new paradigm for studying the national economic state. However, there are two problems when using these data for national economic research. First, there is a mismatch between the job descriptions and the job titles. Second, these titles differ from the category names in the national economic research standards. To map job postings with similar job descriptions but different titles to the same category of the national economic research standards, this paper introduces a text classification corpus named HR-CTC. A novel method to reduce the influence of human subjectivity is proposed to construct the corpus. The experimental results show that the proposed method outperforms manual categorization in accuracy evaluation criteria by a 47.08% increase. To verify the value of this corpus for text classification research and to provide a baseline for further research, we implement five methods of deep learning for text classification and achieve promising results.


Practical Processing of Mobile Sensor Data for Continual Deep Learning Predictions

We present a practical approach for processing mobile sensor time series data for continual deep learning predictions. The approach comprises data cleaning, normalization, capping, time-based compression, and finally classification with a recurrent neural network. We demonstrate the effectiveness of the approach in a case study with 279 participants. On the basis of sparse sensor events, the network continually predicts whether the participants would attend to a notification within 10 minutes. Compared to a random baseline, the classifier achieves a 40% performance increase (AUC of 0.702) on a withheld test set. This approach allows to forgo resource-intensive, domain-specific, error-prone feature engineering, which may drastically increase the applicability of machine learning to mobile phone sensor data.


Transfer Learning for Named-Entity Recognition with Neural Networks

Recent approaches based on artificial neural networks (ANNs) have shown promising results for named-entity recognition (NER). In order to achieve high performances, ANNs need to be trained on a large labeled dataset. However, labels might be difficult to obtain for the dataset on which the user wants to perform NER: label scarcity is particularly pronounced for patient note de-identification, which is an instance of NER. In this work, we analyze to what extent transfer learning may address this issue. In particular, we demonstrate that transferring an ANN model trained on a large labeled dataset to another dataset with a limited number of labels improves upon the state-of-the-art results on two different datasets for patient note de-identification.


Correlation functions of the Pfaffian Schur process using Macdonald difference operators

Selection of Sparse Vine Copulas in High Dimensions with the Lasso

Skew product Smale endomorphisms over countable shifts of finite type

Supply and Shorting in Speculative Markets

Scaling limits for random walks on random critical trees

Static Gesture Recognition using Leap Motion

What’s In A Patch, I: Tensors, Differential Geometry and Statistical Shading Analysis

What’s In A Patch, II: Visualizing generic surfaces

Four NP-complete problems about generalizations of perfect graphs

Coupling conditions for globally stable and robust synchrony of chaotic systems

Lifted Polymatroid Inequalities for Mean-Risk Optimization with Indicator Variables

Network Design with Probabilistic Capacities

Polymatroid inequalities for p-order conic mixed 0-1 optimization

Path Cover and Path Pack Inequalities for the Capacitated Fixed-Charge Network Flow Problem

LCDet: Low-Complexity Fully-Convolutional Neural Networks for Object Detection in Embedded Systems

A Comprehensive Introduction to the Theory of Word-Representable Graphs

Lagrangian Reachabililty

Free fermions and the classical compact groups

Sub-sampled Cubic Regularization for Non-convex Optimization

Rise of the humanbot

Unusual structures inherent in point pattern data predict colon cancer patient survival

Subregular Complexity and Deep Learning

Inverse Multipath Fingerprinting for Millimeter Wave V2I Beam Alignment

Optimal segmentation of directed graph and the minimum number of feedback arcs

Optimal Resource Allocation for Power-Efficient MC-NOMA with Imperfect Channel State Information

A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing

Calibration of the NDHA model to describe N2O dynamics from respirometric assays

Sharp bounds for the Randic index of graphs with given minimum and maximum degree

Match results prediction ability of official ATP singles ranking

Community Detection for Multilayer Heterogeneous Network

A macroscopic multifractal analysis of parabolic stochastic PDEs

New Directions In Cellular Automata

Polar-Coded Non-Orthogonal Multiple Access

A Continuous Opinion Dynamic Model in Co-evolving Networks–A Novel Group Decision Approach

AI, Native Supercomputing and The Revival of Moore’s Law

A bijection between bargraphs and Dyck paths

Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model

Learning a Hierarchical Latent-Variable Model of Voxelized 3D Shapes

Stability Analysis of Adaptive Control Systems with Event-triggered Try-once-discard Protocol

A Survey on Trapping Sets and Stopping Sets

Automatic Vertebra Labeling in Large-Scale 3D CT using Deep Image-to-Image Network with Message Passing and Sparsity Regularization

Three Asymptotic Regimes for Ranking and Selection with General Sample Distributions

One Shot Joint Colocalization and Cosegmentation

Skew-Cyclic Codes over $B_k$

PaMM: Pose-aware Multi-shot Matching for Improving Person Re-identification

Observational Data-Driven Modeling and Optimization of Manufacturing Processes

Chamber structure for some equivariant relative Gromov-Witten invariants of $\mathbb{P}^1$ in genus $0$

Regularizing with Bregman-Moreau envelopes

A vanishing result for the first twisted cohomology of affine varieties and applications to line arrangements

Demand-Aware Network Designs of Bounded Degree

Joint Positioning and Radio Map Generation Based on Stochastic Variational Bayesian Inference for FWIPS

Learning to Identify Ambiguous and Misleading News Headlines

Information Geometry Approach to Parameter Estimation in Hidden Markov Models

Minimization of fraction function penalty in compressed sensing

A Note on The Enumeration of Euclidean Self-Dual Skew-Cyclic Codes over Finite Fields

Data-Centric Mobile Crowdsensing

Target Type Identification for Entity-Bearing Queries

Joint Learning from Earth Observation and OpenStreetMap Data to Get Faster Better Semantic Maps

Pitfalls and Best Practices in Algorithm Configuration

On the maximum degree of path-pairable planar graphs

Rank 3 Inhabitation of Intersection Types Revisited (Extended Version)

Millimeter Wave Communications for Future Mobile Networks

Superfast Line Spectral Estimation

A Variational Reconstruction Method for Undersampled Dynamic X-ray Tomography based on Physical Motion Models

Robust Registration of Gaussian Mixtures for Colour Transfer

Upper bounds on the growth rate of Diffusion Limited Aggregation

Symplectic Geometry of Constrained Optimization

Unlabeled Data for Morphological Generation With Character-Based Sequence-to-Sequence Models

Robust Sum Secrecy Rate Optimization for MIMO Two-way Full Duplex Systems

Geometrical features of time series provide new perspectives on collective fluctuations in driven disordered systems

Cutoff for a stratified random walk on the hypercube

A general framework for solving convex optimization problems involving the sum of three convex functions

Two-Sample Tests for Large Random Graphs using Network Statistics

Zeros of Loschmidt echo in the presence of Anderson localization

Self-stabilising Byzantine Clock Synchronisation is Almost as Easy as Consensus

Spectrum degeneracy and impact on extrapolation and sampling for functions on branching lines

Co-clustering through Optimal Transport

Iteration-complexity analysis of a generalized alternating direction method of multipliers

Infinite combinatorics plain and simple

Six Degree-of-Freedom Localization of Endoscopic Capsule Robots using Recurrent Neural Networks embedded into a Convolutional Neural Network

Data Access for LIGO on the OSG

An Investigation of Newton-Sketch and Subsampled Newton Methods

Mixing MACs: An Introduction to Hybrid Radio Wireless Virtualization

Principal Component Analysis for Functional Data on Riemannian Manifolds and Spheres

Learning to Represent Haptic Feedback for Partially-Observable Tasks

On the threshold of spread-out voter model percolation

Optimal Ramp Schemes and Related Combinatorial Objects

Regression models for replicated marked point processes

A deep level set method for image segmentation

A D-vine copula based model for repeated measurements extending linear mixed models with homogeneous correlation structure

Utility of general and specific word embeddings for classifying translational stages of research

Deep Diagnostics: Applying Convolutional Neural Networks for Vessels Defects Detection

A Parallel Solver for Graph Laplacians

On the Nonexistence of Some Generalized Folkman Numbers

A Review on Bilevel Optimization: From Classical to Evolutionary Approaches and Applications

Fast Snapshottable Concurrent Braun Heaps

Advertisements