Magister Dixit

“Twitter is probably the best place to start conversations about data science.” John Foreman


Book Memo: “Algorithm Engineering”

Selected Results and Surveys
Algorithm Engineering is a methodology for algorithmic research that combines theory with implementation and experimentation in order to obtain better algorithms with high practical impact. Traditionally, the study of algorithms was dominated by mathematical (worst-case) analysis. In Algorithm Engineering, algorithms are also implemented and experiments conducted in a systematic way, sometimes resembling the experimentation processes known from fields such as biology, chemistry, or physics. This helps in counteracting an otherwise growing gap between theory and practice.

If you did not already know: “Stochastic Frontier Models”

Stochastic frontier models allow to analyse technical inefficiency in the framework of production functions. Production units (firms, regions, countries, etc.) are assumed to produce according to a common technology, and reach the frontier when they produce the maximum possible output for a given set of inputs. Inefficiencies can be due to structural problems or market imperfections and other factors which cause countries to produce below their maximum attainable output. Over time, production units can become less inefficient and catch up to the frontier. It is also possible that the frontier shifts, indicating technical progress. In addition, production units can move along the frontier by changing input quantities. Finally, there can be some combinations of these three effects. The stochastic frontier method allows to decompose growth into changes in input use, changes in technology and changes in efficiency, thus extending the widely used growth accounting method. … Stochastic Frontier Models google

Distilled News

A survey of itemset mining

Itemset mining is an important subfield of data mining, which consists of discovering interesting and useful patterns in transaction databases. The traditional task of frequent itemset mining is to discover groups of items (itemsets) that appear frequently together in transactions made by customers. Although itemset mining was designed for market basket analysis, it can be viewed more generally as the task of discovering groups of attribute values frequently cooccurring in databases. Because of its numerous applications in domains such as bioinformatics, text mining, product recommendation, e-learning, and web click stream analysis, itemset mining has become a popular research area. This study provides an up-to-date survey that can serve both as an introduction and as a guide to recent advances and opportunities in the field. The problem of frequent itemset mining and its applications are described. Moreover, main approaches and strategies to solve itemset mining problems are presented, as well as their characteristics are provided. Limitations of traditional frequent itemset mining approaches are also highlighted, and extensions of the task of itemset mining are presented such as high-utility itemset mining, rare itemset mining, fuzzy itemset mining, and uncertain itemset mining. This study also discusses research opportunities and the relationship to other popular pattern mining problems, such as sequential pattern mining, episode mining, subgraph mining, and association rule mining. Main open-source libraries of itemset mining implementations are also briefly presented.

The Value of Exploratory Data Analysis

From the outside, data science is often thought to consist wholly of advanced statistical and machine learning techniques. However, there is another key component to any data science endeavor that is often undervalued or forgotten: exploratory data analysis (EDA). At a high level, EDA is the practice of using visual and quantitative methods to understand and summarize a dataset without making any assumptions about its contents. It is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results.

Logistic regressions (in R)

Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems. Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.

SQL Server 2017 to add Python support

One of the major announcements from yesterday’s Data Amp event was that SQL Server 2017 will add Python as a supported language. Just as with the continued R support, SQL Server 2017 will allow you to process data in the database using any Python function or package without needing to export the data from the database, and use SQL Server itself as an operationalization platform for production applications using Python code. In addition, the high-performance and distributed statistical and machine learning functions from the RevoScaleR and MicrosoftML packages in Microsoft R will be available as Python functions for use within SQL Server.

User Defined Functions in R Exercises (Part 1)

In the Exercises we will discuss User Defined Function in R

Bland-Altman Plot for Age Comparisons?

Last week I posted about a modified age bias plot. In this post I began looking more deeply at an alternative plot called the Bland-Altman plot. Below, I describe this plot, demonstrate how to construct it in R, give a mild critique of its use for compare fish age estimates, and develop an alternative that is mean to correct what I see as some of the shortcomings of using the Bland-Altman plot for comparing age estimates. This is large a “thinking out loud” exercise so I am open to any suggestions that you may have.

Document worth reading: “Generative Deep Neural Networks for Dialogue: A Short Review”

Researchers have recently started investigating deep neural networks for dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq) models have shown promising results for unstructured tasks, such as word-level dialogue response generation. The hope is that such models will be able to leverage massive amounts of data to learn meaningful natural language representations and response generation strategies, while requiring a minimum amount of domain knowledge and hand-crafting. An important challenge is to develop models that can effectively incorporate dialogue context and generate meaningful and diverse responses. In support of this goal, we review recently proposed models based on generative encoder-decoder neural network architectures, and show that these models have better ability to incorporate long-term dialogue history, to model uncertainty and ambiguity in dialogue, and to generate responses with high-level compositional structure. Generative Deep Neural Networks for Dialogue: A Short Review

Book Memo: “Big Graph Analytics Platforms”

The growing need to deal with massive graphs in real-life applications has led to a surge in the development of big graph analytics platforms. Tens of such big graph systems have already been developed, and more are expected to emerge in the near future. Although several experimental studies have been conducted in recent years that compare the performance of several big graph systems, Big Graph Analytics Platforms is the first text to provide a comprehensive survey that clearly summarizes the key features and techniques developed in existing systems. It aims to help readers get a systematic picture of the landscape of recent big graph systems, focusing not just on the systems themselves, but also on the key innovations and design philosophies underlying them. In addition to the popular vertex-centric systems which espouse a think-like-a-vertex paradigm for developing parallel graph applications, Big Graph Analytics Platforms also covers other programming and computation models, contrasts those against each other, and provides a vision for future research in the field.

R Packages worth a look

Goodness-of-Fit Tests for Capture-Recapture Models (R2ucare)
Performs goodness-of-fit tests for capture-recapture models. Also contains several functions to process capture-recapture data.

Regularized Autoregressive Hidden Semi Markov Model (rarhsmm)
Fit Gaussian hidden Markov (or semi-Markov) models with / without autoregressive coefficients and with / without regularization. The fitting algorithm for the hidden Markov model is illustrated by Rabiner (1989) <doi:10.1109/5.18626>. The shrinkage estimation on the covariance matrices is based on the graphical lasso method by Freedman (2007) <doi:10.1093/biostatistics/kxm045>. The shrinkage estimation on the autoregressive coefficients uses the elastic net shrinkage detailed in Zou (2005) <doi:10.1111/j.1467-9868.2005.00503.x>.

Tools to Easily Read/Write NetCDF Files into/from Multidimensional R Arrays (easyNCDF)
Set of wrappers for the ‘ncdf4’ package to simplify and extend its reading/writing capabilities into/from multidimensional R arrays.

R Client for the YouTube Analytics and Reporting API (tubern)
Get statistics and reports from YouTube. To learn more about the YouTube Analytics and Reporting API, see <https://…/>.

Parses LaTeX Documents for Errors (TeXCheckR)
Checks LaTeX documents and .bib files for typing errors, such as spelling errors, incorrect quotation marks. Also provides useful functions for parsing and linting bibliography files.

Whats new on arXiv

End-to-End Multi-View Networks for Text Classification

We propose a multi-view network for text classification. Our method automatically creates various views of its input text, each taking the form of soft attention weights that distribute the classifier’s focus among a set of base features. For a bag-of-words representation, each view focuses on a different subset of the text’s words. Aggregating many such views results in a more discriminative and robust representation. Through a novel architecture that both stacks and concatenates views, we produce a network that emphasizes both depth and width, allowing training to converge quickly. Using our multi-view architecture, we establish new state-of-the-art accuracies on two benchmark tasks.

An Interpretable Knowledge Transfer Model for Knowledge Base Completion

Knowledge bases are important resources for a variety of natural language processing tasks but suffer from incompleteness. We propose a novel embedding model, \emph{ITransF}, to perform knowledge base completion. Equipped with a sparse attention mechanism, ITransF discovers hidden concepts of relations and transfer statistical strength through the sharing of concepts. Moreover, the learned associations between relations and concepts, which are represented by sparse attention vectors, can be interpreted easily. We evaluate ITransF on two benchmark datasets—WN18 and FB15k for knowledge base completion and obtains improvements on both the mean rank and Hits@10 metrics, over all baselines that do not use additional information.

SAFS: A Deep Feature Selection Approach for Precision Medicine

In this paper, we propose a new deep feature selection method based on deep architecture. Our method uses stacked auto-encoders for feature representation in higher-level abstraction. We developed and applied a novel feature learning approach to a specific precision medicine problem, which focuses on assessing and prioritizing risk factors for hypertension (HTN) in a vulnerable demographic subgroup (African-American). Our approach is to use deep learning to identify significant risk factors affecting left ventricular mass indexed to body surface area (LVMI) as an indicator of heart damage risk. The results show that our feature learning and representation approach leads to better results in comparison with others.

Temporal Clustering

We study the problem of clustering sequences of unlabeled point sets taken from a common metric space. Such scenarios arise naturally in applications where a system or process is observed in distinct time intervals, such as biological surveys and contagious disease surveillance. In this more general setting existing algorithms for classical (i.e.~static) clustering problems are not applicable anymore. We propose a set of optimization problems which we collectively refer to as ‘temporal clustering’. The quality of a solution to a temporal clustering instance can be quantified using three parameters: the number of clusters k, the spatial clustering cost r, and the maximum cluster displacement \delta between consecutive time steps. We consider spatial clustering costs which generalize the well-studied k-center, discrete k-median, and discrete k-means objectives of classical clustering problems. We develop new algorithms that achieve trade-offs between the three objectives k, r, and \delta. Our upper bounds are complemented by inapproximability results.

Latent Mixture Modeling for Clustered Data

This article proposes a mixture modeling approach to estimating cluster-wise conditional distributions in clustered (grouped) data. We adapt the mixture-of-experts model to the latent distributions, and propose a model in which each cluster-wise density is represented as a mixture of latent experts with cluster-wise mixing proportions distributed as Dirichlet distribution. The model parameters are estimated by maximizing the marginal likelihood function using a newly developed Monte Carlo Expectation-Maximization algorithm. We also extend the model such that the distribution of cluster-wise mixing proportions depends on some cluster-level covariates. The finite sample performance of the proposed model is compared with some existing mixture modeling approaches as well as linear mixed model through the simulation studies. The proposed model is also illustrated with the posted land price data in Japan.

Knowledge Fusion via Embeddings from Text, Knowledge Graphs, and Images

We present a baseline approach for cross-modal knowledge fusion. Different basic fusion methods are evaluated on existing embedding approaches to show the potential of joining knowledge about certain concepts across modalities in a fused concept representation.

Dynamic Graph Convolutional Networks

Many different classification tasks need to manage structured data, which are usually modeled as graphs. Moreover, these graphs can be dynamic, meaning that the vertices/edges of each graph may change during time. Our goal is to jointly exploit structured data and temporal information through the use of a neural network model. To the best of our knowledge, this task has not been addressed using these kind of architectures. For this reason, we propose two novel approaches, which combine Long Short-Term Memory networks and Graph Convolutional Networks to learn long short-term dependencies together with graph structure. The quality of our methods is confirmed by the promising results achieved.

On Covering Monotonic Paths with Simple Random Walk

A matrix generalization of a theorem of Fine

Self-avoiding walks and connective constants

A Note on the Concentration of Spectral Measure of Wigner’s Matrices

The Complexity of Tree Partitioning

On fast bounded locality sensitive hashing

A Coalition Formation Algorithm for Multi-Robot Task Allocation in Large-Scale Natural Disasters

The Disk is a Local Maximum in Hall’s Conjecture

Piggybacking Codes for Network Coding: The High/Low SNR Regime

Application of Econometric Data Analysis Methods to Physics Software

Model Order Selection Rules For Covariance Structure Classification

Deciding some Maltsev conditions in finite idempotent algebras

Surprising Examples of Manifolds in Toric Topology!

Global Stabilization of Triangular Systems with Time-Delayed Dynamic Input Perturbations

HPatches: A benchmark and evaluation of handcrafted and learned local descriptors

Guaranteed Fault Detection and Isolation for Switched Affine Models

Semi-supervised classification for dynamic Android malware detection

Unassisted Quantitative Evaluation Of Despeckling Filters

Global Relation Embedding for Relation Extraction

SLAM with Objects using a Nonparametric Pose Graph

Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds

SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

Call Attention to Rumors: Deep Attention Based Recurrent Neural Networks for Early Rumor Detection

Cross-domain Semantic Parsing via Paraphrasing

Accelerating microscopy: incorporating half-lies in imaging protocols

Retrospective Higher-Order Markov Processes for User Trails

Strategic Arrivals to Queues Offering Priority Service

Thresholds For Detecting An Anomalous Path From Noisy Environments

Subspace Designs based on Algebraic Function Fields

Edge Connectivity, Packing Spanning Trees, and eigenvalues of Graphs

An Expectation Maximization Algorithm for High-Dimensional Model Selection for the Ising Model with Misclassified States

On the Success Probability of the Box-Constrained Rounding and Babai Detectors

Non-Coherent Direction-of-Arrival Estimation Using Partly Calibrated Arrays

Fast Generation for Convolutional Autoregressive Models

The minimum Q-index of strongly connected bipartite digraphs with complete bipartite subdigraphs

Fractional Moment Methods for Anderson Localization with SAW Representation

BranchConnect: Large-Scale Visual Recognition with Learned Branch Connections

Graph-based Joint Signal / Power Restoration for Energy Harvesting Wireless Sensor Networks

Genetic Algorithm Based Floor Planning System

PAFit: An R Package for Modeling and Estimating Preferential Attachment and Node Fitness in Temporal Complex Networks

A Fuzzy Brute Force Matching Method for Binary Image Features

Enhancing Person Re-identification in a Self-trained Subspace

H-relative error estimation approach for multiplicative regression model with random effect

Performance Limits of Stochastic Sub-Gradient Learning, Part II: Multi-Agent Case

Predicting Cognitive Decline with Deep Learning of Brain Metabolism and Amyloid Imaging

Edge fluctuations of limit shapes

End-to-end representation learning for Correlation Filter based tracking

On Level-1 Consensus Ensuring Stable Social Choice

Understanding the Mechanisms of Deep Transfer Learning for Medical Images

Improvement of PolSAR Decomposition Scattering Powers Using a Relative Decorrelation Measure

Multidimensional random walk with reflections

Critical Gaussian chaos: convergence and uniqueness in the derivative normalisation

Multi-view Probability Linear Discrimination Analysis for Multi-view Vector Based Text Dependent Speaker Verification

Every Untrue Label is Untrue in its Own Way: Controlling Error Type with the Log Bilinear Loss

A note on MCMC for nested multilevel regression models via belief propagation

End-to-End Unsupervised Deformable Image Registration with a Convolutional Neural Network

Certification of Compact Low-Stretch Routing Schemes

Quenched Central Limit Theorem for Random Walks in Doubly Stochastic Random Environment

A Geometric Approach to Covariance Matrix Estimation and its Applications to Radar Problems

Name Independent Fault Tolerant Routing Scheme

How Bandwidth Affects the $CONGEST$ Model

Independent transversal domination number of a graph

The Dependent Doors Problem: An Investigation into Sequential Decisions without Feedback

How close are time series to power tail Lévy diffusions?

Neural End-to-End Learning for Computational Argumentation Mining

First-Principles Prediction of Densities of Amorphous Materials: The Case of Amorphous Silicon

Using Mise-En-Scène Visual Features based on MPEG-7 and Deep Learning for Movie Recommendation

Exploratory and Confirmatory Factor Analyses of Religiosity. A Four-Factor Conceptual Model

An Achievable Rate for an Optical Channel with Finite Memory

BB_twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs

Learning to Acquire Information

Overpartitions Identities Involving Gaps and Weights

Analysis of Newton-Raphson Consensus for multi-agent convex optimization under asynchronous and lossy communications

Optimal Query Time for Encoding Range Majority

Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data

Positive Affirmation of Non-Algorithmic Information Processing

Halfspace depths for scatter, concentration and shape matrices

A Counterexample to the Vector Generalization of Costa’s EPI, and Partial Resolution

Boolean quadric polytopes are faces of linear ordering polytopes

Segmentation of the Proximal Femur from MR Images using Deep Convolutional Neural Networks

Exploring epoch-dependent stochastic residual networks

Spectral tail processes and max-stable approximations of multivariate regularly varying time series

On the DoF of Parallel MISO BCs with Partial CSIT: Total Order and Separability

Cell-Probe Lower Bounds from Online Communication Complexity

Training object class detectors with click supervision

Softmax GAN

Intrusion Prevention and Detection in Grid Computing – The ALICE Case

Improved Neural Relation Detection for Knowledge Base Question Answering

An analytical Lieb-Sokal lemma

Independence times for iid sequences, random walks and Lévy processes

On conditional cuts for Stochastic Dual Dynamic Programming

ADMM Penalty Parameter Selection by Residual Balancing

Weighted regression and meta-analysis are computationally efficient ways to fit mixed models to electronic health records from the Clinical Practice Research Datalink

On Singleton Arc Consistency for Natural CSPs Defined by Forbidden Patterns

Reinforcement Learning with External Knowledge and Two-Stage Q-functions for Predicting Popular Reddit Threads

Discrete configuration spaces of squares and hexagons

Temporal Action Detection with Structured Segment Networks

Large-sample approximations for variance-covariance matrices of high-dimensional time series

‘Wrong’ Side Interpolation by Low Degree Positive Real rational Functions

On monotone circuits with local oracles and clique lower bounds

Towards Large-Pose Face Frontalization in the Wild

The Nu Class of Low-Degree-Truncated Rational Multifunctions. Ia. MINOS for IMSPE Evaluation and Optimal-IMSPE-Design Search

Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency

On the gonality, treewidth, and orientable genus of a graph

Robust Wirtinger Flow for Phase Retrieval with Arbitrary Corruption