# Magister Dixit

“Twitter is probably the best place to start conversations about data science.” John Foreman

# Book Memo: “Algorithm Engineering”

 Selected Results and Surveys Algorithm Engineering is a methodology for algorithmic research that combines theory with implementation and experimentation in order to obtain better algorithms with high practical impact. Traditionally, the study of algorithms was dominated by mathematical (worst-case) analysis. In Algorithm Engineering, algorithms are also implemented and experiments conducted in a systematic way, sometimes resembling the experimentation processes known from fields such as biology, chemistry, or physics. This helps in counteracting an otherwise growing gap between theory and practice.

# If you did not already know: “Stochastic Frontier Models”

Stochastic frontier models allow to analyse technical inefficiency in the framework of production functions. Production units (firms, regions, countries, etc.) are assumed to produce according to a common technology, and reach the frontier when they produce the maximum possible output for a given set of inputs. Inefficiencies can be due to structural problems or market imperfections and other factors which cause countries to produce below their maximum attainable output. Over time, production units can become less inefficient and catch up to the frontier. It is also possible that the frontier shifts, indicating technical progress. In addition, production units can move along the frontier by changing input quantities. Finally, there can be some combinations of these three effects. The stochastic frontier method allows to decompose growth into changes in input use, changes in technology and changes in efficiency, thus extending the widely used growth accounting method. … Stochastic Frontier Models

# Distilled News

Itemset mining is an important subfield of data mining, which consists of discovering interesting and useful patterns in transaction databases. The traditional task of frequent itemset mining is to discover groups of items (itemsets) that appear frequently together in transactions made by customers. Although itemset mining was designed for market basket analysis, it can be viewed more generally as the task of discovering groups of attribute values frequently cooccurring in databases. Because of its numerous applications in domains such as bioinformatics, text mining, product recommendation, e-learning, and web click stream analysis, itemset mining has become a popular research area. This study provides an up-to-date survey that can serve both as an introduction and as a guide to recent advances and opportunities in the field. The problem of frequent itemset mining and its applications are described. Moreover, main approaches and strategies to solve itemset mining problems are presented, as well as their characteristics are provided. Limitations of traditional frequent itemset mining approaches are also highlighted, and extensions of the task of itemset mining are presented such as high-utility itemset mining, rare itemset mining, fuzzy itemset mining, and uncertain itemset mining. This study also discusses research opportunities and the relationship to other popular pattern mining problems, such as sequential pattern mining, episode mining, subgraph mining, and association rule mining. Main open-source libraries of itemset mining implementations are also briefly presented.
From the outside, data science is often thought to consist wholly of advanced statistical and machine learning techniques. However, there is another key component to any data science endeavor that is often undervalued or forgotten: exploratory data analysis (EDA). At a high level, EDA is the practice of using visual and quantitative methods to understand and summarize a dataset without making any assumptions about its contents. It is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results.
Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems. Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.
One of the major announcements from yesterday’s Data Amp event was that SQL Server 2017 will add Python as a supported language. Just as with the continued R support, SQL Server 2017 will allow you to process data in the database using any Python function or package without needing to export the data from the database, and use SQL Server itself as an operationalization platform for production applications using Python code. In addition, the high-performance and distributed statistical and machine learning functions from the RevoScaleR and MicrosoftML packages in Microsoft R will be available as Python functions for use within SQL Server.
In the Exercises we will discuss User Defined Function in R
Last week I posted about a modified age bias plot. In this post I began looking more deeply at an alternative plot called the Bland-Altman plot. Below, I describe this plot, demonstrate how to construct it in R, give a mild critique of its use for compare fish age estimates, and develop an alternative that is mean to correct what I see as some of the shortcomings of using the Bland-Altman plot for comparing age estimates. This is large a “thinking out loud” exercise so I am open to any suggestions that you may have.

# Document worth reading: “Generative Deep Neural Networks for Dialogue: A Short Review”

Researchers have recently started investigating deep neural networks for dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq) models have shown promising results for unstructured tasks, such as word-level dialogue response generation. The hope is that such models will be able to leverage massive amounts of data to learn meaningful natural language representations and response generation strategies, while requiring a minimum amount of domain knowledge and hand-crafting. An important challenge is to develop models that can effectively incorporate dialogue context and generate meaningful and diverse responses. In support of this goal, we review recently proposed models based on generative encoder-decoder neural network architectures, and show that these models have better ability to incorporate long-term dialogue history, to model uncertainty and ambiguity in dialogue, and to generate responses with high-level compositional structure. Generative Deep Neural Networks for Dialogue: A Short Review

# Book Memo: “Big Graph Analytics Platforms”

 The growing need to deal with massive graphs in real-life applications has led to a surge in the development of big graph analytics platforms. Tens of such big graph systems have already been developed, and more are expected to emerge in the near future. Although several experimental studies have been conducted in recent years that compare the performance of several big graph systems, Big Graph Analytics Platforms is the first text to provide a comprehensive survey that clearly summarizes the key features and techniques developed in existing systems. It aims to help readers get a systematic picture of the landscape of recent big graph systems, focusing not just on the systems themselves, but also on the key innovations and design philosophies underlying them. In addition to the popular vertex-centric systems which espouse a think-like-a-vertex paradigm for developing parallel graph applications, Big Graph Analytics Platforms also covers other programming and computation models, contrasts those against each other, and provides a vision for future research in the field.

# R Packages worth a look

Goodness-of-Fit Tests for Capture-Recapture Models (R2ucare)
Performs goodness-of-fit tests for capture-recapture models. Also contains several functions to process capture-recapture data.

Regularized Autoregressive Hidden Semi Markov Model (rarhsmm)
Fit Gaussian hidden Markov (or semi-Markov) models with / without autoregressive coefficients and with / without regularization. The fitting algorithm for the hidden Markov model is illustrated by Rabiner (1989) <doi:10.1109/5.18626>. The shrinkage estimation on the covariance matrices is based on the graphical lasso method by Freedman (2007) <doi:10.1093/biostatistics/kxm045>. The shrinkage estimation on the autoregressive coefficients uses the elastic net shrinkage detailed in Zou (2005) <doi:10.1111/j.1467-9868.2005.00503.x>.

Tools to Easily Read/Write NetCDF Files into/from Multidimensional R Arrays (easyNCDF)
Set of wrappers for the ‘ncdf4’ package to simplify and extend its reading/writing capabilities into/from multidimensional R arrays.

R Client for the YouTube Analytics and Reporting API (tubern)

Parses LaTeX Documents for Errors (TeXCheckR)
Checks LaTeX documents and .bib files for typing errors, such as spelling errors, incorrect quotation marks. Also provides useful functions for parsing and linting bibliography files.

# If you did not already know: “Metric”

In mathematics, a metric or distance function is a function that defines a distance between each pair of elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set, but not all topologies can be generated by a metric. A topological space whose topology can be described by a metric is called metrizable. … Metric

# Whats new on arXiv

We propose a multi-view network for text classification. Our method automatically creates various views of its input text, each taking the form of soft attention weights that distribute the classifier’s focus among a set of base features. For a bag-of-words representation, each view focuses on a different subset of the text’s words. Aggregating many such views results in a more discriminative and robust representation. Through a novel architecture that both stacks and concatenates views, we produce a network that emphasizes both depth and width, allowing training to converge quickly. Using our multi-view architecture, we establish new state-of-the-art accuracies on two benchmark tasks.
Knowledge bases are important resources for a variety of natural language processing tasks but suffer from incompleteness. We propose a novel embedding model, \emph{ITransF}, to perform knowledge base completion. Equipped with a sparse attention mechanism, ITransF discovers hidden concepts of relations and transfer statistical strength through the sharing of concepts. Moreover, the learned associations between relations and concepts, which are represented by sparse attention vectors, can be interpreted easily. We evaluate ITransF on two benchmark datasets—WN18 and FB15k for knowledge base completion and obtains improvements on both the mean rank and Hits@10 metrics, over all baselines that do not use additional information.
In this paper, we propose a new deep feature selection method based on deep architecture. Our method uses stacked auto-encoders for feature representation in higher-level abstraction. We developed and applied a novel feature learning approach to a specific precision medicine problem, which focuses on assessing and prioritizing risk factors for hypertension (HTN) in a vulnerable demographic subgroup (African-American). Our approach is to use deep learning to identify significant risk factors affecting left ventricular mass indexed to body surface area (LVMI) as an indicator of heart damage risk. The results show that our feature learning and representation approach leads to better results in comparison with others.
We study the problem of clustering sequences of unlabeled point sets taken from a common metric space. Such scenarios arise naturally in applications where a system or process is observed in distinct time intervals, such as biological surveys and contagious disease surveillance. In this more general setting existing algorithms for classical (i.e.~static) clustering problems are not applicable anymore. We propose a set of optimization problems which we collectively refer to as ‘temporal clustering’. The quality of a solution to a temporal clustering instance can be quantified using three parameters: the number of clusters $k$, the spatial clustering cost $r$, and the maximum cluster displacement $\delta$ between consecutive time steps. We consider spatial clustering costs which generalize the well-studied $k$-center, discrete $k$-median, and discrete $k$-means objectives of classical clustering problems. We develop new algorithms that achieve trade-offs between the three objectives $k$, $r$, and $\delta$. Our upper bounds are complemented by inapproximability results.
This article proposes a mixture modeling approach to estimating cluster-wise conditional distributions in clustered (grouped) data. We adapt the mixture-of-experts model to the latent distributions, and propose a model in which each cluster-wise density is represented as a mixture of latent experts with cluster-wise mixing proportions distributed as Dirichlet distribution. The model parameters are estimated by maximizing the marginal likelihood function using a newly developed Monte Carlo Expectation-Maximization algorithm. We also extend the model such that the distribution of cluster-wise mixing proportions depends on some cluster-level covariates. The finite sample performance of the proposed model is compared with some existing mixture modeling approaches as well as linear mixed model through the simulation studies. The proposed model is also illustrated with the posted land price data in Japan.
We present a baseline approach for cross-modal knowledge fusion. Different basic fusion methods are evaluated on existing embedding approaches to show the potential of joining knowledge about certain concepts across modalities in a fused concept representation.
Many different classification tasks need to manage structured data, which are usually modeled as graphs. Moreover, these graphs can be dynamic, meaning that the vertices/edges of each graph may change during time. Our goal is to jointly exploit structured data and temporal information through the use of a neural network model. To the best of our knowledge, this task has not been addressed using these kind of architectures. For this reason, we propose two novel approaches, which combine Long Short-Term Memory networks and Graph Convolutional Networks to learn long short-term dependencies together with graph structure. The quality of our methods is confirmed by the promising results achieved.