“Twitter is probably the best place to start conversations about data science.” John Foreman

# Magister Dixit

**23**
*Sunday*
Apr 2017

Posted Magister Dixit

in
Advertisements

**23**
*Sunday*
Apr 2017

Posted Magister Dixit

in“Twitter is probably the best place to start conversations about data science.” John Foreman

Advertisements

**22**
*Saturday*
Apr 2017

Posted Books

in
**22**
*Saturday*
Apr 2017

Posted What is ...

inStochastic frontier models allow to analyse technical inefficiency in the framework of production functions. Production units (firms, regions, countries, etc.) are assumed to produce according to a common technology, and reach the frontier when they produce the maximum possible output for a given set of inputs. Inefficiencies can be due to structural problems or market imperfections and other factors which cause countries to produce below their maximum attainable output. Over time, production units can become less inefficient and catch up to the frontier. It is also possible that the frontier shifts, indicating technical progress. In addition, production units can move along the frontier by changing input quantities. Finally, there can be some combinations of these three effects. The stochastic frontier method allows to decompose growth into changes in input use, changes in technology and changes in efficiency, thus extending the widely used growth accounting method. … Stochastic Frontier Models

**22**
*Saturday*
Apr 2017

Posted Distilled News

inItemset mining is an important subfield of data mining, which consists of discovering interesting and useful patterns in transaction databases. The traditional task of frequent itemset mining is to discover groups of items (itemsets) that appear frequently together in transactions made by customers. Although itemset mining was designed for market basket analysis, it can be viewed more generally as the task of discovering groups of attribute values frequently cooccurring in databases. Because of its numerous applications in domains such as bioinformatics, text mining, product recommendation, e-learning, and web click stream analysis, itemset mining has become a popular research area. This study provides an up-to-date survey that can serve both as an introduction and as a guide to recent advances and opportunities in the field. The problem of frequent itemset mining and its applications are described. Moreover, main approaches and strategies to solve itemset mining problems are presented, as well as their characteristics are provided. Limitations of traditional frequent itemset mining approaches are also highlighted, and extensions of the task of itemset mining are presented such as high-utility itemset mining, rare itemset mining, fuzzy itemset mining, and uncertain itemset mining. This study also discusses research opportunities and the relationship to other popular pattern mining problems, such as sequential pattern mining, episode mining, subgraph mining, and association rule mining. Main open-source libraries of itemset mining implementations are also briefly presented.

**The Value of Exploratory Data Analysis**

From the outside, data science is often thought to consist wholly of advanced statistical and machine learning techniques. However, there is another key component to any data science endeavor that is often undervalued or forgotten: exploratory data analysis (EDA). At a high level, EDA is the practice of using visual and quantitative methods to understand and summarize a dataset without making any assumptions about its contents. It is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results.

Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems. Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.

**SQL Server 2017 to add Python support**

One of the major announcements from yesterday’s Data Amp event was that SQL Server 2017 will add Python as a supported language. Just as with the continued R support, SQL Server 2017 will allow you to process data in the database using any Python function or package without needing to export the data from the database, and use SQL Server itself as an operationalization platform for production applications using Python code. In addition, the high-performance and distributed statistical and machine learning functions from the RevoScaleR and MicrosoftML packages in Microsoft R will be available as Python functions for use within SQL Server.

**User Defined Functions in R Exercises (Part 1)**

In the Exercises we will discuss User Defined Function in R

**Bland-Altman Plot for Age Comparisons?**

Last week I posted about a modified age bias plot. In this post I began looking more deeply at an alternative plot called the Bland-Altman plot. Below, I describe this plot, demonstrate how to construct it in R, give a mild critique of its use for compare fish age estimates, and develop an alternative that is mean to correct what I see as some of the shortcomings of using the Bland-Altman plot for comparing age estimates. This is large a “thinking out loud” exercise so I am open to any suggestions that you may have.

**22**
*Saturday*
Apr 2017

Posted Documents

inResearchers have recently started investigating deep neural networks for dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq) models have shown promising results for unstructured tasks, such as word-level dialogue response generation. The hope is that such models will be able to leverage massive amounts of data to learn meaningful natural language representations and response generation strategies, while requiring a minimum amount of domain knowledge and hand-crafting. An important challenge is to develop models that can effectively incorporate dialogue context and generate meaningful and diverse responses. In support of this goal, we review recently proposed models based on generative encoder-decoder neural network architectures, and show that these models have better ability to incorporate long-term dialogue history, to model uncertainty and ambiguity in dialogue, and to generate responses with high-level compositional structure. Generative Deep Neural Networks for Dialogue: A Short Review

**22**
*Saturday*
Apr 2017

Posted Books

in
**22**
*Saturday*
Apr 2017

Posted Magister Dixit

in“Models are mathematical Equations.” Rapid Insight ( 2015 )

**22**
*Saturday*
Apr 2017

Posted R Packages

in* Goodness-of-Fit Tests for Capture-Recapture Models* (

Performs goodness-of-fit tests for capture-recapture models. Also contains several functions to process capture-recapture data.

Fit Gaussian hidden Markov (or semi-Markov) models with / without autoregressive coefficients and with / without regularization. The fitting algorithm for the hidden Markov model is illustrated by Rabiner (1989) <doi:10.1109/5.18626>. The shrinkage estimation on the covariance matrices is based on the graphical lasso method by Freedman (2007) <doi:10.1093/biostatistics/kxm045>. The shrinkage estimation on the autoregressive coefficients uses the elastic net shrinkage detailed in Zou (2005) <doi:10.1111/j.1467-9868.2005.00503.x>.

Set of wrappers for the ‘ncdf4’ package to simplify and extend its reading/writing capabilities into/from multidimensional R arrays.

Get statistics and reports from YouTube. To learn more about the YouTube Analytics and Reporting API, see <https://…/>.

Checks LaTeX documents and .bib files for typing errors, such as spelling errors, incorrect quotation marks. Also provides useful functions for parsing and linting bibliography files.

**21**
*Friday*
Apr 2017

Posted What is ...

inIn mathematics, a metric or distance function is a function that defines a distance between each pair of elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set, but not all topologies can be generated by a metric. A topological space whose topology can be described by a metric is called metrizable. … Metric

**21**
*Friday*
Apr 2017

Posted arXiv Papers

in**End-to-End Multi-View Networks for Text Classification**

We propose a multi-view network for text classification. Our method automatically creates various views of its input text, each taking the form of soft attention weights that distribute the classifier’s focus among a set of base features. For a bag-of-words representation, each view focuses on a different subset of the text’s words. Aggregating many such views results in a more discriminative and robust representation. Through a novel architecture that both stacks and concatenates views, we produce a network that emphasizes both depth and width, allowing training to converge quickly. Using our multi-view architecture, we establish new state-of-the-art accuracies on two benchmark tasks.

**An Interpretable Knowledge Transfer Model for Knowledge Base Completion**

Knowledge bases are important resources for a variety of natural language processing tasks but suffer from incompleteness. We propose a novel embedding model, \emph{ITransF}, to perform knowledge base completion. Equipped with a sparse attention mechanism, ITransF discovers hidden concepts of relations and transfer statistical strength through the sharing of concepts. Moreover, the learned associations between relations and concepts, which are represented by sparse attention vectors, can be interpreted easily. We evaluate ITransF on two benchmark datasets—WN18 and FB15k for knowledge base completion and obtains improvements on both the mean rank and Hits@10 metrics, over all baselines that do not use additional information.

**SAFS: A Deep Feature Selection Approach for Precision Medicine**

In this paper, we propose a new deep feature selection method based on deep architecture. Our method uses stacked auto-encoders for feature representation in higher-level abstraction. We developed and applied a novel feature learning approach to a specific precision medicine problem, which focuses on assessing and prioritizing risk factors for hypertension (HTN) in a vulnerable demographic subgroup (African-American). Our approach is to use deep learning to identify significant risk factors affecting left ventricular mass indexed to body surface area (LVMI) as an indicator of heart damage risk. The results show that our feature learning and representation approach leads to better results in comparison with others.

We study the problem of clustering sequences of unlabeled point sets taken from a common metric space. Such scenarios arise naturally in applications where a system or process is observed in distinct time intervals, such as biological surveys and contagious disease surveillance. In this more general setting existing algorithms for classical (i.e.~static) clustering problems are not applicable anymore. We propose a set of optimization problems which we collectively refer to as ‘temporal clustering’. The quality of a solution to a temporal clustering instance can be quantified using three parameters: the number of clusters , the spatial clustering cost , and the maximum cluster displacement between consecutive time steps. We consider spatial clustering costs which generalize the well-studied -center, discrete -median, and discrete -means objectives of classical clustering problems. We develop new algorithms that achieve trade-offs between the three objectives , , and . Our upper bounds are complemented by inapproximability results.

**Latent Mixture Modeling for Clustered Data**

This article proposes a mixture modeling approach to estimating cluster-wise conditional distributions in clustered (grouped) data. We adapt the mixture-of-experts model to the latent distributions, and propose a model in which each cluster-wise density is represented as a mixture of latent experts with cluster-wise mixing proportions distributed as Dirichlet distribution. The model parameters are estimated by maximizing the marginal likelihood function using a newly developed Monte Carlo Expectation-Maximization algorithm. We also extend the model such that the distribution of cluster-wise mixing proportions depends on some cluster-level covariates. The finite sample performance of the proposed model is compared with some existing mixture modeling approaches as well as linear mixed model through the simulation studies. The proposed model is also illustrated with the posted land price data in Japan.

**Knowledge Fusion via Embeddings from Text, Knowledge Graphs, and Images**

We present a baseline approach for cross-modal knowledge fusion. Different basic fusion methods are evaluated on existing embedding approaches to show the potential of joining knowledge about certain concepts across modalities in a fused concept representation.

**Dynamic Graph Convolutional Networks**

Many different classification tasks need to manage structured data, which are usually modeled as graphs. Moreover, these graphs can be dynamic, meaning that the vertices/edges of each graph may change during time. Our goal is to jointly exploit structured data and temporal information through the use of a neural network model. To the best of our knowledge, this task has not been addressed using these kind of architectures. For this reason, we propose two novel approaches, which combine Long Short-Term Memory networks and Graph Convolutional Networks to learn long short-term dependencies together with graph structure. The quality of our methods is confirmed by the promising results achieved.

• On Covering Monotonic Paths with Simple Random Walk

• A matrix generalization of a theorem of Fine

• Self-avoiding walks and connective constants

• A Note on the Concentration of Spectral Measure of Wigner’s Matrices

• The Complexity of Tree Partitioning

• On fast bounded locality sensitive hashing

• A Coalition Formation Algorithm for Multi-Robot Task Allocation in Large-Scale Natural Disasters

• The Disk is a Local Maximum in Hall’s Conjecture

• Piggybacking Codes for Network Coding: The High/Low SNR Regime

• Application of Econometric Data Analysis Methods to Physics Software

• Model Order Selection Rules For Covariance Structure Classification

• Deciding some Maltsev conditions in finite idempotent algebras

• Surprising Examples of Manifolds in Toric Topology!

• Global Stabilization of Triangular Systems with Time-Delayed Dynamic Input Perturbations

• HPatches: A benchmark and evaluation of handcrafted and learned local descriptors

• Guaranteed Fault Detection and Isolation for Switched Affine Models

• Semi-supervised classification for dynamic Android malware detection

• Unassisted Quantitative Evaluation Of Despeckling Filters

• Global Relation Embedding for Relation Extraction

• SLAM with Objects using a Nonparametric Pose Graph

• Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds

• SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

• Call Attention to Rumors: Deep Attention Based Recurrent Neural Networks for Early Rumor Detection

• Cross-domain Semantic Parsing via Paraphrasing

• Accelerating microscopy: incorporating half-lies in imaging protocols

• Retrospective Higher-Order Markov Processes for User Trails

• Strategic Arrivals to Queues Offering Priority Service

• Thresholds For Detecting An Anomalous Path From Noisy Environments

• Subspace Designs based on Algebraic Function Fields

• Edge Connectivity, Packing Spanning Trees, and eigenvalues of Graphs

• An Expectation Maximization Algorithm for High-Dimensional Model Selection for the Ising Model with Misclassified States

• On the Success Probability of the Box-Constrained Rounding and Babai Detectors

• Non-Coherent Direction-of-Arrival Estimation Using Partly Calibrated Arrays

• Fast Generation for Convolutional Autoregressive Models

• The minimum Q-index of strongly connected bipartite digraphs with complete bipartite subdigraphs

• Fractional Moment Methods for Anderson Localization with SAW Representation

• BranchConnect: Large-Scale Visual Recognition with Learned Branch Connections

• Graph-based Joint Signal / Power Restoration for Energy Harvesting Wireless Sensor Networks

• Genetic Algorithm Based Floor Planning System

• PAFit: An R Package for Modeling and Estimating Preferential Attachment and Node Fitness in Temporal Complex Networks

• A Fuzzy Brute Force Matching Method for Binary Image Features

• Enhancing Person Re-identification in a Self-trained Subspace

• H-relative error estimation approach for multiplicative regression model with random effect

• Performance Limits of Stochastic Sub-Gradient Learning, Part II: Multi-Agent Case

• Predicting Cognitive Decline with Deep Learning of Brain Metabolism and Amyloid Imaging

• Edge fluctuations of limit shapes

• End-to-end representation learning for Correlation Filter based tracking

• On Level-1 Consensus Ensuring Stable Social Choice

• Understanding the Mechanisms of Deep Transfer Learning for Medical Images

• Improvement of PolSAR Decomposition Scattering Powers Using a Relative Decorrelation Measure

• Multidimensional random walk with reflections

• Critical Gaussian chaos: convergence and uniqueness in the derivative normalisation

• Multi-view Probability Linear Discrimination Analysis for Multi-view Vector Based Text Dependent Speaker Verification

• Every Untrue Label is Untrue in its Own Way: Controlling Error Type with the Log Bilinear Loss

• A note on MCMC for nested multilevel regression models via belief propagation

• End-to-End Unsupervised Deformable Image Registration with a Convolutional Neural Network

• Certification of Compact Low-Stretch Routing Schemes

• Quenched Central Limit Theorem for Random Walks in Doubly Stochastic Random Environment

• A Geometric Approach to Covariance Matrix Estimation and its Applications to Radar Problems

• Name Independent Fault Tolerant Routing Scheme

• How Bandwidth Affects the $CONGEST$ Model

• Independent transversal domination number of a graph

• The Dependent Doors Problem: An Investigation into Sequential Decisions without Feedback

• How close are time series to power tail Lévy diffusions?

• Neural End-to-End Learning for Computational Argumentation Mining

• First-Principles Prediction of Densities of Amorphous Materials: The Case of Amorphous Silicon

• Using Mise-En-Scène Visual Features based on MPEG-7 and Deep Learning for Movie Recommendation

• Exploratory and Confirmatory Factor Analyses of Religiosity. A Four-Factor Conceptual Model

• An Achievable Rate for an Optical Channel with Finite Memory

• BB_twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs

• Learning to Acquire Information

• Overpartitions Identities Involving Gaps and Weights

• Analysis of Newton-Raphson Consensus for multi-agent convex optimization under asynchronous and lossy communications

• Optimal Query Time for Encoding Range Majority

• Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data

• Positive Affirmation of Non-Algorithmic Information Processing

• Halfspace depths for scatter, concentration and shape matrices

• A Counterexample to the Vector Generalization of Costa’s EPI, and Partial Resolution

• Boolean quadric polytopes are faces of linear ordering polytopes

• Segmentation of the Proximal Femur from MR Images using Deep Convolutional Neural Networks

• Exploring epoch-dependent stochastic residual networks

• Spectral tail processes and max-stable approximations of multivariate regularly varying time series

• On the DoF of Parallel MISO BCs with Partial CSIT: Total Order and Separability

• Cell-Probe Lower Bounds from Online Communication Complexity

• Training object class detectors with click supervision

• Softmax GAN

• Intrusion Prevention and Detection in Grid Computing – The ALICE Case

• Improved Neural Relation Detection for Knowledge Base Question Answering

• An analytical Lieb-Sokal lemma

• Independence times for iid sequences, random walks and Lévy processes

• On conditional cuts for Stochastic Dual Dynamic Programming

• ADMM Penalty Parameter Selection by Residual Balancing

• Weighted regression and meta-analysis are computationally efficient ways to fit mixed models to electronic health records from the Clinical Practice Research Datalink

• On Singleton Arc Consistency for Natural CSPs Defined by Forbidden Patterns

• Reinforcement Learning with External Knowledge and Two-Stage Q-functions for Predicting Popular Reddit Threads

• Discrete configuration spaces of squares and hexagons

• Temporal Action Detection with Structured Segment Networks

• Large-sample approximations for variance-covariance matrices of high-dimensional time series

• ‘Wrong’ Side Interpolation by Low Degree Positive Real rational Functions

• On monotone circuits with local oracles and clique lower bounds

• Towards Large-Pose Face Frontalization in the Wild

• The Nu Class of Low-Degree-Truncated Rational Multifunctions. Ia. MINOS for IMSPE Evaluation and Optimal-IMSPE-Design Search

• Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency

• On the gonality, treewidth, and orientable genus of a graph

• Robust Wirtinger Flow for Phase Retrieval with Arbitrary Corruption