# Distilled News

A common scenario that data analysts in general encounter is what I like to describe as ‘data denialism’. Often, and especially while consulting, an analyst will find that the data tells a different story than what the customer holds to be true. It is also often the case that, when presenting this finding, the customer will outright deny the evidence, asserting that either the data or the analysis must be wrong. For example, it may be that a retailer focused on the low-end market is getting most of its sales from high-end customers, and such a fact upends months -maybe even years- of marketing planning and strategy. (This may, or may not, be based on one of my previous consulting experiences) It is of course part of the analyst’s job to present and discuss such controversial findings carefully and in a way that they can be understood an accepted, or tell a story that is compelling enough to be believable. Of course, too, some discussion about findings is definitely healthy and desirable. But even if the customer is convinced that the analyst did their job right, there’s still the matter of the data itself, for how can the customer be assured that the data is correct? After the myriad transformations, schema modifications, unifications and predictive tasks, how can even the analyst be sure that everything went right?
A lot of marketing research is aimed at uncovering why consumers do what they do and not just predicting what they’ll do next. Marketing scientist Kevin Gray asks Harvard Professor Tyler VanderWeele about causal analysis, arguably the next frontier in analytics.
RStudio is excited to announce the availability of RStudio Server Pro on the Google Cloud Platform.
I have been going through the deep learning literature for quite some time now. I have also participated in a few challenges to get my hands dirty. But what I enjoy the most is to apply deep learning in a real life problem. A real life problem which encompasses my daily life. This is partly why I picked up this problem of chair count recognition, to finally solve a problem which was unsolved till now! In this article, I will cover how I defined the problem. I will also mention what were the steps I took to solve the problem. Consider it as a raw uncut version of my experience as I tried to solve the problem ??
in this post I discuss clustering: techniques that form this method and some peculiarities of using clustering in practice. This post continues previous one about the OPERA.
When I start designing a map I consider: How do I want the viewer to read the information on my map? Do I want them to see how a measurement varies across a geographic area at a glance? Do I want to show the level of variability within a specific region? Or do I want to indicate busy pockets of activity or the relative volume/density within an area?
On this set of exercises, we are going to explore some of the probability functions in R with practical applications. Basic probability knowledge is required. Note: We are going to use random number functions and random process functions in R such as runif, a problem with these functions is that every time you run them you will obtain a different value. To make your results reproducible you can specify the value of the seed using set.seed(‘any number’) before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random numbers). For this set of exercises we will use set.seed(1), don’t forget to specify it before every random exercise.

# If you did not already know

Parallel Data Assimilation Framework (PDAF)
The Parallel Data Assimilation Framework – PDAF – is a software environment for ensemble data assimilation. PDAF simplifies the implementation of the data assimilation system with existing numerical models. With this, users can obtain a data assimilation system with less work and can focus on applying data assimilation. PDAF provides fully implemented and optimized data assimilation algorithms, in particular ensemble-based Kalman filters like LETKF and LSEIK. It allows users to easily test different assimilation algorithms and observations. PDAF is optimized for the application with large-scale models that usually run on big parallel computers and is applicable for operational applications. However, it is also well suited for smaller models and even toy models. PDAF provides a standardized interface that separates the numerical model from the assimilation routines. This allows to perform the further development of the assimilation methods and the model independently. New algorithmic developments can be readily made available through the interface such that they can be immediately applied with existing implementations. The test suite of PDAF provides small models for easy testing of algorithmic developments and for teaching data assimilation. PDAF is an open-source project. Its functionality will be further extended by input from research projects. In addition, users are welcome to contribute to the further enhancement of PDAF, e.g. by contributing additional assimilation methods or interface routines for different numerical models. …

Data Structure Graph
A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph. …

Statistical Distance
In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be between an individual sample point and a population or a wider sample of points. A distance between populations can be interpreted as measuring the distance between two probability distributions and hence they are essentially measures of distances between probability measures. Where statistical distance measures relate to the differences between random variables, these may have statistical dependence, and hence these distances are not directly related to measures of distances between probability measures. Again, a measure of distance between random variables may relate to the extent of dependence between them, rather than to their individual values. Statistical distance measures are mostly not metrics and they need not be symmetric. Some types of distance measures are referred to as (statistical) divergences. …

# Whats new on arXiv

This paper contributes improvements on both the effectiveness and efficiency of Matrix Factorization (MF) methods for implicit feedback. We highlight two critical issues of existing works. First, due to the large space of unobserved feedback, most existing works resort to assign a uniform weight to the missing data to reduce computational complexity. However, such a uniform assumption is invalid in real-world settings. Second, most methods are also designed in an offline setting and fail to keep up with the dynamic nature of online data. We address the above two issues in learning MF models from implicit feedback. We first propose to weight the missing data based on item popularity, which is more effective and flexible than the uniform-weight assumption. However, such a non-uniform weighting poses efficiency challenge in learning the model. To address this, we specifically design a new learning algorithm based on the element-wise Alternating Least Squares (eALS) technique, for efficiently optimizing a MF model with variably-weighted missing data. We exploit this efficiency to then seamlessly devise an incremental update strategy that instantly refreshes a MF model given new feedback. Through comprehensive experiments on two public datasets in both offline and online protocols, we show that our eALS method consistently outperforms state-of-the-art implicit MF methods. Our implementation is available at https://…/sigir16-eals.
Principal component analysis continues to be a powerful tool in dimension reduction of high dimensional data. We assume a variance-diverging model and use the high-dimension, low-sample-size asymptotics to show that even though the principal component directions are not consistent, the sample and prediction principal component scores can be useful in revealing the population structure. We further show that these scores are biased, and the bias is asymptotically decomposed into rotation and scaling parts. We propose methods of bias-adjustment that are shown to be consistent and work well in the finite but high dimensional situations with small sample sizes. The potential advantage of bias-adjustment is demonstrated in a classification setting.
Many predictive tasks of web applications need to model categorical variables, such as user IDs and demographics like genders and occupations. To apply standard machine learning techniques, these categorical predictors are always converted to a set of binary features via one-hot encoding, making the resultant feature vector highly sparse. To learn from such sparse data effectively, it is crucial to account for the interactions between features. Factorization Machines (FMs) are a popular solution for efficiently using the second-order feature interactions. However, FM models feature interactions in a linear way, which can be insufficient for capturing the non-linear and complex inherent structure of real-world data. While deep neural networks have recently been applied to learn non-linear feature interactions in industry, such as the Wide&Deep by Google and DeepCross by Microsoft, the deep structure meanwhile makes them difficult to train. In this paper, we propose a novel model Neural Factorization Machine (NFM) for prediction under sparse settings. NFM seamlessly combines the linearity of FM in modelling second-order feature interactions and the non-linearity of neural network in modelling higher-order feature interactions. Conceptually, NFM is more expressive than FM since FM can be seen as a special case of NFM without hidden layers. Empirical results on two regression tasks show that with one hidden layer only, NFM significantly outperforms FM with a 7.3% relative improvement. Compared to the recent deep learning methods Wide&Deep and DeepCross, our NFM uses a shallower structure but offers better performance, being much easier to train and tune in practice.
In recent years, deep neural network exhibits its powerful superiority on information discrimination in many computer vision applications. However, the capacity of deep neural network architecture is still a mystery to the researchers. Intuitively, larger capacity of neural network can always deposit more information to improve the discrimination ability of the model. But, the learnable parameter scale is not feasible to estimate the capacity of deep neural network. Due to the overfitting, directly increasing hidden nodes number and hidden layer number are already demonstrated not necessary to effectively increase the network discrimination ability. In this paper, we propose a novel measurement, named ‘total valid bits’, to evaluate the capacity of deep neural networks for exploring how to quantitatively understand the deep learning and the insights behind its super performance. Specifically, our scheme to retrieve the total valid bits incorporates the skilled techniques in both training phase and inference phase. In the network training, we design decimal weight regularization and 8-bit forward quantization to obtain the integer-oriented network representations. Moreover, we develop adaptive-bitwidth and non-uniform quantization strategy in the inference phase to find the neural network capacity, total valid bits. By allowing zero bitwidth, our adaptive-bitwidth quantization can execute the model reduction and valid bits finding simultaneously. In our extensive experiments, we first demonstrate that our total valid bits is a good indicator of neural network capacity. We also analyze the impact on network capacity from the network architecture and advanced training skills, such as dropout and batch normalization.
In recent years, deep neural networks have yielded immense success on speech recognition, computer vision and natural language processing. However, the exploration of deep neural networks on recommender systems has received relatively less scrutiny. In this work, we strive to develop techniques based on neural networks to tackle the key problem in recommendation — collaborative filtering — on the basis of implicit feedback. Although some recent work has employed deep learning for recommendation, they primarily used it to model auxiliary information, such as textual descriptions of items and acoustic features of musics. When it comes to model the key factor in collaborative filtering — the interaction between user and item features, they still resorted to matrix factorization and applied an inner product on the latent features of users and items. By replacing the inner product with a neural architecture that can learn an arbitrary function from data, we present a general framework named NCF, short for Neural network-based Collaborative Filtering. NCF is generic and can express and generalize matrix factorization under its framework. To supercharge NCF modelling with non-linearities, we propose to leverage a multi-layer perceptron to learn the user-item interaction function. Extensive experiments on two real-world datasets show significant improvements of our proposed NCF framework over the state-of-the-art methods. Empirical evidence shows that using deeper layers of neural networks offers better recommendation performance.
A distributed semantic graph processing system that provides locality control, indexing, graph query, and parallel processing capabilities is presented.
Natural language processing (NLP) has recently gained much attention for representing and analysing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. The paper distinguishes four phases by discussing different levels of NLP and components of Natural Language Generation (NLG) followed by presenting the history and evolution of NLP, state of the art presenting the various applications of NLP and current trends and challenges.
Data streams typically have items of large number of dimensions. We study the fundamental heavy-hitters problem in this setting. Formally, the data stream consists of $d$-dimensional items $x_1,\ldots,x_m \in [n]^d$. A $k$-dimensional subcube $T$ is a subset of distinct coordinates $\{ T_1,\cdots,T_k \} \subseteq [d]$. A subcube heavy hitter query ${\rm Query}(T,v)$, $v \in [n]^k$, outputs YES if $f_T(v) \geq \gamma$ and NO if $f_T(v) < \gamma/4$, where $f_T$ is the ratio of number of stream items whose coordinates $T$ have joint values $v$. The all subcube heavy hitters query ${\rm AllQuery}(T)$ outputs all joint values $v$ that return YES to ${\rm Query}(T,v)$. The one dimensional version of this problem where $d=1$ was heavily studied in data stream theory, databases, networking and signal processing. The subcube heavy hitters problem is applicable in all these cases. We present a simple reservoir sampling based one-pass streaming algorithm to solve the subcube heavy hitters problem in $\tilde{O}(kd/\gamma)$ space. This is optimal up to poly-logarithmic factors given the established lower bound. In the worst case, this is $\Theta(d^2/\gamma)$ which is prohibitive for large $d$, and our goal is to circumvent this quadratic bottleneck. Our main contribution is a model-based approach to the subcube heavy hitters problem. In particular, we assume that the dimensions are related to each other via the Naive Bayes model, with or without a latent dimension. Under this assumption, we present a new two-pass, $\tilde{O}(d/\gamma)$-space algorithm for our problem, and a fast algorithm for answering ${\rm AllQuery}(T)$ in $O(k/\gamma^2)$ time. Our work develops the direction of model-based data stream analysis, with much that remains to be explored.
We investigate statistical properties of a clustering algorithm that receives level set estimates from a kernel density estimator and then estimates the first split in the density level cluster tree if such a split is present or detects the absence of such a split. Key aspects of our analysis include finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for chosing the kernel bandwidth. For the rates and the adaptivity we do not need continuity assumptions on the density such as H\’older continuity, but only require intuitive geometric assumptions of non-parametric nature.

# Document worth reading: “Temporal anomaly detection: calibrating the surprise”

We propose a hybrid approach to temporal anomaly detection in user-database access data — or more generally, any kind of subject-object co-occurrence data. Our methodology allows identifying anomalies based on a single stationary model, instead of requiring a full temporal one, which would be prohibitive in our setting. We learn our low-rank stationary model from the high-dimensional training data, and then fit a regression model for predicting the expected likelihood score of normal access patterns in the future. The disparity between the predicted and the observed likelihood scores is used to assess the ‘surprise’. This approach enables calibration of the anomaly score so that time-varying normal behavior patterns are not considered anomalous. We provide a detailed description of the algorithm, including a convergence analysis, and report encouraging empirical results. One of the datasets we tested is new for the public domain. It consists of two months’ worth of database access records from a live system. This dataset will be made publicly available, and is provided in the supplementary material. Temporal anomaly detection: calibrating the surprise

# R Packages worth a look

Graph Matching Library for R (RGraphM)
This is a wrapper package for the graph matching library ‘graphm’. The original ‘graphm’ C/C++ library can be found in <http://…/> . Latest version ( 0.52 ) of this library is slightly modified to fit ‘Rcpp’ usage and included in the source package. The development version of the package is also available at <https://…/RGraphM> .

Doubly-Robust Nonparametric Estimation and Inference (drtmle)
Targeted minimum loss-based estimators of counterfactual means and causal effects that are doubly-robust with respect both to consistency and asymptotic normality (van der Laan (2014), <doi:10.1515/ijb-2012-0038>).

Sparse and Regularized Discriminant Analysis (sparsediscrim)
A collection of sparse and regularized discriminant analysis methods intended for small-sample, high-dimensional data sets. The package features the High-Dimensional Regularized Discriminant Analysis classifier.

# Book Memo: “Neural Network Driven Artificial Intelligence”

 Decision Making Based on Fuzzy Logic With today’s growing and overloading volume of information, it is becoming tremendously difficult to analyze the huge amounts of data that contain this information. It makes it very strenuous and inconvenient to introduce an appropriate methodology of decision-making fast enough to the point that it can be considered as real-time. The demand for real-time processing information and related data both structured and unstructured is on the rise and consequently makes it harder and harder to implement correct decision making at the enterprise level to keep the organization robust and resilient against either manmade threats or natural disasters. Neural networking and fuzzy systems combined show how Artificial Intelligence (AI) can be driven by these combinations as a trainable system that is more dynamic than static when it comes to machine and deep learning language to deal with both adversary and friendly events in real-time. Dynamic systems of AI that are built around such an innovative approach allows the robots of the future to be more adaptive with mechanisms such as principle adoption, self-organization, and the convergence of global stability from the viewpoint of business and intelligence security needed in today’s cyber world. To deal with uncertainty, vagueness, and imprecision, Lofti A. Zadeh introduced fuzzy sets and fuzzy logic. In the present book, fuzzy classification is applied to extend portfolio analysis, scoring methods, customer segmentation and performance measurement, and thus improves managerial decisions. As an integral part of the book, case studies show how, fuzzy classification – with its query facilities – can extend customer equity, enable mass customization, and refine marketing campaigns

# Distilled News

In the last few blog posts of this series, we discussed simple linear regression model. We discussed multivariate regression model and methods for selecting the right model. Fernando has now created a better model.
Microsoft Cognitive Services provides several APIs for image recognition, but if you want to build your own recognizer (or create one that works offline), you can use the new Image Featurizer capabilities of Microsoft R Server. The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates ‘features’ from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these ‘features’, you could use them in a traditional machine learning model to classify the images, or perform other recognition tasks.
Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the fourth part of the series and it aims to cover the cleaning of data used. At previous parts we learned how to import, reshape and transform data. The rest of the series will be dedicated to the data cleansing process. On this post we will go through the regular expressions, a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings.In particular, we will cover the foundations of regular expression syntax.
Aim In this post, we will give an intuition on why model validation as approximating generalization error of a model fit and detection of overfitting can not be resolved simultaneously on a single model. We will work on a concrete example workflow in understanding overfitting, overtraining and a typical final model building stage after some conceptual introduction. We will avoid giving a reference to the Bayesian interpretations and regularisation and restrict the post to regression and cross-validation. While regularisation has different ramification due to its mathematical properties and prior distributions have different implications in Bayesian statistics. We assume an introductory background in machine learning, so this is not a beginners tutorial.
Shiny 1.0.4 is now available on CRAN. For most Shiny users, the most exciting news is that file inputs now support dragging and dropping. It is now possible to add and remove tabs from a tabPanel, with the new functions insertTab(), appendTab(), prependTab(), and removeTab(). It is also possible to hide and show tabs with hideTab() and showTab(). Shiny also has a new a function, onStop(), which registers a callback function that will execute when the application exits. (Note that this is different from the existing onSessionEnded(), which registers a callback that executes when a user’s session ends. An application can serve multiple sessions.) This can be useful for cleaning up resources when an application exits, such as database connections. This release of Shiny also has many minor new features and bug fixes. For a the full set of changes, see the changelog.
Learning rate is the rate at which the accumulation of information in a neural network progresses over time. The learning rate determines how quickly (and whether at all) the network reaches the optimum, most conducive location in the network for the specific output desired. In plain Stochastic Gradient Descent (SGD), the learning rate is not related to the shape of the error gradient because a global learning rate is used, which is independent of the error gradient. However, there are many modifications that can be made to the original SGD update rule that relates the learning rate to the magnitude and orientation of the error gradient.
Take a visit to most malls today and you’ll be a witness to an industry under siege. Retailers with physical stores have been struggling to compete with online competition as customers equipped with mobile phones check prices, product reviews, and do other research to help their shopping efforts. Customers are moving at top speed. Physical stores have a tough time keeping up.
Generative adversarial networks (GANs) are a class of neural networks that are used in unsupervised machine learning. They help to solve such tasks as image generation from descriptions, getting high resolution images from low resolution ones, predicting which drug could treat a certain disease, retrieving images that contain a given pattern, etc. The Statsbot team asked a data scientist, Anton Karazeev, to make the introduction to GANs engine and their applications in everyday life.
Artificial Intelligence (A.I.) will soon be at the heart of every major technological system in the world including: cyber and homeland security, payments, financial markets, biotech, healthcare, marketing, natural language processing, computer vision, electrical grids, nuclear power plants, air traffic control, and Internet of Things (IoT). While A.I. seems to have only recently captured the attention of humanity, the reality is that A.I. has been around for over 60 years as a technological discipline. In the late 1950’s, Arthur Samuel wrote a checkers playing program that could learn from its mistakes and thus, over time, became better at playing the game. MYCIN, the first rule-based expert system, was developed in the early 1970’s and was capable of diagnosing blood infections based on the results of various medical tests. The MYCIN system was able to perform better than non-specialist doctors. While Artificial Intelligence is becoming a major staple of technology, few people understand the benefits and shortcomings of A.I. and Machine Learning technologies.
The first winter occurred in the 1970s, followed by another one in 1980s for some reason or the other, but majorly due to less resources. I agree that there have been many major breakthroughs but here’s my attempt to illustrate the timeline of major events…
The recent but noticeable shift from CPUs to GPUs is mainly due to the unique benefits they bring to sectors like AdTech, finance, telco, retail, or security/IT . We examine where GPU databases shine.

# If you did not already know

Boosting
Boosting is a machine learning meta-algorithm for reducing bias in supervised learning. Boosting is based on the question posed by Kearns: Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.
Schapire’s affirmative answer to Kearns’ question has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting.

Geometric Mean Metric Learning
We revisit the task of learning a Euclidean metric from data. We approach this problem from first principles and formulate it as a surprisingly simple optimization problem. Indeed, our formulation even admits a closed form solution. This solution possesses several very attractive properties: (i) an innate geometric appeal through the Riemannian geometry of positive definite matrices; (ii) ease of interpretability; and (iii) computational speed several orders of magnitude faster than the widely used LMNN and ITML methods. Furthermore, on standard benchmark datasets, our closed-form solution consistently attains higher classification accuracy. …

Docker
Build, Ship and RunAny App, Anywhere. Docker – An open platform for distributed applications for developers and sysadmins.
Docker is a relatively new open source application and service, which is seeing interest across a number of areas. It uses recent Linux kernel features (containers, namespaces) to shield processes. While its use (superficially) resembles that of virtual machines, it is much more lightweight as it operates at the level of a single process (rather than an emulation of an entire OS layer). This also allows it to start almost instantly, require very little resources and hence permits an order of magnitude more deployments per host than a virtual machine.
Docker offers a standard interface to creation, distribution and deployment. The shipping container analogy is apt: just how shipping containers (via their standard size and “interface”) allow global trade to prosper, Docker is aiming for nothing less for deployment. A Dockerfile provides a concise, extensible, and executable description of the computational environment. Docker software then builds a Docker image from the Dockerfile. Docker images are analogous to virtual machine images, but smaller and built in discrete, extensible and reuseable layers. Images can be distributed and run on any machine that has Docker software installed—including Windows, OS X and of course Linux. Running instances are called Docker containers. A single machine can run hundreds of such containers, including multiple containers running the same image.

# Magister Dixit

“There’s been a lot of talk about trying to make AI work on existing infrastructure. But the sad reality is that you’re always going to end up with something that’s far less than state-of-the-art. And I don’t mean it will be 30 or 40 percent slower. It’s more likely to be a thousand times slower” Rao Naveen ( 2017 )