If you did not already know

Zeros Ones Inflated Proportional google
The ZOIP distribution (Zeros Ones Inflated Proportional) is a proportional data distribution inflated with zeros and/or ones, this distribution is defined on the most known proportional data distributions, the beta and simplex distribution, Jørgensen and Barndorff-Nielsen (1991) <doi:10.1016/0047-259X(91)90008-P>, also allows it to have different parameterizations of the beta distribution, Ferrari and Cribari-Neto (2004) <doi:10.1080/0266476042000214501>, Rigby and Stasinopoulos (2005) <doi:10.18637/jss.v023.i07>. The ZOIP distribution has four parameters, two of which correspond to the proportion of zeros and ones, and the other two correspond to the distribution of the proportional data of your choice. The ‘ZOIP’ package allows adjustments of regression models for fixed and mixed effects for proportional data inflated with zeros and/or ones. …

Fuzzy Supervised Learning with Binary Meta-Feature (FSL-BM) google
This paper introduces a novel real-time Fuzzy Supervised Learning with Binary Meta-Feature (FSL-BM) for big data classification task. The study of real-time algorithms addresses several major concerns, which are namely: accuracy, memory consumption, and ability to stretch assumptions and time complexity. Attaining a fast computational model providing fuzzy logic and supervised learning is one of the main challenges in the machine learning. In this research paper, we present FSL-BM algorithm as an efficient solution of supervised learning with fuzzy logic processing using binary meta-feature representation using Hamming Distance and Hash function to relax assumptions. While many studies focused on reducing time complexity and increasing accuracy during the last decade, the novel contribution of this proposed solution comes through integration of Hamming Distance, Hash function, binary meta-features, binary classification to provide real time supervised method. Hash Tables (HT) component gives a fast access to existing indices; and therefore, the generation of new indices in a constant time complexity, which supersedes existing fuzzy supervised algorithms with better or comparable results. To summarize, the main contribution of this technique for real-time Fuzzy Supervised Learning is to represent hypothesis through binary input as meta-feature space and creating the Fuzzy Supervised Hash table to train and validate model. …

Norm google
In linear algebra, functional analysis and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space – save possibly for the zero vector, which is assigned a length of zero. A seminorm, on the other hand, is allowed to assign zero length to some non-zero vectors (in addition to the zero vector). A norm must also satisfy certain properties pertaining to scalability and additivity which are given in the formal definition below. A simple example is the 2-dimensional Euclidean space R2 equipped with the Euclidean norm. Elements in this vector space (e.g., (3, 7)) are usually drawn as arrows in a 2-dimensional cartesian coordinate system starting at the origin (0, 0). The Euclidean norm assigns to each vector the length of its arrow. Because of this, the Euclidean norm is often known as the magnitude. A vector space on which a norm is defined is called a normed vector space. Similarly, a vector space with a seminorm is called a seminormed vector space. It is often possible to supply a norm for a given vector space in more than one way. …


R Packages worth a look

Query Search Interfaces (searcher)
Provides a search interface to look up terms on ‘Google’, ‘Bing’, ‘DuckDuckGo’, ‘StackOverflow’, ‘GitHub’, and ‘BitBucket’. Upon searching, a browser window will open with the aforementioned search results.

Composite Likelihood Estimation for Spatial Data (clespr)
Composite likelihood approach is implemented to estimating statistical models for spatial ordinal and proportional data based on Feng et al. (2014) <doi:10.1002/env.2306>. Parameter estimates are identified by maximizing composite log-likelihood functions using the limited memory BFGS optimization algorithm with bounding constraints, while standard errors are obtained by estimating the Godambe information matrix.

Adjusted Prediction Model Performance Estimation (APPEstimation)
Calculating predictive model performance measures adjusted for predictor distributions using density ratio method (Sugiyama et al., (2012, ISBN:9781139035613)). L1 and L2 error for continuous outcome and C-statistics for binomial outcome are computed.

Rcpp’ Bindings for the ‘Corpus Workbench’ (‘CWB’) (RcppCWB)
Rcpp’ Bindings for the C code of the ‘Corpus Workbench’ (‘CWB’), an indexing and query engine to efficiently analyze large corpora (<> ). ‘RcppCWB’ is licensed under the GNU GPL-3, in line with the GPL-3 license of the ‘CWB’ (<https://…/GPL-3> ). The ‘CWB’ relies on ‘pcre’ (BSD license, see <https://…/licence.txt> ) and ‘GLib’ (LGPL license, see <https://…/lgpl-3.0.en.html> ). See the file LICENSE.note for further information.

An Integrated Regression Model for Normalizing ‘NanoString nCounter’ Data (RCRnorm)
NanoString nCounter’ is a medium-throughput platform that measures gene or microRNA expression levels. Here is a publication that introduces this platform: Malkov (2009) <doi:10.1186/1756-0500-2-80>. Here is the webpage of ‘NanoString nCounter’ where you can find detailed information about this platform <https://…/ncounter-technology>. It has great clinical application, such as diagnosis and prognosis of cancer. Implements integrated system of random-coefficient hierarchical regression model to normalize data from ‘NanoString nCounter’ platform so that noise from various sources can be removed.

Book Memo: “Classification in BioApps”

Automation of Decision Making
This book on classification in biomedical image applications presents original and valuable research work on advances in this field, which covers the taxonomy of both supervised and unsupervised models, standards, algorithms, applications and challenges. Further, the book highlights recent scientific research on artificial neural networks in biomedical applications, addressing the fundamentals of artificial neural networks, support vector machines and other advanced classifiers, as well as their design and optimization. In addition to exploring recent endeavours in the multidisciplinary domain of sensors, the book introduces readers to basic definitions and features, signal filters and processing, biomedical sensors and automation of biomeasurement systems. The target audience includes researchers and students at engineering and medical schools, researchers and engineers in the biomedical industry, medical doctors and healthcare professionals.

Document worth reading: “Neural Networks for Information Retrieval”

Machine learning plays a role in many aspects of modern IR systems, and deep learning is applied in all of them. The fast pace of modern-day research has given rise to many different approaches for many different IR problems. The amount of information available can be overwhelming both for junior students and for experienced researchers looking for new research topics and directions. Additionally, it is interesting to see what key insights into IR problems the new technologies are able to give us. The aim of this full-day tutorial is to give a clear overview of current tried-and-trusted neural methods in IR and how they benefit IR research. It covers key architectures, as well as the most promising future directions. Neural Networks for Information Retrieval

Distilled News

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective

Machine learning sits at the core of many essential products and services at Facebook. This paper describes the hardware and software infrastructure that supports machine learning at global scale. Facebook’s machine learning workloads are extremely diverse: services require many different types of models in practice. This diversity has implications at all layers in the system stack. In addition, a sizable fraction of all data stored at Facebook flows through machine learning pipelines, presenting significant challenges in delivering data to high-performance distributed training flows. Computational requirements are also intense, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference. Addressing these and other emerging challenges continues to require diverse efforts that span machine learning algorithms, software, and hardware design.

Amazing New AI Innovations Unveiled at CES 2018 in Las Vegas

• The future of Healthcare
• L’Oreal’s Thumbnail-sized Sensor
• Cocoon Cam Clarity
• Rinseed Snap
• Toyota’s e-Palette Concept Car
• Google Assistant is taking on Amazon’s Alexa, in a BIG way
• Youtube’s Recommendations Keep Getting Better

A Simple Introduction to ANOVA (with applications in Excel)

Buying a new product or testing a new technique but not sure how it stacks up against the alternatives? It’s an all too familiar situation for most of us. Most of the options sound similar to each other so picking the best out of the lot is a challenge. Consider a scenario where we have three medical treatments to apply on patients with similar diseases. Once we have the test results, one approach is to assume that the treatment which took the least time to cure the patients is the best among them. What if some of these patients had already been partially cured, or if any other medication was already working on them? In order to make a confident and reliable decision, we will need evidence to support our approach. This is where the concept of ANOVA comes into play. In this article, I’ll introduce you to the different ANOVA techniques used for making the best decisions. We’ll take a few cases and try to understand the techniques for getting the results. We will also be leveraging the use of Excel to understand these concepts. You must know the basics of statistics to understand this topic. Knowledge of t-tests and Hypothesis testing would be an additional benefit.

Putting AI-enhanced analytics at the heart of retail customer experience

Last Sunday, my husband and I went to visit our daughter. As we drove, my cell informed me that we were 30 minutes from our destination. How did it know? I hadn’t told it where we were going – there wasn’t an appointment on my calendar. The cell had worked out this was a trip we regularly take on a Sunday and was able to provide us with useful information based on that knowledge. This is just an everyday example of how quickly Artificial Intelligence (AI) is becoming a normal part of our lives. It’s something that’s beginning to shape retail customer experience. In this blog, I want to look at how AI and analytics together can deliver the highly targeted and personalized experience that customers demand. The holiday season has just passed and, if you’re like me, you’ll be giving thanks to Amazon (other online shopping services are available!). Going online is quick and convenient. Personally, I like shopping in the mall but our busy lives often make this practically impossible. What’s more, the personalization and recommendations engines of services such as Amazon are now so sophisticated that it really does feel that I’m receiving an individual service that understands my wants and preferences. This level of personal service is something that every retailer must aspire to.

A survey of incremental high-utility itemset mining

Traditional association rule mining has been widely studied. But it is unsuitable for real-world applications where factors such as unit profits of items and purchase quantities must be considered. High-utility itemset mining (HUIM) is designed to find highly profitable patterns by considering both the purchase quantities and unit profits of items. However, most HUIM algorithms are designed to be applied to static databases. But in real-world applications such as market basket analysis and business decision-making, databases are often dynamically updated by inserting new data such as customer transactions. Several researchers have proposed algorithms to discover high-utility itemsets (HUIs) in dynamically updated databases. Unlike batch algorithms, which always process a database from scratch, incremental high-utility itemset mining (iHUIM) algorithms incrementally update and output HUIs, thus reducing the cost of discovering HUIs. This paper provides an up-to-date survey of the state-of-the-art iHUIM algorithms, including Apriori-based, tree-based, and utility-list-based approaches. To the best of our knowledge, this is the first survey on the mining task of incremental high-utility itemset mining. The paper also identifies several important issues and research challenges for iHUIM.

Design Patterns for Deep Learning Architectures?

Deep Learning Architecture can be described as a new method or style of building machine learning systems. Deep Learning is more than likely to lead to more advanced forms of artificial intelligence. The evidence for this is in the sheer number of breakthroughs that had occurred since the beginning of this decade. There is a new found optimism in the air and we are now again in a new AI spring. Unfortunately, the current state of deep learning appears too many ways to be akin to alchemy. Everybody seems to have their own black-magic methods of designing architectures. The field thus needs to move forward and strive towards chemistry, or perhaps even a periodic table for deep learning. Although deep learning is still in its early infancy of development, this book strives towards some kind of unification of the ideas in deep learning. It leverages a method of description called pattern languages. Pattern Languages are languages derived from entities called patterns that when combined form solutions to complex problems. Each pattern describes a problem and offers alternative solutions. Pattern languages are a way of expressing complex solutions that were derived from experience. The benefit of an improved language of expression is that other practitioners are able to gain a much better understanding of the complex subject as well as a better way of expressing a solution to problems.

The Bayesian Approach to Sample Size Calculations

During a clinical trial, we want to make inferences about the value of some endpoint of interest which in this article we will call ? ? . In order for these inferences to be meaningful, we need to make sure that the we study enough subjects so that the estimate of the effect size is sufficiently precise. On the other hand, we do not want too many subjects because it would be unethical to expose subjects to the possibly harmful effects of the treatment or for them to be exposed to the risk of not receiving the standard of care.

Is Learning Rate Useful in Artificial Neural Networks?

This article will help you understand why we need the learning rate and whether it is useful or not for training an artificial neural network. Using a very simple Python code for a single layer perceptron, the learning rate value will get changed to catch its idea. An obstacle for newbies in artificial neural networks is the learning rate. I was asked many times about the effect of the learning rate in the training of the artificial neural networks (ANNs). Why we use learning rate? What is the best value for the learning rate? In this article, I will try to make things simpler by providing an example that shows how learning rate is useful in order to train an ANN. I will start by explaining our example with Python code before working with the learning rate.

Generalized additive models with principal component analysis: an application to time series of respiratory disease and air pollution data

Environmental epidemiological studies of the health effects of air pollution frequently utilize the generalized additive model (GAM) as the standard statistical methodology, considering the ambient air pollutants as explanatory covariates. Although exposure to air pollutants is multi-dimensional, the majority of these studies consider only a single pollutant as a covariate in the GAM model. This model restriction may be because the pollutant variables do not only have serial dependence but also interdependence between themselves. In an attempt to convey a more realistic model, we propose here the hybrid generalized additive model-principal component analysis-vector auto-regressive (GAM-PCA-VAR) model, which is a combination of PCA and GAMs along with a VAR process. The PCA is used to eliminate the multicollinearity between the pollutants whereas the VAR model is used to handle the serial correlation of the data to produce white noise processes as covariates in the GAM. Some theoretical and simulation results of the methodology proposed are discussed, with special attention to the effect of time correlation of the covariates on the PCA and, consequently, on the estimates of the parameters in the GAM and on the relative risk, which is a commonly used statistical quantity to measure the effect of the covariates, especially the pollutants, on population health. As a main motivation to the methodology, a real data set is analysed with the aim of quantifying the association between respiratory disease and air pollution concentrations, especially particulate matter PM10, sulphur dioxide, nitrogen dioxide, carbon monoxide and ozone. The empirical results show that the GAM-PCA-VAR model can remove the auto-correlations from the principal components. In addition, this method produces estimates of the relative risk, for each pollutant, which are not affected by the serial correlation in the data. This, in general, leads to more pronounced values of the estimated risk compared with the standard GAM model, indicating, for this study, an increase of almost 5.4% in the risk of PM10, which is one of the most important pollutants which is usually associated with adverse effects on human health.

Multiclass vector auto-regressive models for multistore sales data

Retailers use the vector auto-regressive (VAR) model as a standard tool to estimate the effects of prices, promotions and sales in one product category on the sales of another product category. Besides, these price, promotion and sales data are available not just for one store, but for a whole chain of stores. We propose to study cross-category effects by using a multiclass VAR model: we jointly estimate cross-category effects for several distinct but related VAR models, one for each store. Our methodology encourages effects to be similar across stores, while still allowing for small differences between stores to account for store heterogeneity. Moreover, our estimator is sparse: unimportant effects are estimated as exactly 0, which facilitates the interpretation of the results. A simulation study shows that the multiclass estimator proposed improves estimation accuracy by borrowing strength across classes. Finally, we provide three visual tools showing clustering of stores with similar cross-category effects, networks of product categories and similarity matrices of shared cross-category effects across stores.

P-values from random effects linear regression models

lme4::lmer is a useful frequentist approach to hierarchical/multilevel linear regression modelling. For good reason, the model output only includes t-values and doesn’t include p-values (partly due to the difficulty in estimating the degrees of freedom, as discussed here). Yes, p-values are evil and we should continue to try and expunge them from our analyses. But I keep getting asked about this. So here is a simple bootstrap method to generate two-sided parametric p-values on the fixed effects coefficients. Interpret with caution.

Setting up RStudio Server quickly on Amazon EC2

I have recently been working on projects using Amazon EC2 (elastic compute cloud), and RStudio Server. I thought I would share some of my working notes. Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user interface to your remote R environment that is nearly indistinguishable from a local RStudio console. The idea is: for a few dollars you can work interactively on R tasks requiring hundreds of GB of memory and tens of CPUs and GPUs. If you are already an Amazon EC2 user with some Unix experience it is very easy to quickly stand up a powerful R environment, which is what I will demonstrate in this note.

Fitting a TensorFlow Linear Classifier with tfestimators

In a recent post, I mentioned three avenues for working with TensorFlow from R:
• The keras package, which uses the Keras API for building scaleable, deep learning models
• The tfestimators package, which wraps Google’s Estimators API for fitting models with pre-built estimators
• The tensorflow package, which provides an interface to Google’s low-level TensorFlow API
In this post, Edgar and I use the linear_classifier() function, one of six pre-built models currently in the tfestimators package, to train a linear classifier using data from the titanic package.

R Packages worth a look

Computation of the Sparse Inverse Subset (sparseinv)
Creates a wrapper for the ‘SuiteSparse’ routines that execute the Takahashi equations. These equations compute the elements of the inverse of a sparse matrix at locations where the its Cholesky factor is non-zero. The resulting matrix is known as a sparse inverse subset. Some helper functions are also implemented. Support for spam matrices is currently limited and will be implemented in the future. See Rue and Martino (2007) <doi:10.1016/j.jspi.2006.07.016> and Zammit-Mangion and Rougier (2017) <arXiv:1707.00892> for the application of these equations to statistics.

Gaussian Process Modeling of Multi-Response Datasets (GPM)
Provides a general and efficient tool for fitting a response surface to datasets via Gaussian processes. The dataset can have multiple responses. The package is based on the work of Bostanabad, R., Kearney, T., Tao, S., Apley, D. W. & Chen, W. Leveraging the nugget parameter for efficient Gaussian process modeling (2017) <doi:10.1002/nme.5751>.

Fast Generators and Iterators for Permutations, Combinations and Partitions (arrangements)
Fast generators and iterators for permutations, combinations and partitions. The iterators allow users to generate arrangements in a memory efficient manner and the generated arrangements are in lexicographical (dictionary) order. Permutations and combinations can be drawn with/without replacement and support multisets. It has been demonstrated that ‘arrangements’ outperforms most of the existing packages of similar kind. Some benchmarks could be found at <https://…/benchmark.html>.

Calculating and Visualizing ROC Curves Across Multi-Class Classifications (multiROC)
Tools to solve real-world problems with multiple classes by computing the areas under ROC curve via micro-averaging and macro-averaging. The methodology is described in V. Van Asch (2013) <https://…/microaverage.pdf> and Pedregosa et al. (2011) <http://…/plot_roc.html>.

Palettes and graphics matching your RStudio editor (editheme)
The package editheme provides a collection of color palettes designed to match the different themes available in RStudio. It also includes functions to customize ‘base’ and ‘ggplot2’ graphs styles in order to harmonize the look of your favorite IDE.

If you did not already know

Partial Membership Latent Dirichlet Allocation (PM-LDA) google
For many years, topic models (e.g., pLSA, LDA, SLDA) have been widely used for segmenting and recognizing objects in imagery simultaneously. However, these models are confined to the analysis of categorical data, forcing a visual word to belong to one and only one topic. There are many images in which some regions cannot be assigned a crisp categorical label (e.g., transition regions between a foggy sky and the ground or between sand and water at a beach). In these cases, a visual word is best represented with partial memberships across multiple topics. To address this, we present a partial membership latent Dirichlet allocation (PM-LDA) model and associated parameter estimation algorithms. PM-LDA defines a novel partial membership model for word and document generation. We employ Gibbs sampling for parameter estimation. Experimental results on two natural image datasets and one SONAR image dataset show that PM-LDA can produce both crisp and soft semantic image segmentations; a capability existing methods do not have. …

Error-Robust Multi-View Clustering (EMVC) google
In the era of big data, data may come from multiple sources, known as multi-view data. Multi-view clustering aims at generating better clusters by exploiting complementary and consistent information from multiple views rather than relying on the individual view. Due to inevitable system errors caused by data-captured sensors or others, the data in each view may be erroneous. Various types of errors behave differently and inconsistently in each view. More precisely, error could exhibit as noise and corruptions in reality. Unfortunately, none of the existing multi-view clustering approaches handle all of these error types. Consequently, their clustering performance is dramatically degraded. In this paper, we propose a novel Markov chain method for Error-Robust Multi-View Clustering (EMVC). By decomposing each view into a shared transition probability matrix and error matrix and imposing structured sparsity-inducing norms on error matrices, we characterize and handle typical types of errors explicitly. To solve the challenging optimization problem, we propose a new efficient algorithm based on Augmented Lagrangian Multipliers and prove its convergence rigorously. Experimental results on various synthetic and real-world datasets show the superiority of the proposed EMVC method over the baseline methods and its robustness against different types of errors. …

Measure Differential Equations (MDE) google
A new type of differential equations for probability measures on Euclidean spaces, called Measure Differential Equations (briefly MDEs), is introduced. MDEs correspond to Probability Vector Fields, which map measures on an Euclidean space to measures on its tangent bundle. Solutions are intended in weak sense and existence, uniqueness and continuous dependence results are proved under suitable conditions. The latter are expressed in terms of the Wasserstein metric on the base and fiber of the tangent bundle. MDEs represent a natural measure-theoretic generalization of Ordinary Differential Equations via a monoid morphism mapping sums of vector fields to fiber convolution of the corresponding Probability Vector Fields. Various examples, including finite-speed diffusion and concentration, are shown, together with relationships to Partial Differential Equations. Finally, MDEs are also natural mean-field limits of multi-particle systems, with convergence results extending the classical Dubroshin approach. …

Document worth reading: “Active Learning for Visual Question Answering: An Empirical Study”

We present an empirical study of active learning for Visual Question Answering, where a deep VQA model selects informative question-image pairs from a pool and queries an oracle for answers to maximally improve its performance under a limited query budget. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a fast and effective goal-driven active learning scoring function to pick question-image pairs for deep VQA models under the Bayesian Neural Network framework. We find that deep VQA models need large amounts of training data before they can start asking informative questions. But once they do, all three approaches outperform the random selection baseline and achieve significant query savings. For the scenario where the model is allowed to ask generic questions about images but is evaluated only on specific questions (e.g., questions whose answer is either yes or no), our proposed goal-driven scoring function performs the best. Active Learning for Visual Question Answering: An Empirical Study

Magister Dixit

“An agile environment is one that’s adaptive and promotes evolutionary development and continuous improvement. It fosters flexibility and champions fast failures. Perhaps most importantly, it helps software development teams build and deliver optimal solutions as rapidly as possible. That’s because in today’s competitive market chock-full of tech-savvy customers used to new apps and app updates every day and copious amounts of data with which to work, IT teams can no longer respond to IT requests with months-long development cycles. It doesn’t matter if the request is from a product manager looking to map the next rev’s upgrade or a data scientist asking for a new analytics model.” Tom Phelan ( February 10, 2015 )