“You shouldn’t be collecting Big Data under the premise that more data is better, cooler, sexier, etc.” Pradyumna S. Upadrashta ( February 13, 2015 )

# Magister Dixit

**24**
*Saturday*
Mar 2018

Posted Magister Dixit

in
Advertisements

**24**
*Saturday*
Mar 2018

Posted Magister Dixit

in“You shouldn’t be collecting Big Data under the premise that more data is better, cooler, sexier, etc.” Pradyumna S. Upadrashta ( February 13, 2015 )

Advertisements

**24**
*Saturday*
Mar 2018

Posted Books

in
**24**
*Saturday*
Mar 2018

Posted Books

in
**24**
*Saturday*
Mar 2018

Posted R Packages

in* C’ and ‘Java’ Source Code Generator for Fitted Glm Objects* (

Provides two functions that generate source code implementing the predict function of fitted glm objects. In this version, code can be generated for either ‘C’ or ‘Java’. The idea is to provide a tool for the easy and fast deployment of glm predictive models into production. The source code generated by this package implements two function/methods. One of such functions implements the equivalent to predict(type=’response’), while the second implements predict(type=’link’). Source code is written to disk as a .c or .java file in the specified path. In the case of c, an .h file is also generated.

Bayesian synthetic likelihood (BSL, Price et al. (2018) <doi:10.1080/10618600.2017.1302882>) is an alternative to standard, non-parametric approximate Bayesian computation (ABC). BSL assumes a multivariate normal distribution for the summary statistic likelihood and it is suitable when the distribution of the model summary statistics is sufficiently regular. This package provides a Metropolis Hastings Markov chain Monte Carlo implementation of BSL and BSL with graphical lasso (BSLasso, An et al. (2018) <https://…/> ), which is computationally more efficient when the dimension of the summary statistic is high. Extensions to this package are planned.

Counterparts to R string manipulation functions that account for the effects of ANSI text formatting control sequences.

Core package containing all the tools for simple and advanced source extraction. This is used to create inputs for ‘ProFit’, or for source detection, extraction and photometry in its own right.

Create and combine HTML and PDF reports from within R. Possibility to design tables and listings for reporting and also include R plots.

**24**
*Saturday*
Mar 2018

Posted Distilled News

in**Aspiring Data Scientists! Start to learn Statistics with these 6 books!**

1. You Are Not So Smart — by David McRaney

2. Think Like a Freak — by Dubner & Levitt

3. Innumeracy — by John Allen Paulos

4. Naked Statistics — by Charles Wheelan

5. Practical Statistics for Data Scientists — by Andrew & Peter Bruce

6. Think Stats — by Allen B. Downey

2. Think Like a Freak — by Dubner & Levitt

3. Innumeracy — by John Allen Paulos

4. Naked Statistics — by Charles Wheelan

5. Practical Statistics for Data Scientists — by Andrew & Peter Bruce

6. Think Stats — by Allen B. Downey

**Scalable Select of Random Rows in SQL**

If you’re new to the big data world and also migrating from tools like Google Analytics or Mixpanel for your web analytics, you probably noticed performance differences. Google Analytics can show you predefined reports in seconds, while the same query for the same data in your data warehouse can take several minutes or even more. Such performance boosts are achieved by selecting random rows or the sampling technique. Let’s learn how to select random rows in SQL.

**Human Involvement Helps Researchers Perfect New Algorithms to Train Robots**

Many underestimate the role of humans in successful deployment of AI solutions. Alegion engine produces AI training data and enables content moderation, sentiment analysis, data enrichment, tagging, categorization, and more.

**CatBoost vs. Light GBM vs. XGBoost**

I recently participated in this Kaggle competition (WIDS Datathon by Stanford) where I was able to land up in Top 10 using various boosting algorithms. Since then, I have been very curious about the fine workings of each model including parameter tuning, pros and cons and hence decided to write this blog. Despite the recent re-emergence and popularity of neural networks, I am focusing on boosting algorithms because they are still more useful in the regime of limited training data, little training time and little expertise for parameter tuning.

**The most prolific package maintainers on CRAN**

During a discussion with some other members of the R Consortium, the question came up: who maintains the most packages on CRAN? DataCamp maintains a list of most active maintainers by downloads, but in this case we were interested in the total number of packages by maintainer. Fortunately, this is pretty easy to figure thanks to the CRAN repository tools now included in R, and a little dplyr (see the code below) gives the answer quickly.

**Comparing additive and multiplicative regressions using AIC in R**

One of the basic things the students are taught in statistics classes is that the comparison of models using information criteria can only be done when the models have the same response variable. This means, for example, that when you have log(y t ) log?(yt) and calculate AIC, then this value is not comparable with AIC from a model with y t yt . The reason for this is because the scales of variables are different. But there is a way to make the criteria in these two cases comparable: both variables need to be transformed into the original scale, and we need to understand what are the distributions of these variables in that scale.

**Machine Learning Modelling in R : : Cheat Sheet**

I came across this excellent article lately “Machine learning at central banks” which I decided to use as a basis for a new cheat sheet called Machine Learning Modelling in R. The cheat sheet can be downloaded from RStudio cheat sheets repository.

There was a recent blog post on mental models for deep learning drawing parallels from optics. We all have intuitions for few models but it is hard to put it in words, I believe it is necessary to work collectively for this mental model.

**24**
*Saturday*
Mar 2018

Posted Documents

inPrincipal Components Analysis (PCA) is one of the most widely used dimension reduction techniques. Robust PCA (RPCA) refers to the problem of PCA when the data may be corrupted by outliers. Recent work by Candes, Wright, Li, and Ma defined RPCA as a problem of decomposing a given data matrix into the sum of a low-rank matrix (true data) and a sparse matrix (outliers). The column space of the low-rank matrix then gives the PCA solution. This simple definition has lead to a large amount of interesting new work on provably correct, fast, and practically useful solutions to the RPCA problem. More recently, the dynamic (time-varying) version of the RPCA problem has been studied and a series of provably correct, fast, and memory efficient tracking solutions have been proposed. Dynamic RPCA (or robust subspace tracking) is the problem of tracking data lying in a (slowly) changing subspace while being robust to sparse outliers. This article provides an exhaustive review of the last decade of literature on RPCA and its dynamic counterpart (robust subspace tracking), along with describing their theoretical guarantees, discussing the pros and cons of various approaches, and providing empirical comparisons of performance and speed. Static and Dynamic Robust PCA via Low-Rank + Sparse Matrix Decomposition: A Review

**23**
*Friday*
Mar 2018

Posted What is ...

in**Algorithmic Social Intervention**

Social and behavioral interventions are a critical tool for governments and communities to tackle deep-rooted societal challenges such as homelessness, disease, and poverty. However, real-world interventions are almost always plagued by limited resources and limited data, which creates a computational challenge: how can we use algorithmic techniques to enhance the targeting and delivery of social and behavioral interventions? The goal of my thesis is to provide a unified study of such questions, collectively considered under the name ‘algorithmic social intervention’. This proposal introduces algorithmic social intervention as a distinct area with characteristic technical challenges, presents my published research in the context of these challenges, and outlines open problems for future work. A common technical theme is decision making under uncertainty: how can we find actions which will impact a social system in desirable ways under limitations of knowledge and resources? The primary application area for my work thus far is public health, e.g. HIV or tuberculosis prevention. For instance, I have developed a series of algorithms which optimize social network interventions for HIV prevention. Two of these algorithms have been pilot-tested in collaboration with LA-area service providers for homeless youth, with preliminary results showing substantial improvement over status-quo approaches. My work also spans other topics in infectious disease prevention and underlying algorithmic questions in robust and risk-aware submodular optimization. … **Competitive Intelligence (CI)**

Competitive intelligence is the action of defining, gathering, analyzing, and distributing intelligence about products, customers, competitors, and any aspect of the environment needed to support executives and managers making strategic decisions for an organization. Competitive intelligence essentially means understanding and learning what’s happening in the world outside your business so one can be as competitive as possible. It means learning as much as possible-as soon as possible-about one’s industry in general, one’s competitors, or even one’s county’s particular zoning rules. In short, it empowers you to anticipate and face challenges head on. A more focused definition of CI regards it as the organizational function responsible for the early identification of risks and opportunities in the market before they become obvious. Experts also call this process the early signal analysis. This definition focuses attention on the difference between dissemination of widely available factual information (such as market statistics, financial reports, newspaper clippings) performed by functions such as libraries and information centers, and competitive intelligence which is a perspective on developments and events aimed at yielding a competitive edge.

Competitive Intelligence and 6 Tips for Its Effective Use … **Border-Peeling Clustering**

In this paper, we present a novel non-parametric clustering technique, which is based on an iterative algorithm that peels off layers of points around the clusters. Our technique is based on the notion that each latent cluster is comprised of layers that surround its core, where the external layers, or border points, implicitly separate the clusters. Analyzing the K-nearest neighbors of the points makes it possible to identify the border points and associate them with points of inner layers. Our clustering algorithm iteratively identifies border points, peels them, and separates the latent clusters. We show that the peeling process adapts to the local density and successfully separates adjacent clusters. A notable quality of the Border-Peeling algorithm is that it does not require any parameter tuning in order to outperform state-of-the-art finely-tuned non-parametric clustering methods, including Mean-Shift and DBSCAN. We further assess our technique on high-dimensional datasets that vary in size and characteristics. In particular, we analyze the space of deep features that were trained by a convolutional neural network. …

**23**
*Friday*
Mar 2018

Posted Books

in
**23**
*Friday*
Mar 2018

Posted Books

in
**23**
*Friday*
Mar 2018

Posted Magister Dixit

in“Hadoop has an irreparably fractured ecosystem.” Joey Zwicker ( 12. February 2015 )