If you did not already know

Tube Convolutional Neural Network (T-CNN) google
Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has been limited due to complexity of video data and lack of annotations. Previous convolutional neural networks (CNN) based video action detection approaches usually consist of two major steps: frame-level action proposal detection and association of proposals across frames. Also, these methods employ two-stream CNN framework to handle spatial and temporal feature separately. In this paper, we propose an end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for action detection in videos. The proposed architecture is a unified network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and for each clip a set of tube proposals are generated next based on 3D Convolutional Network (ConvNet) features. Finally, the tube proposals of different clips are linked together employing network flow and spatio-temporal action detection is performed using these linked video proposals. Extensive experiments on several video datasets demonstrate the superior performance of T-CNN for classifying and localizing actions in both trimmed and untrimmed videos compared to state-of-the-arts. …

spaCy google
spaCy, aIndustrial-strength NLP, is a library for advanced natural language processing in Python and Cython.
spaCy is built on the very latest research, but it isn’t researchware. It was designed from day 1 to be used in real products. You can buy a commercial license, or you can use it under the AGPL. Features:
• Labelled dependency parsing (91.8% accuracy on OntoNotes 5)
• Named entity recognition (82.6% accuracy on OntoNotes 5)
• Part-of-speech tagging (97.1% accuracy on OntoNotes 5)
• Easy to use word vectors
• All strings mapped to integer IDs
• Export to numpy data arrays
• Alignment maintained to original string, ensuring easy mark up calculation
• Range of easy-to-use orthographic features.
• No pre-processing required. spaCy takes raw text as input, warts and newlines and all.

Streaming Variational Bayes (SVB) google
We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data – a case where SVI may be applied – and in the streaming setting, where SVI does not apply. …


Document worth reading: “EDISON Data Science Framework”

The EDISON Data Science Framework is a collection of documents that define the Data Science profession. Freely available, these documents have been developed to guide educators and trainers, emplyers and managers, and Data Scientists themselves. This collection of documents collectively breakdown the complexity of the skills and competences need to define Data Science as a professional practice. EDISON Data Science Framework

R Packages worth a look

Additional Tools for Developing Spatially Explicit Discrete Event Simulation (SpaDES) Models (
Provides GIS/map utilities and additional modeling tools for developing cellular automata and agent based models in ‘SpaDES’.

Quiver Plots for ‘ggplot2’ (ggquiver)
An extension of ‘ggplot2’ to provide quiver plots to visualise vector fields. This functionality is implemented using a geom to produce a new graphical layer, which allows aesthetic options. This layer can be overlaid on a map to improve visualisation of mapped data.

Data from Surveys Conducted by Forwards (forwards)
Anonymized data from surveys conducted by Forwards <http://…/>, the R Foundation task force on women and other under-represented groups. Currently, a single data set of responses to a survey of attendees at useR! 2016 <http://…/>, the R user conference held at Stanford University, Stanford, California, USA, June 27 – June 30 2016.

Book Memo: “Now You See It: Simple Visualization Techniques for Quantitative Analysis”

Now You See It: Simple Visualization Techniques for Quantitative Analysis teaches simple, practical means to explore and analyze quantitative data–techniques that rely primarily on using your eyes. This book features graphical techniques that can be applied to a broad range of software tools, including Microsoft Excel, because so many people have nothing else, but also more powerful visual analysis tools that can dramatically extend your analytical reach. You’ll learn to make sense of quantitative data by discerning the meaningful patterns, trends, relationships, and exceptions that measure your organization’s performance, identify potential problems and opportunities, and reveal what will likely happen in the future. Now You See It is not just for those with ‘analyst’ in their titles, but for everyone who’s interested in discovering the stories in their data that reveal their organization’s performance and how it can be improved.

Document worth reading: “Copy the dynamics using a learning machine”

Is it possible to generally construct a dynamical system to simulate a black system without recovering the equations of motion of the latter? Here we show that this goal can be approached by a learning machine. Trained by a set of input-output responses or a segment of time series of a black system, a learning machine can be served as a copy system to mimic the dynamics of various black systems. It can not only behave as the black system at the parameter set that the training data are made, but also recur the evolution history of the black system. As a result, the learning machine provides an effective way for prediction, and enables one to probe the global dynamics of a black system. These findings have significance for practical systems whose equations of motion cannot be approached accurately. Examples of copying the dynamics of an artificial neural network, the Lorenz system, and a variable star are given. Our idea paves a possible way towards copy a living brain. Copy the dynamics using a learning machine

If you did not already know

Significance-Offset Convolutional Neural Network google
We propose ‘Significance-Offset Convolutional Neural Network’, a deep convolutional network architecture for multivariate time series regression. The model is inspired by standard autoregressive (AR) models and gating mechanisms used in recurrent neural networks. It involves an AR-like weighting system, where the final predictor is obtained as a weighted sum of sub-predictors while the weights are data-dependent functions learnt through a convolutional network.The architecture was designed for applications on asynchronous time series with low signal-to-noise ratio and hence is evaluated on such datasets: a hedge fund proprietary dataset of over2 million quotes for a credit derivative index andan artificially generated noisy autoregressive series. The proposed architecture achieves promising results compared to convolutional and recur-rent neural networks. The code for the numerical experiments and the architecture implementation will be shared online to make the research reproducible. …

QICD (QICD) google
Extremely fast algorithm ‘QICD’, Iterative Coordinate Descent Algorithm for High-dimensional Nonconvex Penalized Quantile Regression. This algorithm combines the coordinate descent algorithm in the inner iteration with the majorization minimization step in the outside step. For each inner univariate minimization problem, we only need to compute a one-dimensional weighted median, which ensures fast computation. Tuning parameter selection is based on two different method: the cross validation and BIC for quantile regression model. Details are described in the Peng,B and Wang,L. (2015) linked to via the URL below with <DOI:10.1080/10618600.2014.913516>. …

Dirichlet Process Mixture Model (DPMM) google
The Dirichlet process is a family of non-parametric Bayesian models which are commonly used for density estimation, semi-parametric modelling and model selection/averaging. The Dirichlet processes are non-parametric in a sense that they have infinite number of parameters. Since they are treated in a Bayesian approach we are able to construct large models with infinite parameters which we integrate out to avoid overfitting. …

Distilled News

Data Lineage: The History of your Data

A common scenario that data analysts in general encounter is what I like to describe as ‘data denialism’. Often, and especially while consulting, an analyst will find that the data tells a different story than what the customer holds to be true. It is also often the case that, when presenting this finding, the customer will outright deny the evidence, asserting that either the data or the analysis must be wrong. For example, it may be that a retailer focused on the low-end market is getting most of its sales from high-end customers, and such a fact upends months -maybe even years- of marketing planning and strategy. (This may, or may not, be based on one of my previous consulting experiences) It is of course part of the analyst’s job to present and discuss such controversial findings carefully and in a way that they can be understood an accepted, or tell a story that is compelling enough to be believable. Of course, too, some discussion about findings is definitely healthy and desirable. But even if the customer is convinced that the analyst did their job right, there’s still the matter of the data itself, for how can the customer be assured that the data is correct? After the myriad transformations, schema modifications, unifications and predictive tasks, how can even the analyst be sure that everything went right?

Causation: The Why Beneath The What

A lot of marketing research is aimed at uncovering why consumers do what they do and not just predicting what they’ll do next. Marketing scientist Kevin Gray asks Harvard Professor Tyler VanderWeele about causal analysis, arguably the next frontier in analytics.

RStudio Server Pro is ready for BigQuery on the Google Cloud Platform

RStudio is excited to announce the availability of RStudio Server Pro on the Google Cloud Platform.

Finding chairs the data scientist way! (Hint: using Deep Learning) – Part I

I have been going through the deep learning literature for quite some time now. I have also participated in a few challenges to get my hands dirty. But what I enjoy the most is to apply deep learning in a real life problem. A real life problem which encompasses my daily life. This is partly why I picked up this problem of chair count recognition, to finally solve a problem which was unsolved till now! In this article, I will cover how I defined the problem. I will also mention what were the steps I took to solve the problem. Consider it as a raw uncut version of my experience as I tried to solve the problem ??

Clustering applied to showers in the OPERA

in this post I discuss clustering: techniques that form this method and some peculiarities of using clustering in practice. This post continues previous one about the OPERA.

Dots vs. polygons: How I choose the right visualization

When I start designing a map I consider: How do I want the viewer to read the information on my map? Do I want them to see how a measurement varies across a geographic area at a glance? Do I want to show the level of variability within a specific region? Or do I want to indicate busy pockets of activity or the relative volume/density within an area?

Probability Functions Beginner

On this set of exercises, we are going to explore some of the probability functions in R with practical applications. Basic probability knowledge is required. Note: We are going to use random number functions and random process functions in R such as runif, a problem with these functions is that every time you run them you will obtain a different value. To make your results reproducible you can specify the value of the seed using set.seed(‘any number’) before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random numbers). For this set of exercises we will use set.seed(1), don’t forget to specify it before every random exercise.

If you did not already know

Parallel Data Assimilation Framework (PDAF) google
The Parallel Data Assimilation Framework – PDAF – is a software environment for ensemble data assimilation. PDAF simplifies the implementation of the data assimilation system with existing numerical models. With this, users can obtain a data assimilation system with less work and can focus on applying data assimilation. PDAF provides fully implemented and optimized data assimilation algorithms, in particular ensemble-based Kalman filters like LETKF and LSEIK. It allows users to easily test different assimilation algorithms and observations. PDAF is optimized for the application with large-scale models that usually run on big parallel computers and is applicable for operational applications. However, it is also well suited for smaller models and even toy models. PDAF provides a standardized interface that separates the numerical model from the assimilation routines. This allows to perform the further development of the assimilation methods and the model independently. New algorithmic developments can be readily made available through the interface such that they can be immediately applied with existing implementations. The test suite of PDAF provides small models for easy testing of algorithmic developments and for teaching data assimilation. PDAF is an open-source project. Its functionality will be further extended by input from research projects. In addition, users are welcome to contribute to the further enhancement of PDAF, e.g. by contributing additional assimilation methods or interface routines for different numerical models. …

Data Structure Graph google
A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph. …

Statistical Distance google
In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be between an individual sample point and a population or a wider sample of points. A distance between populations can be interpreted as measuring the distance between two probability distributions and hence they are essentially measures of distances between probability measures. Where statistical distance measures relate to the differences between random variables, these may have statistical dependence, and hence these distances are not directly related to measures of distances between probability measures. Again, a measure of distance between random variables may relate to the extent of dependence between them, rather than to their individual values. Statistical distance measures are mostly not metrics and they need not be symmetric. Some types of distance measures are referred to as (statistical) divergences. …