Statistical Inference: A Tidy Approach

(Slides)


Datasets and deep learning methodologies to extend image-based applications to videos

In today’s tutorial, we will extend image based application to videos, which will include pose estimation, captioning, and generating videos.


Python For Loops and If Statements Combined (Python for Data Science Basics #6)

Last time I wrote about Python For Loops and If Statements. Today we will talk about how to combine them. In this article, I’ll show you – through a few practical examples – how to combine a for loop with another for loop and/or with an if statement!


A Must-Read Introduction to Sequence Modelling (with use cases)

Artificial Neural Networks (ANN) were supposed to replicate the architecture of the human brain, yet till about a decade ago, the only common feature between ANN and our brain was the nomenclature of their entities (for instance – neuron). These neural networks were almost useless as they had very low predictive power and less number of practical applications. But thanks to the rapid advancement in technology in the last decade, we have seen the gap being bridged to the extent that these ANN architectures have become extremely useful across industries. In this article, we will look at the two main advances in the field of artificial neural networks that have made these ANNs more like the human brain,


LogRatio Regression – A Simple Way to Model Compositional Data

The compositional data are proportionals of mutually exclusive groups that would be summed up to the unity. Statistical models for compositional data have been applicable in a number of areas, e.g. the product or channel mix in the marketing research and asset allocations of a investment portfolio. In the example below, I will show how to model compositional outcomes with a simple LogRatio regression. The underlying idea is very simple. With the D-dimension outcome [p_1, p_2…p_D], we can derive a [D-1]-dimension outcome [log(p_2 / p_1)…log(p_D / p_1)] and then estimate a multivariate regression based on the new outcome.


Imputing missing values in parallel using {furrr}


K-Means Clustering in R – Exercises

K-means is efficient, and perhaps, the most popular clustering method. It is a way for finding natural groups in otherwise unlabeled data. You specify the number of clusters you want defined and the algorithm minimizes the total within-cluster variance. In this exercise, we will play around with the base R inbuilt k-means function on some labeled data.


How do I interpret the AIC?

My student asked today how to interpret the AIC (Akaike’s Information Criteria) statistic for model selection. We ended up bashing out some R code to demonstrate how to calculate the AIC for a simple GLM (general linear model). I always think if you can understand the derivation of a statistic, it is much easier to remember how to use it. Now if you google derivation of the AIC, you are likely to run into a lot of math. But the principles are really not that complex. So here we will fit some simple GLMs, then derive a means to choose the ‘best’ one. Skip to the end if you just want to go over the basic principles. Before we can understand the AIC though, we need to understand the statistical methodology of likelihoods.


Writing better R functions part three

In my last post I worked on two functions that took pairs of variables from a dataset and produced some nice useful ggplot plots from them. We started with the simplest case, plotting counts of how two variables cross-tabulate. Then we worked our way up to being able to automate the process of plotting lots of pairings of variables from the same dataframe. We added a feature to change the plot type and tested for general functionality. To be honest I thought I was in great shape until I went and started trying the function on a much larger dataset. Performance was terrible I had made a couple of mistakes at least. Today I’ll fix those problems and combine our two functions into one function.


Tutorial: Multistep Forecasting with Seasonal ARIMA in Python

When trend and seasonality is present in a time series, instead of decomposing it manually to fit an ARMA model using the Box Jenkins method, another very popular method is to use the seasonal autoregressive integrated moving average (SARIMA) model which is a generalization of an ARMA model. SARIMA models are denoted SARIMA(p,d,q)(P,D,Q)[S], where S refers to the number of periods in each season, d is the degree of differencing (the number of times the data have had past values subtracted), and the uppercase P, D, and Q refer to the autoregressive, differencing, and moving average terms for the seasonal part of the ARIMA model.


Machine Learning with Signal Processing Techniques

Stochastic Signal Analysis is a field of science concerned with the processing, modification and analysis of (stochastic) signals. Anyone with a background in Physics or Engineering knows to some degree about signal analysis techniques, what these technique are and how they can be used to analyze, model and classify signals. Data Scientists coming from a different fields, like Computer Science or Statistics, might not be aware of the analytical power these techniques bring with them. In this blog post, we will have a look at how we can use Stochastic Signal Analysis techniques, in combination with traditional Machine Learning Classifiers for accurate classification and modelling of time-series and signals. At the end of the blog-post you should be able understand the various signal-processing techniques which can be used to retrieve features from signals and be able to classify ECG signals (and even identify a person by their ECG signal), predict seizures from EEG signals, classify and identify targets in radar signals, identify patients with neuropathy or myopathyetc from EMG signals by using the FFT, etc etc.


Google plays doctor by identifying your medical symptoms

An update to Google’s mobile app and website lets you search for symptoms and receive a list of possible conditions and other details.


Topic Modeling in Python with NLTK and Gensim

In this post, we will learn how to identify which topic is discussed in a document, called topic modeling. In particular, we will cover Latent Dirichlet Allocation (LDA): a widely used topic modelling technique. And we will apply LDA to convert set of research papers to a set of topics.
Advertisements