March Machine Learning Mania, 4th Place Winner’s Interview: Erik Forseth

The annual March Machine Learning Mania competition, which ran on Kaggle from February to April, challenged Kagglers to predict the outcome of the 2017 NCAA men’s basketball tournament. Unlike your typical bracket, competitors relied on historical data to call the winners of all possible team match-ups. In this winner’s interview, Kaggler Erik Forseth explains how he came in fourth place using a combination of logistic regression, neural networks, and a little luck.

Shiny Application Layouts Exercises (Part-5)

In the fifth part of our series we will apply the kmeans() function to the iris dataset to create a shiny application. The difference is that now we will display its result vertically.

Forecasting: ARIMAX Model Exercises (Part-5)

The standard ARIMA (autoregressive integrated moving average) model allows to make forecasts based only on the past values of the forecast variable. The model assumes that future values of a variable linearly depend on its past values, as well as on the values of past (stochastic) shocks. The ARIMAX model is an extended version of the ARIMA model. It includes also other independent (predictor) variables. The model is also referred to as the vector ARIMA or the dynamic regression model. The ARIMAX model is similar to a multivariate regression model, but allows to take advantage of autocorrelation that may be present in residuals of the regression to improve the accuracy of a forecast. This set of exercises provides a practice in using the auto.arima function from the forecast package to make forecasts with the ARIMAX model. A function from the lmtest package is also used to check the statisical significance of regression coeffcients.

Graphical Presentation of Missing Data; VIM Package

Missing data is a problem that challenge data analysis methodologically and computationally in medical research. Patients of the clinical trials and cohort studies may drop out of the study, and therefore, generate missing data. The missing data could be at random when participants who drop out of study are not different from those who remained in study. For example, in the study of body mass index and cholesterol levels, participants who don’t measure their blood cholesterol have a comparable body mass index with participants who measure their blood cholesterol. To handle missing data researchers often choose to conduct analysis among participants without missing data (i.e., complete case analysis), but sometimes they prefer to impute the data. In the previous tutorials (Tutorial 1, Tutorial 2) published at DataScience+ we have shown how to impute the missing data by using MICE package. In this tutorial I will show how to graphically present the missing data, with only one purpose, to find whether missing is at random. Therefore, we will build plots by using the function marginplot from VIM package. This would be a short “how to” tutorial and have no intention to explain types of the missing data. You can learn more about types of missing data such as missing completely at random, missing at random, and not missing at random from the book Statistical Analysis with Missing Data.