Artificial intelligence has reached peak hype. News outlets report that companies have replaced workers with IBM Watson and that algorithms are beating doctors at diagnoses. New AI startups pop up everyday, claiming to solve all your personal and business problems with machine learning. Ordinary objects like juicers and Wi-Fi routers suddenly advertise themselves as “powered by AI.” Not only can smart standing desks remember your height settings, they can also order you lunch. Much of the AI hubbub is generated by reporters who’ve never trained a neural network and by startups or those hoping to be acqui-hired for engineering talent despite not having solved any real business problems. No wonder there are so many misconceptions about what AI can and cannot do.
Data visualization of sports historical results is one of the means by which champions strengths and weaknesses comparison can be outlined. In this tutorial, we show what plots flavors may help in champions performances comparison, timeline visualization, player-to-player and player-to-tournament relationships. We are going to use the Tennis Grand Slam Tournaments results as outlined by the ESP site at: ESPN site tennis history table and which has been made available as tab-delimited file at the following link: tennis-grand-slam-winners
I don’t do a lot of plotting in my job, but I recently heard about a website called Plotly that provides a plotting service for anyone’s data. They even have a plotly package for Python (among others)! So in this article we will be learning how to plot with their package. Let’s have some fun making graphs!
Kaggle’s annual March Machine Learning Mania competition returned once again to challenge Kagglers to predict the outcomes of the 2017 NCAA Men’s Basketball tournament. This year, 442 teams competed to forecast outcomes of all possible match-ups. In this winner’s interview, Kaggler Scott Kellert describes how he came in second place by calculating team quality statistics to account for opponent strength for each game. Ultimately, he discovered his final linear regression model beat out a more complex neural network ensemble.
While programming languages will never be completely obsolete, a growing number of programmers (and data scientists) prefer working with frameworks and view them as the more modern and cutting-edge option for a number of reasons.
In ensemble methods, more diverse the models used, more robust will be the ultimate result.
We will develop a classification exercise using C5.0 decision tree algorithm. The exercise was originally published in ‘Machine Learning in R’ by Brett Lantz, PACKT publishing 2015 (open source community experience destilled). The example we will develop is about identifying risky bank loans. We will carry out the exercise verbatim as published in the aforementioned reference.
In the previous exercises of this series, forecasts were based only on an analysis of the forecast variable. Another approach to forecasting is to use external variables, which serve as predictors. This set of exercises focuses on forecasting with the standard multivariate linear regression.
One of my more popular answers on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). Comments, even on StackOverflow, aren’t a good place for a discussion so I thought I’d post something hereon my blog that went into a bit more detail as to why, for some common types of GLMs, prediction intervals aren’t that useful and require a lot more thinking about what they mean and how they should be calculated. I’ve broken it into two and in this, the second part, I look at Possion models. The second example — purely because I happen to have it handy from teaching this semester — is from Korner-Nievergelt et al. (2015), and concerns the number of breeding pairs of the common whitethroat (Silvia communis). This species likes to inhabit field margins and fallow lands and has been adversely affected by intensive agricultural activities reducing these types of habitat on the landscape. As a mitigiation effort, wildflower fields are sown and left largely unmanaged for several years. The data come from a study looking at how the number of breeding pairs of common whitethroat change as the composition and structure of the plant community changes over time. The data are in the blmeco package available on CRAN.
One of my more popular answers on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). My answer really only addresses how to compute confidence intervals for parameters but in the comments I discuss the more substantive points raised by the OP in their question. Lately there’s been a bit of back and forth between Jarrett Byrnes and myself about what a prediction “interval” for a GLM might mean. Comments, even on StackOverflow, aren’t a good place for a discussion so I thought I’d post something here that went into a bit more detail as to why, for some common types of GLMs, prediction intervals aren’t that useful and require a lot more thinking about what they mean and how they should be calculated. For illustration, I thought I’d use some small teaching example data sets, but whilst writing the post it started to get a little on the long side. So, I’ve broken it into two and in this part I look at logistic regression. The first example concerns a small experiment on the rare insectivorous pitcher plant Darlingtonia californica (the cobra lily) used as an example in Gotelli and Ellison (2013) and originally reported in Dixon et al. (2005). Darlingtonia grows leaves that are modified to form a pitcher trap, which is filled with nectar that attracts insects, in particular vespulid wasps (Vespula atropilosa). The observations in the data set are on the height of pitcher traps (leafHeight) and whether or not the leaf was visited by a wasp (visited). The code chunk below downloads the data from the book’s website and loads it into R ready for use.
We will develop a classification exercise using Naive-Bayes algorithm. The exercise was originally published in ‘Machine Learning in R’ by Brett Lantz, PACKT publishing 2015 (open source community experience destilled). Naive Bayes is a probabilistic classification algorithm that can be applied to problems of text classification such as spam filtering, intrusion detection or network anomalies, diagnosis of medical conditions given a set of symptoms, among others. The exercise we will develop is about filtering spam and ham sms messages. We will carry out the exercise verbatim as published in the aforementioned reference.