Recently I came across the classical 1983 paper A note on screening regression equations by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure performed on the same data. As a concrete example Freedman considers the practice of performing variable selection first, and then fitting another model using only the identified variables on the same data that was used to identify them in the first place. Because of the unexpectedly high severity of the problem this phenomenon became known as “Freedman’s paradox”. Moreover, in his paper Freedman derives asymptotic estimates for the resulting errors. The 1983 paper presents a simulation with only 10 repetitions. But in the present day it is very easy (both in terms of computational time and implementation difficulty) to reproduce the simulation with many more repetitions (even my phone’s computational power is probably higher than that of the high performance computer that Freedman used in the 80’s). We also have more convenient ways to visualize the results than in the 80’s. So let’s do it. I am going to use a few R packages (most notably the package broom to fit and analyze many many linear models in a single step).
The standard OLS (Ordinary Least Squares) model explains the relationship between independent variables and the conditional mean of the dependent variable. In contrast, quantile regression models this relationship for different quantiles of the dependent variable. In this exercise set we will use the quantreg package (package description: here) to implement quantile regression in R.
Frankly in R (especially once you add many packages) there is usually more than one way doing things.
When you start with R and try to estimate a standard ANOVA , which is relatively simple in commercial software like SPSS, R kind of sucks. Especially for unbalanced designs or designs with repeated-measures replicating the results from such software in base R may require considerable effort. For a newcomer (and even an old timer) this can be somewhat off-putting. After I had gained experience developing my first package and was once again struggling with R and ANOVA I had enough and decided to develop afex. If you know this feeling, afex is also for you.
The previous post described how the deeply nested JSON data on fligths were parsed and stored in an R-friendly database structure. However, looking into the data, the information is not yet ready for statistical analysis and visualization and some further processing is necessary before extracting insights and producing nice plots. In the parsed batch, it is clearly visible the redundant structure of the data with the flight id repeted for each segment of each flight. This is also confirmed with the following simple check as the rows of the dataframe are more than the unique counts of the elements in the id column.
We’ve been talking about data science and data scientists for a decade now. While there’s always been some debate over what “data scientist” means, we’ve reached the point where many universities, online academies, and bootcamps offer data science programs: master’s degrees, certifications, you name it. The world was a simpler place when we only had statistics. But simplicity isn’t always healthy, and the diversity of data science programs demonstrates nothing if not the demand for data scientists.
What separates ‘traditional’ applied statistics from machine learning? Is statistics the foundation on top of which machine learning is built? Is machine learning a superset of ‘traditional’ statistics? Do these 2 concepts have a third unifying concept in common? So, in that vein… is regression analysis actually a form of machine learning?
R.R. Donnelley is a Fortune 500 marketing and business communications firm that became much more than that. One of the 150-year-old company’s key services is transporting massive amounts of documents, marketing support materials, and an array of other items, large and small, for demanding clients. For R.R. Donnelley, logistics cost optimization is not just a good idea — it’s essential to long-term competitiveness. ‘We found that we were actually quite good at managing independent shipping organizations, and we expanded that business to ship much more than just communications material, to ship all kinds of products,’ explains CIO Ken O’Brien. ‘Usually nothing in the refrigerated space, but everything from dog biscuits to refrigerators.’ As R.R. Donnelley grew its shipping business, it increasingly found itself in an excellent position to capture a wide range of market segments. ‘But a lot of our potential was driven by our ability to provide accurate rates for our customers,’ O’Brien says. ‘Not just accurate, but compelling rates for our customers, and speed.’ It quickly became apparent that obtaining an ability to provide a quick turnaround on bids and estimates for potential shipping jobs was the key to winning contracts. ‘And that was really the catalyst behind the development of our new technology platform,’ O’Brien says.
Overheard after class: “doesn’t the Bias-Variance Tradeoff sound like the name of a treaty from a history documentary?” Ok, that’s fair… but it’s also one of the most important concepts to understand for supervised machine learning and predictive modeling. Unfortunately, because it’s often taught through dense math formulas, it’s earned a tough reputation. But as you’ll see in this guide, it’s not that bad. In fact, the Bias-Variance Tradeoff has simple, practical implications around model complexity, over-fitting, and under-fitting.
In the first part, I introduced the weather dataset and outlining its exploratory analysis. In the second part of our tutorial, we are going to build multiple logistic regression models to predict weather forecast. Specifically, we intend to produce the following forecasts:
• tomorrow’s weather forecast at 9am of the current day
• tomorrow’s weather forecast at 3pm of the current day
• tomorrow’s weather forecast at late evening time of the current day
For each of above tasks, a specific subset of variables shall be available, and precisely:
• 9am: MinTemp, WindSpeed9am, Humidity9am, Pressure9am, Cloud9am, Temp9am
• 3pm: (9am variables) + Humidity3pm, Pressure3pm, Cloud3pm, Temp3pm, MaxTemp
• evening: (3pm variables) + Evaporation, Sunshine, WindGustDir, WindGustSpeed
We suppose the MinTemp already available at 9am as we expect the overnight temperature resulting with that information. We suppose the MaxTemp already available at 3pm, as determined on central day hours. Further, we suppose Sunshine, Evaporation, WindGustDir and WindGustSpeed final information only by late evening. Other variables are explicitely bound to a specific daytime.
Before we start, have a look at the below examples.
1. You open Google and search for a news article on the ongoing Champions trophy and get hundreds of search results in return about it.
2. Nate silver analysed millions of tweets and correctly predicted the results of 49 out of 50 states in 2008 U.S Presidential Elections.
3. You type a sentence in google translate in English and get an Equivalent Chinese conversion.
So what do the above examples have in common?
You possible guessed it right – TEXT processing. All the above three scenarios deal with humongous amount of text to perform different range of tasks like clustering in the google search example, classification in the second and Machine Translation in the third.
Humans can deal with text format quite intuitively but provided we have millions of documents being generated in a single day, we cannot have humans performing the above the three tasks. It is neither scalable nor effective.
Does it sound familiar to you? In order to get an idea of how to choose a parameter for a given classifier, you have to cross reference to a number of papers or books, which often turn out to present competing arguments for or against a certain parameterization choice but with few applications to real-world problems. For example, you may find a few papers discussing optimal selection of K in K-nearest Neighbour, one supporting so-called square-root of sample size N method, another talking about selecting K based on how well the classifier performs according to its cross-validation samples. The parameterization choices have signficant impacts on the performances of classifiers; so it’s important to get them right. Parameterized differently, as shown in the paper below, the performances of each of the 8 most popular classification algorithms can be significantly different.
Machine learning is cool. There is no denying in that. In this post we will try to make it a little uncool, well it will still be cool but you may start looking at it differently. Machine learning is not a black box. It is intuitive and this post is just to convey that. If I give you this function f(x) = x^2 + log(x) and ask you to tell me what will be f(2), you will first laugh at me and then run away to do something important. This is trivial for you, right? If a function is there that maps inputs to outputs then it is very easy to get the output for any new input. Machine learning helps you get a function that can map the input to the output. How does it do it? What is this function? We will try to answer such questions in the paragraphs below. Let us try to answer the above questions using a problem that can be solved using machine learning. Assume, you are a technical recruiter. You have been running a recruitment firm for the last 3 years. Now you being tech savvy, you follow the latest trends in technology and you came to know about machine learning. You understand that machine learning can be used to predict the future given you have data from the past.