Putting the science back in data science

One of key tenets of science (physics, chemistry, etc.), or at least the theoretical ideal of science, is reproducibility. Truly “scientific” results should not be accepted by the community unless they can be clearly reproduced and have undergone a peer review process. Of course, things get messy in practice for both academic scientists and data scientists, and many workflows employed by data scientists are far from reproducible.


Support Vector Machines Simplified using R

This tutorial describes theory and practical application of Support Vector Machines (SVM) with R code. It’s a popular supervised learning algorithm (i.e. classify or predict target variable). It works both for classification and regression problems. It’s one of the sought-after machine learning algorithm that is widely used in data science competitions.


Comprehensive Guide on t-SNE algorithm with implementation in R & Python

Imagine you get a dataset with hundreds of features (variables) and have little understanding about the domain the data belongs to. You are expected to identify hidden patterns in the data, explore and analyze the dataset. And not just that, you have to find out if there is a pattern in the data – is it signal or is it just noise? Does that thought make you uncomfortable? It made my hands sweat when I came across this situation for the first time. Do you wonder how to explore a multidimensional dataset? It is one of the frequently asked question by many data scientists. In this article, I will take you through a very powerful way to exactly do this.


Adapting an Algorithm to Real-time Applications

Below are some images generated by an algorithm that I originally created to study stocks. The stock in this case is Yahoo! in the early 2000s. I realize that none of these images resemble conventional stock market patterns. Keep in mind that algorithms by nature interact with the data. They don’t simply restate the numbers in graphical form. An algorithm designed to deal with historical stock trading data can be modified to deal with the data in real-time. However, there is no inherent reason for such an algorithm to be applied to an application like the original.


Metro Systems Over Time: Part 1

Metro systems are an interesting way to learn more about the growth of a city over time. You can see things like how the city expanded as public transit spread farther and farther from the original city limits. You can also see how the city center moved from certain neighborhoods to others. One example of this is the city of Paris, where I currently live, which started off just having metro stops along the river, and then quickly spread to a more circular shape over time. The gif below shows that progression over time. Blue dots are metro stops and the red dot is the center of the metro system.


More Dots – Syntactic Loop Fusion in Julia

After a lengthy design process and preliminary foundations in Julia 0.5, Julia 0.6 includes new facilities for writing code in the “vectorized” style (familiar from Matlab, Numpy, R, etcetera) while avoiding the overhead that this style of programming usually imposes: multiple vectorized operations can now be “fused” into a single loop, without allocating any extraneous temporary arrays.


Trumpworld Analysis : Ownership Relations in his Business Network

You do not need a machine learning algorithm to predict that the presidency of Donald Trump will be controversial. One of the most discussed aspects of his upcoming reign is the massive potential for conflicts of interest. Trump’s complex business empire is entangled with many aspects of national and international politics.


Visualizing “The Best”

It’s not immediately clear. Because, “The Best” is incredibly vague and subjective. My “Best” is not the same as your “Best”. And our “Best”s can converge and diverge depending on what we are measuring and how we measure it. I think “The Best” is often the wrong question. Usually when we’re looking for “The Best” (are you sick of me saying “The Best” yet?) we’re really just trying to find “The Better”. Answers to questions like “Who is the best player in the NBA?” or “What is the best city in the world?” or “Which Pokemon is the best?” are fraught with caveats and asterisks and clarifications. And they have to be! What do you mean “Best”? Best Shooter? Best of All Time? Best Last Year? Best in terms of Quality of Living? Best on measures of Entertainment? Best Speeds? Best Attacks? Best Best Best Best Best! Aggh! To answer these “Best” questions we have to narrow down the problem and convert them into “Better” questions. By slimming down the pool of possible options we can actually start to make some progress!


Principal Component Analysis

Often, it is not helpful or informative to only look at all the variables in a dataset for correlations or covariances. A preferable approach is to derive new variables from the original variables that preserve most of the information given by their variances. Principal component analysis is a widely used and popular statistical method for reducing data with many dimensions (variables) by projecting the data with fewer dimensions using linear combinations of the variables, known as principal components. The new projected variables (principal components) are uncorrelated with each other and are ordered so that the first few components retain most of the variation present in the original variables. Thus, PCA is also useful in situations where the independent variables are correlated with each other and can be employed in exploratory data analysis or for making predictive models. Principal component analysis can also reveal important features of the data such as outliers and departures from a multinormal distribution.
Advertisements