This week, in the course on non-supervised techniques for data science, we’ve been using a dataset, with candidate for the presidential elections in 2002 (per row) and newpapers (per column). In order to visualize that dataset, consider three candidates, and three newspapers
Most of us have limited knowledge of regression. Of which, linear and logistic regression are our favorite ones. As an interesting fact, regression has extended capabilities to deal with different types of variables. Do you know, regression has provisions for dealing with multi-level dependent variables too? I’m sure, you didn’t. Neither did I. Until I was pushed to explore this aspect of Regression. For multi-level dependent variables, there are many machine learning algorithms which can do the job for you; such as naive bayes, decision tree, random forest etc. For starters, these algorithm can be a bit difficult to understand. But, if you very well understand logistic regression, mastering this new aspect of regression should be easy for you! In this article, I’ve explained the method of using multinomial and ordinal regression. Also, for practical purpose, I’ve demonstrated this algorithm in a step wise fashion in R.
When we talk about Regression, we often end up discussing Linear and Logistics Regression. But, that’s not the end. Do you know there are 7 types of Regressions ? Linear and logistic regression is just the most loved members from the family of regressions. Last week, I saw a recorded talk at NYC Data Science Academy from Owen Zhang, current Kaggle rank 3 and Chief Product Officer at DataRobot. He said, ‘if you are using regression without regularization, you have to be very special!’. I hope you get what a person of his stature referred to. I understood it very well and decided to explore regularization techniques in detail. In this article, I have explained the complex science behind ‘Ridge Regression‘ and ‘Lasso Regression‘ which are the most fundamental regularization techniques, sadly still not used by many.
The for-loop in R, can be very slow in its raw un-optimised form, especially when dealing with larger data sets. There are a number of ways you can make your logics run fast, but you will be really surprised how fast you can actually go. This posts shows a number of approaches including simple tweaks to logic design, parallel processing and Rcpp, increasing the speed by orders of several magnitudes, so you can comfortably process data as large as 100 Million rows and more.
If you’ve ever done churn analysis using cox regression with time-dependent covariates, you know that the hardest part of doing that type of research is building your base data set. You have to divide each customer’s lifetime into ‘chunks’ where the changing values of a host of different predictor variables apply. I’ve coded this in SQL before, and it gets ugly. Fast.
Folks know that gradient-boosted trees generally perform better than random forest, although there is a price for that: GBT have a few hyperparams to tune, while random forest is practically tuning-free. Let’s look at what the literature says about comparing the two methods.
General Linear Model (GLM) is a tool used to understand and analyse linear relationships among variables. It is an umbrella term for many techniques that are taught in most statistics courses: ANOVA, multiple regression, etc. In its simplest form it describes the relationship between two variables, “y” (dependent variable, outcome, and so on) and “x” (independent variable, predictor, etc). These variables could be both categorical (how many?), both continuous (how much?) or one of each.
With fewer than 500 North Atlantic right whales left in the world’s oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. In the NOAA Right Whale Recognition challenge, 470 players on 364 teams competed to build a model that could identify any individual, living North Atlantic right whale from its aerial photographs. The deepsense.io team entered the competition spurred by a recent improvements in their image recognition skills and ended up taking 1st place. In this blog, they share their pipeline, their solution’s ‘most valuable player’, and what they’ve taken away from the competition experience.
Kaggle’s annual Santa optimization competition wrapped up in early January with a nail-biting finish. When the dust settled, team Woshialex and Weezy had managed to take 2nd place in the competition and also take home the Rudolph prize. (This prize is awarded to the team that held 1st place on the leaderboard for the longest period of time.) In this blog the data scientists on the team, Mirsad and Qi, share the details of their simulated annealing algorithm, what worked / what didn’t work, and why they benefited from teaming up.
There is a featureI really like in Apache Spark. Spark can process data out of memory in my local machine even without a cluster. Good news for those who process data sets bigger than the memory size that currently have. From time to time, I have this issue when I work with hypothesis testing. For hypothesis testing I usually use statistical bootstrapping techniques. This method does not require any statistical knowledge and is very easy to understand. Also, this method is very simple to implement. There are no normal distributions and student distributions from your statistical courses, only some basic coding skills. Good news for those who doesn’t like statistics. Spark and bootstrapping is a very powerful combination which can help you check hypotheses in a large scale.
Sometimes a correlation means absolutely nothing, and is purely accidental (especially when you compute millions of correlations among thousands of variables) or it can be explained by confounding factors. For instance, the fact that the cost of electricity is correlated to how much people spend on education, is explained by a confounding factor: inflation, which makes both electricity and education costs grow over time. This confounding factor has a bigger influence than true causal factors, such as more administrators / government-funded student loans boosting college tuition.
Julia is a high-level dynamic programming language designed to address the requirements of high-performance numerical and scientific computing. It has been discussed as one of the languages that could be the future of high performance data analytics because of its performance capabilities with benchmarks comparable to C.
Hashtag is the new “paralanguage” of Twitter. What started as a way for people to connect with others and to organize similar tweets together, propagate ideas, promote specific people or topics has now grown into a language of its own. As hashtags are created by people on their own, any new event or topic can be referred to by a variety of hashtags. This linguistic innovation in the form of hashtags is a very special feature of Twitter which has become immensely popular and are also widely adopted in various other social media like Facebook, Google+ etc. and have been studied extensively by researchers to analyze the competition dynamics, the adoption rate and popularity scores. One of the interesting and prevalent linguistic phenomena in today’s world of brief expressions, chats etc. is hashtag compounding where new hashtags are formed through combination of two or more hashtags together with the form of the individual hashtags remaining intact.
Deep learning is all the rage these days. What exactly is deep learning? Well, it all boils down to neural networks. Neural networks have been around for decades, just that no one used to call them deep networks back then. Now we have all sorts of different flavors of neural networks – deep belief networks (DBNs), convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and more. There are also a ton of different learning algorithms for them nowadays too. It used to be just backpropagation, but now you’ve got contrastive divergence, dropout, DropConnect, and all sorts of other modifications to the vanilla gradient descent algorithm.
No matter what you do, you can’t avoid excel. So, may as well dive into it & tame the beast. Here are 5 excel Add Ins that every data scientist should install.
In this post, we take a look at what deep convolutional neural networks (convnets) really learn, and how they understand the images we feed them. We will use Keras to visualize inputs that maximize the activation of the filters in different layers of the VGG16 architecture, trained on ImageNet. All of the code used in this post can be found on Github.
Deep Learning allows computational models composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection, and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large datasets by using the back-propagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about dramatic improvements in processing images, video, speech and audio, while recurrent nets have shone on sequential data such as text and speech. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep learning methods are representation learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. This tutorial will introduce the fundamentals of deep learning, discuss applications, and close with challenges ahead.
Today we are excited to announce the release of Census Analyzer: the free web-based tool for data analysis, visualization and report sharing by JetBrains.
In a major breakthrough for artificial intelligence, a computing system developed by Google researchers in Great Britain has beaten a top human player at the game of Go, the ancient Eastern contest of strategy and intuition that has bedeviled AI experts for decades.
The business world is full of streams of items that need to be filtered or evaluated: parts on an assembly line, resumés in an application pile, emails in a delivery queue, transactions awaiting processing. Machine learning techniques are increasingly being used to make such processes more efficient: image processing to flag bad parts, text analysis to surface good candidates, spam filtering to sort email, fraud detection to lower transaction costs etc. In this article, I show how you can take business factors into account when using machine learning to solve these kinds of problems with binary classifiers. Specifically, I show how the concept of expected utility from the field of economics maps onto the Receiver Operating Characteristic (ROC) space often used by machine learning practitioners to compare and evaluate models for binary classification. I begin with a parable illustrating the dangers of not taking such factors into account. This concrete story is followed by a more formal mathematical look at the use of indifference curves in ROC space to avoid this kind of problem and guide model development. I wrap up with some recommendations for successfully using binary classifiers to solve business problems.
Deep learning is tricky in several respects. Not only can the math and theory quickly lead to hairballs of gradient formulas and update equations, but deep learning models are also complex and often finicky pieces of software. Recently, we’ve seen promising projects like TDB for TensorFlow, which promises online visualization of neural networks and control flow interruption during training and inference, helping the developer diagnose the behavior of a neural network that isn’t working.