An End-to-End Guide to Understand the Math behind XGBoost

Ever since its introduction in 2014, XGBoost has been lauded as the holy grail of machine learning hackathons and competitions. From predicting ad click-through rates to classifying high energy physics events, XGBoost has proved its mettle in terms of performance – and speed. I always turn to XGBoost as my first algorithm of choice in any ML hackathon. The accuracy it consistently gives, and the time it saves, demonstrates how useful it is. But how does it actually work? What kind of mathematics power XGBoost? We´ll figure out the answers to these questions soon.

Ensemble Learning in Python

In this tutorial, you’ll learn what ensemble is and how it improves the performance of a machine learning model.

An ethics checklist for data scientists

deon is a command line tool that allows you to easily add an ethics checklist to your data science projects. We support creating a new, standalone checklist file or appending a checklist to an existing analysis in many common formats.

Guide to Getting Started with TensorFlow

Including video and written tutorials, beginner code examples, useful tricks, helpful communities, books, jobs and more – this is the ultimate guide to getting started with TensorFlow.

Essential Math for Data Science – ‘Why’ and ‘How’

Mathematics is the bedrock of any contemporary discipline of science. It is no surprise then that, almost all the techniques of modern data science (including all of the machine learning) have some deep mathematical underpinning or the other. In this article, we discuss the essential math topics to master to become a better data scientist in all aspects.

Data Science Cheatsheet

This cheatsheet is currently a 9-page reference in basic data science that covers basic concepts in probability, statistics, statistical learning, machine learning, big data frameworks and SQL. The cheatsheet is loosely based off of The Data Science Design Manual by Steven S. Skiena and An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

Don’t Use Dropout in Convolutional Networks.

If you are wondering how to implement dropout, here is your answer. … I have noticed that there is an abundance of resources for learning the what and why of deep learning. Unfortunately when it comes time to make a model, their are very few resources explaining the when and how. I am writing this article for other data scientists trying to implement deep learning. So you don´t have to troll through research articles and Reddit discussions like I did. In this article you will learn why dropout is falling out of favor in convolutional architectures.

Sample size and class balance on model performance

This post shows the relationship between the sample size and the accuracy in a classification model. I hope this little research I did may help you in your classification problems. An LSTM model created in Keras was used to produce the results. The metric we are tracking is categorical_accuracy (equivalent to accuracy for multi-class), which is biased towards the values that have more representativeness. For example, if we are predicting fraud, which occurs only 1 time in 1000 (0.1%); then assigning all 100% of the cases as ‘Non-fraud’ will lead us to be correct 99.9% of the times. High, but utterly useless.

Get Started with R (For Free) in IBM Watson Studio

As you may have noticed, I blog a lot about R. I just can’t help it y’all, I’m like a moth to a flame with these fancy R packages. Since I try to make my blogs beginner friendly, I usually begin with a little talk about your options for running R code. As such, I wanted to dedicate a whole blog to explain your R options within IBM Watson Studio. Why? Well first and foremost, I use it a lot and I want to share the benefits. Even better, I can share it because the service has a free tier! Watson Studio is a hosted, full service and scalable data science platform. It allows us to integrate a variety of languages, products, techniques and data assets all within one place. As an R user, I like it because my colleagues and I can leverage the collaboration options and work in the same project space but use different languages or tools. The fact that it’s hosted, means that I can access it from any website (I’m talking ipads folks). Finally, it has a lot of great (and free) integrations like: SPSS, Cognos dashboards and a variety of embedded AI services like Watson Visual Recognition and Natural Language Classifier.

Why GPUs?

It is no secret in the Deep Learning community that GPU enabled machines and clusters will dramatically speed up the time it takes to train neural networks. In this article, we will look at the running time of Gradient Descent and where GPUs reduce this time complexity.

Statistics : The Collection, Analysis and Inference of Data ( Part I )

Statistics are numbers that summarise raw facts and figures in some meaningful way. They present key ideas that may not be immediately apparent by just looking at the raw data, and by data, we mean facts or figures from which we can draw conclusions.

Demystifying Optimizations for machine learning

Optimization is the most essential ingredient in the recipe of machine learning algorithms. It starts with defining some kind of loss function/cost function and ends with minimizing the it using one or the other optimization routine. The choice of optimization algorithm can make a difference between getting a good accuracy in hours or days. The applications of optimization are limitless and is widely researched topic in industry as well as academia. In this article we´ll walk through several optimization algorithms used in the realm of deep learning. (You can go through this article to understand the basics of loss functions)

Apache Druid

Druid is primarily used to store, query, and analyze large event streams. Examples of event streams include user generated data such as clickstreams, application generated data such as performance metrics, and machine generated data such as network flows and server metrics. Druid is optimized for sub-second queries to slice-and-dice, drill down, search, filter, and aggregate this data. Druid is commonly used to power interactive applications where performance, concurrency, and uptime are important.

An introduction to Druid, your Interactive Analytics at (big) Scale

I have discovered Druid ( ) approximately 2 years ago. I was working at SuperAwesome at that time, and we needed a solution to replace our existing reporting system based on Mongo that was showing its fatigue. Our MongoDB implementation didn´t scale well due to the high cardinality of the data, and the storage cost made us thought it wasn´t the best tool for the job. At the time, we were handling approximately 100 millions events per day, and some of our reports were taking 30 seconds to generate. We currently handle billions of events per day, and the reporting takes less than 1 second most of the time. The data we had in MongoDB was migrated. This data stored in MongoDB was using approximately 60GB of disk space, and when indexed inside Druid, the same data represented only 600MB. Yep. 100x less storage! This post will explain what is Druid, why you should care, a high-level overview on how it works, and some information on how to get started and achieve less than 1 second query time!

Simple Implementation of Densely Connected Convolutional Networks in PyTorch.

In this post I will try to explain the implementation of the Densely Connected Convolutional Networks with the use of the PyTorch library. Dense Networks are a relatively recent implementation of Convolutional Neural Networks, that expand the idea proposed for Residual Networks, which have become a standard implementation for feature extraction on image data.