A Guide To Conduct Analysis Using Non-Parametric Statistical Tests

The average salary package of an economics honors graduate at Hansraj College during the end of the 1980s was around INR 1,000,000 p.a. The number is significantly higher than people graduating in early 80s or early 90s. What could be the reason for such a high average? Well, one of the highest paid Indian celebrity, Shahrukh Khan graduated from Hansraj College in 1988 where he was pursuing economics honors. This, and many such examples tell us that average is not a good indicator of the center of the data. It can be extremely influenced by Outliers. In such cases, looking at median is a better choice. It is a better indicator of the center of the data because half of the data lies below the median and the other half lies above it.


AutoML for large scale image classification and object detection

A few months ago, we introduced AutoML, an approach that automates the design of machine learning models. While we found that AutoML can design small neural networks that perform on par with neural networks designed by human experts, these results were constrained to small academic datasets like CIFAR-10, and Penn Treebank. We became curious how this method would perform on larger more challenging datasets, such as ImageNet image classification and COCO object detection. Many state-of-the-art machine learning architectures have been invented by humans to tackle these datasets in academic competitions.


Latest Innovations in TensorFlow Serving

Since initially open-sourcing TensorFlow Serving in February 2016, we’ve made some major enhancements. Let’s take a look back at where we started, review our progress, and share where we are headed next. Before TensorFlow Serving, users of TensorFlow inside Google had to create their own serving system from scratch. Although serving might appear easy at first, one-off serving solutions quickly grow in complexity. Machine Learning (ML) serving systems need to support model versioning (for model updates with a rollback option) and multiple models (for experimentation via A/B testing), while ensuring that concurrent models achieve high throughput on hardware accelerators (GPUs and TPUs) with low latency. So we set out to create a single, general TensorFlow Serving software stack. We decided to make it open-sourceable from the get-go, and development started in September 2015. Within a few months, we created the initial end-to-end working system and our open-source release in February 2016.


Process Mining with R: Introduction

In the past years, several niche tools have appeared to mine organizational business processes. In this article, we’ll show you that it is possible to get started with “process mining” using well-known data science programming languages as well.


Problems In Estimating GARCH Parameters in R

I’m leaving this post up though as a warning to others to avoid fGarch in the future. This was news to me, books often refer to fGarch, so this could be a resource for those looking for working with GARCH models in R why not to use fGarch.


Role Playing with Probabilities: The Importance of Distributions

I have a confession to make. I am not just a statistics nerd; I am also a role-playing games geek. I have been playing Dungeons and Dragons (DnD) and its variants since high school. While playing with my friends the other day it occurred to me, DnD may have some lessons to share in my job as a data scientist. Hidden in its dice rolling mechanics is a perfect little experiment for demonstrating at least one reason why practitioners may resist using statistical methods even when we can demonstrate a better average performance than previous methods. It is all about distributions. While our averages may be higher, the distribution of individual data points can be disastrous.


A ggplot-based Marimekko/Mosaic plot

One of my first baby steps into the open source world, was when I answered this SO question over four years ago. Recently I revisited the post and saw that Z.Lin did a very nice and more modern implementation, using dplyr and facetting in ggplot2. I decided to merge here ideas with mine to create a general function that makes MM plots. I also added two features: counts, proportions, or percentages to the cells as text and highlighting cells by a condition. For those of you unfamiliar with this type of plot, it graphs the joint distribution of two categorical variables. x is plotted in bins, with the bin widths reflecting its marginal distribution. The fill of the bins is based on y. Each bin is filled by the co-occurence of its x and y values. When x and y are independent, all the bins are filled (approximately) in the same way. The nice feature of the MM plot, is that is shows both the joint distribution and the marginal distributions of x and y.


Want to know how Deep Learning works? Here’s a quick guide for everyone.

Artificial Intelligence (AI) and Machine Learning (ML) are some of the hottest topics right now. The term “AI” is thrown around casually every day. You hear aspiring developers saying they want to learn AI. You also hear executives saying they want to implement AI in their services. But quite often, many of these people don’t understand what AI is. Once you’ve read this article, you will understand the basics of AI and ML. More importantly, you will understand how Deep Learning, the most popular type of ML, works. This guide is intended for everyone, so no advanced mathematics will be involved.


How to identify risky bank loans using C.50 decision trees

The global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans. Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language. Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while the others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory. It is likely that automated credit scoring models are employed to instantly approve credit applications on the telephone and web. In this tutorial, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.
Advertisements