How to do common Excel and SQL tasks in Python

Data practitioners have many tools that they use to slice and dice data. Some people use Excel, some people use SQL — and some people use Python. The advantages of using Python are obvious when it comes to certain tasks. You can process much bigger datasets at much faster speeds. You can use open source machine learning libraries built on top of Python. You can easily import and export data in different formats. Python can become an essential part of any data analyst’s toolbox due to its versatility. However, it can be hard to get started. Most data analysts are probably familiar with either SQL or Excel. This tutorial is structured to help you transfer over skills and techniques from those two programs to Python. First, let’s get you set up on Python. The easiest way to get started is to use Jupyter Notebook and Anaconda. This visual interface will allow you to plug Python code in and immediately see the output of your results. It’ll make it easy for you to follow along with the rest of this tutorial as well.

Empirical Bayes for multiple sample sizes

Here’s a data problem I encounter all the time. Let’s say I’m running a website where users can submit movie ratings on a continuous 1-10 scale. For the sake of argument, let’s say that the users who rate each movie are an unbiased random sample from the population of users. I’d like to compute the average rating for each movie so that I can create a ranked list of the best movies.

Updating Google Maps with Deep Learning and Street View

Every day, Google Maps provides useful directions, real-time traffic information and information on businesses to millions of people. In order to provide the best experience for our users, this information has to constantly mirror an ever-changing world. While Street View cars collect millions of images daily, it is impossible to manually analyze more than 80 billion high resolution images collected to date in order to find new, or updated, information for Google Maps. One of the goals of the Google’s Ground Truth team is to enable the automatic extraction of information from our geo-located imagery to improve Google Maps. In “Attention-based Extraction of Structured Information from Street View Imagery”, we describe our approach to accurately read street names out of very challenging Street View images in many countries, automatically, using a deep neural network. Our algorithm achieves 84.2% accuracy on the challenging French Street Name Signs (FSNS) dataset, significantly outperforming the previous state-of-the-art systems. Importantly, our system is easily extensible to extract other types of information out of Street View images as well, and now helps us automatically extract business names from store fronts. We are excited to announce that this model is now publicly available!

timekit: Time Series Forecast Applications Using Data Mining

The timekit package contains a collection of tools for working with time series in R. There’s a number of benefits. One of the biggest is the ability to use a time series signature to predict future values (forecast) through data mining techniques. While this post is geared toward exposing the user to the timekit package, there are examples showing the power of data mining a time series as well as how to work with time series in general. A number of timekit functions will be discussed and implemented in the post. The first group of functions works with the time series index, and these include functions tk_index(), tk_get_timeseries_signature(), tk_augment_timeseries_signature() and tk_get_timeseries_summary(). We’ll spend the bulk of this post introducing you to these. The next function deals with creating a future time series from an existing index, tk_make_future_timeseries(). The last set of functions deal with coercion to and from the major time series classes in R, tk_tbl(), tk_xts(), tk_zoo() (and tk_zooreg()), and tk_ts().

Technical Foundations of Informatics: A modern introduction to R

Informatics (or Information Science) is the practice of creating, storing, finding, manipulating and sharing information. These are all tasks that the R language was designed for, and so Technical Foundations of Informatics, the online course guide for the University of Washington course of the same name, also provides an excellent resource for learning those skills using R.

Machine Learning Classification with 1R and RIPPER Rule Learners (Edible/Poisonous Mushrooms)

We will develop a classification example using 1R and RIPPER rule learners algorithms. The exercise was originally published in ‘Machine Learning in R’ by Brett Lantz, PACKT publishing 2015 (open source community experience destilled). The example we will develop is about classifying edible and poisonous mushrooms.

Data Science for Operational Excellence (Part-5)

Operations need to have demand forecasts in order to establish optimal resource allocation policies. But, when we make predictions the only thing that we assure is the occurrence of prediction errors. Fortunately, there is no need to be 100% accurate to succeed, we just need to perform better than our competitors. In this exercise we will learn a practical approach to predict using the forecast package.

Creating Graphs with Python and GooPyCharts

Last summer, I came across an interesting plotting library called GooPyCharts which is a Python wrapper for the Google Charts API. In this article, we will spend a few minutes learning how to use this interesting package. GooPyCharts follows syntax that

Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

It can be difficult to demonstrate the importance of data visualization. Some people are of the impression that charts are simply ‘pretty pictures’, while all of the important information can be divined through statistical analysis. An effective (and often used) tool used to demonstrate that visualizing your data is in fact important is Anscome’s Quartet. Developed by F.J. Anscombe in 1973, Anscombe’s Quartet is a set of four datasets, where each produces the same summary statistics (mean, standard deviation, and correlation), which could lead one to believe the datasets are quite similar. However, after visualizing (plotting) the data, it becomes clear that the datasets are markedly different. The effectiveness of Anscombe’s Quartet is not due to simply having four different datasets which generate the same statistical properties, it is that four clearly different and visually distinct datasets are producing the same statistical properties. In contrast the ‘Unstructured Quartet’ on the right in Figure 1 also shares the same statistical properties as Anscombe’s Quartet, however without any obvious underlying structure to the individual datasets, this quartet is not nearly as effective at demonstrating the importance of visualizing your data.

Every single Machine Learning course on the internet, ranked by your reviews

For this guide, I spent a dozen hours trying to identify every online machine learning course offered as of May 2017, extracting key bits of information from their syllabi and reviews, and compiling their ratings. For this task, I turned to none other than the open source Class Central community, and its database of thousands of course ratings and reviews.

How to Fail with Artificial Intelligence: 9 creative ways to make your AI startup fail

#1 Cut R&D spending to save money
#2 Operate in a technology bubble
#3 Prioritize technology over business strategy
#5 Develop without addressing business needs
#6 Cultivate a “we’re the best” attitude
#7 Get caught in a never-ending development loop
#8 Assume your customers are like developers
#9 Assume the AI hype is enough to succeed

Data preparation in the age of deep learning

In this episode of the Data Show, I spoke with Lukas Biewald, co-founder and chief data scientist at CrowdFlower. In a previous episode we covered how the rise of deep learning is fueling the need for large labeled data sets and high-performance computing systems. CrowdFlower has a service that many leading companies have come to rely on to provide them with labeled data sets to train machine learning models. As deep learning models get larger and more complex, they require training data sets that are bigger than those required by other machine learning techniques.

How the TensorFlow team handles open source support

Open-sourcing is more than throwing code over the wall and hoping somebody uses it. I knew this in theory, but being part of the TensorFlow team at Google has opened my eyes to how many different elements you need to build a community around a piece of software.

Estimating the Size of a Demonstration

Inspired by the recent March For Science we look into methods for the statistical estimation of the number of people participating in a demonstration organized as a march. In particular, we provide R code to reproduce the two on-the-spot counting method analysis of Yip et al. (2010) for the data of the July 1 March in Hong Kong 2006.