TensorFlow vs. Theano vs. Torch

In this study, I evaluate some popular deep learning frameworks. The candidates are listed in alphabet order: TensorFlow, Theano, and Torch. This is a dynamic document and the evaluation is based the current state of their code, not what the authors claim in white papers. This evaluation is mostly technical, to the best of my knowledge, and doesn’t take into account the community size or who use what. If you find something wrong or incomplete, please help fix by creating an issue or sending pull requests.


Normality tests for Continuous Data

We use normality tests when we want to understand whether a given sample set of continuous (variable) data could have come from the Gaussian distribution (also called the normal distribution). Normality tests are a pre-requisite for some inferential statistics, especially the generation of confidence intervals and hypothesis tests such as 1 and 2 sample t-tests. The normality assumption is also important when we’re performing ANOVA, to compare multiple samples of data with one another to determine if they come from the same population. Normality tests are a form of hypothesis test, which is used to make an inference about the population from which we have collected a sample of data. There are a number of normality tests available for R. All these tests fundamentally assess the below hypotheses. The first of these is called a null hypothesis – which states that there is no difference between this data set and the normal distribution.


Comparing 7 Python data visualization tools

The Python scientific stack is fairly mature, and there are libraries for a variety of use cases, including machine learning, and data analysis. Data visualization is an important part of being able to explore data and communicate results, but has lagged a bit behind other tools such as R in the past.


Introduction to Spark with Python

After lots of ground-breaking work led by the UC Berkeley AMP Lab, Spark was developed to utilize distributed, in-memory data structures to improve data processing speeds over Hadoop for most workloads. In this post, we’re going to cover the architecture of Spark and basic transformations and actions using a real dataset. If you want to write and run your own Spark code, check out the interactive version of this post on Dataquest.


Happy 10th Birthday, Google Analytics!

Today marks the 10th anniversary of the launch of Google Analytics. So much has changed over the last decade! If you think back ten years ago, the most popular smartphone was the Blackberry and a 128 megabyte flash drive cost about $30. Today you can get 250X the storage for half the price. During that time, the world we now refer to as “digital analytics” has changed significantly.
1. Event Tracking
2. Real-Time Reporting
3. Multi-Channel Funnels, Attribution Modeling, and Data-Driven Attribution
4. Tag Management
5. Analytics Academy
6. Universal Analytics
7. Measurement Protocol
8. Mobile App Analytics
9. Enhanced Ecommerce
10. Remarketing


Avoiding Tunnel Vision in Peer Comparisons

Comparing yourself to peers – also known as benchmarking – lets you understand how you’re doing, identify performance gaps and opportunities to improve, and highlight peer achievements that you could emulate, or your own achievements to be celebrated. As long as data is available, peer comparison can potentially accomplish all of these. The opportunities for peer comparison are greatly increasing due to cloud and other services that generate data as a by-product of serving customers.


Applied Statistical Theory: Quantile Regression

This is part two of the ‘applied statistical theory’ series that will cover the bare essentials of various statistical techniques. As analysts, we need to know enough about what we’re doing to be dangerous and explain approaches to others. It’s not enough to say “I used X because the misclassification rate was low.”


A Statistical View of Deep Learning

Over the past 6 months, I’ve taken to writing a series of posts (one each month) on a statistical view of deep learning with two principal motivations in mind. The first was as a personal exercise to make concrete and to test the limits of the way that I think about, and use deep learning in my every day work. The second, was to highlight important statistical connections and implications of deep learning that I do not see being made in the popular courses, reviews and books on deep learning, but which are extremely important to keep in mind.


Big Data Analytics – Nine Easy Steps to Unlock Breakthrough Results

An earlier post addressed one of the more perplexing challenges to managing an analytic community of any size against the irresistible urge to cling to what everyone else seems to be doing without thinking carefully about what is needed, not just wanted. This has become more important and urgent with the breath-taking speed of Big Data adoption in the analytic community. Older management styles and obsolete thinking have created needless friction between the business and their supporting IT organizations. To unlock breakthrough results requires a deep understanding of why this friction is occurring and what can be done to reduce this unnecessary effort so everyone can get back to the work at hand.


Graph from Sparse Adjacency Matrix

I spent a decent chunk of my morning trying to figure out how to construct a sparse adjacency matrix for use with graph.adjacency(). I’d have thought that this would be rather straight forward, but I tripped over a few subtle issues with the Matrix package. My biggest problem (which in retrospect seems rather trivial) was that elements in my adjacency matrix were occupied by the pipe symbol.