Vega makes visualizing BIG data easy

We’re delighted to announce the availability of Vega, the JSON specification for creating custom visualizations of large datasets. Using Vega you can create server-rendered visualizations in the community version and enterprise versions of MapD.

New Data Studio Data Control

Today we’ve made it dramatically easier to view your Google Analytics data in Data Studio using the new Data control. When a report is created using the Data Control, all viewers can see their own data in the report, without creating anything.

Machine Learning Exercises in Python: An Introductory Tutorial Series

This post presents a summary of a series of tutorials covering the exercises from Andrew Ng’s machine learning class on Coursera. Instead of implementing the exercises in Octave, the author has opted to do so in Python, and provide commentary along the way.

The truth about priors and overfitting

Have you ever thought about how strong a prior is compared to observed data? It’s not an entirely easy thing to conceptualize. In order to alleviate this trouble I will take you through some simulation exercises. These are meant as a fruit for thought and not necessarily a recommendation. However, many of the considerations we will run through will be directly applicable to your everyday life of applying Bayesian methods to your specific domain. We will start out by creating some data generated from a known process. The process is the following. …

Revolutionizing Data Science Package Management, July 25

Learn how Anaconda solves one of the most headache-inducing problems in data science—overcoming the package dependency nightmare—through the power of conda, in this webinar, on July 25.

Summary of Unintuitive Properties of Neural Networks

Neural networks work really well on many problems, including language, image and speech recognition. However understanding how they work is not simple, and here is a summary of unusual and counter intuitive properties they have.

When not to use deep learning

I know it’s a weird way to start a blog with a negative, but there was a wave of discussion in the last few days that I think serves as a good hook for some topics on which I’ve been thinking recently. It all started with a post in the Simply Stats blog by Jeff Leek on the caveats of using deep learning in the small sample size regime. In sum, he argues that when the sample size is small (which happens a lot in the bio domain), linear models with few parameters perform better than deep nets even with a modicum of layers and hidden units. He goes on to show that a very simple linear predictor, with top ten most informative features, performs better than a simple deep net when trying to classify zeros and ones in the MNIST dataset using only 80 or so samples. This prompted Andrew Beam to write a rebuttal in which a properly trained deep net was able to beat the simple linear model, even with very few training samples. This back-and-forth comes at a time where more and more researchers in biomedical informatics are adopting deep learning for various problems. Is the hype real or are linear models really all we need? The answer, as always, is that it depends. In this post, I want to visit use cases in machine learning where using deep learning does not really make sense as well as tackle preconceptions that I think prevent deep learning to be used effectively, especially for newcomers.

A lesson in prescriptive modeling

For the data professional, the first step to mastering prescriptive modeling is to understand simulation. In this excerpt from the O’Reilly video Hands-On Techniques for Business Model Simulation, I’ll walk you through a practical case study-simulating the cross-breeding of a new species of iris, and new business models for the resulting flowers. Using published open source code, viewers learn to generate a new species of iris, find interesting new characteristics, and search through business model simulations for profitable ways of bringing the new flowers to the market. It takes a lot of knowledge and skill to create useful simulations of the real world. That information is often hidden by obscure techniques or confusing explanations. In the full O’Reilly Learning Path, Creating Simulations to Discover New Business Models, I take viewers through a straightforward approach to learning prescriptive model simulation by treating it like a foreign language. We start by learning key terms and intuitive definitions. We assemble those terms into meaningful ideas, and we complete the Learning Path with the Iris example shown in the video excerpt in this post.

Thinking with data with “Modern Data Science with R”

One of the biggest challenges educators face is how to teach statistical thinking integrated with data and computing skills to allow our students to fluidly think with data. Contemporary data science requires a tight integration of knowledge from statistics, computer science, mathematics, and a domain of application. For example, how can one model high earnings as a function of other features that might be available for a customer? How do the results of a decision tree compare to a logistic regression model? How does one assess whether the underlying assumptions of a chosen model are appropriate? How are the results interpreted and communicated?