RStudio v0.99 Preview: Code Snippets
We’re getting close to shipping the next version of RStudio (v0.99) and this week will continue our series of posts describing the major new features of the release (previous posts have already covered code completion, the revamped data viewer, and improvements to vim mode). Note that if you want to try out any of the new features now you can do so by downloading the RStudio Preview Release.

Research papers that changed the world of Big Data
If you are looking for some of the most influential research papers that revolutionised the way how we gather, aggregate, analyze and store increasing volumes of data in a short span of 10 years, you are in the right place! These papers were shortlisted, based on recommendations by big data enthusiasts and experts around the globe from various social media channels. In case we’ve missed out any important paper, please let us know.

Bring Your Own Data – Analyzing Wine Market
So what does determine a given store’s wine selection? Do managers already buy popular wines most people drink and are, therefore, trying to ‘spice up’ their menu with bottles other stores will definitely not have? Or maybe they study customers and select bottles based on consumer preferences? My big question is what makes wine sell?

The Perils Of Marketing Attribution
One of the hottest topics in analytics today is marketing attribution. Attribution, for those unfamiliar, is the process of assigning credit to various marketing efforts when a sale is generated. In the modern world, this is no easy task. There are myriad ways to touch a customer today and the goal of attribution is to tease out the impact that each touch had in convincing you to make a purchase. Was it the email you were sent? Or the Google link you clicked? Or the banner ad you clicked when visiting a different site? Or the ad you saw with your video on YouTube? Or one of many other potential touch points? Or is it a mix? It is quite common today for a customer to have been exposed to multiple influences in the lead up to a purchase. How do you attribute the relationship?

Deep Dive into Spark SQL’s Catalyst Optimizer
Spark SQL is one of the newest and most technically involved components of Spark. It powers both SQL queries and the new DataFrame API. At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasiquotes) in a novel way to build an extensible query optimizer. We recently published a paper on Spark SQL that will appear in SIGMOD 2015 (co-authored with Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, and Ali Ghodsi). In this blog post we are republishing a section in the paper that explains the internals of the Catalyst optimizer for broader consumption.

Wrangling Complex Spreadsheet Column Headers
Something I’ve been exploring lately are ‘external spreadsheet data source’ wrappers for the pandas Python library that wrap frequently released spreadsheets with a simple (?!) interface that lets you pull the data from the spreadsheet into a pandas dataframe.

Mapping Flows in R … with data.table and lattice
Some days ago James Cheshire published the post Mapping Flows in R. I have implemented an alternative (faster) version using data.table to read and join the datasets (and lattice to display the results). If you are new to data.table you should read this wiki and this cheatsheet.

A new interactive interface for learning R online, for free
Using the open-source swirl project and RStudio server, DataCamp has created an exciting new online learning interface. Discover a fun collection of free interactive R tutorials covering basic R functions, the apply-family, base graphics, data structures and much more. http://www.datacamp.com/swirl-r-tutorial

Hash Table Performance in R: Part II
In Part I of this series, I explained how R hashed environments are superior to vectors or lists when you need a hash table for your work. I also teased that in this post I would explain the caveats associated with that choice, but I’m saving that for later as I want to share with you the fastest ways of operating on them.

Advertisements