1.1 Billion Taxi Rides with Spark 2.2 & 3 Raspberry Pi 3 Model Bs

The Raspberry Pi is a £29, UK-built, single-board computer. To date more than 12.5 million units have been sold. In this benchmark I’ll use three Raspberry Pis, a few Micro SD cards and an old 7200 RPM hard drive and see what sort of query performance Spark 2.2 can achieve on a cluster of these devices. The dataset I’ll be using is a data dump I’ve produced of 1.1 billion taxi trips conducted in New York City over a six year period. The Billion Taxi Rides in Redshift blog post goes into detail on how I put the dataset together. This is the same dataset I’ve used to benchmark Amazon Athena, BigQuery, BrytLyt, ClickHouse, Elasticsearch, EMR, kdb+/q, MapD, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.


Snowflake Introduces the Cloud Data Warehouse Built for Financial Services

Snowflake Computing, the data warehouse built for the cloud, announced Virtual Private Snowflake (VPS) – the ideal solution for industries such as financial services that demand the highest level of security. VPS is the most advanced edition of Snowflake. It delivers the most secure solution so financial services enterprises can easily and efficiently derive all the insight from all their data. Thanks to VPS, financial services companies can experience the most innovative architecture the cloud has to offer, with the administrative controls required by large enterprises.


Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://…perational-loss-directly-with-tweedie-glm ), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.


R with remote databases Exercises (Part-2)

This is common case when working with data that your source is a remote database. Usual ways to cope this when using R is either to load all the data into R or to perform the heaviest joins and aggregations with SQL before loading the data. Both of them have cons: the former one is limited by the memory capacity and may be very slow and the later forces you to use two technologies thus is more complicated and prone to errors. Solution to these problems is to use dplyr with dbplyr to communicate with database backend. This allows user to write dplyr code that is translated to SQL and executed at database server. One can say that this combines advantages of the two standard solutions and gets rid of their disadvantages.


HR Analytics: Using Machine Learning to Predict Employee Turnover

Employee turnvover (attrition) is a major cost to an organization, and predicting turnover is at the forefront of needs of Human Resources (HR) in many organizations. Until now the mainstream approach has been to use logistic regression or survival curves to model employee attrition. However, with advancements in machine learning (ML), we can now get both better predictive performance and better explanations of what critical features are linked to employee attrition. In this post, we’ll use two cutting edge techniques. First, we’ll use the h2o package’s new FREE automatic machine learning algorithm, h2o.automl(), to develop a predictive model that is in the same ballpark as commercial products in terms of ML accuracy. Then we’ll use the new lime package that enables breakdown of complex, black-box machine learning models into variable importance plots. We can’t stress how excited we are to share this post because it’s a much needed step towards machine learning in business applications!!! Enjoy.


6 Common Probability Distributions every data science professional should know

Suppose you are a teacher in a university. After checking assignments for a week, you graded all the students. You gave these graded papers to a data entry guy in the university and tell him to create a spreadsheet containing the grades of all the students. But the guy only stores the grades and not the corresponding students. He made another blunder, he missed a couple of entries in a hurry and we have no idea whose grades are missing. Let’s find a way to solve this.


NumPy Cheat Sheet


Keras Tutorial: Recognizing Tic-Tac-Toe Winners with Neural Networks

In this tutorial, we will build a neural network with Keras to determine whether or not tic-tac-toe games have been won by player X for given endgame board configurations. Introductory neural network concerns are covered.


Mapping Fall Foliage with sf

Since there aren’t nearly enough sf and geom_sf examples out on the wild, wild #rstats web, here’s a short one that shows how to do basic sf operations, including how to plot sf objects in ggplot2 and animate a series of them with magick. I’m hoping someone riffs off of this to make an interactive version with Shiny. If you do, definitely drop a link+note in the comments!
Advertisements