Cool Datasets

A place to find cool datasets.

Algorithms Tour – How data science is woven into the fabric of Stitch Fix

At Stitch Fix, we’re transforming the way people find what they love. Our clients want the perfect clothes for their individual preferences—yet without the burden of search or having to keep up with current trends. Our merchandise is curated from the market and augmented with our own designs to fill in the gaps. It’s kept current and extremely vast and diverse—ensuring something for everyone. Rich data on both sides of this ‘market’ enables Stitch Fix to be a matchmaker, connecting clients with styles they love (and never would’ve found on their own).

Graph Databases and the Connected Enterprise

In this special guest feature, Emil Eifrem, Founder and CEO of Neo Technology suggests that in order to achieve connected enterprise status and realize the significance of the graph database, companies must understand the database options that are available. Right now, the landscape consists of three categories, which he outlines below, including where he sees their growth in the next few years. Emil sketched what today is known as the property graph model on a flight to Mumbai in 2000. As the CEO of Neo Technology, co-founder of Neo4j and a co-author of the O’Reilly book Graph Databases, he’s devoted his professional life to building and evangelizing graph databases. Committed to sustainable open source, Emil guides Neo along a balanced path between free availability and commercial reliability. He plans to save the world with graphs and own Larry’s yacht by the end of the decade.

Visualizing Time-Series Change

Time-series data visualizations are everywhere. While these charts are understood amongst individuals of all professions, effectively communicating change over time can present unexpected challenges. When creating any type of visualization, it is important to first determine the message you would like to communicate. The increased popularity of exploratory data visualization tools such as Tableau and Microsoft Power BI make it easy to forget this step. These tools provide users with the ability to connect to databases and click around until they find the prettiest visualization. Unfortunately, the exploratory nature of these tools can often lead to ineffective visualizations with no explicit purpose.

It seems dplyr is overtaking correlation heatmaps

For a long time, my correlation heatmap with ggplot2 was the most viewed post on this blog. It still leads the overall top list, but by far the most searched and visited post nowadays is this one about dplyr (followed by it’s sibling about plyr).

Hierarchical Clustering Nearest Neighbors Algorithm in R

Hierarchical clustering is a widely used and popular tool in statistics and data mining for grouping data into ‘clusters’ that exposes similarities or dissimilarities in the data. There are many approaches to hierarchical clustering as it is not possible to investigate all clustering possibilities. One set of approaches to hierarchical clustering is known as agglomerative, whereby in each step of the clustering process an observation or cluster is merged into another cluster. The first approach we will explore is known as the single linkage method, also known as nearest neighbors.