t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples.

How to read most commonly used file formats in Data Science (using Python)?

If you have been part of data industry, you would know the challenge of working with different data types. Different formats, different compression, different parsing on different systems – you could be quickly pulling your hairs! Oh and I have not talked about the unstructured data or semi-structured data yet. For any data scientist or data engineer, dealing with different formats can become a tedious task. In real-world, people rarely get neat tabular data. Thus, it is mandatory for any data scientist (or a data engineer) to be aware of different file formats, common challenges in handling them and the best / efficient ways to handle this data in real life. This article provides common formats a data scientist or a data engineer must be aware of. I will first introduce you to different common file formats used in the industry. Later, we’ll see how to read these file formats in Python.

Visualizing Commute Disruptions in New York City

Alphabet subsidiary Sidewalk Labs has partnered with the New York City transportation advocacy group, Transportation Alternatives, to create an interactive map that allows New Yorkers to predict how plans to close the L train will affect their daily commutes. Damage stemming from Hurricane Sandy will force the interborough train to shut down for 18 months beginning in 2019, affecting a quarter of a million daily riders. The map uses real-time data from the New York City Metro Transit Authority on all bus and subway lines, as well as the Staten Island Ferry, to allow users to identify how long it will take to get between any two points in the city. Users can enter their start and end points, their preferences about their ideal mode of transportation to commute, and their time restraints, to find the best possible route.

Using MongoDB with R

MongoDB is a NoSQL database program which uses JSON-like documents with schemas. It is free and open-source cross-platform database. MongoDB, top NoSQL database engine in use today, could be a good data storage alternative when analyzing large volume data. To use MongoDB with R, first, we have to download and install MongoDB


Bowtie is a library for writing dashboards in Python. No need to know web frameworks or JavaScript, focus on building functionality in Python. Interactively explore your data in new ways! Deploy and share with others!

How the General Data Protection Regulation (GDPR) Expands Privacy Data Scope and Provides New Rights of Data Control to Customers

This is the second contributed article in our series of GDPR related posts. In the first post we discussed the extra-territorial reach of the regulation and why U.S. companies need to understand GDPR. The final blogpost will discuss GDPR’s approach to accountability and data security.

Zaloni Continues to Redefine the Data Lake

With the latest release of its Bedrock Data Lake Management Platform and its Mica Self-service Data Platform, Zaloni continues to establish itself as a leader in the space by pushing the boundaries of what defines a “data lake,” expanding beyond Hadoop to encompass a more holistic, enterprise-wide approach. Zaloni’s vision is a “logical” data lake architecture versus a physical one, which gives companies transparency into all of their data regardless of its location, enables application of enterprise-wide governance capabilities, and allows for expanded, controlled access for self-serve business users across the organization.

I ranked every Intro to Data Science course on the internet, based on thousands of data points

A year ago, I dropped out of one of the best computer science programs in Canada. I started creating my own data science master’s program using online resources. I realized that I could learn everything I needed through edX, Coursera, and Udacity instead. And I could learn it faster, more efficiently, and for a fraction of the cost. I’m almost finished now. I’ve taken many data science-related courses and audited portions of many more. I know the options out there, and what skills are needed for learners preparing for a data analyst or data scientist role. A few months ago, I started creating a review-driven guide that recommends the best courses for each subject within data science. For the first guide in the series, I recommended a few coding classes for the beginner data scientist. Then it was statistics and probability classes.

Giving a Thematic Touch to your Interactive Chart

Usually (mainly at work) I made a chart and when I present it nobody cares about the style, if the chart comes from an excel spreadsheet, paint or intercative chart, or colors, labels, font, or things I like to care. That’s sad for me but it’s fine: the data/history behind and how you present it is what matters. And surely I’m overreacting. But hey! That’s not implies you only must do always clean chart or tufte style plots. Sometimes you can play with the topic of your chart and give some thematic touch.

Predicting the length of a hospital stay, with R

I haven’t been admitted to hospital many times in my life, but every time the only thing I really cared about was: when am I going to get out? It’s also a question that weighs heavily on hospital managers: by knowing ahead of time how long each patient’s stay is likely to be, they can better manage facilities and staff, and know whether the hospital is likely to reach maximum capacity in the near future.