22 must watch talks on Python for Deep Learning, Machine Learning & Data Science (from PyData 2017, Amsterdam)

Python is increasingly gaining popularity among machine learning and data science communities across the world – and for the right reasons. It probably has the most developed ecosystem for deep learning, a collection of awesome libraries like pandas and scikit learn and an awesome community. PyData is a community for developers and users for open source data tools. They also conduct several conferences and I came across amazing talks from PyData Amsterdam 2017 recently. Even though I wanted to be part of the conference, it was difficult for me to travel. Thankfully, PyData released all the videos on their YouTube channel. The spread of the talks is amazing. Be it a novice, intermediate or an expert python user, PyData had something for everyone. To help the community, I have summarized the best talks from data science perspective in this article. For your convenience, I’ve also added a short summary of each video. We have the videos segregated in 4 categories – Deep Learning, Big Data, Data Science and Natural Language Processing.

For data scientists, the big money is in open source

Big data means big compensation for data scientists. But the kind of data scientist you are largely determines just how big your paycheck will be. As a new O’Reilly survey reveals, data scientists that focus on open source technologies make more money than those still dealing in proprietary technologies. The more open source software you know, the more money you stand to make in big data.

Call Detail Record Analysis – K-means Clustering with R

Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most of the telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis. In this blog, we will discuss about clustering of the customer activities for 24 hours by using unsupervised K-means clustering algorithm. It is used to understand segment of customers with respect to their usage by hours. For example, customer segment with high activity may generate more revenue. Customer segment with high activity in the night hours might be fraud ones.

15 Timeless Data Science Articles

This is our first post of a new series featuring articles published long ago. We manually selected articles that were most popular or overlooked, time-insensitive (for instance we eliminated articles about data science products because software packages and platforms have evolved so much over the last few years) and we only kept articles that still make sense and are useful today. The first one below has been kept for historical reasons: it was published in February 2008, four years before DSC was even created, and it is our very first article – still alive on Analyticbridge today!

SQL for Data Analysis – Tutorial for Beginners – ep1

SQL is a must, if you want to be a Data Analyst or a Data Scientist. I have worked with many online businesses in the last few years – from 5 people startups up to 5000+ employees multinational companies and haven’t seen a single company who would not use SQL for Data Analysis (and for many more things) in some way.

Top 15 Python Libraries for Data Science in 2017

1. NumPy
2. SciPy
3. Pandas
4. Matplotlib
5. Seaborn
6. Bokeh
7. Plotly
8. SciKit-Learn
9. Theano
10. TensorFlow
11. Keras
12. NLTK
13. Gensim
14. Scrapy
15. Statsmodels

Kullback-Leibler Divergence Explained

In this post we’re going to take a look at way of comparing two probability distributions called Kullback-Leibler Divergence (often shortened to just KL divergence). Very often in Probability and Statistics we’ll replace observed data or a complex distributions with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation.

?Dstl Satellite Imagery Competition, 3rd Place Winners’ Interview: Vladimir & Sergey

In their satellite imagery competition, the Defence Science and Technology Laboratory (Dstl) challenged Kagglers to apply novel techniques to ‘train an eye in the sky’. From December 2016 to March 2017, 419 teams competed in this image segmentation challenge to detect and label 10 classes of objects including waterways, vehicles, and buildings. In this winners’ interview, Vladimir and Sergey provide detailed insight into their 3rd place solution.

The Guerrilla Guide to Machine Learning with R

This post is a lean look at learning machine learning with R. It is a complete, if very short, course for the quick study hacker with no time (or patience) to spare.

Shiny Applications Layouts Exercises (Part-6)

In the sixth part of our journey through Shiny App Layouts we will meet the absolutely-positioned panels. These are panels that you can drag and drop or not wherever you want in the interface. Moreover you can put anything in them, including inputs and outputs.

Load a Python/pandas data frame from an HDF5 file into R

The title is self-descriptive, so I will not dwell on the issue at length before showing the code. Just a small note: to my knowledge, there is only one public snippet out there that addresses this particular problem. It uses the Bioc package rhdf5 and you can find it here. The main problem is that it only works when the HDF5 file contains a single data frame, which is not very useful.

Studying CRAN package names

Setting a name for a CRAN package is an intimate process. Out of an infinite range of possibilities, an idea comes for a package and you spend at least a couple of days writing up and testing your code before submitting to CRAN. Once you set the name of the package, you cannot change it. You choice index your effort and, it shouldn’t be a surprise that the name of the package can improve its impact.