Often, we need fast answers with limited resources. We have to make judgements in a world full of uncertainty. We can’t measure everything. We can’t run all the experiments we’d like. You may not have the resources to model a product or the impact of a decision. How do you find a balance between finding fast answers and finding correct answers? How do you minimize uncertainty with limited resources?
It always amazes me how I can hear a statement uttered in the space of a few seconds about some aspect of machine learning that then takes me countless hours to understand. I first heard about Gaussian Processes on an episode of the Talking Machines podcast and thought it sounded like a really neat idea. I promptly procured myself a copy of the classic text on the subject, Gaussian Processes for Machine Learning by Rasmussen and Williams, but my tenuous grasp on the Bayesian approach to machine learning meant I got stumped pretty quickly. That’s when I began the journey I described in my last post, From both sides now: the math of linear regression.
A curated list of 50+ awesome TensorFlow resources including tutorials, books, libraries, projects and more. If you know of any awesome TensorFlow resources that you think should be added to this list, please let me know in the comments section. And be sure to check out our other awesome lists of the best computer vision resources and free machine learning books.
Time series analysis has been around for ages. Even though it sometimes does not receive the attention it deserves in the current data science and big data hype, it is one of those problems almost every data scientist will encounter at some point in their career. Time series problems can actually be quite hard to solve, as you deal with a relatively small sample size most of the time. This usually means an increase in the uncertainty of your parameter estimates or model predictions.
The Matthews Correlation Coefficient (MCC) has a range of -1 to 1 where -1 indicates a completely wrong binary classifier while 1 indicates a completely correct binary classifier. Using the MCC allows one to gauge how well their classification model/function is performing. Another method for evaluating classifiers is known as the ROC curve.
fastText is a library for efficient learning of word representations and sentence classification.
In a previous post, we talked about how Elasticsearch approaches some of the fundamental challenges of a distributed system. In this post, we would be reviewing aspects of Elasticsearch like near real-time search and trade-offs it considers to calculate search relevance that Insight Data Engineering Fellows have leveraged while building data platforms. Mainly, we will look at:
• Near real-time search
• Why deep pagination in distributed search can be dangerous?
• Trade-offs in calculating search relevance
Everywhere you go these days, you hear about deep learning’s impressive advancements. New deep learning libraries, tools, and products get announced on a regular basis, making the average data scientist feel like they’re missing out if they don’t hop on the deep learning bandwagon. However, as Kamil Bartocha put it in his post The Inconvenient Truth About Data Science, 95% of tasks do not require deep learning. This is obviously a made up number, but it’s probably an accurate representation of the everyday reality of many data scientists. This post discusses an often-overlooked area of study that is of much higher relevance to most data scientists than deep learning: causality.
Here are some notebooks I have made. You can click on a notebook title to view it in the browser, or on ‘(download)’ to get a copy that you can run and modify on your computer (assuming you have Jupyter installed).
I’m always on the lookout for ideas that can improve how I tackle data analysis projects. I particularly favor approaches that translate to tools I can use repeatedly. Most of the time, I find these tools on my own—by trial and error—or by consulting other practitioners. I also have an affinity for academics and academic research, and I often tweet about research papers that I come across and am intrigued by. Often, academic research results don’t immediately translate to what I do, but I recently came across ideas from several research projects that are worth sharing with a wider audience. The collection of ideas I’ve presented in this post address problems that come up frequently. In my mind, these ideas also reinforce the notion of data science as comprising data pipelines, not just machine learning algorithms. These ideas also have implications for engineers trying to build artificial intelligence (AI) applications.
The Facebook V: Predicting Check Ins data science competition where the goal was to predict which place a person would like to check in to has just ended. I participated with the goal of learning as much as possible and maybe aim for a top 10% since this was my first serious Kaggle competition attempt. I managed to exceed all expectations and finished 1st out of 1212 participants! In this post, I’ll explain my approach. This blog post will cover all sections to go from the raw data to the winning submission.