One of the most common question we get on Analytics Vidhya is: ‘How much maths do I need to learn to be a data scientist?’ Even though the question sounds simple, there is no simple answer to the the question. Usually, we say that you need to know basic descriptive and inferential statistics to start. That is good to start. But, once you have covered the basic concepts in machine learning, you will need to learn some more math. You need it to understand how these algorithms work. What are their limitations and in case they make any underlying assumptions. Now, there could be a lot of areas to study including algebra, calculus, statistics, 3-D geometry etc. If you get confused (like I did) and ask experts what should you learn at this stage, most of them would suggest / agree that you go ahead with Linear Algebra. But, the problem does not stop there. The next challenge is to figure out how to learn Linear Algebra. You can get lost in the detailed mathematics and derivation and learning them would not help as much! I went through that journey myself and hence decided to write this comprehensive guide. If you have faced this question about how to learn & what to learn in Linear Algebra – you are at the right place. Just follow this guide.
Our Two Sigma Financial Modeling Challenge ran from December 2016 to March 2017 this year. Asked to search for signal in financial markets data with limited hardware and computational time, this competition attracted over 2000 competitors. In this winners’ interview, 2nd place winners’ Nima and Chahhou describe how paying close attention to unreliable engineered features was important to building a successful model.
It’s been a while since my last post on some TB WHO data. A lot has happened since then, including the opportunity to attend the Open Data Science Conference (ODSC) East held in Boston, MA. Over a two day period I had the opportunity to listen to a number of leaders in various industries and fields. It was inspiring to learn about the wide variety of data science applications ranging from finance and marketing to genomics and even the refugee crisis. One of the workshops at ODSC was text analytics, which includes basic text processing, dendrograms, natural language processing and sentiment analysis. This gave me the thought of applying some text analytics to visualize some data I was working on last summer. In this post I’m going to walk through how I used regular expression to label classification codes in a large dataset (NHAMCS) representing emergency department visits in the United States and eventually visualize the data.
As also described in Cormen, et al (2009) p. 65, in algorithm design, divide-and-conquer paradigm incorporates a recursive approach in which the main problem is:
• Divided into smaller sub-problems (divide),
• The sub-problems are solved (conquer),
• And the solutions to sub-problems are combined to solve the original and “bigger” problem (combine).
Instead of constructing indefinite number of nested loops destroying the readability of the code and the performance of execution, the “recursive” way utilizes just one block of code which calls itself (hence the term “recursive”) for the smaller problem. The main point is to define a “stop” rule, so that the function does not sink into an infinite recursion depth. While nested loops modify the same object (or address space in the low level sense), recursion moves the “stack pointer”, so each recursion depth uses a different part of the stack (a copy of the objects will be created for each recursion). This illustrates a well-known trade-off in algorithm design: Memory versus performance; recursion enhances performance at the expense of using more memory.
This is a keynote highlight from the Strata Data Conference in London 2017.
In this episode of the Data Show, I spoke with Jeremy Stanley, VP of data science at Instacart, a popular grocery delivery service that is expanding rapidly. As Stanley describes it, Instacart operates a four-sided marketplace comprised of retail stores, products within the stores, shoppers assigned to the stores, and customers who order from Instacart. The objective is to get fresh groceries from popular retailers delivered to customers in a timely fashion. Instacart’s goals land them in the center of the many opportunities and challenges involved in building high-impact data products.
This is a highlight from Ted Malaska’s Introduction to Apache Spark for Java and Scala developers.
DataScience.com new Python library, Skater, uses a combination of model interpretation algorithms to identify how models leverage data to make predictions.
Data science and machine learning are iterative processes. It is never possible to successfully complete a data science project in a single pass. A data scientist constantly tries new ideas and changes steps of his pipeline:
1. extract new features and accidentally find noise in the data
2. clean up the noise, find one more promising feature
3. extract the new feature
4. rebuild and validate the model, realize that the learning algorithm parameters are not perfect for the new feature set
5. change machine learning algorithm parameters and retrain the model
6. find the ineffective feature subset and remove it from the feature set
7. try a few more new features
8. try another ML algorithm. And then a data format change is required.
This is only a small episode in a data scientist’s daily life and it is what makes our job different from a regular engineering job.
The machine learning revolution leaves no stone unturned. Natural language processing is yet another field that underwent a small revolution thanks to the second coming of artificial neural networks. Let’s just briefly discuss two advances in the natural language processing toolbox made thanks to artificial neural networks and deep learning techniques.
Data science is an interdisciplinary field where scientific techniques from statistics, mathematics, and computer science are used to analyze data and solve problems more accurately and effectively. It is no wonder, then, that languages such as R and Python, with their extensive packages and libraries that support statistical methods and machine learning algorithms are cornerstones of the data science revolution. Often times, beginners find it hard to decide which language to learn first. This guide will help you make that decision.
The simplest solutions are usually the most powerful ones, and Naive Bayes is a good proof of that. In spite of the great advances of the Machine Learning in the last years, it has proven to not only be simple but also fast, accurate and reliable. It has been successfully used for many purposes, but it works particularly well with natural language processing (NLP) problems. Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and Bayes’ Theorem to predict the category of a sample (like a piece of news or a customer review). They are probabilistic, which means that they calculate the probability of each category for a given sample, and then output the category with the highest one. The way they get these probabilities is by using Bayes’ Theorem, which describes the probability of a feature, based on prior knowledge of conditions that might be related to that feature. We’re going to be working with an algorithm called Multinomial Naive Bayes. We’ll walk through the algorithm applied to NLP with an example, so by the end not only will you know how this method works, but also why it works. Then, we’ll lay out a few advanced techniques that can make Naive Bayes competitive with more complex Machine Learning algorithms, such as SVM and neural networks.
Welcome to Part 2 of our tour through modern machine learning algorithms. In this part, we’ll cover methods for Dimensionality Reduction, further broken into Feature Selection and Feature Extraction. In general, these tasks are rarely performed in isolation. Instead, they’re often preprocessing steps to support other tasks.
Brian Hopkins of Forrester Research recently penned an excellent blog post about why companies are getting disrupted and why they realize it so late. The post draws from Ray Kurzweil’s Law Of Accelerating Returns and speaks to the fact that the human brain doesn’t do well with exponential growth.
This post outlines an entire 6-part tutorial series on the MXNet deep learning library and its Python API. In-depth and descriptive, this is a great guide for anyone looking to start leveraging this powerful neural network library.
If you build a model and never update it you’re missing a trick. Behaviours change so your model will tend to perform worse over time. You’ve got to regularly refresh it, whether that’s adjusting the existing model to fit the latest data (recalibration) or building a whole new model (retraining), but this means you’ve got new versions of your model that you have to handle. You need to think about your methodology for versioning R model objects, ideally before you lose any versions. You could store models with ye olde YYYYMMDD style of versioning but that means regularly changing your code to use the latest model version. I’m too lazy for that! If we’re storing our R model objects in SQL Server then we can utilise another SQL Server capability, temporal tables, to take the pain out of versioning and make it super simple. Temporal tables will track changes automatically so you would overwrite the previous model with the new one and it would keep a copy of the old one automagically in a history table. You get to always use the latest version via the main table but you can then write temporal queries to extract any version of the model that’s ever been implemented. Super neat! For some of you, if you’re not interested in the technical details you can drop off now with the knowledge that you can store your models in a non-destructive but easy to use way in SQL Server if you need to. If you want to see how it’s done, read on!
In this special guest feature, Irshad Raihan, Product Marketing Manager at Red Hat Storage, discusses how organizations can save money and realize greater flexibility by moving data with lower business value to a more affordable storage solution. Irshad Raihan is a product manager at Red Hat Storage, responsible for product strategy, messaging, and go to market activities. Previously, he held senior product marketing and product management positions at HP and IBM responsible for big data and data management products. Irshad holds a Masters in Computer Science from Clemson University, and an MBA from Carnegie Mellon University.
This is a short blog post to introduce the concept of an ontology for those who are unfamiliar with the term, or who have previously encountered explanations that make little or no sense, as I have. I’m aiming to “democratise knowledge of this topic” as one of my colleagues put it.