I was talking to one of my friends who happens to be an operations manager at one of the Supermarket chains in India. Over our discussion, we started talking about the amount of preparation the store chain needs to do before the Indian festive season (Diwali) kicks in. He told me how critical it is for them to estimate / predict which product will sell like hot cakes and which would not prior to the purchase. A bad decision can leave your customers to look for offers and products in the competitor stores. The challenge does not finish there – you need to estimate the sales of products across a range of different categories for stores in varied locations and with consumers having different consumption techniques. While my friend was describing the challenge, the data scientist in me started smiling! Why? I just figured out a potential topic for my next article. In today’s article, I will tell you everything you need to know about regression models and how they can be used to solve prediction problems like the one mentioned above.
Among the many R packages, there is the outbreaks package. It contains datasets on epidemics, on of which is from the 2013 outbreak of influenza A H7N9 in China, as analysed by Kucharski et al (2014). I will be using their data as an example to test whether we can use Machine Learning algorithms for predicting disease outcome. To do so, I selected and extracted features from the raw data, including age, days between onset and outcome, gender, whether the patients were hospitalised, etc. Missing values were imputed and different model algorithms were used to predict outcome (death or recovery). The prediction accuracy, sensitivity and specificity. The thus prepared dataset was devided into training and testing subsets. The test subset contained all cases with an unknown outcome. Before I applied the models to the test data, I further split the training data into validation subsets. The tested modeling algorithms were similarly successful at predicting the outcomes of the validation data. To decide on final classifications, I compared predictions from all models and defined the outcome “Death” or “Recovery” as a function of all models, whereas classifications with a low prediction probability were flagged as “uncertain”. Accounting for this uncertainty led to a 100% correct classification of the validation test set. The training cases with unknown outcome were then classified based on the same algorithms. From 57 unknown cases, 14 were classified as “Recovery”, 10 as “Death” and 33 as uncertain.
This morning we announced the release of Ayasdi Envision – a framework for accelerating the development of intelligent applications based on our award winning machine intelligence platform. Envision does a lot of innovative things and while I won’t recount the press release here, I do want to expand on a few of them.
Machine learning gets a lot of buzz these days, usually in connection with big data and artificial intelligence (AI). But what exactly is it? Broadly speaking, machine learners are computer algorithms designed for pattern recognition, curve fitting, classification and clustering. The word learning in the term stems from the ability to learn from data. Machine learning is also widely used in data mining and predictive analytics, which some commentators loosely call big data. It also is used for consumer survey analytics and is not restricted to high-volume, high-velocity data or unstructured data and need not have any connection with AI. In fact, many methods marketing researchers are well acquainted with, such as regression and k-means clustering, are also frequently called machine learners. For examples, see Apache Spark’s machine learning library or the books I cite in the last section of this article. To keep things simple, I will refer to well-known statistical techniques like regression and factor analysis as older machine learners and methods such as artificial neural networks as newer machine learners since they are generally less familiar to marketing researchers.
Real-world data is almost always incomplete or innaccurate in some way. This means that the uncertain conclusions we draw from it are only meaningful if we can answer the question: how uncertain? One way to do this is using Bayesian inference. But, while Bayesian inference is conceptually simple, it can be analytically and computationally difficult in practice. Probabilistic programming is a paradigm that abstracts away some of this complexity. There are many probabilistic programming systems. Perhaps the most advanced is Stan, and the most accessible to non-statistician programmers is PyMC3. At Fast Forward Labs, we recently shared with our clients a detailed report on the technology and uses of probabilistic programming in startups and enterprises. But in this article, rather than use either of these advanced comprehensive systems, we’re going to build our own extremely simple system from from scratch.
The Windows edition of the Data Science Virtual Machine (DSVM), the all-in-one virtual machine image with a wide-collection of open-source and Microsoft data science tools, has been updated to the Windows Server 2016 platform. This update brings built-in support for Docker containers and GPU-based deep learning.
Recently I looked at fitting a smoother to a time-series using Bayesian modelling. Now I will look at how you can control the smoothness by using more or less informative priors on the precision (1/variance) of the random effect.
Analytic administrator is a role that data scientists assume when they onboard new tools, deploy solutions, support existing standards, or train other data scientists. It is a role that works closely with IT to maintain, upgrade, and scale analytic environments. Analytic admins have a multiplier effect – as they go about their work, they influence others in the organization to be more effective. If you are a data scientist using R, you might consider filling the role of analytic admin for your organization. Consider the data scientist who wants to make R a legitimate part of their organization. This person has to introduce a new technology and help IT build the architecture around it. In this role, the data scientist – acting as an analytic admin – influences their entire organization.
In this episode of the Data Show, I spoke with Michael Freedman, CTO of Timescale and professor of computer science at Princeton University. When I first heard that Freedman and his collaborators were building a time-series database, my immediate reaction was: “Don’t we have enough options already?” The early incarnation of Timescale was a startup focused on IoT, and it was while building tools for the IoT problem space that Freedman and the rest of the Timescale team came to realize that the database they needed wasn’t available (at least out in open source). Specifically, they wanted a database that could easily support complex queries and the sort of real-time applications many have come to associate with streaming platforms. Based on early reactions to TimescaleDB, many users concur.