Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research

Given the recent explosion of interest in streaming data and online algorithms, clustering of time series subsequences, extracted via a sliding window, has received much attention. In this work we make a surprising claim. Clustering of time series subsequences is meaningless. More concretely, clusters extracted from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature. We can justify calling our claim surprising, since it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method which, based on the concept of time series motifs, is able to meaningfully cluster subsequences on some time series datasets.


Decoding Machine Learning Methods

Machine Learning, thinking systems, expert systems, knowledge engineering, decision systems, neural networks – all synonymous loosely woven words in the evolving fabric of Artificial Intelligence. Of these Machine Learning (ML) and Artificial Intelligence (AI) are often debated and used interchangeably. broadly speaking AI can be termed as a futuristic state of self aware smart learning machines in true sense, but for all practical purposes we deal more often with ML at present. In very abstract terms, ML is a structured approach for deriving meaningful predictions/insights from both structured and unstructured data. ML methods employ complex algorithms that enable analytics based on data, history and patterns. The field of data science continues to scale new heights enabled by the exponential growth in computing power over the last decade. Data scientists are continuously exploring new models & methods each day and sometimes it’s scary to even keep pace with the trends. However to keep matters simple, here is a clean starting point.


Data Science Best Practices An Enterprise’s Blue Print to Operationalize Analytics

In today’s interconnected world, a large amount of data is being continuously generated, thus compelling businesses to explore how they can gain actionable insights from these different data sets. Data analytics is a key element for a successful business. Today, most enterprises are striving to find new ways, tools and platforms to derive value from their business data to drive productivity, enhance revenue and solve some business problems like churn, customer satisfaction, etc. However, many organizations are still struggling to derive significant ROIs from these investments. The winning formula to achieve success with analytics lies in an enterprise’s ability to identify specific business use cases/problems and form an implementation strategy to achieve those goals. This whitepaper explains key pillars of success and best practices for analytics implementations.


Handling ‘Happy’ vs ‘Not Happy’: Better sentiment analysis with sentimentr in R

Sentiment Analysis is one of the most obvious things Data Analysts with unlabelled Text data (with no score or no rating) end up doing in an attempt to extract some insights out of it and the same Sentiment analysis is also one of the potential research areas for any NLP (Natural Language Processing) enthusiasts. For an analyst, the same sentiment analysis is a pain in the neck because most of the primitive packages/libraries handling sentiment analysis perform a simple dictionary lookup and calculate a final composite score based on the number of occurrences of positive and negative words. But that often ends up in a lot of false positives, with a very obvious case being ‘happy’ vs ‘not happy’ – Negations, in general Valence Shifters.


Genetic Algorithms in Data Science

We have often heard of optimization as a subject more relevant to Operations Research. But as a Data Science student, I got an opportunity to explore Genetic Algorithms in the field of data mining & machine learning. It is interesting that Darwin’s theory of evolution is an optimization process and there have been many research & bio-inspired computing work done in the field of Genetic Algorithms. This blog talks about what is GA, why it acts as an optimization process & how it is used in data sciences.


Probabilistic Graphical Models Tutorial – Part 2

In the previous part of this probabilistic graphical models tutorial for the Statsbot team, we looked at the two types of graphical models, namely Bayesian networks and Markov networks. We also explored the problem setting, conditional independences, and an application to the Monty Hall problem. In this post, we will cover parameter estimation and inference, and look at another application.


Understanding Objective Functions in Neural Networks

This blog post is targeted towards people who have experience with machine learning, and want to get a better intuition on the different objective functions used to train neural networks.


Uncertainty in Deep Learning

I organised the already published results on how to obtain uncertainty in deep learning, and collected lots of bits and pieces of new research I had lying around (which I hadn’t had the time to publish yet). The questions I got about the work over the past year were a great help in guiding my writing, with the greatest influence on my writing, I reckon, being the work of Professor Sir David MacKay (and his thesis specifically). Weirdly enough, I would consider David’s writing style to be the equivalent of modern blogging, and would highly recommend reading his thesis. I attempted to follow David’s writing style in my own writing, explaining topics through examples and remarks, resulting in what almost looks like a 120 pages long blog post. So hopefully it can now be seen as a more complete body of work, accessible to as large an audience as possible, and also acting as an introduction to the field of what people refer to today as Bayesian Deep Learning. One of the interesting results which I will demonstrate below touches on uncertainty visualisation in Bayesian neural networks. It’s something that almost looks trivial, yet it has gone unnoticed for quite some time! But before that, I’ll review quickly some of the new bits and pieces in the thesis for people already familiar with the work. For others I would suggest starting with the introduction: The Importance of Knowing What We Don’t Know.


Tips for A/B Testing with R

Which layout of an advertisement leads to more clicks? Would a different color or position of the purchase button lead to a higher conversion rate? Does a special offer really attract more customers – and which of two phrasings would be better? For a long time, people have trusted their gut feeling to answer these questions. Today all these questions could be answered by conducting an A/B test. For this purpose, visitors of a website are randomly assigned to one of two groups between which the target metric (i.e. click-through rate, conversion rate…) can then be compared. Due to this randomization, the groups do not systematically differ in all other relevant dimensions. This means: If your target metric takes a significantly higher value in one group, you can be quite sure that it is because of your treatment and not because of any other variable.
Advertisements