Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Big data has become popular for processing, storing and managing massive volumes of data. The clustering of datasets has become a challenging issue in the field of big data analytics. The K-means algorithm is best suited for finding similarities between entities based on distance measures with small datasets. Existing clustering algorithms require scalable solutions to manage large datasets. This study presents two approaches to the clustering of large datasets using MapReduce. The first approach, K-Means Hadoop MapReduce (KM-HMR), focuses on the MapReduce implementation of standard K-means. The second approach enhances the quality of clusters to produce clusters with maximum intra-cluster and minimum inter-cluster distances for large datasets. The results of the proposed approaches show significant improvements in the efficiency of clustering in terms of execution times. Experiments conducted on standard K-means and proposed solutions show that the KM-I2C approach is both effective and efficient.

This is how Netflix’s top-secret recommendation system works

More than 80 per cent of the TV shows people watch on Netflix are discovered through the platform’s recommendation system. That means the majority of what you decide to watch on Netflix is the result of decisions made by a mysterious, black box of an algorithm. Intrigued? Here’s how it works. Netflix uses machine learning and algorithms to help break viewers’ preconceived notions and find shows that they might not have initially chosen. To do this, it looks at nuanced threads within the content, rather than relying on broad genres to make its predictions. This explains how, for example, one in eight people who watch one of Netflix’s Marvel shows are completely new to comic book-based stuff on Netflix.

Distributed K-Means with R-Hadoop

In this article, an R-hadoop (with rmr2) implementation of Distributed KMeans Clustering will be described with a sample 2-d dataset.

In this article, an R-hadoop (with rmr2) implementation of Distributed KMeans Clustering will be described with a sample 2-d dataset.

A 10x developer is someone who is literally 10 times more productive than the average developer. A 10x developer is someone who not only produces more code per hour than the average developer, but they debug like a boss, they introduce less bugs because they test their code, they mentor junior developers, write their own documentation, and have a lot of other broad skills that go beyond knowing how to code.

Hot data meets big data to make real-time, real-world decisions

“Hot data” is the most recent snapshot of the real world. It’s the real-time data continuously streaming in from IoT device sensors, user clickstreams, and mobile game-play activity. Hot data becomes big data when it comes to rest in a data warehouse, and that data warehouse is traditionally where data science happens. Machine learning models are typically trained on batches of big data at rest, but many operational use cases require hot data. If you are serving video ads to mobile gamers, supporting sales people walking into a meeting, or operating an oil drill, using the latest data is crucial for success. Machine learning models must be combined with real-time data to make many real-world decisions, and real-time data needs a real-time data infrastructure.

Further considerations of a hidden process underlying categorical responses

In my previous post, I described a continuous data generating process that can be used to generate discrete, categorical outcomes. In that post, I focused largely on binary outcomes and simple logistic regression just because things are always easier to follow when there are fewer moving parts. Here, I am going to focus on a situation where we have multiple outcomes, but with a slight twist – these groups of interest can be interpreted in an ordered way. This conceptual latent process can provide another perspective on the models that are typically applied to analyze these types of outcomes.

A guide to parallelism in R

In this post, I will talk about parallelism in R. This post will likely be biased towards the solutions I use. For example, I never use mcapply nor clusterApply. I prefer to always use foreach. In this post, we will focus on how to parallelize R code on your computer. I will use mainly silly examples, just to show one point at a time.

Writing and Publishing my first R package

One of the themes at useR 2017 in Brussels was “Get involved”. People were encouraged to contribute to the community, even when they did not consider themselves R specialists (yet). This could be by writing a package or a blog post, but also by simply correcting typos through pull requests, or sending a tweet about a successful analysis. Bottom line: get your stuff out in the open. Share your work! I felt this urge of getting involved already a year ago, at the useR conference in 2016. Hearing all these people speak about the great work they had done was really inspiring, and I wanted to be a part of it. I wanted to find out what it was like to develop a package. I knew that what I could do was only minor; I have a full-time job as a data scientist of which contributing to the community is not part of the job description. However, I could free up one day a week by working a few hours more on the other days of the week and I could do a little on the train to work every day. I was not sure if I was up to developing a full R package, but I could at least try.

Keras for R

We are excited to announce that the keras package is now available on CRAN. The package provides an R interface to Keras, a high-level neural networks API developed with a focus on enabling fast experimentation.