Neural networks in retail industry

There are wide applications of neural networks in industry. This post is an attempt to intuitively explain one of the applications of word2vec in retail industry. Natural language processing is an exciting field. Quite a few new algorithms are being developed resulting in innovative ways of solving traditional problems. One of the problems that researchers were working on is the challenge of identifying similar words to a given word. This way we would be in a position to say, whether two sentences are mentioning about similar context & perform a variety of tasks.

Breakthrough Argyle Data – Application Successfully Predicts Mobile Subscriber Creditworthiness in Multiple Operator Trials

AI/machine learning company Argyle Data™ has successfully concluded a series of trials with European and Latin American operators, using new algorithms and neural network architectures analyzing real carrier data to accurately predict subscribers’ intention and ability to pay monthly service bills.

Compacting your Shared Libraries

Welcome to the nineth post in the recognisably rancid R randomness series, or R4 for short. Following on the heels of last week’s post, we aim to look into the shared libraries created by R. We love the R build process. It is robust, cross-platform, reliable and rather predicatable. It. Just. Works. One minor issue, though, which has come up once or twice in the past is the (in)ability to fully control all compilation options. R will always recall CFLAGS, CXXFLAGS, … etc as used when it was compiled. Which often entails the -g flag for debugging which can seriously inflate the size of the generated object code. And once stored in ${RHOME}/etc/Makeconf we cannot on the fly override these values. But there is always a way. Sometimes even two.

Reproducibility: A cautionary tale from data journalism

Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different?

Big Data Solutions: A/B t test

@drsimonj here to share my code for using Welch’s t-test to compare group means using summary statistics.

A Stan case study, sort of: The probability my son will be stung by a bumblebee

The Stan project for statistical computation has a great collection of curated case studies which anybody can contribute to, maybe even me, I was thinking. But I don’t have time to worry about that right now because I’m on vacation, being on the yearly visit to my old family home in the north of Sweden. What I do worry about is that my son will be stung by a bumblebee. His name is Torsten, he’s almost two years old, and he loves running around barefoot on the big lawn. Which has its fair share of bumblebees. Maybe I should put shoes on him so he wont step on one, but what are the chances, really. Well, what are the chances?

Treating your data: The old school vs tidyverse modern tools

When I first started using R there was no such thing as the tidyverse. Although some of the tidyverse packages were available independently, I learned to treat my data mostly using brute force combining pieces of information I had from several sources. It is very interesting to compare this old school programming with the tidyverse writing using the magrittr package. Even if you want to stay old school, tidyverse is here to stay and it is the first tool taught in many data science courses based on R. My objective is to show a very simple example comparing the two ways of writing. There are several ways to do what I am going to propose here, but I think this example is enough to capture the main differences between old school codes and magrittr plus tidyverse. Magrittr is not new, but It seems to me that it is more popular now because of tidyverse.

The Secret to AI Could be Little-Known Transfer Learning

As more enterprises discover AI-based business applications, the concept of transfer learning could help to level the playing field.

A Developer’s Guide to Launching a Machine Learning Startup

This post is part of an insideHPC guide that explores how to successfully launch a machine learning startup. The complete report, available here, covers how to get started, how to choose a framework, how to decide what applications and technology to use, and more.

Comparing Distance Measurements with Python and SciPy

This post introduces five perfectly valid ways of measuring distances between data points. We will also perform simple demonstration and comparison with Python and the SciPy library.

Support Vector Machines Tutorial

If you have used machine learning to perform classification, you might have heard about Support Vector Machines (SVM). Introduced a little more than 50 years ago, they have evolved over time and have also been adapted to various other problems like regression, outlier analysis, and ranking. SVMs are a favorite tool in the arsenal of many machine learning practitioners. We too use them to solve a variety of problems. In this post, we will try to gain a high-level understanding of how SVMs work. I’ll focus on developing intuition rather than rigor. What that essentially means is we will skip as much of the math as possible and develop a strong intuition of the working principle.

Curve Smoothing Using Moving Averages

Working with highly-granularized data presents a number of unique challenges to the entire range of data users. Sometimes the data is stored incorrectly, creating the need for modification upon import, and sometimes the nature of the data makes it difficult to work with. Take the following example: an analyst is presented with two possibly-related datasets, cargo tonnage data from the two largest airports in the New York City Metropolitan Area and passenger enplanement data from the same airports. The analyst is asked to create a relational model between the two datasets for a major airline looking to expand their presence at either LaGuardia or John F. Kennedy Airport in order to facilitate the maximum number of passengers and cargo. The first data set, collected by the Port Authority of New York and New Jersey is aggregated monthly while the second dataset from the New York Department of Transportation is aggregated annually. The analyst knows that the data must be modified somehow, but using a schema based parser to aggregate the Port Authority Cargo data would destroy the fine granularization that was created by monthly collection over the entire observation period and using an average baseline calculated over the entire observation period would return results that neglected current trends because four decades worth of data would be regarded equally.