On the deep learning R&D team at SVDS, we have investigated Recurrent Neural Networks (RNN) for exploring time series and developing speech recognition capabilities. Many products today rely on deep neural networks that implement recurrent layers, including products made by companies like Google, Baidu, and Amazon. However, when developing our own RNN pipelines, we did not find many simple and straightforward examples of using neural networks for sequence learning applications like speech recognition. Many examples were either powerful but quite complex, like the actively developed DeepSpeech project from Mozilla under Mozilla Public License, or were too simple and abstract to be used on real data. In this post, we’ll provide a short tutorial for training a RNN for speech recognition; we’re including code snippets throughout, and you can find the accompanying GitHub repository here. The software we’re using is a mix of borrowed and inspired code from existing open source projects. Below is a video example of machine speech recognition on a 1906 Edison Phonograph advertisement. The video includes a running trace of sound amplitude, extracted spectrogram, and predicted text.
This is the last video of a three part introduction to Bayesian data analysis aimed at you who isn’t necessarily that well-versed in probability theory but that do know a little bit of programming. If you haven’t watched the other parts yet, I really recommend you do that first: Part 1 & Part 2. This third video covers the how? of Bayesian data analysis: How to do it efficiently and how to do it in practice. But covers is really a big word, briefly introduces is really more appropriate. Along the way I will then briefly introduce Markov chain Monte Carlo, parameter spaces and the computational framework Stan.
Building machine learning and statistical models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from these models requires repeated application of transformation and inverse-transformation functions – to go from the domain of the original input variables to the domain of the original output variables (via the model). This is usually quite a laborious and repetitive process that leads to messy code and notebooks.
A week or so ago, I came up with a new chart type – race concordance charts – for looking at a motor circuit race from the on-track perspective of a particular driver.
We will develop a forecasting example using model trees and regression trees algorithms. The exercise was originally published in ‘Machine Learning in R’ by Brett Lantz, PACKT publishing 2015 (open source community experience destilled). The example we will develop is about predicting wine quality.
At Mango we’re often giving R training in locations where a reliable WiFi connection is not always guaranteed, so if we need trainees to download packages from CRAN it can be a show-stopper. Here are a couple of code snippets that are useful to download packages from CRAN onto a USB stick when you have a good connection and then install them on site from the USB should you need to.
R is my language of choice for data science but a good data scientist should have some knowledge of all of the great tools available to them. Recently, I have been gleefully using Python for machine learning problems (specifically pandas and the wonderful scikit-learn). However, for all its greatness, I couldn’t help but feel it lacks a bit in the data visualisation department. Don’t get me wrong, matplotlib can be used to produce some very nice visualisations but I think the code is a bit messy and quite unintuitive when compared to Hadley Wickham’s ggplot2. I’m a huge fan of the ggplot2 package and was delighted to discover that there has been an attempt to replicate its style in Python via the ggplot package. I wanted to compare the two packages and see just how well ggplot matches up to ggplot2. Both packages contain built-in datasets and I will use the mtcars data to build a series of plots to see how they compare, both visually and syntactically.
Ever wondered how to make an rmarkdown title dynamic? Maybe, wanted to use a parameter in multiple locations? Maybe wanted to pass through a publication date? Advanced use of YAML headers can help! Normally, when we write rmarkdown, we might use something like the basic YAML header that the rmarkdown template gives us.
Continuum Analytics, H2O.ai, and MapD Technologies have announced the formation of the GPU Open Analytics Initiative (GOAI) to create common data frameworks enabling developers and statistical researchers to accelerate data science on GPUs. GOAI will foster the development of a data science ecosystem on GPUs by allowing resident applications to interchange data seamlessly and efficiently. BlazingDB, Graphistry and Gunrock from UC Davis led by CUDA Fellow John Owens have joined the founding members to contribute their technical expertise.
The primary aim of any marketing campaign is to effectively engage with the target audience and encourage them to perform the desired set of actions.
Traditionally, this is where a lot of the following come in: conditional random fields, bag-of-words, TF-IDFs, WordNet, statistical analysis, but also a lot of manual work done by linguists and domain experts for the creation of synonym lists, skill taxonomies, job title hierarchies, knowledge bases or ontologies. Despite that seeming like a bombardment of jargon, the point is simpler: while these concepts are very valuable for the problem we try to solve, they also require a certain amount of manual feature engineering and human expertise. This expertise is certainly a factor that makes these techniques valuable, but in what follows we would like to explore more automated ways of extracting knowledge from recruitment data to complement these more traditional approaches. While some of the key concepts of deep learning have been around for some time, it was only more recently that everyone got excited: in 2012 researchers from the University of Toronto won the ImageNet image classification challenge using a convolutional neural network (CNN) by obtaining an error rate that was about 40% lower than the second-best entry. On a personal level, I got excited after reading the 2011 “Natural language processing (almost) from scratch” paper by Collobert et al. applying a CNN to textual information. I also enjoyed the Hellinger PCA paper by Lebret and Collobert and then there was of course the famous Word2Vec paper by Mikolov, Sutskever et al. So what is the practical value of these deep learning methods for recruitment data? The aim of this project was to get an idea about their usefulness for extracting knowledge about the job space thereby leveraging a large amount of job vacancy data. We applied minimal data cleaning and basically started from the raw job descriptions. No feature engineering was done. We set out to discover what we expect from a relatively simple but effective deep learning approach and if it could provide us with some insights about our data.