Julia: Installation and Editors

If you have been following this blog, you may have noticed that I don’t have any update for more than a year now. The reason is that I’ve been busy with my research, my work, and I promised not to share anything here until I finished my degree (Master of Science in Statistics). Anyways, at this point I think it’s time to share with you what I’ve learned in the past year. So far, it’s been a good year for Statistics especially in the Philippines, in fact, last November 15, 2016, the team of local data scientists made a huge step in Big data by organizing the first ever conference on this topic. Also months before that, the 13th National Convention on Statistics organized by the Philippine Statistics Authority, invited a keynote speaker from Paris21 to tackle Big data and its use in the government. So without further ado, in this post, I would like to share a new programming language which I’ve used for several months now, and it’s called Julia. This programming language is by far my favorite, it’s a well-thought-out language as many would say, for many reasons. The first of course is the speed, second is the grammar, and many more. I can’t list them down here, but I suggest you visit the official website, and try it for yourself.

Which algorithm takes the crown: Light GBM vs XGBOOST?

If you are an active member of the Machine Learning community, you must be aware of Boosting Machines and their capabilities. The development of Boosting Machines started from ADABOOST to today’s favourite XGBOOST. XGBOOST has become a de-facto algorithm for winning competitions at Analytics Vidhya and Kaggle, simply because it is extremely powerful. But given lots and lots of data, even XGBOOST takes a long time to train. Enter…. Light GBM. Many of you might not be familiar with the Light Gradient Boosting, but you will be after reading this article. The most natural question that will come to your mind is – Why another boosting machine algorithm? Is it superior to XGBOOST?

Weather forecast with regression models – part 3

In the previous part of this tutorial, we build several models based on logistic regression. One aspect to be further considered is the decision threshold tuning that may help in reaching a better accuracy. We are going to show a procedure able to determine an optimal value for such purpose. ROC plots will be introduced as well.

Marching neural network

In this demonstration you can play with a simple neural network in 3 spacial dimensions and visualize the functions the network produces (those are quite interesting despite the simplicity of a network, just click ‘randomize weights’ button several times). Technically, visualization employs a variation of raymarching technique in computer graphics, thus everything is computed with shaders only (and neural network is calculated with shareds as well). Animated surfaces are level surfaces of a neural network. You can stop animation and choose level of surface yourself, pay attention that demo shows surfaces with all levels that differ by integer: f(x) = … , level-1, level, level+1, … that is why animation is periodic. Sparks that are following surfaces demonstrate the regions with rapid change (large gradient), sparks’ color demonstrates level of surface they are following (red is higher). As the level is changing, the color of sparks also changes (from blue to red).

Counting Objects with Faster R-CNN

Accurately counting objects instances in a given image or video frame is a hard problem to solve in machine learning. A number of solutions have been developed to count people, cars and other objects and none of them is perfect. Of course, we are talking about image processing here, so a neural network seems to be a good tool for the job. Below you can find a description of different approaches, common problems, challenges and latest solutions in the Neural Networks object counting field. As a proof of concept, existing model for Faster R-CNN network will be used to count objects on the street with video examples given at the end of the post.

Advantages of Using R Notebooks For Data Analysis Instead of Jupyter Notebooks

Jupyter Notebooks, formerly known as IPython Notebooks, are ubiquitous in modern data analysis. The Notebook format allows statistical code and its output to be viewed on any computer in a logical and reproducible manner, avoiding both the confusion caused by unclear code and the inevitable “it only works on my system” curse.

Setting up your GPU TensorFlow platform

If you want to install TensorFlow with GPU currently you have two choices: either do your own manual install (good luck with that) or use a Docker image. If you install the TensorFlow GPU docker image, it is almost plug and play: you can start coding, almost immediately, in a Jupyter notebook with all TensorFlow GPU libraries installed. But what if you want to use your own libraries on top of it? And how do you access your own files? In this post I will explain how to use GPU TensorFlow with Docker in a flexible way, i.e. allowing to use your own libraries. I assume the reader has Docker already installed and knows the very basics of Docker (for example knows Part I and Part II of the Docker Get Started). Docker has several advantages: it is portable, allows for version control and reproducibility and it is very efficient (more than a virtual machine). However, it requires some setup which is not straightforward. Once you have installed the Docker image for GPU, following this intro, you would like to be able to run the Docker image in bash mode, not the Jupyter, with access to your files and with access to the corresponding ports. Inside that image, you would like to install your own libraries, and be able to recover it for the future, as Dockers are ephemeral and isolated. I will explain how to do all that.

Data Visualization with googleVis exercises part 2

In the second part of our series we are going to meet three more googleVis charts. More specifically these charts are Area Chart, Stepped Area Chart and Combo Chart. Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Machine Learning Powered Biological Network Analysis

Metabolomic network analysis can be used to interpret experimental results within a variety of contexts including: biochemical relationships, structural and spectral similarity and empirical correlation. Machine learning is useful for modeling relationships in the context of pattern recognition, clustering, classification and regression based predictive modeling. The combination of developed metabolomic networks and machine learning based predictive models offer a unique method to visualize empirical relationships while testing key experimental hypotheses. The following presentation focuses on data analysis, visualization, machine learning and network mapping approaches used to create richly mapped metabolomic networks. Learn more at http://www.createdatasol.com

Data Manipulation in R Exercises

Managing intermediate results when using R/sparklyr

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.

Data Science for Business – Time Series Forecasting Part 2: Forecasting with timekit

In time series forecasting, we use models to predict future time points based on past observations. As mentioned in timekit’s vignette, “as with most machine learning applications, the prediction is only as good as the patterns in the data. Forecasting using this approach may not be suitable when patterns are not present or when the future is highly uncertain (i.e. past is not a suitable predictor of future performance).” And while this is certainly true, we don’t always have data with a strong regular pattern. And, I would argue, data that has very obvious patterns doesn’t need a complicated model to generate forecasts – we can already guess the future curve just by looking at it. So, if we think of use-cases for businesses, who want to predict e.g. product sales, forecasting models are especially relevant in cases where we can’t make predictions manually or based on experience. The packages I am using are timekit for forecasting, tidyverse for data wrangling and visualization, caret for additional modeling functions, tidyquant for its ggplot theme, broom and modelr for (tidy) modeling.