Misuses of Statistics: Examples and Solutions

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, Hadoop, decision trees, ensembles, correlation, ouliers, regression Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, time series, cross-validation, model fitting, dataviz, and many more. To keep receiving these articles, sign up on DSC. Previous entries are listed below the picture.

The Best R Packages for Machine Learning

There is no doubt R is language of choice for the majority of data scientists who want to understand data, especially those looking to leverage its great machine learning packages.

Massively-parallel computations on Azure clusters with R, made easy with doAzureParallel

For users of the R language, scaling up their work to take advantage of cloud-based computing has generally been a complex undertaking. We are therefore excited to announce doAzureParallel, a lightweight R package built on Azure Batch that allows you to easily use Azure’s flexible compute resources right from your R session. The doAzureParallel package complements Microsoft R Server and provides the infrastructure you need to run massively parallel simulations on Azure directly from R. The doAzureParallel package is a parallel backend for the popular foreach package, making it possible to execute multiple processes across a cluster of Azure virtual machines with just a few lines of R code. The package helps you create and manage the cluster in Azure, and register it as a parallel backend to be used with foreach.

Coordinatized Data: A Fluid Data Specification

It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).