How not to be biased?

#1: Do not underestimate the amount of stupidity around you!
#2: Always ask about the research method!
#3: Do your own analyses and researches!
#4: Ask smarter people!
#5: Think!


Data Science: (not) the preferred nomenclature

Data science is new and exciting. Data scientist has been called the sexiest job of the 21st century. But what, exactly, is data science? There is no shortage of position papers, Venn diagrams and white papers offering a perspective on, if not a definition of, data science. But I feel many of them ignore one critical distinction: the difference between the Science OF Data and doing Science WITH Data. The Science OF Data is an academic subject that studies data in all its manifestations, together with methods and algorithms to manipulate, analyse, visualise and enrich data. It is methodologically close to computer science and statistics, combining theoretical, algorithmic and empirical work.


Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets

The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constraint-based learning of Bayesian networks. Most of the currently available feature selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. In that respect the SES algorithm subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. The SES algorithm is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning ‘mind from the machine’ in Latin. The MXM implementation of SES handles several data analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.


Mining Your Routine Data for Reference Intervals: Hoffman, Bhattacharya and Maximum Likelihood

When you look at histograms of routine clinical data from allcomers, on some occasions the data will form a bimodal looking distribution formed by the putatively sick and well. If you could statistically determine the distribution of the well subjects, then you could, in principle, determine the reference interval without performing a reference interval study. We can all dream, right?


cdparcoord: Parallel Coordinates Plots for Categorical Data

The idea behind both packages is to remedy the “black screen problem” in parallel coordinates plots, in which there are so many lines plotted that the screen fills and no patterns are discernible. We avoid this by plotting only the most “typical” lines, as defined by estimated nonparametric density value in freqparcoord and by simple counts in cdparcoord.


Readability Redux

I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text.


Tidyverse practice: mapping large European cities

As noted in several recent posts, when you’re learning R and R’s Tidyverse packages, it’s important to break everything down into small units that you can learn. What that means is that you need to identify the most important tools and functions of the Tidyverse, and then practice them until you are fluent. But once you have mastered the essential functions as isolated units, you need to put them together. By putting the individual piece together, you solidify your knowledge of how they work individually but also begin to learn how you can combine small tools together to create novel effects. With that in mind, I want to show you another small project. Here, we’re going to use a fairly small set of functions to create a map of the largest cities in Europe.


Cross- Validation Code Visualization: Kind of Fun

Let us say, you are writing a nice and clean Machine Learning code(e.g. Linear Regression). You code is Ok, first you divided your dataset into two parts “Training Set and Test Set” as usual with the function like train_test_split and with some random factor. Your prediction could be slightly under or overfit like the figures below.


A Vision for Making Deep Learning Simple

In this blog post, we introduced Deep Learning Pipelines, a new library that makes deep learning drastically easier to use and scale. While this is just the beginning, we believe Deep Learning Pipelines has the potential to accomplish what Spark did to big data: make the deep learning “superpower” approachable for everybody. Future posts in the series will cover the various tools in the library in more detail: image manipulation at scale, transfer learning, prediction at scale, and making deep learning available in SQL.
Advertisements