List of Must- Read Free Books for Data Science

Earlier, we came up with a list of some of the best Machine Learning books you should consider going through. In this article, we have come up with yet another list of the recommended books for Data Science.

Evaluation and comparison of open source software suites for data mining and knowledge discovery

The growing interest in the extraction of useful knowledge from data with the aim of being beneficial for the data owner is giving rise to multiple data mining tools. Research community is specially aware of the importance of open source data mining software to ensure and ease the dissemination of novel data mining algorithms. The availability of these tools at no cost, and also the chance of better understanding of the approaches by examining their source code, provides the research community with an opportunity to tune and improve the algorithms. Documentation, updating, variety of algorithms, extensibility, and interoperability among others can be major issues to motivate users for opting for a specific open source data mining tool. The aim of this paper is to evaluate 19 open source data mining tools and to provide the research community with an extensive study based on a wide set of features that any tool should satisfy. The evaluation is carried out by following two methodologies. The first one is based on scores provided by experts to produce a subjective judgment of each tool. The second procedure performs an objective analysis about which features are satisfied by each tool. The ultimate aim of this work is to provide the research community with an extensive study on different features included in any data mining tool, either from a subjective and an objective point of view. Results reveal that RapidMiner, Konstanz Information Miner, and Waikato Environment for Knowledge Analysis are the tools that include higher percentage of these features.

Task-based End-to-end Model Learning

As machine learning techniques have become more ubiquitous, it has become common to see machine learning prediction algorithms operating within some larger process. However, the criteria by which we train machine learning algorithms often differ from the ultimate criteria on which we evaluate them. This paper proposes an end-to-end approach for learning probabilistic machine learning models within the context of stochastic programming, in a manner that directly captures the ultimate task-based objective for which they will be used. We then present two experimental evaluations of the proposed approach, one as applied to a generic inventory stock problem and the second to a real-world electrical grid scheduling task. In both cases, we show that the proposed approach can outperform both a traditional modeling approach and a purely black-box policy optimization approach.

Improved Python-style Logging in R

Last August, in Python-style Logging in R, we described using an R script as a wrapper around the futile.logger package to generate log files for an operational R data processing script. Today, we highlight an improved, documented version that can be sourced by your R scripts or dropped into your package’s R/ directory to provide easy file and console logging.

Ensemble Methods are Doomed to Fail in High Dimensions

By ensemble methods, I (Bob, not Andrew) mean approaches that scatter points in parameter space and then make moves by inteprolating or extrapolating among subsets of them.

Data Structures Exercises

There are 5 important basic data structures in R: vector, matrix, array, list and dataframe. They can be 1-dimensional (vector and list), 2-dimensional (matrix and data frame) or multidimensional (array). They also differ according to homogeneity of elements they can contain: while all elements contained in vector, matrix and array must be of the same type, list and data frame can contain multiple types. In this set of exercises we shall practice casting between different types of these data structures, together with some basic operations on them. You can find more about data structures on Advanced R – Data structures page.

Mapping Housing Data with R

What is my home worth? Many homeowners in America ask themselves this question, and many have an answer. What does the market think, though? The best way to estimate a property’s value is by looking at other, similar properties that have sold recently in the same area – the comparable sales approach. In an effort to allow homeowners to do some exploring (and because I needed a new project), I developed a small Shiny app with R.