Book Memo: “Big Data, Little Data, No Data”

Scholarship in the Networked World
“Big Data” is on the covers of Science, Nature, the Economist, and Wired magazines, on the front pages of the Wall Street Journal and the New York Times. But despite the media hyperbole, as Christine Borgman points out in this examination of data and scholarly research, having the right data is usually better than having more data; little data can be just as valuable as big data. In many cases, there are no data—because relevant data don’t exist, cannot be found, or are not available. Moreover, data sharing is difficult, incentives to do so are minimal, and data practices vary widely across disciplines. Borgman, an often-cited authority on scholarly communication, argues that data have no value or meaning in isolation; they exist within a knowledge infrastructure—an ecology of people, practices, technologies, institutions, material objects, and relationships. After laying out the premises of her investigation—six “provocations” meant to inspire discussion about the uses of data in scholarship—Borgman offers case studies of data practices in the sciences, the social sciences, and the humanities, and then considers the implications of her findings for scholarly practice and research policy. To manage and exploit data over the long term, Borgman argues, requires massive investment in knowledge infrastructures; at stake is the future of scholarship.

Magister Dixit

“When a new, powerful tool comes along, there’s a tendency to think it can solve more problems than it actually can. Computers have not made offices paperless, for example, and Predator drones haven’t made a significant dent in the annual number of terrorist acts. … While big data tools are, indeed, very powerful, the results they deliver tend to be only as good as the strategy behind their deployment. A closer look at successful big data projects offers clues as to why they are successful … and why others fall short of the mark.” Patrick Marshall ( 08.07.2014 )

Document worth reading: “Experimental Analysis of Design Elements of Scalarizing Functions-based Multiobjective Evolutionary Algorithms”

In this paper we systematically study the importance, i.e., the influence on performance, of the main design elements that differentiate scalarizing functions-based multiobjective evolutionary algorithms (MOEAs). This class of MOEAs includes Multiobjecitve Genetic Local Search (MOGLS) and Multiobjective Evolutionary Algorithm Based on Decomposition (MOEA/D) and proved to be very successful in multiple computational experiments and practical applications. The two algorithms share the same common structure and differ only in two main aspects. Using three different multiobjective combinatorial optimization problems, i.e., the multiobjective symmetric traveling salesperson problem, the traveling salesperson problem with profits, and the multiobjective set covering problem, we show that the main differentiating design element is the mechanism for parent selection, while the selection of weight vectors, either random or uniformly distributed, is practically negligible if the number of uniform weight vectors is sufficiently large. Experimental Analysis of Design Elements of Scalarizing Functions-based Multiobjective Evolutionary Algorithms

Document worth reading: “Deep Architectures for Modulation Recognition”

We survey the latest advances in machine learning with deep neural networks by applying them to the task of radio modulation recognition. Results show that radio modulation recognition is not limited by network depth and further work should focus on improving learned synchronization and equalization. Advances in these areas will likely come from novel architectures designed for these tasks or through novel training methods. Deep Architectures for Modulation Recognition

Book Memo: “Matrix and Tensor Factorization Techniques for Recommender Systems”

This book presents the algorithms used to provide recommendations by exploiting matrix factorization and tensor decomposition techniques. It highlights well-known decomposition methods for recommender systems, such as Singular Value Decomposition (SVD), UV-decomposition, Non-negative Matrix Factorization (NMF), etc. and describes in detail the pros and cons of each method for matrices and tensors. This book provides a detailed theoretical mathematical background of matrix/tensor factorization techniques and a step-by-step analysis of each method on the basis of an integrated toy example that runs throughout all its chapters and helps the reader to understand the key differences among methods. It also contains two chapters, where different matrix and tensor methods are compared experimentally on real data sets, such as Epinions, GeoSocialRec,, BibSonomy, etc. and provides further insights into the advantages and disadvantages of each method.

If you did not already know: “mapnik”

Mapnik is a high-powered rendering library that can take GIS data from a number of sources (ESRI shapefiles, PostGIS databases, etc.) and use them to render beautiful 2-dimensional maps. It’s used as the underlying rendering solution for a lot of online mapping services, most notably including MapQuest and the OpenStreetMap project, so it’s a truly production-quality framework. And, despite being written in C++, it comes with bindings for Python and Node, so you can leverage it in the language of your choice.
Render Google Maps Tiles with Mapnik and Python
mapnik google

R Packages worth a look

Uncertainty Propagation Analysis (spup)
Uncertainty propagation analysis in spatial environmental modelling following methodology described in Heuvelink et al. (2017) <doi:10.1080/13658810601063951> and Brown and Heuvelink (2007) <doi:10.1016/j.cageo.2006.06.015>. The package provides functions for examining the uncertainty propagation starting from input data and model parameters, via the environmental model onto model outputs. The functions include uncertainty model specification, stochastic simulation and propagation of uncertainty using Monte Carlo (MC) techniques. Uncertain variables are described by probability distributions. Both numerical and categorical data types are handled. Spatial auto-correlation within an attribute and cross-correlation between attributes is accommodated for. The MC realizations may be used as input to the environmental models called from R, or externally.

An Easy Way to Report ROC Analysis (reportROC)
Provides an easy way to report the results of ROC analysis, including: 1. an ROC curve. 2. the value of Cutoff, SEN (sensitivity), SPE (specificity), AUC (Area Under Curve), AUC.SE (the standard error of AUC), PLR (positive likelihood ratio), NLR (negative likelihood ratio), PPV (positive predictive value), NPV (negative predictive value).

Estimating Finite Population Total (fpest)
Given the values of sampled units and selection probabilities the desraj function in the package computes the estimated value of the total as well as estimated variance.

Regression Analysis Based on Win Loss Endpoints (WLreg)
Use various regression models for the analysis of win loss endpoints adjusting for non-binary and multivariate covariates.

Smoothed Bootstrap and Random Generation from Kernel Densities (kernelboot)
Smoothed bootstrap and functions for random generation from univariate and multivariate kernel densities. It does not estimate kernel densities.

R Packages worth a look

Dynamic Correlation Analysis for High Dimensional Data (DCA)
Finding dominant latent signals that regulate dynamic correlation between many pairs of variables.

Encrypt and Decrypt Strings, R Objects and Files (safer)
A consistent interface to encrypt and decrypt strings, R objects and files using symmetric key encryption.

Create and Evaluate NONMEM Models in a Project Context (nonmemica)
Systematically creates and modifies NONMEM(R) control streams. Harvests NONMEM output, builds run logs, creates derivative data, generates diagnostics. NONMEM (ICON Development Solutions <http://…/> ) is software for nonlinear mixed effects modeling. See ‘package?nonmemica’.

Tabular Reporting API (flextable)
Create pretty tables for ‘Microsoft Word’, ‘Microsoft PowerPoint’ and ‘HTML’ documents. Functions are provided to let users create tables, modify and format their content. It extends package ‘officer’ that does not contain any feature for customized tabular reporting. Function ‘tabwid’ produces an ‘htmlwidget’ ready to be used in ‘Shiny’ or ‘R Markdown (*.Rmd)’ documents. See the ‘flextable’ website for more information.

Connect to ‘DocuSign’ API (docuSignr)
Connect to the ‘DocuSign’ Rest API <https://…/RESTAPIGuide.htm>, which supports embedded signing, and sending of documents.

Document worth reading: “A Review on Algorithms for Constraint-based Causal Discovery”

Causal discovery studies the problem of mining causal relationships between variables from data, which is of primary interest in science. During the past decades, significant amount of progresses have been made toward this fundamental data mining paradigm. Recent years, as the availability of abundant large-sized and complex observational data, the constrain-based approaches have gradually attracted a lot of interest and have been widely applied to many diverse real-world problems due to the fast running speed and easy generalizing to the problem of causal insufficiency. In this paper, we aim to review the constraint-based causal discovery algorithms. Firstly, we discuss the learning paradigm of the constraint-based approaches. Secondly and primarily, the state-of-the-art constraint-based casual inference algorithms are surveyed with the detailed analysis. Thirdly, several related open-source software packages and benchmark data repositories are briefly summarized. As a conclusion, some open problems in constraint-based causal discovery are outlined for future research. A Review on Algorithms for Constraint-based Causal Discovery