A System for an Accountable Data Analysis Process in R

Efficiently producing transparent analyses may be difficult for beginners or tedious for the experienced. This implies a need for computing systems and environments that can efficiently satisfy reproducibility and accountability standards. To this end, we have developed a system, R package, and R Shiny application called adapr (Accountable Data Analysis Process in R) that is built on the principle of accountable units. An accountable unit is a data file (statistic, table or graphic) that can be associated with a provenance, meaning how it was created, when it was created and who created it, and this is similar to the ’verifiable computational results’ (VCR) concept proposed by Gavish and Donoho. Both accountable units and VCRs are version controlled, sharable, and can be incorporated into a collaborative project. However, accountable units use file hashes and do not involve watermarking or public repositories like VCRs. Reproducing collaborative work may be highly complex, requiring repeating computations on multiple systems from multiple authors; however, determining the provenance of each unit is simpler, requiring only a search using file hashes and version control systems.


Support Vector Machines for Survival Analysis with R

This article introduces the R package survivalsvm, implementing support vector machines for survival analysis. Three approaches are available in the package: The regression approach takes censoring into account when formulating the inequality constraints of the support vector problem. In the ranking approach, the inequality constraints set the objective to maximize the concordance index for comparable pairs of observations. The hybrid approach combines the regression and ranking constraints in a single model. We describe survival support vector machines and their implementation, provide examples and compare the prediction performance with the Cox proportional hazards model, random survival forests and gradient boosting using several real datasets. On these datasets, survival support vector machines perform on par with the reference methods.


Simple Features for R: Standardized Support for Spatial Vector Data

Simple features are a standardized way of encoding spatial vector data (points, lines, polygons) in computers. The sf package implements simple features in R, and has roughly the same capacity for spatial vector data as packages sp, rgeos and rgdal. We describe the need for this package, its place in the R package ecosystem, and its potential to connect R to other computer systems. We illustrate this with examples of its use.


How far can we forecast?

Forecasts are often made for several consecutive periods ahead. For example, Consensus Economics collects quarterly forecasts for up to six quarters into the future from research institutes and other professional forecasters. Yet, especially longer-term forecasts possibly do not provide any information beyond that contained in the long-run mean of the target variable. Such forecasts are deemed to be uninformative. Therefore, it is desirable to be able to determine the largest horizon for which informative forecasts can be made. Up to now, only descriptive methods have been available for this purpose.
Forecasts are useless whenever the forecast error variance fails to be smaller than the unconditional variance of the target variable. This paper develops tests for the null hypothesis that forecasts become uninformative beyond some limiting forecast horizon h . Following Diebold and Mariano (DM, 1995) we propose a test based on the comparison of the mean-squared error of the forecast and the sample variance. We show that the resulting test does not possess a limiting normal distribution and suggest two simple modifications of the DM-type test with different limiting null distributions. Furthermore, a forecast encompassing test is developed that tends to better control the size of the test. In our empirical analysis, we apply our tests to macroeconomic forecasts from the survey of Consensus Economics. Our results suggest that forecasts of macroeconomic key variables are barely informative beyond 2-4 quarters ahead.


How to Install and Use Homebrew

Discover Homebrew for data science: learn how you can use this package manager to install, update, and remove technologies such as Apache Spark and Graphviz.


Zomato Web Scraping with BeautifulSoup in Python

The Data Science projects start with the collection of data. The data can be collected from the database, internet/online and offline mode. These days most of the information is available online and in order to extract that information Data Engineers/Data Scientists use Web Scraping. In this article we will learn about web scraping and how is it done in Python using openly available tools.


Create your Machine Learning library from scratch with R ! (3/5) – KNN

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement K nearest neighbors (KNN) using only the linear algebra available in R. Previously, we managed to implement PCA and next time we will deal with SVM and decision trees.


Intelligent Development Environment and Software Knowledge Graph

Software intelligent development has become one of the most important research trends in software engineering. In this paper, we put forward two key concepts — intelligent development environment (IntelliDE) and software knowledge graph — for the first time. IntelliDE is an ecosystem in which software big data are aggregated, mined and analyzed to provide intelligent assistance in the life cycle of software development. We present its architecture and discuss its key research issues and challenges. Software knowledge graph is a software knowledge representation and management framework, which plays an important role in IntelliDE. We study its concept and introduce some concrete details and examples to show how it could be constructed and leveraged. Intelligent Development Environment and… (PDF Download Available). Available from: https://…_Environment_and_Software_Knowledge_Graph [accessed May 23 2018].
Advertisements