Git is a distributed implementation of version control. Many people have written very eloquently about why it is a good idea to use version control, not only if you collaborate in a team but also if you work on your own; one example is this article from RStudio’s Support pages. In short, its main feature is that version control allows you to keep track of the changes you make to your code. It will also keep a history of all the changes you have made in the past and allows you to go back to specific versions if you made a major mistake. And Git makes collaborating on the same code very easy. Most R packages are also hosted on GitHub. You can check out their R code in the repositories if you want to get a deeper understanding of the functions, you can install the latest development versions of packages or install packages that are not on CRAN. The issue tracker function of GitHub also makes it easy to report and respond to issues/problems with your code.
The myriad uses of big data continue to unfold as new methods of generating, parsing and combining it evolve. Consider the data generated from Internet of Things (IoT) sensors, or produced through artificial intelligence (AI), or existing as historic data. Modern industry can now capture data from these and other sources to create a “digital twin” – a virtual model that is essentially the intelligent counterpart to an actual, physical object. By monitoring the status of an object or process and using multiple streams of data in real time to study its digital twin, engineers gain insight on how to improve product lifecycles, streamline maintenance and hone optimization.
Decision Trees are one of the most respected algorithm in machine learning and data science. They are transparent, easy to understand, robust in nature and widely applicable. You can actually see what the algorithm is doing and what steps does it perform to get to a solution. This trait is particularly important in business context when it comes to explaining a decision to stakeholders. This skill test was specially designed for you to test your knowledge on decision tree techniques. More than 750 people registered for the test. If you are one of those who missed out on this skill test, here are the questions and solutions.
If you were to ask me 2 most intuitive algorithms in machine learning – it would be k-Nearest Neighbours (kNN) and tree based algorithms. Both of them are simple to understand, easy to explain and perfect to demonstrate to people. Interestingly, we had skill tests for both these algorithms last month. If you are new to machine learning, make sure you test yourself on understanding of both of these algorithms. They are simplistic, but immensely powerful and used extensively in industry. This skill test will help you test yourself on k-Nearest Neighbours. It is specially designed for you to test your knowledge on kNN and its applications.
If you read this blog, you are very likely to be involved in any kind of data collection, manipulation or analysis. When not performed wisely, your analysis will lead you to incorrect conclusions. Alex Reinhart, in his book Statistics Done Wrong, has listed several concepts that are key when analysing data, such as statistical power, correlation/causation and publication bias. The book provides interesting advices and warnings related to research papers. Alex clearly explains how people currently use statistics with example of misuse. Statistics Done Wrong provides plenty of examples of statistical misinterpretation…even done by statisticians. The book covers what I would call insidious topics such as the base rate fallacy and the issue of testing several hypotheses, generating a high rate of false positive within p-values. The concept of statistical power, or how you can miss an effect if your sample size is not adequate, is also discussed. My only regret is that, starting from Chapter 9, the book suddenly aims at an academic audience with topics related to publication. Out of these last chapters, the book is really dedicated to practitioners in data-related fields. Any Data Scientist should read this book.
This article is meant to help R users enhance their set of skills and learn Python for data science (from scratch). After all, R and Python are the most important programming languages a data scientist must know. Python is a supremely powerful and a multi-purpose programming language. It has grown phenomenally in the last few years. It is used for web development, game development, and now data analysis / machine learning. Data analysis and machine learning is a relatively new branch in python. For a beginner in data science, learning python for data analysis can be really painful. Why ? You try Googling ‘learn python,’ and you’ll get tons of tutorials only meant for learning python for web development. How can you find a way then ? In this tutorial, we’ll be exploring the basics of python for performing data manipulation tasks. Alongside, we’ll also look how you do it in R. This parallel comparison will help you relate the set of tasks you do in R to how you do it in python! And in the end, we’ll take up a data set and practice our newly acquired python skills.
In Machine Learning, Cross-validation is a resampling method used for model evaluation to avoid testing a model on the same dataset on which it was trained. This is a common mistake, especially that a separate testing dataset is not always available. However, this usually leads to inaccurate performance measures (as the model will have an almost perfect score since it is being tested on the same data it was trained on). To avoid this kind of mistakes, cross validation is usually preferred. The concept of cross validation is actually simple: Instead of using the whole dataset to train and then test on same data, we could randomly divide our data into training and testing datasets.
Google Trends shows the changes in the popularity of search terms over a given time (i.e., number of hits over time). It can be used to find search terms with growing or decreasing popularity or to review periodic variations from the past such as seasonality. Google Trends search data can be added to other analyses, manipulated and explored in more detail in R. This post describes how you can use R to download data from Google Trends, and then include it in a chart or other analysis. We’ll discuss first how you can get overall (global) data on a search term (query), how to plot it as a simple line chart, and then how to can break the data down by geographical region. The first example I will look at is the rise and fall of the Blu-ray.
A common experience for those in the social sciences migrating to R from SPSS or STATA is that some procedures that happened at the click of a button will now require more work or are too obscured by the unfamiliar language to see how to accomplish. One such procedure that I’ve experienced is when calculating the marginal effects of a generalized linear model. In this exercise set, we will explore calculating marginal effects for linear, logistic, and probit regression models in R.
LASSO has been a popular algorithm for the variable selection and extremely effective with high-dimension data. However, it often tends to “over-regularize” a model that might be overly compact and therefore under-predictive. The Elastic Net addresses the aforementioned “over-regularization” by balancing between LASSO and ridge penalties. In particular, a hyper-parameter, namely Alpha, would be used to regularize the model such that the model would become a LASSO in case of Alpha = 1 and a ridge in case of Alpha = 0. In practice, Alpha can be tuned easily by the cross-validation. Below is a demonstration of Elastic Net with R glmnet package and its comparison with LASSO and ridge models.
Control calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results and illustrate its use by a case study with two statisticians ping-ponging an analysis using dplyr and SQL, respectively.