It has always been a debatable topic to choose between R and Python. The Machine Learning world has been divided over the preference of one language over the other. But with the explosion of Deep Learning, the balance shifted towards Python as it had an enormous list of Deep Learning libraries and frameworks which R lacked (till now). I personally switched to Python from R simply because I wanted to dive into the Deep Learning space but with an R, it was almost impossible. But not anymore! With launch of Keras in R, this fight is back at the center. Python was slowly becoming the de-facto language for Deep Learning models. But with the release of Keras library in R with tensorflow (CPU and GPU compatibility) at the backend as of now, it is likely that R will again fight Python for the podium even in the Deep Learning space. Below we will see how to install Keras with Tensorflow in R and build our first Neural Network model on the classic MNIST dataset in the RStudio.
In ordinary linear regression, we are estimating the mean of some variable y, conditional on the values of independent variables X. As we proceed to fit the ordinary least square regression model on the data we make a key assumption about the random error term in the linear model. Our assumption is that the error term has a constant variance across the values of independent variable X.
• What happens when this assumption is no longer true ?
• Also instead of estimating the mean of our independent variable can we estimate the median or the 0.3th quantile or 0.8th quantile of our independent variable ?.
This is where Quantile Regression comes to our rescue.
Let us write some code to better understand this. Let us create some data and plot it.
For over a year we surveyed thousands of companies from all types of industries and data science advancement on how they managed to overcome these difficulties and analyzed the results. Here are the key things to keep in mind when you’re working on your design-to-production pipeline.
For any developers who have ever written an S3 method for the print() function, they probably know what a top-level R expression means, but this is a very confusing concept to non-developers. I have to explain this every now and then, so I decided to write a short post about it. Yesterday I saw a Github issue in the rmarkdown repository, and you can see that there are still users confused by the fact that ggplot2 plots are not rendered in certain cases. I have seen similar questions perhaps hundreds of times. Such questions have been answered in the R FAQ 7.22 “Why do lattice/trellis graphics not work?”, but the answer didn’t explain the root reason in detail.
Until very recently, only a very limited classes of feasible non Gaussian time series models were available. For example, one could use extensions of state space models to non Gaussian environments (see, for example, Durbin and Koopman (2012)), but extensive Monte Carlo simulation is required to numerically evaluate the conditional densities that define the estimation process of such models. The high technicalities involved in implementing these algorithms and its accompanying computational cost have not helped its widespread use by practitioners. On the other hand, different attempts to extend ARMA type models with conditional non Gaussian distributions have been more successful. For example, the use of GARCH type models to deal with heavy tailed distributions in finance (Engle and Bollerslev (1986)), the Autoregressive Conditional Duration (ACD) model of Engle and Russell (1998) to tackle asymmetric distributions in time duration and the Poisson count models of Davis et al (2003) for the modelling of discrete events in time. But, so far, these extensions have lacked an unifying framework that would allow the specification, estimation and forecasting of a model based on an arbitrary non Gaussian distribution. The recently proposed Generalized Autoregressive Scores (GAS) models by Creal et al (2008, 2013), or dynamic conditional score (DCS) from Harvey (2013), offer an unifying framework to derive and estimate time series model with any conditional non-Gaussian distribution, either discrete or continuous, univariate or multivariate. More specifically, in GAS models, conditional on past observations, a proper probability model, possibly non Gaussian, is chosen for the response variable at time t . Then, by construction, time varying parameters can be accommodated according to an updating mechanism that uses the score as its driving force. The use of the score for updating time-varying parameters is intuitive given that it defines the steepest ascent direction for improving the model’s local fit in terms of the likelihood or density at time t , given the current parameter position. In such an updating mechanism information from the whole density is used to track the evolution of time varying parameters. Of course, in this post I will briefly explain the estimation framework of such models for our community, however I deeply encourage you our fellow “insighteR” to pay a visit to gasmodel.com, where you can find a whole section devoted to GAS models papers and see for yourself the diversity of applications.
Choropleths are a common approach to visualizing data on geographic maps. But choropleths — by design or necessity — aggregate individual data points into a single geographic region (like a country or census tract), which is all shaded a single colour. This can introduce interpretability issues (are we seeing changes in the variable of interest, or just population density?) and can fail to express the richness of the underlying data. For an alternative approach, take a look at the recent Culture of Insight blog post which provides a tutorial on creating dot-density maps in R. The chart below is based on UK Census data. Each point represents 10 London residents, with the colour representing one of five ethnic categories. Now, the UK census only reports ethnic ratios on a borough-by-borough basis, so the approach here is to simulate the individual resident data (which is not available) by randomly distributing points across the borough following the reported distribution. In a way, this is suggesting a level of precision which isn’t available in the source data, but it does provide a visualization of London’s ethnic diversity that isn’t confounded with the underlying population distribution.