Anyone who has worked on a Click Prediction problem or Recommendation systems would have faced a similar situation. Since the datasets are huge, doing predictions for these datasets becomes challenging with limited computation resources. However, in most cases these datasets are sparse (only a few variables for each training example are non zero) due to which there are several features which are not important for prediction, this is where factorization helps to extract the most important latent or hidden features from the existing raw ones. Factorization helps in representing approximately the same relationship between the target and predictors using a lower dimension dense matrix. In this article, I discuss Factorization Machines(FM) and Field Aware Factorization Machines(FFM) which allows us to take advantage of factorization in a regression/classification problem with an implementation using python.
Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data. I’ll provide an overview along with information to help you choose. I organize the types of regression by the different kinds of dependent variable. If you’re not sure which procedure to use, determine which type of dependent variable you have, and then focus on that section in this post. This process should help narrow the choices! I’ll cover regression models that are appropriate for dependent variables that measure continuous, categorical, and count data.
In this article, we introduce an Uber forecasting model that combines historical data and external factors to more precisely predict extreme events, highlighting its new architecture and how it compares to our previous model.
This is the third tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. The first one is available here. In the first part, we saw how to retrieve, sort and filter data using Spark RDDs, DataFrames and SparkSQL. In the second part (here), we saw how to work with multiple tables in Spark the RDD way, the DataFrame way and with SparkSQL. In this third part of the blog post series, we will perform web server log analysis using real-world text-based production logs. Log data can be used monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more. Server log analysis is a good use case for Spark. It’s a very large, common data source and contains a rich set of information.
Exponential Smoothing is an old technique, but it can perform extremely well on real time series, as discussed in Hyndman, Koehler, Ord & Snyder (2008)), when Gardner (2005) appeared, many believed that exponential smoothing should be disregarded because it was either a special case of ARIMA modeling or an ad hoc procedure with no statistical rationale. As McKenzie (1985) observed, this opinion was expressed in numerous references to my paper. Since 1985, the special case argument has been turned on its head, and today we know that exponential smoothing methods are optimal for a very general class of state-space models that is in fact broader than the ARIMA class.
Quantum Machine Learning (Quantum ML) is the interdisciplinary area combining Quantum Physics and Machine Learning(ML). It is a symbiotic association- leveraging the power of Quantum Computing to produce quantum versions of ML algorithms, and applying classical ML algorithms to analyze quantum systems. Read this article for an introduction to Quantum ML.
This is the considerably belated second part of my blog series on fitting diffusion models (or better, the 4-parameter Wiener model) with brms. The first part discusses how to set up the data and model. This second part is concerned with perhaps the most important steps in each model based data analysis, model diagnostics and the assessment of model fit. Note, the code in the part is completely self sufficient and can be run without running the code of part I.
In this post, I’ll describe a fun visualization of Anscombe’s quartet I whipped up recently. If you aren’t familiar with Anscombe’s quartet, here’s a brief description from its Wikipedia entry: ‘Anscombe’s quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. He described the article as being intended to counter the impression among statisticians that ‘numerical calculations are exact, but graphs are rough.’ ‘ In essence, there are 4 different datasets with quite different patterns in the data. Fitting a linear regression model through each dataset yields (nearly) identical regression coefficients, while graphing the data makes it clear that the underlying patterns are very different. What’s amazing to me is how these simple data sets (and accompanying graphs) make immediately intuitive the importance of data visualization, and drive home the point of how well-constructed graphs can help the analyst understand the data he or she is working with.
The R package wrapr now has a neat new feature: “wrapr_applicable”. Wraprs This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery. A small effort in making a package “wrapr aware” appears to have a fairly large payoff.
In this post, I will demonstrate the use of nonlinear models for time series analysis, and contrast to linear models. I will use a (simulated) noisy and nonlinear time series of sales data, use multiple linear regression and a small neural network to fit training data, then predict 90 days forward. I implemented all of this in R, although it could be done in a number of coding environments. (Specifically, I used R 3.4.2 in RStudio 1.1.183 in Windows 10).