Almost everyone loves to spend their leisure time to watch movies with their family and friends. We all have the same experience when we sit on our couch to choose a movie that we are going to watch and spend the next two hours but can’t even find one after 20 minutes. It is so disappointing. We definitely need a computer agent to provide movie recommendation to us when we need to choose a movie and save our time. Apparently, a movie recommendation agent has already become an essential part of our life. According to Data Science Central ‘Although hard data is difficult to come by, many informed sources estimate that, for the major ecommerce platforms like Amazon and Netflix, that recommenders may be responsible for as much as 10% to 25% of incremental revenue.’ In this project, I study some basic recommendation algorithms for movie recommendation and also try to integrate deep learning to my movie recommendation system.
Any time series classification or regression forecasting involves the Y prediction at ‘t+n’ given the X and Y information available till time T. Obviously no data scientist or statistician can deploy the system without back testing and validating the performance of model in history. Using the future actual information in training data which could be termed as ‘Look Ahead Bias’ is probably the gravest mistake a data scientist can make. Even the sentence “we cannot make use future data in training” sounds too obvious and simple in theory, anyone unknowingly can add look ahead bias in complex forecasting problems.
Variable reduction is a crucial step for accelerating model building without losing the potential predictive power of the data. With the advent of Big Data and sophisticated data mining techniques, the number of variables encountered is often tremendous making variable selection or dimension reduction techniques imperative to produce models with acceptable accuracy and generalization. The temptation to build an ecological model using all available information (i.e., all variables) is hard to resist. Ample time and money are exhausted gathering data and supporting information. Analytical limitations require us to think carefully about the variables we choose to model, rather than adopting a naive approach where we blindly use all information to understand complexity. The purpose of this post is to illustrate the use of some techniques to effectively manage the selection of explanatory variables consequently leading to a parsimonious model with highest possible prediction accuracy. It may be noted that the following techniques may or may not be followed in the given order contingent on the data. The very basic step before applying following techniques is to execute univariate analysis for all the variables to get observations frequency count as well as missing value count. Variables with a large proportion of missing values can be dropped upfront from the further analysis.
Analytical challenges in multivariate data analysis and predictive modeling include identifying redundant and irrelevant variables. A recommended analytics approach is to first address the redundancy; which can be achieved by identifying groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with other variable groups in the same data set. On the other hand, relevancy is about potential predictor variables and involves understanding the relationship between the target variable and input variables.
The internet age has brought unfathomably massive amounts of information to the fingertips of billions – if only we had time to read it. Though our lives have been transformed by ready access to limitless data, we also find ourselves ensnared by information overload. For this reason, automatic text summarization – the task of automatically condensing a piece of text to a shorter version – is becoming increasingly vital.
Efficient implementation is key to achieving the benefits of parallelization, even though parallelism is a good idea when the task can be divided into sub-tasks that can be executed independent of each other without communication or shared resources.
This post outlines some very basic methods for performing financial data analysis using Python, Pandas, and Matplotlib, focusing mainly on stock price data. A good place for beginners to start.
A curated list of the most cited deep learning papers (since 2012) We believe that there exist classic deep learning papers which are worth reading regardless of their application domain. Rather than providing overwhelming amount of papers, We would like to provide a curated list of the awesome deep learning papers which are considered as must-reads in certain research domains.
TensorFlow is by far the most popular deep learning software package available today. This training covers all of the essentials of TensorFlow; and provides you with hands-on experience building a deep learning model using the TensorFlow library. Every line of code written during the course is analyzed to help you understand…
There are several Python libraries which provide solid implementations of a range of machine learning algorithms. One of the best known is Scikit-Learn, a package that provides efficient versions of a large number of common algorithms. Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation. A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward. This section provides an overview of the Scikit-Learn API; a solid understanding of these API elements will form the foundation for understanding the deeper practical discussion of machine learning algorithms and approaches in the following chapters. We will start by covering data representation in Scikit-Learn, followed by covering the Estimator API, and finally go through a more interesting example of using these tools for exploring a set of images of hand-written digits.
I recently delivered a day of training at SQLBits and I really upped my game in terms of infrastructure for it. The resultant solution was super smooth and mitigated all the install issues and preparation for attendees. This meant we got to spend the whole day doing R, instead of troubleshooting. I’m so happy with the solution for an online R training environment that I want to share the solution, so you can take it and use it for when you need to do training.
The classification decisions made by machine learning models are usually difficult – if not impossible – to understand by our human brains. The complexity of some of the most accurate classifiers, like neural networks, is what makes them perform so well – often with better results than achieved by humans. But it also makes them inherently hard to explain, especially to non-data scientists. Especially, if we aim to develop machine learning models for medical diagnostics, high accuracies on test samples might not be enough to sell them to clinicians. Doctors and patients alike will be less inclined to trust a decision made by a model that they don’t understand. Therefore, we would like to be able to explain in concrete terms why a model classified a case with a certain label, e.g. why one breast mass sample was classified as “malignant” and not as “benign”.
R is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with.
Yesterday, I had the honour of presenting at The Data Science Conference in Chicago. My topic was Reproducible Data Science with R, and while the specific practices in the talk are aimed at R users, my intent was to make a general argument for doing data science within a reproducible workflow.