A Practical Introduction to K-Nearest Neighbors Algorithm for Regression (with Python code)

Out of all the machine learning algorithms I have come across, KNN has easily been the simplest to pick up. Despite it´s simplicity, it has proven to be incredibly effective at certain tasks (as you will see in this article). And even better? It can be used for both classification and regression problems! It´s far more popularly used for classification problems, however. I have seldom seen KNN being implemented on any regression task. My aim here is to illustrate and emphasize how KNN can be equally effective when the target variable is continuous in nature.


Support Vector Machines in R

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. However, they are mostly used in classification problems. In this tutorial, we will try to gain a high-level understanding of how SVMs work and then implement them using R. I´ll focus on developing intuition rather than rigor. What that essentially means is we will skip as much of the math as possible and develop a strong intuition of the working principle.


Analysis of Los Angeles Crime with R

City of Los Angeles or ‘The Birthplace of Jazz’ is one of the most populous city in the United States of America with the population estimated over four million. With the city of this size, it is worth the effort to explore the crime rate in this city. The current project is aimed to explore the crime rate in the year 2017. The dataset used in this project is found in this link which is provided by the Los Angeles Police Department.


Sensor Time-Series of Aircraft Engines

Can one predict when an engine or device breaks down? This seems a pure engineering question. But nowadays it is also a data science question. More concretely, it is an important question everywhere engines and devices use data to guide maintenance, such as aircraft engines [1], windturbine engines [2], and rotating machinery [3]. With regards to human safety and logistic planning, it is not a good idea to just wait until an engine breaks down. It is essential to pro-actively plan maintenance to avoid costly downtime. Proactive planning is enabled by multiple sensor technologies that provide powerful data about the state of machines and devices. Sensors include data such as temperature, fan and core speed, static pressure etc. Can we use these data to predict within certain margins how long an aircraft engine will be functioning without failure? And if so, how to do it? This is the question the concept of remaining useful life (RUL) attempts to answer. It aims to estimate the remaining time an item, component, or system is able to function in accordance with its intended purpose before warranting replacement. The present blog shows how to use deep learning in Python Keras to predict the RUL. It is meant to provide an example case study, not an exhaustive and ultimate solution. There is a lack of real data to answer this question. However, data simulations have been made and provide a unique resource. One such a fascinating simulation is provided by the C-MAPSS data [1]. It provides train data that show sensor-based time-series until the timepoint the engine breaks down. In contrast, the test data constitute of sensor-based time-series a ‘random’ time before the endpoint. The key aim is to estimate the RUL of the testset, that is, how much time is left after the last recorded time-point.


9 Things You Should Know About TensorFlow

This morning I found myself summarizing my favorite bits from a talk that I enjoyed at Google Cloud Next in San Francisco, What´s New with TensorFlow? Then I thought about it for a moment and couldn´t see a reason not to share my super-short summary with you (except maybe that you might not watch the video – you totally should check it out, the speaker is awesome) so here goes…
#1 It´s a powerful machine learning framework
#2 The bizarre approach is optional
#3 You can build neural networks line-by-line
#4 It´s not only Python
#5 You can do everything in the browser
#6 There´s a Lite version for tiny devices
#7 Specialized hardware just got better
#8 The new data pipelines are much improved
#9 You don´t need to start from scratch


Leveraging Agent-based Models (ABM) and Digital Twins to Prevent Injuries

Both athletes and machines deal with inter-twined complex systems (where the interactions of one complex system can have a ripple effect on others) that can have significant impact on their operational effectiveness.


Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting

A major challenge in instrumental variable (IV) analysis is to find instruments that are valid, or have no direct effect on the outcome and are ignorable. Typically one is unsure whether all of the putative IVs are in fact valid. We propose a general inference procedure in the presence of invalid IVs, called two-stage hard thresholding with voting. The procedure uses two hard thresholding steps to select strong instruments and to generate candidate sets of valid IVs. Voting takes the candidate sets and uses majority and plurality rules to determine the true set of valid IVs. In low dimensions with invalid instruments, our proposal correctly selects valid IVs, consistently estimates the causal effect, produces valid confidence intervals for the causal effect and has oracle optimal width, even if the so-called 50% rule or the majority rule is violated. In high dimensions, we establish nearly identical results without oracle optimality. In simulations, our proposal outperforms traditional and recent methods in the invalid IV literature. We also apply our method to reanalyse the causal effect of education on earnings.


Modelling non-stationary multivariate time series of counts via common factors

We develop a new parameter-driven model for multivariate time series of counts. The time series is not necessarily stationary. We model the mean process as the product of modulating factors and unobserved stationary processes. The former characterizes the long-run movement in the data, whereas the latter is responsible for rapid fluctuations and other unknown or unavailable covariates. The unobserved stationary processes evolve independently of the past observed counts and might interact with each other. We express the multivariate unobserved stationary processes as a linear combination of possibly low dimensional factors that govern the contemporaneous and serial correlation within and across the observed counts. Regression coefficients in the modulating factors are estimated via pseudo-maximum-likelihood estimation, and identification of common factor(s) is carried out through eigenanalysis on a positive definite matrix that pertains to the autocovariance of the observed counts at non-zero lags. Theoretical validity of the two-step estimation procedure is documented. In particular, we establish consistency and asymptotic normality of the pseudo-maximum-likelihood estimator in the first step and the convergence rate of the second-step estimator. We also present an exhaustive simulation study to examine the finite sample performance of the estimators, and numerical results corroborate our theoretical findings. Finally, we illustrate the use of the proposed model through an application to the numbers of National Science Foundation fundings awarded to seven research universities from January 2001 to December 2012.


Auxiliary gradient-based sampling algorithms

We introduce a new family of Markov chain Monte Carlo samplers that combine auxiliary variables, Gibbs sampling and Taylor expansions of the target density. Our approach permits the marginalization over the auxiliary variables, yielding marginal samplers, or the augmentation of the auxiliary variables, yielding auxiliary samplers. The well-known Metropolis-adjusted Langevin algorithm MALA and preconditioned Crank-Nicolson-Langevin algorithm pCNL are shown to be special cases. We prove that marginal samplers are superior in terms of asymptotic variance and demonstrate cases where they are slower in computing time compared with auxiliary samplers. In the context of latent Gaussian models we propose new auxiliary and marginal samplers whose implementation requires a single tuning parameter, which can be found automatically during the transient phase. Extensive experimentation shows that the increase in efficiency (measured as the effective sample size per unit of computing time) relative to (optimized implementations of) pCNL, elliptical slice sampling and MALA ranges from tenfold in binary classification problems to 25 fold in log-Gaussian Cox processes to 100 fold in Gaussian process regression, and it is on a par with Riemann manifold Hamiltonian Monte Carlo sampling in an example where that algorithm has the same complexity as the aforementioned algorithms. We explain this remarkable improvement in terms of the way that alternative samplers try to approximate the eigenvalues of the target. We introduce a novel Markov chain Monte Carlo sampling scheme for hyperparameter learning that builds on the auxiliary samplers. The MATLAB code for reproducing the experiments in the paper is publicly available and an on-line supplement to this paper contains additional experiments and implementation details.
Advertisements