Introduction to Kurtosis

Kurtosis is a measure of the degree to which portfolio returns appear in the tails of our distribution. A normal distribution has a kurtosis of 3, which follows from the fact that a normal distribution does have some of its mass in its tails. A distribution with a kurtosis greater than 3 has more returns out in its tails than the normal, and one with kurtosis less than 3 has fewer returns in its tails than the normal. That matters to investors because more bad returns out in tails means that our portfolio might be at risk of a rare but huge downside. The terminology is a bit confusing. Negative kurtosis is considered less risky because it has fewer returns out in the tails. Negative == less risky? We’re not used to that in finance.

Deep Learning from first principles in Python, R and Octave – Part 1

This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!

Die Markov Kette: Wie berechne ich die Gleichgewichtsverteilung?

In diesem Artikel möchten wir Ihnen das Konzept der Markov Kette vorstellen, dessen Grundlagen veranschaulichen und Ihnen mehrere mögliche Anwendungsbereiche aufzeigen, in denen Sie mit einer gezielten statistischen Programmierung von Markov Ketten profitieren können. Eine Markov Kette ist ein stochastischer Prozess mit den vielfältigsten Anwendungsbereichen aus der Natur, Technik und Wirtschaft. Klassische Anwendungsmöglichkeiten aus der Wirtschaft sind die Modellierung von Warteschlangen und Wechselkursen. Auch im Bereich der Technik gibt es zahlreiche Einsatzgebiete wie zum Beispiel die Modellierung des Verhaltens von Talsperren und von Geschwindigkeitsregelanlagen bei Kraftfahrzeugen, sowie die analytische Bewertung von Mobilitätsalgorithmen wie dem Random Walk. In der Naturwissenschaft verwendet man diesen Prozess in der Populationsdynamik zur Vorhersage des Bevölkerungswachstums von Menschen und Tieren und in einer abgewandelten Form für die Brownsche Molekularbewegung. Für die Praxis besonders relevant ist die statistische Programmierung und Simulation der Gleichgewichtsverteilung mit der Statistik Software R, welche im Folgenden anhand eines anschaulichen Beispiels durchgeführt wird. Doch zunächst werden die für die Berechnung erforderlichen Begriffe erläutert.

Divide and parallelize large data problems with Rcpp

Got stuck with too large a dataset? R speed drives you mad? Divide, parallelize and go with Rcpp! One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.

How to extract data from a PDF file with R

In this post, taken from the book R Data Mining by Andrea Cirillo, we’ll be looking at how to scrape PDF files using R. It’s a relatively straightforward way to look at text mining – but it can be challenging if you don’t know exactly what you’re doing.

How to build a Successful Advanced Analytics Department

This article presents our opinions and suggestions on how an Advanced Analytics department should operate. The post is not intended to be a comprehensive list of steps, but rather, a list of tips and warnings. We hope this will be useful for those who want to implement analytics work in their company, as well as for existing departments. The post is divided into 3 parts. The first provides a list of the most important aspects to be aware of when leading AA work in your organization. If you are already leading such a department, then you are already aware of these but it can still prove useful when presenting your case higher in the hierarchy or to another department. The second part will include a list of the most important elements that need to be addressed and cared for in such a department. Lastly, we caution you of the most common issues we have seen in failed initiatives.

Predictive Analytics Path to Mainstream Adoption

Hold on to your hats data scientists, you’re in for another wild ride. A few months ago, our beloved field of predictive analytics was taken down a peg by the 2017 Hype Cycle for Analytics and Business Intelligence. In the latest report, predictive analytics moved from the “Peak of Inflated Expectations” to the “Trough of Disillusionment”. Don’t despair, this is a good thing! The transition means that the silver bullet seekers are likely moving on to the next craze and the technology is moving one step closer to long term productive use. Gartner estimates approximately 2-5 years to mainstream adoption.

Intuition behind Bias-Variance trade-off, Lasso and Ridge Regression

Linear regression uses Ordinary Least square method to find the best coefficient estimates. One of the assumptions of Linear regression is that the variables are not correlated with each other. However, when the multicollinearity exists in the dataset (two or more variables are highly correlated with each other) Ordinary Least square method cannot be that effective. In this blog, we will talk about two methods which are slightly better than Ordinary Least Square method – Lasso and Ridge regression. Lasso and Ridge regressions are closely related to each other and they are called shrinkage methods. We use Lasso and Ridge regression when we have a huge number of variables in the dataset and when the variables are highly correlated.

Big data analytics – A review of data-mining models for small and medium enterprises in the transportation sector

The need for small and medium enterprises (SMEs) to adopt data analytics has reached a critical point, given the surge of data implied by the advancement of technology. Despite data mining (DM) being widely used in the transportation sector, it is staggering to note that there are minimal research case studies being done on the application of DM by SMEs, specifically in the transportation sector. From the extensive review conducted, the three most common DM models used by large enterprises in the transportation sector are identified, namely “Knowledge Discovery in Database,” “Sample, Explore, Modify, Model and Assess” (SEMMA), and “CRoss Industry Standard Process for Data Mining” (CRISP-DM). The same finding was revealed in the SMEs’ context across the various industries. It was also uncovered that among the three models, CRISP-DM had been widely applied commercially. However, despite CRISP-DM being the de facto DM model in practice, a study carried out to assess the strengths and weakness of the models reveals that they have several limitations with respect to SMEs. This paper concludes that there is a critical need for a novel model to be developed in order to cater to the SMEs’ prerequisite, especially so in the transportation sector context.

Multilabel feature selection: A comprehensive review and guiding experiments

Feature selection has been an important issue in machine learning and data mining, and is unavoidable when confronting with high-dimensional data. With the advent of multilabel (ML) datasets and their vast applications, feature selection methods have been developed for dimensionality reduction and improvement of the classification performance. In this work, we provide a comprehensive review of the existing multilabel feature selection (ML-FS) methods, and categorize these methods based on different perspectives. As feature selection and data classification are closely related to each other, we provide a review on ML learning algorithms as well. Also, to facilitate research in this field, a section is provided for setup and benchmarking that presents evaluation measures, standard datasets, and existing software for ML data. At the end of this survey, we discuss some challenges and open problems in this field that can be pursued by researchers in future.

How to perform Logistic Regression, LDA, & QDA in R

Classification algorithm defines set of rules to identify a category or group for an observation. There is various classification algorithm available like Logistic Regression, LDA, QDA, Random Forest, SVM etc. Here I am going to discuss Logistic regression, LDA, and QDA. The classification model is evaluated by confusion matrix. This matrix is represented by a table of Predicted True/False value with Actual True/False Value. The confusion matrix is shown as below. This list down the TRUE/FALSE for Predicted and Actual Value in a 2X2 table.

The convergence of AI and Blockchain: what’s the deal?

It is undeniable that AI and blockchain are two of the major technologies that are catalyzing the pace of innovation and introducing radical shifts in every industry. Each technology has its own degree of technical complexity as well as business implications but the joint use of the two may be able to redesign the entire technological (and human) paradigm from scratch. This article wants to give a flavor of the potentialities realized at the intersection of AI and Blockchain and discuss standard definitions, challenges, and benefits of this alliance, as well as about some interesting player in this space.

Interactive Workflows for C++ with Jupyter

Scientists, educators and engineers not only use programming languages to build software systems, but also in interactive workflows, using the tools available to explore a problem and reason about it.
Running some code, looking at a visualization, loading data, and running more code. Quick iteration is especially important during the exploratory phase of a project.
For this kind of workflow, users of the C++ programming language currently have no choice but to use a heterogeneous set of tools that don’t play well with each other, making the whole process cumbersome, and difficult to reproduce.