If you did not already know

Temporal Overdrive Recurrent Neural Network
In this work we present a novel recurrent neural network architecture designed to model systems characterized by multiple characteristic timescales in their dynamics. The proposed network is composed by several recurrent groups of neurons that are trained to separately adapt to each timescale, in order to improve the system identification process. We test our framework on time series prediction tasks and we show some promising, preliminary results achieved on synthetic data. To evaluate the capabilities of our network, we compare the performance with several state-of-the-art recurrent architectures. …

Extreme Value Learning (EVL)
The novel unseen classes can be formulated as the extreme values of known classes. This inspired the recent works on open-set recognition \cite{Scheirer_2013_TPAMI,Scheirer_2014_TPAMIb,EVM}, which however can have no way of naming the novel unseen classes. To solve this problem, we propose the Extreme Value Learning (EVL) formulation to learn the mapping from visual feature to semantic space. To model the margin and coverage distributions of each class, the Vocabulary-informed Learning (ViL) is adopted by using vast open vocabulary in the semantic space. Essentially, by incorporating the EVL and ViL, we for the first time propose a novel semantic embedding paradigm — Vocabulary-informed Extreme Value Learning (ViEVL), which embeds the visual features into semantic space in a probabilistic way. The learned embedding can be directly used to solve supervised learning, zero-shot and open set recognition simultaneously. Experiments on two benchmark datasets demonstrate the effectiveness of proposed frameworks. …

Skip-Gram Model
A technique where by n-grams are still stored to model language, but they allow for tokens to be skipped. …

Distilled News

This pilot project collects problems and metrics/datasets from the AI research literature, and tracks progress on them. You can use this Notebook to see how things are progressing in specific subfields or AI/ML as a whole, as a place to report new results you’ve obtained, as a place to look for problems that might benefit from having new datasets/metrics designed for them, or as a source to build on for data science projects. At EFF, we’re ultimately most interested in how this data can influence our understanding of the likely implications of AI. To begin with, we’re focused on gathering it.
As data analysts, we’re frequently presented with comma-separated value files and tasked with reporting insights. While it’s tempting to import that data directly into R or Python in order to perform data munging and exploratory data analysis, there are also a number of utilities to examine, fix, slice, transform, and summarize data through the command line. In particular, Csvkit is a suite of python based utilities for working with CSV files from the terminal. For this post, we will grab data using wget, subset rows containing a particular value, and summarize the data in different ways. The goal is to take data on criminal activity, group by a particular offense type, and develop counts to understand the frequency distribution.
TLDR: Most Julia programmers also use Python. However, among all languages R is the one whose users are most likely to also develop in Julia. Recently Stack Overflow has made public the results of Developer Survey 2017. It is definitely an interesting data set. In this post I analyzed the answers to the question ‘Which of the following languages have you done extensive development work in over the past year, and which do you want to work in over the next year?’ from the perspective of Julia language against other programming languages. Actually we get two variables of interest: 1) what was used and 2) what is planned to be used.
I’m pleased to announce the release of the dbplyr package, which now contains all dplyr code related to connecting to databases. This shouldn’t affect you-as-a-user much, but it makes dplyr simpler, and makes it easier to release improvements just for database related code.
This post summarizes and links to a great multi-part tutorial series on learning the TensorFlow API for building a variety of neural networks, as well as a bonus tutorial on backpropagation from the beginning.
Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm().

R Packages worth a look

Nearest Centroid (NC) Sampling (NCSampling)
Provides functionality for performing Nearest Centroid (NC) Sampling. The NC sampling procedure was developed for forestry applications and selects plots for ground measurement so as to maximize the efficiency of imputation estimates. It uses multiple auxiliary variables and multivariate clustering to search for an optimal sample. Further details are given in Melville G. & Stone C. (2016) <doi:10.1080/00049158.2016.1218265>.

SQL Server R Database Interface (DBI) and ‘dplyr’ SQL Backend (RSQLServer)
Utilises The ‘jTDS’ project’s ‘JDBC’ 3.0 ‘SQL Server’ driver to extend ‘DBI’ classes and methods. The package also implements a ‘SQL’ backend to the ‘dplyr’ package.

A Way of Writing Functions that Quote their Arguments (quotedargs)
A facility for writing functions that quote their arguments, may sometimes evaluate them in the environment where they were quoted, and may pass them as quoted to other functions.

Easily Build and Evaluate Machine Learning Models (easyml)
Easily build and evaluate machine learning models on a dataset. Machine learning models supported include penalized linear models, penalized linear models with interactions, random forest, support vector machines, neural networks, and deep neural networks.

Sequential Invariant Causal Prediction (seqICP)
Contains an implementation of invariant causal prediction for sequential data. The main function in the package is ‘seqICP’, which performs linear sequential invariant causal prediction and has guaranteed type I error control. For non-linear dependencies the package also contains a non-linear method ‘seqICPnl’, which allows to input any regression procedure and performs tests based on a permutation approach that is only approximately correct. In order to test whether an individual set S is invariant the package contains the subroutines ‘seqICP.s’ and ‘seqICPnl.s’ corresponding to the respective main methods.

Whats new on arXiv

We introduce and study exterior distance function (EDF) and correspondent exterior point method (EPM) for convex optimization. The EDF is a classical Lagrangian for an equivalent problem obtained from the initial one by monotone transformation of both the objective function and the constraints. The constraints transformation is scaled by a positive scaling parameter. Thus, the EDF is a particular realization of the Nonlinear Rescaling (NR) principle. Along with the ‘center’, the EDF has two extra tools: the barrier (scaling) parameter and the vector of Lagrange multipliers. We show that EPM generates primal – dual sequence, which converges to the primal – dual solution in value under minimum assumption on the input data. Moreover, the convergence is taking place under any fixed interior point as a ‘center’ and any fixed positive scaling parameter, just due to the Lagrange multipliers update. If the second order sufficient optimality condition is satisfied, then the EPM converges with Q-linear rate under any fixed interior point as a ‘center’ and any fixed, but large enough positive scaling parameter.
Simulation Optimization (SO) refers to the optimization of an objective function subject to constraints, both of which can be evaluated through a stochastic simulation. To address specific features of a particular simulation—discrete or continuous decisions, expensive or cheap simulations, single or multiple outputs, homogeneous or heterogeneous noise—various algorithms have been proposed in the literature. As one can imagine, there exist several competing algorithms for each of these classes of problems. This document emphasizes the difficulties in simulation optimization as compared to mathematical programming, makes reference to state-of-the-art algorithms in the field, examines and contrasts the different approaches used, reviews some of the diverse applications that have been tackled by these methods, and speculates on future directions in the field.
Optimization solvers routinely utilize presolve techniques, including model simplification, reformulation and domain reduction techniques. Domain reduction techniques are especially important in speeding up convergence to the global optimum for challenging nonconvex nonlinear programming (NLP) and mixed-integer nonlinear programming (MINLP) optimization problems. In this work, we survey the various techniques used for domain reduction of NLP and MINLP optimization problems. We also present a computational analysis of the impact of these techniques on the performance of various widely available global solvers on a collection of 1740 test problems.
Video prediction has been an active topic of research in the past few years. Many algorithms focus on pixel-level predictions, which generates results that blur and disintegrate within a few frames. In this project, we use a hierarchical approach for long-term video prediction. We aim at estimating high-level structure in the input frame first, then predict how that structure grows in the future. Finally, we use an image analogy network to recover a realistic image from the predicted structure. Our method is largely adopted from the work by Villegas et al. The method is built with a combination of LSTMs and analogy-based convolutional auto-encoder networks. Additionally, in order to generate more realistic frame predictions, we also adopt adversarial loss. We evaluate our method on the Penn Action dataset, and demonstrate good results on high-level long-term structure prediction.
Recent neural IR models have demonstrated deep learning’s utility in ad-hoc information retrieval. However, deep models have a reputation for being black boxes, and the roles of a neural IR model’s components may not be obvious at first glance. In this work, we attempt to shed light on the inner workings of a recently proposed neural IR model, namely the PACRR model, by visualizing the output of intermediate layers and by investigating the relationship between intermediate weights and the ultimate relevance score produced. We highlight several insights, hoping that such insights will be generally applicable.
Compared to LiDAR-based localization methods, which provide high accuracy but rely on expensive sensors, visual localization approaches only require a camera and thus are more cost-effective while their accuracy and reliability typically is inferior to LiDAR-based methods. In this work, we propose a vision-based localization approach that learns from LiDAR-based localization methods by using their output as training data, thus combining a cheap, passive sensor with an accuracy that is on-par with LiDAR-based localization. The approach consists of two deep networks trained on visual odometry and topological localization, respectively, and a successive optimization to combine the predictions of these two networks. We evaluate the approach on a new challenging pedestrian-based dataset captured over the course of six months in varying weather conditions with a high degree of noise. The experiments demonstrate that the localization errors are up to 10 times smaller than with traditional vision-based localization methods.
We view a neural network as a distributed system of which neurons can fail independently, and we evaluate its robustness in the absence of any (recovery) learning phase. We give tight bounds on the number of neurons that can fail without harming the result of a computation. To determine our bounds, we leverage the fact that neural activation functions are Lipschitz-continuous. Our bound is on a quantity, we call the \textit{Forward Error Propagation}, capturing how much error is propagated by a neural network when a given number of components is failing, computing this quantity only requires looking at the topology of the network, while experimentally assessing the robustness of a network requires the costly experiment of looking at all the possible inputs and testing all the possible configurations of the network corresponding to different failure situations, facing a discouraging combinatorial explosion. We distinguish the case of neurons that can fail and stop their activity (crashed neurons) from the case of neurons that can fail by transmitting arbitrary values (Byzantine neurons). Interestingly, as we show in the paper, our bound can easily be extended to the case where synapses can fail. We show how our bound can be leveraged to quantify the effect of memory cost reduction on the accuracy of a neural network, to estimate the amount of information any neuron needs from its preceding layer, enabling thereby a boosting scheme that prevents neurons from waiting for unnecessary signals. We finally discuss the trade-off between neural networks robustness and learning cost.
We study the problem of learning latent variables in Gaussian graphical models. Existing methods for this problem assume that the precision matrix of the observed variables is the superposition of a sparse and a low-rank component. In this paper, we focus on the estimation of the low-rank component, which encodes the effect of marginalization over the latent variables. We introduce fast, proper learning algorithms for this problem. In contrast with existing approaches, our algorithms are manifestly non-convex. We support their efficacy via a rigorous theoretical analysis, and show that our algorithms match the best possible in terms of sample complexity, while achieving computational speed-ups over existing methods. We complement our theory with several numerical experiments.
With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.

Document worth reading: “The ALAMO approach to machine learning”

ALAMO is a computational methodology for leaning algebraic functions from data. Given a data set, the approach begins by building a low-complexity, linear model composed of explicit non-linear transformations of the independent variables. Linear combinations of these non-linear transformations allow a linear model to better approximate complex behavior observed in real processes. The model is refined, as additional data are obtained in an adaptive fashion through error maximization sampling using derivative-free optimization. Models built using ALAMO can enforce constraints on the response variables to incorporate first-principles knowledge. The ability of ALAMO to generate simple and accurate models for a number of reaction problems is demonstrated. The error maximization sampling is compared with Latin hypercube designs to demonstrate its sampling efficiency. ALAMO’s constrained regression methodology is used to further refine concentration models, resulting in models that perform better on validation data and satisfy upper and lower bounds placed on model outputs. The ALAMO approach to machine learning

Book Memo: “A Primer on Process Mining”

 Practical Skills with Python and Graphviz The main goal of this book is to explain the core ideas of process mining, and to demonstrate how they can be implemented using just some basic tools that are available to any computer scientist or data scientist. It describes how to analyze event logs in order to discover the behavior of real-world business processes. The end result can often be visualized as a graph, and the book explains how to use Python and Graphviz to render these graphs intuitively. Overall, it enables the reader to implement process mining techniques on his or her own, independently of any specific process mining tool. An introduction to two popular process mining tools, namely Disco and ProM, is also provided. The book will be especially valuable for self-study or as a precursor to a more advanced text. Practitioners and students will be able to follow along on their own, even if they have no prior knowledge of the topic. After reading this book, they will be able to more confidently proceed to the research literature if needed.

Book Memo: “Guide to Convolutional Neural Networks”

 A Practical Application to Traffic-Sign Detection and Classification This must-read text/reference introduces the fundamental concepts of convolutional neural networks (ConvNets), offering practical guidance on using libraries to implement ConvNets in applications of traffic sign detection and classification. The work presents techniques for optimizing the computational efficiency of ConvNets, as well as visualization techniques to better understand the underlying processes. The proposed models are also thoroughly evaluated from different perspectives, using exploratory and quantitative analysis. Topics and features: explains the fundamental concepts behind training linear classifiers and feature learning; discusses the wide range of loss functions for training binary and multi-class classifiers; illustrates how to derive ConvNets from fully connected neural networks, and reviews different techniques for evaluating neural networks; presents a practical library for implementing ConvNets, explaining how to use a Python interface for the library to create and assess neural networks; describes two real-world examples of the detection and classification of traffic signs using deep learning methods; examines a range of varied techniques for visualizing neural networks, using a Python interface; provides self-study exercises at the end of each chapter, in addition to a helpful glossary, with relevant Python scripts supplied at an associated website. This self-contained guide will benefit those who seek to both understand the theory behind deep learning, and to gain hands-on experience in implementing ConvNets in practice. As no prior background knowledge in the field is required to follow the material, the book is ideal for all students of computer vision and machine learning, and will also be of great interest to practitioners working on autonomous cars and advanced driver assistance systems.

R Packages worth a look

RStudio Addin for Searching Packages in CRAN Database Based on Keywords (CRANsearcher)
One of the strengths of R is its vast package ecosystem. Indeed, R packages extend from visualization to Bayesian inference and from spatial analyses to pharmacokinetics (<https://…/> ). There is probably not an area of quantitative research that isn’t represented by at least one R package. At the time of this writing, there are more than 10,000 active CRAN packages. Because of this massive ecosystem, it is important to have tools to search and learn about packages related to your personal R needs. For this reason, we developed an RStudio addin capable of searching available CRAN packages directly within RStudio.

Parses Web Pages using Postlight Mercury (postlightmercury)
This is a wrapper for the Mercury Parser API. The Mercury Parser is a single API endpoint that takes a URL and gives you back the content reliably and easily. With just one API request, Mercury takes any web article and returns only the relevant content — headline, author, body text, relevant images and more — free from any clutter. It’s reliable, easy-to-use and free. See the webpage here: <https://…/>.

Analysis of Time-Ordered Event Data with Missed Observations (intRvals)
Calculates event rates and compares means and variances of groups of interval data corrected for missed arrival observations.

Utilities for Delaying Function Execution (later)
Executes arbitrary R or C functions some time after the current time, after the R execution stack has emptied.

Simplification of Surface Triangular Meshes with Associated Distributed Data (meshsimp)
Iterative simplification strategy for surface triangular meshes (2.5D meshes) with associated data. Each iteration corresponds to an edge collapse where the selection of the edge to contract is driven by a cost functional that depends both on the geometry of the mesh than on the distribution of the data locations over the mesh. The library can handle both zero and higher genus surfaces. The package has been designed to be fully compatible with the R package ‘fdaPDE’, which implements regression models with partial differential regularizations, making use of the Finite Element Method. In the future, the functionalities provided by the current package may be directly integrated into ‘fdaPDE’.

Whats new on arXiv

We present and apply a general-purpose, multi-start algorithm for improving the performance of low-energy samplers used for solving optimization problems. The algorithm iteratively fixes the value of a large portion of the variables to values that have a high probability of being optimal. The resulting problems are smaller and less connected, and samplers tend to give better low-energy samples for these problems. The algorithm is trivially parallelizable, since each start in the multi-start algorithm is independent, and could be applied to any heuristic solver that can be run multiple times to give a sample. We present results for several classes of hard problems solved using simulated annealing, path-integral quantum Monte Carlo, parallel tempering with isoenergetic cluster moves, and a quantum annealer, and show that the success metrics as well as the scaling are improved substantially. When combined with this algorithm, the quantum annealer’s scaling was substantially improved for native Chimera graph problems. In addition, with this algorithm the scaling of the time to solution of the quantum annealer is comparable to the Hamze–de Freitas–Selby algorithm on the weak-strong cluster problems introduced by Boixo et al. Parallel tempering with isoenergetic cluster moves was able to consistently solve 3D spin glass problems with 8000 variables when combined with our method, whereas without our method it could not solve any.
In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset.
In the last decade, driven also by the availability of an unprecedented computational power and storage capabilities in cloud environments we assisted to the proliferation of new algorithms, methods, and approaches in two areas of artificial intelligence: knowledge representation and machine learning. On the one side, the generation of a high rate of structured data on the Web led to the creation and publication of the so-called knowledge graphs. On the other side, deep learning emerged as one of the most promising approaches in the generation and training of models that can be applied to a wide variety of application fields. More recently, autoencoders have proven their strength in various scenarios, playing a fundamental role in unsupervised learning. In this paper, we instigate how to exploit the semantic information encoded in a knowledge graph to build connections between units in a Neural Network, thus leading to a new method, SEM-AUTO, to extract and weigh semantic features that can eventually be used to build a recommender system. As adding content-based side information may mitigate the cold user problems, we tested how our approach behave in the presence of a few rating from a user on the Movielens 1M dataset and compare results with BPRSLIM.
Convolutional kernels are basic and vital components of deep Convolutional Neural Networks (CNN). In this paper, we equip convolutional kernels with shape attributes to generate the deep Irregular Convolutional Neural Networks (ICNN). Compared to traditional CNN applying regular convolutional kernels like ${3\times3}$, our approach trains irregular kernel shapes to better fit the geometric variations of input features. In other words, shapes are learnable parameters in addition to weights. The kernel shapes and weights are learned simultaneously during end-to-end training with the standard back-propagation algorithm. Experiments for semantic segmentation are implemented to validate the effectiveness of our proposed ICNN.
This paper provides an entry point to the problem of interpreting a deep neural network model and explaining its predictions. It is based on a tutorial given at ICASSP 2017. It introduces some recently proposed techniques of interpretation, along with theory, tricks and recommendations, to make most efficient use of these techniques on real data. It also discusses a number of practical applications.
This paper introduces a novel deep learning framework including a lexicon-based approach for sentence-level prediction of sentiment label distribution. We propose to first apply semantic rules and then use a Deep Convolutional Neural Network (DeepCNN) for character-level embeddings in order to increase information for word-level embedding. After that, a Bidirectional Long Short-Term Memory Network (Bi-LSTM) produces a sentence-wide feature representation from the word-level embedding. We evaluate our approach on three Twitter sentiment classification datasets. Experimental results show that our model can improve the classification accuracy of sentence-level sentiment analysis in Twitter social networking.
We investigate the problem of inferring the causal variables of a response $Y$ from a set of $d$ predictors $(X^1,\dots,X^d)$. Classical ordinary least squares regression includes all predictors that reduce the variance of $Y$. Using only the causal parents instead leads to models that have the advantage of remaining invariant under interventions, i.e., loosely speaking they lead to invariance across different ‘environments’ or ‘heterogeneity patterns’. More precisely, the conditional distribution of $Y$ given its causal variables remains constant for all observations. Recent work exploit such a stability to infer causal relations from data with different but known environments. We show here that even without having knowledge of the environments or heterogeneity pattern, inferring causal relations is possible for time-ordered (or any other type of sequentially ordered) data. In particular, this then allows to detect instantaneous causal relations in multivariate linear time series, in contrast to the concept of Granger causality. Besides novel methodology, we provide statistical confidence bounds and asymptotic detection results for inferring causal variables, and we present an application to monetary policy in macro economics.
In this paper we provide a conceptual overview of latent variable models within a probabilistic modeling framework, an overview that emphasizes the compositional nature and the interconnectedness of the seemingly disparate models commonly encountered in statistical practice.
We propose a simple and efficient approach to learning sparse models. Our approach consists of (1) projecting the data into a lower dimensional space, (2) learning a dense model in the lower dimensional space, and then (3) recovering the sparse model in the original space via compressive sensing. We apply this approach to Non-negative Matrix Factorization (NMF), tensor decomposition and linear classification—showing that it obtains $10\times$ compression with negligible loss in accuracy on real data, and obtains up to $5\times$ speedups. Our main theoretical contribution is to show the following result for NMF: if the original factors are sparse, then their projections are the sparsest solutions to the projected NMF problem. This explains why our method works for NMF and shows an interesting new property of random projections: they can preserve the solutions of non-convex optimization problems such as NMF.
The practice of evidence-based medicine (EBM) urges medical practitioners to utilise the latest research evidence when making clinical decisions. Because of the massive and growing volume of published research on various medical topics, practitioners often find themselves overloaded with information. As such, natural language processing research has recently commenced exploring techniques for performing medical domain-specific automated text summarisation (ATS) techniques– targeted towards the task of condensing large medical texts. However, the development of effective summarisation techniques for this task requires cross-domain knowledge. We present a survey of EBM, the domain-specific needs for EBM, automated summarisation techniques, and how they have been applied hitherto. We envision that this survey will serve as a first resource for the development of future operational text summarisation techniques for EBM.
Recognizing entity synonyms from text has become a crucial task in many entity-leveraging applications. However, discovering entity synonyms from domain-specific text corpora (e.g., news articles, scientific papers) is rather challenging. Current systems take an entity name string as input to find out other names that are synonymous, ignoring the fact that often times a name string can refer to multiple entities (e.g., ‘apple’ could refer to both Apple Inc and the fruit apple). Moreover, most existing methods require training data manually created by domain experts to construct supervised-learning systems. In this paper, we study the problem of automatic synonym discovery with knowledge bases, that is, identifying synonyms for knowledge base entities in a given domain-specific corpus. The manually-curated synonyms for each entity stored in a knowledge base not only form a set of name strings to disambiguate the meaning for each other, but also can serve as ‘distant’ supervision to help determine important features for the task. We propose a novel framework, called DPE, to integrate two kinds of mutually-complementing signals for synonym discovery, i.e., distributional features based on corpus-level statistics and textual patterns based on local contexts. In particular, DPE jointly optimizes the two kinds of signals in conjunction with distant supervision, so that they can mutually enhance each other in the training stage. At the inference stage, both signals will be utilized to discover synonyms for the given entities. Experimental results prove the effectiveness of the proposed framework.
Do GANS (Generative Adversarial Nets) actually learn the target distribution? The foundational paper of (Goodfellow et al 2014) suggested they do, if they were given sufficiently large deep nets, sample size, and computation time. A recent theoretical analysis in Arora et al (to appear at ICML 2017) raised doubts whether the same holds when discriminator has finite size. It showed that the training objective can approach its optimum value even if the generated distribution has very low support —in other words, the training objective is unable to prevent mode collapse. The current note reports experiments suggesting that such problems are not merely theoretical. It presents empirical evidence that well-known GANs approaches do learn distributions of fairly low support, and thus presumably are not learning the target distribution. The main technical contribution is a new proposed test, based upon the famous birthday paradox, for estimating the support size of the generated distribution.
This paper introduces a framework for speeding up Bayesian inference conducted in presence of large datasets. We design a Markov chain whose transition kernel uses an {unknown} fraction of {fixed size} of the available data that is randomly refreshed throughout the algorithm. Inspired by the Approximate Bayesian Computation (ABC) literature, the subsampling process is guided by the fidelity to the observed data, as measured by summary statistics. The resulting algorithm, Informed Sub-Sampling MCMC, is a generic and flexible approach which, contrarily to existing scalable methodologies, preserves the simplicity of the Metropolis-Hastings algorithm. Even though exactness is lost, i.e. the chain distribution approximates the target, we study and quantify theoretically this bias and show on a diverse set of examples that it yields excellent performances when the computational budget is limited. If available and cheap to compute, we show that setting the summary statistics as the maximum likelihood estimator is supported by theoretical arguments.
Today, massive amounts of streaming data from smart devices need to be analyzed automatically to realize the Internet of Things. The Complex Event Processing (CEP) paradigm promises low-latency pattern detection on event streams. However, CEP systems need to be extended with Machine Learning (ML) capabilities such as online training and inference in order to be able to detect fuzzy patterns (e.g., outliers) and to improve pattern recognition accuracy during runtime using incremental model training. In this paper, we propose a distributed CEP system denoted as StreamLearner for ML-enabled complex event detection. The proposed programming model and data-parallel system architecture enable a wide range of real-world applications and allow for dynamically scaling up and out system resources for low-latency, high-throughput event processing. We show that the DEBS Grand Challenge 2017 case study (i.e., anomaly detection in smart factories) integrates seamlessly into the StreamLearner API. Our experiments verify scalability and high event throughput of StreamLearner.
Bayesian neural networks (BNNs) with latent variables are probabilistic models which can automatically identify complex stochastic patterns in the data. We describe and study in these models a decomposition of predictive uncertainty into its epistemic and aleatoric components. First, we show how such a decomposition arises naturally in a Bayesian active learning scenario by following an information theoretic approach. Second, we use a similar decomposition to develop a novel risk sensitive objective for safe reinforcement learning (RL). This objective minimizes the effect of model bias in environments whose stochastic dynamics are described by BNNs with latent variables. Our experiments illustrate the usefulness of the resulting decomposition in active learning and safe RL settings.