# R Packages worth a look

Density Estimation and Random Number Generation with Distribution Element Trees (detpack)
Density estimation for possibly large data sets and conditional/unconditional random number generation with distribution element trees. For more details on distribution element trees see: Meyer, D.W. (2016) <arXiv:1610.00345> or Meyer, D.W., Statistics and Computing (2017) <doi:10.1007/s11222-017-9751-9> and Meyer, D.W. (2017) <arXiv:1711.04632>.

Tidy Temporal Data Frames and Tools (tsibble)
Provides a ‘tbl_ts’ class (the ‘tsibble’) to store and manage temporal-context data in a data-centric format, which is built on top of the ‘tibble’. The ‘tsibble’ aims at manipulating and analysing temporal data in a tidy and modern manner, including easily interpolate missing values, aggregate over calendar periods, performing rolling window calculations, and etc.

An Easy SVG Basic Elements Generator (easySVG)
This SVG elements generator can easily generate SVG elements such as rect, line, circle, ellipse, polygon, polyline, text and group. Also, it can combine and output SVG elements into a SVG file.

Automatic Generation of Interactive Visualizations for Popular Statistical Results (autoplotly)
Functionalities to automatically generate interactive visualizations for popular statistical results supported by ‘ggfortify’, such as time series, PCA, clustering and survival analysis, with ‘plotly.js’ <https://plot.ly/> and ‘ggplot2’ style. The generated visualizations can also be easily extended using ‘ggplot2’ and ‘plotly’ syntax while staying interactive.

Post Processing of (Half-)Hourly Eddy-Covariance Measurements (REddyProc)
Standard and extensible Eddy-Covariance data post-processing includes uStar-filtering, gap-filling, and flux-partitioning. The Eddy-Covariance (EC) micrometeorological technique quantifies continuous exchange fluxes of gases, energy, and momentum between an ecosystem and the atmosphere. It is important for understanding ecosystem dynamics and upscaling exchange fluxes. (Aubinet et al. (2012) <doi:10.1007/978-94-007-2351-1>). This package inputs pre-processed (half-)hourly data and supports further processing. First, a quality-check and filtering is performed based on the relationship between measured flux and friction velocity (uStar) to discard biased data (Papale et al. (2006) <doi:10.5194/bg-3-571-2006>). Second, gaps in the data are filled based on information from environmental conditions (Reichstein et al. (2005) <doi:10.1111/j.1365-2486.2005.001002.x>). Third, the net flux of carbon dioxide is partitioned into its gross fluxes in and out of the ecosystem by night-time based and day-time based approaches (Lasslop et al. (2010) <doi:10.1111/j.1365-2486.2009.02041.x>).

# Whats new on arXiv

The functional ANOVA expansion of a multivariate mapping plays a fundamental role in statistics. The expansion is unique once a unique distribution is assigned to the covariates. Recent investigations in the environmental and climate sciences show that analysts may not be in a position to assign a unique distribution in realistic applications. We offer a systematic investigation of existence, uniqueness, orthogonality, monotonicity and ultramodularity of the functional ANOVA expansion of a multivariate mapping when a multiplicity of distributions is assigned to the covariates. In particular, we show that a multivariate mapping can be associated with a core of probability measures that guarantee uniqueness. We obtain new results for variance decomposition and dimension distribution under mixtures. Implications for the global sensitivity analysis of computer experiments are also discussed.
Correlation remains to be one of the most widely used statistical tools for assessing the strength of relationships between data series. This paper presents a novel compositional correlation method for detecting linear and nonlinear relationships by considering the averages of all parts of all possible compositions of the data series instead of considering the averages of the whole series. The approach enables cumulative contribution of all local associations to the resulting correlation value. The method is applied on two different datasets: a set of four simple nonlinear polynomial functions and the expression time series data of 4381 budding yeast (saccharomyces cerevisiae) genes. The obtained results show that the introduced compositional correlation method is capable of determining real direct and inverse linear, nonlinear and monotonic relationships. Comparisons with Pearson’s correlation, Spearman’s correlation, distance correlation and the simulated annealing genetic algorithm maximal information coefficient (SGMIC) have shown that the presented method is capable of detecting important associations which were not detected by the compared methods.
By exploiting a causality property of the nonlinear Fourier transform, a novel decision-feedback detection strategy for nonlinear frequency-division multiplexing (NFDM) systems is introduced. The performance of the proposed strategy is investigated both by simulations and by theoretical bounds and approximations, showing that it achieves a considerable performance improvement compared to previously adopted techniques in terms of Q-factor. The obtained improvement demonstrates that, by tailoring the detection strategy to the peculiar properties of the nonlinear Fourier transform, it is possible to boost the performance of NFDM systems and overcome current limitations imposed by the use of more conventional detection techniques suitable for the linear regime.
Internet of things (IoT) applications have become increasingly popular in recent years, with applications ranging from building energy monitoring to personal health tracking and activity recognition. In order to leverage these data, automatic knowledge extraction – whereby we map from observations to interpretable states and transitions – must be done at scale. As such, we have seen many recent IoT data sets include annotations with a human expert specifying states, recorded as a set of boundaries and associated labels in a data sequence. These data can be used to build automatic labeling algorithms that produce labels as an expert would. Here, we refer to human-specified boundaries as breakpoints. Traditional changepoint detection methods only look for statistically-detectable boundaries that are defined as abrupt variations in the generative parameters of a data sequence. However, we observe that breakpoints occur on more subtle boundaries that are non-trivial to detect with these statistical methods. In this work, we propose a new unsupervised approach, based on deep learning, that outperforms existing techniques and learns the more subtle, breakpoint boundaries with a high accuracy. Through extensive experiments on various real-world data sets – including human-activity sensing data, speech signals, and electroencephalogram (EEG) activity traces – we demonstrate the effectiveness of our algorithm for practical applications. Furthermore, we show that our approach achieves significantly better performance than previous methods.
We examine Deep Canonically Correlated LSTMs as a way to learn nonlinear transformations of variable length sequences and embed them into a correlated, fixed dimensional space. We use LSTMs to transform multi-view time-series data non-linearly while learning temporal relationships within the data. We then perform correlation analysis on the outputs of these neural networks to find a correlated subspace through which we get our final representation via projection. This work follows from previous work done on Deep Canonical Correlation (DCCA), in which deep feed-forward neural networks were used to learn nonlinear transformations of data while maximizing correlation.
Big Data architectures allow to flexibly store and process heterogeneous data, from multiple sources, in their original format. The structure of those data, commonly supplied by means of REST APIs, is continuously evolving. Thus data analysts need to adapt their analytical processes after each API release. This gets more challenging when performing an integrated or historical analysis. To cope with such complexity, in this paper, we present the Big Data Integration ontology, the core construct to govern the data integration process under schema evolution by systematically annotating it with information regarding the schema of the sources. We present a query rewriting algorithm that, using the annotated ontology, converts queries posed over the ontology to queries over the sources. To cope with syntactic evolution in the sources, we present an algorithm that semi-automatically adapts the ontology upon new releases. This guarantees ontology-mediated queries to correctly retrieve data from the most recent schema version as well as correctness in historical queries. A functional and performance evaluation on real-world APIs is performed to validate our approach.
The MOOC Replication Framework (MORF) is a novel software system for feature extraction, model training/testing, and evaluation of predictive dropout models in Massive Open Online Courses (MOOCs). MORF makes large-scale replication of complex machine-learned models tractable and accessible for researchers, and enables public research on privacy-protected data. It does so by focusing on the high-level operations of an \emph{extract-train-test-evaluate} workflow, and enables researchers to encapsulate their implementations in portable, fully reproducible software containers which are executed on data with a known schema. MORF’s workflow allows researchers to use data in analysis without providing them access to the underlying data directly, preserving privacy and data security. During execution, containers are sandboxed for security and data leakage and parallelized for efficiency, allowing researchers to create and test new models rapidly, on large-scale multi-institutional datasets that were previously inaccessible to most researchers. MORF is provided both as a Python API (the MORF Software), for institutions to use on their own MOOC data) or in a platform-as-a-service (PaaS) model with a web API and a high-performance computing environment (the MORF Platform).
Feature engineering is one of the most important but tedious tasks in data science projects. This work studies automation of feature learning for relational data. We first theoretically proved that learning relevant features from relational data for a given predictive analytics problem is NP-hard. However, it is possible to empirically show that an efficient rule based approach predefining transformations as a priori based on heuristics can extract very useful features from relational data. Indeed, the proposed approach outperformed the state of the art solutions with a significant margin. We further introduce a deep neural network which automatically learns appropriate transformations of relational data into a representation that predicts the target variable well instead of being predefined as a priori by users. In an extensive experiment with Kaggle competitions, the proposed methods could win late medals. To the best of our knowledge, this is the first time an automation system could win medals in Kaggle competitions with complex relational data.
Topic modeling enables exploration and compact representation of a corpus. The CaringBridge (CB) dataset is a massive collection of journals written by patients and caregivers during a health crisis. Topic modeling on the CB dataset, however, is challenging due to the asynchronous nature of multiple authors writing about their health journeys. To overcome this challenge we introduce the Dynamic Author-Persona topic model (DAP), a probabilistic graphical model designed for temporal corpora with multiple authors. The novelty of the DAP model lies in its representation of authors by a persona — where personas capture the propensity to write about certain topics over time. Further, we present a regularized variational inference algorithm, which we use to encourage the DAP model’s personas to be distinct. Our results show significant improvements over competing topic models — particularly after regularization, and highlight the DAP model’s unique ability to capture common journeys shared by different authors.
This paper introduces grouped latent heterogeneity in panel data quantile regression. More precisely, we assume that the observed individuals come from a heterogeneous population with an unknown, finite number of types. The number of types and group membership is not assumed to be known in advance and is estimated by means of a convex optimization problem. We provide conditions under which group membership is estimated consistently and establish asymptotic normality of the resulting estimators.
We present our framework DKVF that enables one to quickly prototype and evaluate new protocols for key-value stores and compare them with existing protocols based on selected benchmarks. Due to limitations of CAP theorem, new protocols must be developed that achieve the desired trade-off between consistency and availability for the given application at hand. Hence, both academic and industrial communities focus on developing new protocols that identify a different (and hopefully better in one or more aspect) point on this trade-off curve. While these protocols are often based on a simple intuition, evaluating them to ensure that they indeed provide increased availability, consistency, or performance is a tedious task. Our framework, DKVF, enables one to quickly prototype a new protocol as well as identify how it performs compared to existing protocols for pre-specified benchmarks. Our framework relies on YCSB (Yahoo! Cloud Servicing Benchmark) for benchmarking. We demonstrate DKVF by implementing four existing protocols –eventual consistency, COPS, GentleRain and CausalSpartan– with it. We compare the performance of these protocols against different loading conditions. We find that the performance is similar to our implementation of these protocols from scratch. And, the comparison of these protocols is consistent with what has been reported in the literature. Moreover, implementation of these protocols was much more natural as we only needed to translate the pseudocode into Java (and add the necessary error handling). Hence, it was possible to achieve this in just 1-2 days per protocol. Finally, our framework is extensible. It is possible to replace individual components in the framework (e.g., the storage component).
Motion blur is a fundamental problem in computer vision as it impacts image quality and hinders inference. Traditional deblurring algorithms leverage the physics of the image formation model and use hand-crafted priors: they usually produce results that better reflect the underlying scene, but present artifacts. Recent learning-based methods implicitly extract the distribution of natural images directly from the data and use it to synthesize plausible images. Their results are impressive, but they are not always faithful to the content of the latent image. We present an approach that bridges the two. Our method fine-tunes existing deblurring neural networks in a self-supervised fashion by enforcing that the output, when blurred based on the optical flow between subsequent frames, matches the input blurry image. We show that our method significantly improves the performance of existing methods on several datasets both visually and in terms of image quality metrics. The supplementary material is https://goo.gl/nYPjEQ
Partially inspired by successful applications of variational recurrent neural networks, we propose a novel variational recurrent neural machine translation (VRNMT) model in this paper. Different from the variational NMT, VRNMT introduces a series of latent random variables to model the translation procedure of a sentence in a generative way, instead of a single latent variable. Specifically, the latent random variables are included into the hidden states of the NMT decoder with elements from the variational autoencoder. In this way, these variables are recurrently generated, which enables them to further capture strong and complex dependencies among the output translations at different timesteps. In order to deal with the challenges in performing efficient posterior inference and large-scale training during the incorporation of latent variables, we build a neural posterior approximator, and equip it with a reparameterization technique to estimate the variational lower bound. Experiments on Chinese-English and English-German translation tasks demonstrate that the proposed model achieves significant improvements over both the conventional and variational NMT models.
In practice, most spoken language understanding systems process user input in a pipelined manner; first domain is predicted, then intent and semantic slots are inferred according to the semantic frames of the predicted domain. The pipeline approach, however, has some disadvantages: error propagation and lack of information sharing. To address these issues, we present a unified neural network that jointly performs domain, intent, and slot predictions. Our approach adopts a principled architecture for multitask learning to fold in the state-of-the-art models for each task. With a few more ingredients, e.g. orthography-sensitive input encoding and curriculum training, our model delivered significant improvements in all three tasks across all domains over strong baselines, including one using oracle prediction for domain detection, on real user data of a commercial personal assistant.
Data-stream processing has continuously risen in importance as the amount of available data has been steadily increasing over the last decade. Besides traditional domains such as data-center monitoring and click analytics, there is an increasing number of network-enabled production machines that generate continuous streams of data. Due to their continuous nature, queries on data-streams can be more complex, and distinctly harder to understand then database queries. As users have to consider operational details, maintenance and debugging become challenging. Current approaches model data-streams as sequences, because this is the way they are physically received. These models result in an implementation-focused perspective. We explore an alternate way of modeling datastreams by focusing on time-slicing semantics. This focus results in a model based on functions, which is better suited for reasoning about query semantics. By adapting the definitions of relevant concepts in stream processing to our model, we illustrate the practical useful- ness of our approach. Thereby, we link data-streams and query primitives to concepts in functional programming and mathematics. Most noteworthy, we prove that data-streams are monads, and show how to derive monad definitions for current data-stream models. We provide an abstract, yet practical perspective on data- stream related subjects based on a sound, consistent query model. Our work can serve as solid foundation for future data-stream query-languages.
A fundamental task in numerical computation is the solution of large linear systems. The conjugate gradient method is an iterative method which offers rapid convergence to the solution, particularly when an effective preconditioner is employed. However, for more challenging systems a substantial error can be present even after many iterations have been performed. The estimates obtained in this case are of little value unless further information can be provided about the numerical error. In this paper we propose a novel statistical model for this numerical error set in a Bayesian framework. Our approach is a strict generalisation of the conjugate gradient method, which is recovered as the posterior mean for a particular choice of prior. The estimates obtained are analysed with Krylov subspace methods and a contraction result for the posterior is presented. The method is then analysed in a simulation study as well as being applied to a challenging problem in medical imaging.
The computational complexity of leveraging deep neural networks for extracting deep feature representations is a significant barrier to its widespread adoption, particularly for use in embedded devices. One particularly promising strategy to addressing the complexity issue is the notion of evolutionary synthesis of deep neural networks, which was demonstrated to successfully produce highly efficient deep neural networks while retaining modeling performance. Here, we further extend upon the evolutionary synthesis strategy for achieving efficient feature extraction via the introduction of a stress-induced evolutionary synthesis framework, where stress signals are imposed upon the synapses of a deep neural network during training to induce stress and steer the synthesis process towards the production of more efficient deep neural networks over successive generations and improved model fidelity at a greater efficiency. The proposed stress-induced evolutionary synthesis approach is evaluated on a variety of different deep neural network architectures (LeNet5, AlexNet, and YOLOv2) on different tasks (object classification and object detection) to synthesize efficient StressedNets over multiple generations. Experimental results demonstrate the efficacy of the proposed framework to synthesize StressedNets with significant improvement in network architecture efficiency (e.g., 40x for AlexNet and 33x for YOLOv2) and speed improvements (e.g., 5.5x inference speed-up for YOLOv2 on an Nvidia Tegra X1 mobile processor).
Humans can quickly learn new visual concepts, perhaps because they can easily visualize or imagine what novel objects look like from different views. Incorporating this ability to hallucinate novel instances of new concepts might help machine vision systems perform better low-shot learning, i.e., learning concepts from few examples. We present a novel approach to low-shot learning that uses this idea. Our approach builds on recent progress in meta-learning (‘learning to learn’) by combining a meta-learner with a ‘hallucinator’ that produces additional training examples, and optimizing both models jointly. Our hallucinator can be incorporated into a variety of meta-learners and provides significant gains: up to a 6 point boost in classification accuracy when only a single training example is available, yielding state-of-the-art performance on the challenging ImageNet low-shot classification benchmark.

# Book Memo: “Randomness and Hyper-randomness”

 The monograph compares two approaches that describe the statistical stability phenomenon – one proposed by the probability theory that ignores violations of statistical stability and another proposed by the theory of hyper-random phenomena that takes these violations into account. There are five parts. The first describes the phenomenon of statistical stability. The second outlines the mathematical foundations of probability theory. The third develops methods for detecting violations of statistical stability and presents the results of experimental research on actual processes of different physical nature that demonstrate the violations of statistical stability over broad observation intervals. The fourth part outlines the mathematical foundations of the theory of hyper-random phenomena. The fifth part discusses the problem of how to provide an adequate description of the world. The monograph should be interest to a wide readership: from university students on a first course majoring in physics, engineering, and mathematics to engineers, post-graduate students, and scientists carrying out research on the statistical laws of natural physical phenomena, developing and using statistical methods for high-precision measurement, prediction, and signal processing over broad observation intervals. To read the book, it is sufficient to be familiar with a standard first university course on mathematics.

# Document worth reading: “Reductions for Frequency-Based Data Mining Problems”

Studying the computational complexity of problems is one of the – if not the – fundamental questions in computer science. Yet, surprisingly little is known about the computational complexity of many central problems in data mining. In this paper we study frequency-based problems and propose a new type of reduction that allows us to compare the complexities of the maximal frequent pattern mining problems in different domains (e.g. graphs or sequences). Our results extend those of Kimelfeld and Kolaitis [ACM TODS, 2014] to a broader range of data mining problems. Our results show that, by allowing constraints in the pattern space, the complexities of many maximal frequent pattern mining problems collapse. These problems include maximal frequent subgraphs in labelled graphs, maximal frequent itemsets, and maximal frequent subsequences with no repetitions. In addition to theoretical interest, our results might yield more efficient algorithms for the studied problems. Reductions for Frequency-Based Data Mining Problems

# Distilled News

Machines getting the better of humans is no longer a surprise. It started with IBM’s Deep Blue program beating Garry Kasparov in a game of chess more than 20 years ago and with the increasing breakthroughs in the world of machine and deep learning, machines continue to become powerful tools. Yesterday, Alibaba developed a model that beat out any human competition in Stanford’s reading comprehension competition. The dataset consists of more than 100,000 questions sourced from more than 500 Wikipedia articles. The purpose of the quiz is to see how long it takes the machine learning models to process all the information, train themselves and then provide precise or accurate answers. Alibaba used a deep learning framework to build a neural network model. It’s based on the “Hierarchical Attention Network”, which according to the company, works by identifying first paragraphs, then sentences and finally words. The underlying technology has been used previously by Alibaba, in it’s AI-powered chatbot – Dian Xiaomi.
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are faster than R code.” The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
One of the features of the functions in smooth package is the ability to use exogenous (aka “external”) variables. This potentially leads to the increase in the forecasting accuracy (given that you have a good estimate of the future exogenous variable). For example, in retail this can be a binary variable for promotions and we may know when the next promotion will happen. Or we may have an idea about the temperature for the next day and include it as an exogenous variable in the model. While arima() function from stats package allows inserting exogenous variables, ets() function from forecast package does not. That was one of the original motivations of developing an alternative function for ETS. It is worth noting that all the forecasting functions in smooth package (except for sma()) allow using exogenous variables, so this feature is not restricted with es() only.
Choosing a machine learning (ML) library to solve predictive use cases is easier said than done. There are many to choose from, and each have their own niche and benefits that are good for specific use cases. Even for someone with decent experience in ML and data science, it can be an ordeal to vet all the varied solutions. Where do you start? At Salesforce Einstein, we have to constantly research the market to stay on top of it. Here are some observations on the top five characteristics of ML libraries that developers should consider when deciding what library to use:
I started working with R around about 5 years ago. Parts of the R world have changed substantially over that time, while other parts remain largely the same. One thing that hasn’t changed however, is that there has never been a simple, high-level text to introduce newcomers to the ecosystem. I believe this is especially important now that the ecosystem has grown so much. It’s no longer enough to just know about R itself. Those working with, or even around R, must now understand the ecosystem as a whole in order to best manage and support its use. Hopefully the Field Guide to the R Ecosystem goes some way towards filling this gap.

# Whats new on arXiv

Current machine learning systems operate, almost exclusively, in a statistical, or model-free mode, which entails severe theoretical limits on their power and performance. Such systems cannot reason about interventions and retrospection and, therefore, cannot serve as the basis for strong AI. To achieve human level intelligence, learning machines need the guidance of a model of reality, similar to the ones used in causal inference tasks. To demonstrate the essential role of such models, I will present a summary of seven tasks which are beyond reach of current machine learning systems and which have been accomplished using the tools of causal modeling.
Some deep convolutional neural networks were proposed for time-series classification and class imbalanced problems. However, those models performed degraded and even failed to recognize the minority class of an imbalanced temporal sequences dataset. Minority samples would bring troubles for temporal deep learning classifiers due to the equal treatments of majority and minority class. Until recently, there were few works applying deep learning on imbalanced time-series classification (ITSC) tasks. Here, this paper aimed at tackling ITSC problems with deep learning. An adaptive cost-sensitive learning strategy was proposed to modify temporal deep learning models. Through the proposed strategy, classifiers could automatically assign misclassification penalties to each class. In the experimental section, the proposed method was utilized to modify five neural networks. They were evaluated on a large volume, real-life and imbalanced time-series dataset with six metrics. Each single network was also tested alone and combined with several mainstream data samplers. Experimental results illustrated that the proposed cost-sensitive modified networks worked well on ITSC tasks. Compared to other methods, the cost-sensitive convolution neural network and residual network won out in the terms of all metrics. Consequently, the proposed cost-sensitive learning strategy can be used to modify deep learning classifiers from cost-insensitive to cost-sensitive. Those cost-sensitive convolutional networks can be effectively applied to address ITSC issues.
Over the past decade, multivariate time series classification has been receiving a lot of attention. We propose augmenting the existing univariate time series classification models, LSTM-FCN and ALSTM-FCN with a squeeze and excitation block to further improve performance. Our proposed models outperform most of the state of the art models while requiring minimum preprocessing. The proposed models work efficiently on various complex multivariate time series classification tasks such as activity recognition or action recognition. Furthermore, the proposed models are highly efficient at test time and small enough to deploy on memory constrained systems.
Brain Electroencephalography (EEG) classification is widely applied to analyze cerebral diseases in recent years. Unfortunately, invalid/noisy EEGs degrade the diagnosis performance and most previously developed methods ignore the necessity of EEG selection for classification. To this end, this paper proposes a novel maximum weight clique-based EEG selection approach, named mwcEEGs, to map EEG selection to searching maximum similarity-weighted cliques from an improved Fr\'{e}chet distance-weighted undirected EEG graph simultaneously considering edge weights and vertex weights. Our mwcEEGs improves the classification performance by selecting intra-clique pairwise similar and inter-clique discriminative EEGs with similarity threshold $\delta$. Experimental results demonstrate the algorithm effectiveness compared with the state-of-the-art time series selection algorithms on real-world EEG datasets.
The problem of quickest detection of a change in distribution is considered under the assumption that the pre-change distribution is known, and the post-change distribution is only known to belong to a family of distributions distinguishable from a discretized version of the pre-change distribution. A sequential change detection procedure is proposed that partitions the sample space into a finite number of bins, and monitors the number of samples falling into each of these bins to detect the change. A test statistic that approximates the generalized likelihood ratio test is developed. It is shown that the proposed test statistic can be efficiently computed using a recursive update scheme, and a procedure for choosing the number of bins in the scheme is provided. Various asymptotic properties of the test statistic are derived to offer insights into its performance trade-off between average detection delay and average run length to a false alarm. Testing on synthetic and real data demonstrates that our approach is comparable or better in performance to existing non-parametric change detection methods.
Machine Learning (ML) and Deep Learning (DL) are two technologies used to extract representations of the data for a specific purpose. ML algorithms take a set of data as input to generate one or several predictions. To define the final version of one model, usually there is an initial step devoted to train the algorithm (get the right final values of the parameters of the model). There are several techniques, from supervised learning to reinforcement learning, which have different requirements. On the market, there are some frameworks or APIs that reduce the effort for designing a new ML model. In this report, using the benchmark DLBENCH, we will analyse the performance and the execution modes of some well-known ML frameworks on the Finis Terrae II supercomputer when supervised learning is used. The report will show that placement of data and allocated hardware can have a large influence on the final timeto-solution.
Density estimation is an interdisciplinary topic at the intersection of statistics, theoretical computer science and machine learning. We review some old and new techniques for bounding sample complexity of estimating densities of continuous distributions, focusing on the class of mixtures of Gaussians and its subclasses.
In recent years, there have been tremendous advancements in the field of machine learning. These advancements have been made through both academic as well as industrial research. Lately, a fair amount of research has been dedicated to the usage of generative models in the field of computer vision and image classification. These generative models have been popularized through a new framework called Generative Adversarial Networks. Moreover, many modified versions of this framework have been proposed in the last two years. We study the original model proposed by Goodfellow et al. as well as modifications over the original model and provide a comparative analysis of these models.
We present a noise-injected version of the Expectation-Maximization (EM) algorithm: the Noisy Expectation Maximization (NEM) algorithm. The NEM algorithm uses noise to speed up the convergence of the EM algorithm. The NEM theorem shows that injected noise speeds up the average convergence of the EM algorithm to a local maximum of the likelihood surface if a positivity condition holds. The generalized form of the noisy expectation-maximization (NEM) algorithm allow for arbitrary modes of noise injection including adding and multiplying noise to the data. We demonstrate these noise benefits on EM algorithms for the Gaussian mixture model (GMM) with both additive and multiplicative NEM noise injection. A separate theorem (not presented here) shows that the noise benefit for independent identically distributed additive noise decreases with sample size in mixture models. This theorem implies that the noise benefit is most pronounced if the data is sparse. Injecting blind noise only slowed convergence.
Multiple imputation is a straightforward method for handling missing data in a principled fashion. This paper presents an overview of multiple imputation, including important theoretical results and their practical implications for generating and using multiple imputations. A review of strategies for generating imputations follows, including recent developments in flexible joint modeling and sequential regression/chained equations/fully conditional specification approaches. Finally, we compare and contrast different methods for generating imputations on a range of criteria before identifying promising avenues for future research.
We argue that the estimation of the mutual information between high dimensional continuous random variables is achievable by gradient descent over neural networks. This paper presents a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size. MINE is back-propable and we prove that it is strongly consistent. We illustrate a handful of applications in which MINE is succesfully applied to enhance the property of generative models in both unsupervised and supervised settings. We apply our framework to estimate the information bottleneck, and apply it in tasks related to supervised classification problems. Our results demonstrate substantial added flexibility and improvement in these settings.
Information collection is a fundamental problem in big data, where the size of sampling sets plays a very important role. This work considers the information collection process by taking message importance into account. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. It is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to Kolmogorov-Smirnov statistic, but it offers a new way to characterize the distribution goodness-of-fit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system.
Information transfer which reveals the state variation of variables can play a vital role in big data analytics and processing. In fact, the measure for information transfer can reflect the system change from the statistics by using the variable distributions, similar to KL divergence and Renyi divergence. Furthermore, in terms of the information transfer in big data, small probability events dominate the importance of the total message to some degree. Therefore, it is significant to design an information transfer measure based on the message importance which emphasizes the small probability events. In this paper, we propose the message importance divergence (MID) and investigate its characteristics and applications on three aspects. First, the message importance transfer capacity based on MID is presented to offer an upper bound for the information transfer with disturbance. Then, we utilize the MID to guide the queue length selection, which is the fundamental problem considered to have higher social or academic value in the caching operation of mobile edge computing. Finally, we extend the MID to the continuous case and discuss the robustness by using it to measuring information distance.
Salient object detection is a fundamental problem and has been received a great deal of attentions in computer vision. Recently deep learning model became a powerful tool for image feature extraction. In this paper, we propose a multi-scale deep neural network (MSDNN) for salient object detection. The proposed model first extracts global high-level features and context information over the whole source image with recurrent convolutional neural network (RCNN). Then several stacked deconvolutional layers are adopted to get the multi-scale feature representation and obtain a series of saliency maps. Finally, we investigate a fusion convolution module (FCM) to build a final pixel level saliency map. The proposed model is extensively evaluated on four salient object detection benchmark datasets. Results show that our deep model significantly outperforms other 12 state-of-the-art approaches.
This paper proposes a fully connected neural network model to map samples from a uniform distribution to samples of any explicitly known probability density function. During the training, the Jensen-Shannon divergence between the distribution of the model’s output and the target distribution is minimized. We experimentally demonstrate that our model converges towards the desired state. It provides an alternative to existing sampling methods such as inversion sampling, rejection sampling, Gaussian mixture models and Markov-Chain-Monte-Carlo. Our model has high sampling efficiency and is easily applied to any probability distribution, without the need of further analytical or numerical calculations. It can produce correlated samples, such that the output distribution converges faster towards the target than for independent samples. But it is also able to produce independent samples, if single values are fed into the network and the input values are independent as well. We focus on one-dimensional sampling, but additionally illustrate a two-dimensional example with a target distribution of dependent variables.
Automated decision making systems are increasingly being used in real-world applications. In these systems for the most part, the decision rules are derived by minimizing the training error on the available historical data. Therefore, if there is a bias related to a sensitive attribute such as gender, race, religion, etc. in the data, say, due to cultural/historical discriminatory practices against a certain demographic, the system could continue discrimination in decisions by including the said bias in its decision rule. We present an information theoretic framework for designing fair predictors from data, which aim to prevent discrimination against a specified sensitive attribute in a supervised learning setting. We use equalized odds as the criterion for discrimination, which demands that the prediction should be independent of the protected attribute conditioned on the actual label. To ensure fairness and generalization simultaneously, we compress the data to an auxiliary variable, which is used for the prediction task. This auxiliary variable is chosen such that it is decontaminated from the discriminatory attribute in the sense of equalized odds. The final predictor is obtained by applying a Bayesian decision rule to the auxiliary variable.
Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far beyond the GPU DRAM capacity. SuperNeurons features 3 memory optimizations, \textit{Liveness Analysis}, \textit{Unified Tensor Pool}, and \textit{Cost-Aware Recomputation}, all together they effectively reduce the network-wide peak memory usage down to the maximal memory usage among layers. We also address the performance issues in those memory saving techniques. Given the limited GPU DRAM, SuperNeurons not only provisions the necessary memory for the training, but also dynamically allocates the memory for convolution workspaces to achieve the high performance. Evaluations against Caffe, Torch, MXNet and TensorFlow have demonstrated that SuperNeurons trains at least 3.2432 deeper network than current ones with the leading performance. Particularly, SuperNeurons can train ResNet2500 that has $10^4$ basic network layers on a 12GB K40c.
ConvNets have been very effective in many applications where it is required to learn invariances to within-class nuisance transformations. However, through their architecture, ConvNets only enforce invariance to translation. In this paper, we introduce a new class of convolutional architectures called Non-Parametric Transformation Networks (NPTNs) which can learn general invariances and symmetries directly from data. NPTNs are a direct and natural generalization of ConvNets and can be optimized directly using gradient descent. They make no assumption regarding structure of the invariances present in the data and in that aspect are very flexible and powerful. We also model ConvNets and NPTNs under a unified framework called Transformation Networks which establishes the natural connection between the two. We demonstrate the efficacy of NPTNs on natural data such as MNIST and CIFAR 10 where it outperforms ConvNet baselines with the same number of parameters. We show it is effective in learning invariances unknown apriori directly from data from scratch. Finally, we apply NPTNs to Capsule Networks and show that they enable them to perform even better.
Text Mining is a field that aims at extracting information from textual data. One of the challenges of such field of study comes from the pre-processing stage in which a vector (and structured) representation should be extracted from unstructured data. The common extraction creates large and sparse vectors representing the importance of each term to a document. As such, this usually leads to the curse-of-dimensionality that plagues most machine learning algorithms. To cope with this issue, in this paper we propose a new supervised feature extraction and reduction algorithm, named DCDistance, that creates features based on the distance between a document to a representative of each class label. As such, the proposed technique can reduce the features set in more than 99% of the original set. Additionally, this algorithm was also capable of improving the classification accuracy over a set of benchmark datasets when compared to traditional and state-of-the-art features selection algorithms.
Learning a classifier with control on the false-positive rate plays a critical role in many machine learning applications. Existing approaches either introduce prior knowledge dependent label cost or tune parameters based on traditional classifiers, which lack consistency in methodology because they do not strictly adhere to the false-positive rate constraint. In this paper, we propose a novel scoring-thresholding approach, tau-False Positive Learning (tau-FPL) to address this problem. We show the scoring problem which takes the false-positive rate tolerance into accounts can be efficiently solved in linear time, also an out-of-bootstrap thresholding method can transform the learned ranking function into a low false-positive classifier. Both theoretical analysis and experimental results show superior performance of the proposed tau-FPL over existing approaches.
The growth of big data in domains such as Earth Sciences, Social Networks, Physical Sciences, etc. has lead to an immense need for efficient and scalable linear algebra operations, e.g. Matrix inversion. Existing methods for efficient and distributed matrix inversion using big data platforms rely on LU decomposition based block-recursive algorithms. However, these algorithms are complex and require a lot of side calculations, e.g. matrix multiplication, at various levels of recursion. In this paper, we propose a different scheme based on Strassen’s matrix inversion algorithm (mentioned in Strassen’s original paper in 1969), which uses far fewer operations at each level of recursion. We implement the proposed algorithm, and through extensive experimentation, show that it is more efficient than the state of the art methods. Furthermore, we provide a detailed theoretical analysis of the proposed algorithm, and derive theoretical running times which match closely with the empirically observed wall clock running times, thus explaining the U-shaped behaviour w.r.t. block-sizes.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis.
Learning similarity functions between image pairs with deep neural networks yields highly correlated activations of embeddings. In this work, we show how to improve the robustness of such embeddings by exploiting the independence within ensembles. To this end, we divide the last embedding layer of a deep network into an embedding ensemble and formulate training this ensemble as an online gradient boosting problem. Each learner receives a reweighted training sample from the previous learners. Further, we propose two loss functions which increase the diversity in our ensemble. These loss functions can be applied either for weight initialization or during training. Together, our contributions leverage large embedding sizes more effectively by significantly reducing correlation of the embedding and consequently increase retrieval accuracy of the embedding. Our method works with any differentiable loss function and does not introduce any additional parameters during test time. We evaluate our metric learning method on image retrieval tasks and show that it improves over state-of-the-art methods on the CUB 200-2011, Cars-196, Stanford Online Products, In-Shop Clothes Retrieval and VehicleID datasets.
We propose Machines Talking To Machines (M2M), a framework combining automation and crowdsourcing to rapidly bootstrap end-to-end dialogue agents for goal-oriented dialogues in arbitrary domains. M2M scales to new tasks with just a task schema and an API client from the dialogue system developer, but it is also customizable to cater to task-specific interactions. Compared to the Wizard-of-Oz approach for data collection, M2M achieves greater diversity and coverage of salient dialogue flows while maintaining the naturalness of individual utterances. In the first phase, a simulated user bot and a domain-agnostic system bot converse to exhaustively generate dialogue ‘outlines’, i.e. sequences of template utterances and their semantic parses. In the second phase, crowd workers provide contextual rewrites of the dialogues to make the utterances more natural while preserving their meaning. The entire process can finish within a few hours. We propose a new corpus of 3,000 dialogues spanning 2 domains collected with M2M, and present comparisons with popular dialogue datasets on the quality and diversity of the surface forms and dialogue flows.
Database applications are typically written using a mixture of imperative languages and declarative frameworks for data processing. Application logic gets distributed across the declarative and imperative parts of a program. Often, there is more than one way to implement the same program, whose efficiency may depend on a number of parameters. In this paper, we propose a framework that automatically generates all equivalent alternatives of a given program using a given set of program transformations, and chooses the least cost alternative. We use the concept of program regions as an algebraic abstraction of a program and extend the Volcano/Cascades framework for optimization of algebraic expressions, to optimize programs. We illustrate the use of our framework for optimizing database applications. We show through experimental results, that our framework has wide applicability in real world applications and provides significant performance benefits.

# Book Memo: “Mining the Social Web”

 Want to tap the tremendous amount of valuable social data in Facebook, Twitter, LinkedIn, GitHub, Instagram, and Google+? This new edition helps you discover who’s making connections with social media, what they’re talking about, and where they’re located. You’ll learn how to combine social web data, analysis techniques, and visualization to find what you’ve been looking for in the social haystack—as well as useful information you didn’t know existed. • Get a straightforward synopsis of the social web landscape • Use adaptable scripts on GitHub to harvest data from social network APIs. • Learn how to employ easy-to-use Python tools to slice and dice the data you collect • Explore social connections in microformats with the XHTML Friends Network • Apply advanced mining techniques such as TF-IDF, cosine similarity, collocation analysis, document summarization, clique detection, and image recognition • Build interactive visualizations with web technologies based upon HTML5 and JavaScript toolkits

# If you did not already know

Zeros Ones Inflated Proportional
The ZOIP distribution (Zeros Ones Inflated Proportional) is a proportional data distribution inflated with zeros and/or ones, this distribution is defined on the most known proportional data distributions, the beta and simplex distribution, Jørgensen and Barndorff-Nielsen (1991) <doi:10.1016/0047-259X(91)90008-P>, also allows it to have different parameterizations of the beta distribution, Ferrari and Cribari-Neto (2004) <doi:10.1080/0266476042000214501>, Rigby and Stasinopoulos (2005) <doi:10.18637/jss.v023.i07>. The ZOIP distribution has four parameters, two of which correspond to the proportion of zeros and ones, and the other two correspond to the distribution of the proportional data of your choice. The ‘ZOIP’ package allows adjustments of regression models for fixed and mixed effects for proportional data inflated with zeros and/or ones. …

Fuzzy Supervised Learning with Binary Meta-Feature (FSL-BM)
This paper introduces a novel real-time Fuzzy Supervised Learning with Binary Meta-Feature (FSL-BM) for big data classification task. The study of real-time algorithms addresses several major concerns, which are namely: accuracy, memory consumption, and ability to stretch assumptions and time complexity. Attaining a fast computational model providing fuzzy logic and supervised learning is one of the main challenges in the machine learning. In this research paper, we present FSL-BM algorithm as an efficient solution of supervised learning with fuzzy logic processing using binary meta-feature representation using Hamming Distance and Hash function to relax assumptions. While many studies focused on reducing time complexity and increasing accuracy during the last decade, the novel contribution of this proposed solution comes through integration of Hamming Distance, Hash function, binary meta-features, binary classification to provide real time supervised method. Hash Tables (HT) component gives a fast access to existing indices; and therefore, the generation of new indices in a constant time complexity, which supersedes existing fuzzy supervised algorithms with better or comparable results. To summarize, the main contribution of this technique for real-time Fuzzy Supervised Learning is to represent hypothesis through binary input as meta-feature space and creating the Fuzzy Supervised Hash table to train and validate model. …

Norm
In linear algebra, functional analysis and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space – save possibly for the zero vector, which is assigned a length of zero. A seminorm, on the other hand, is allowed to assign zero length to some non-zero vectors (in addition to the zero vector). A norm must also satisfy certain properties pertaining to scalability and additivity which are given in the formal definition below. A simple example is the 2-dimensional Euclidean space R2 equipped with the Euclidean norm. Elements in this vector space (e.g., (3, 7)) are usually drawn as arrows in a 2-dimensional cartesian coordinate system starting at the origin (0, 0). The Euclidean norm assigns to each vector the length of its arrow. Because of this, the Euclidean norm is often known as the magnitude. A vector space on which a norm is defined is called a normed vector space. Similarly, a vector space with a seminorm is called a seminormed vector space. It is often possible to supply a norm for a given vector space in more than one way. …

# R Packages worth a look

Query Search Interfaces (searcher)
Provides a search interface to look up terms on ‘Google’, ‘Bing’, ‘DuckDuckGo’, ‘StackOverflow’, ‘GitHub’, and ‘BitBucket’. Upon searching, a browser window will open with the aforementioned search results.

Composite Likelihood Estimation for Spatial Data (clespr)
Composite likelihood approach is implemented to estimating statistical models for spatial ordinal and proportional data based on Feng et al. (2014) <doi:10.1002/env.2306>. Parameter estimates are identified by maximizing composite log-likelihood functions using the limited memory BFGS optimization algorithm with bounding constraints, while standard errors are obtained by estimating the Godambe information matrix.

Adjusted Prediction Model Performance Estimation (APPEstimation)
Calculating predictive model performance measures adjusted for predictor distributions using density ratio method (Sugiyama et al., (2012, ISBN:9781139035613)). L1 and L2 error for continuous outcome and C-statistics for binomial outcome are computed.

Rcpp’ Bindings for the ‘Corpus Workbench’ (‘CWB’) (RcppCWB)
Rcpp’ Bindings for the C code of the ‘Corpus Workbench’ (‘CWB’), an indexing and query engine to efficiently analyze large corpora (<http://cwb.sourceforge.net> ). ‘RcppCWB’ is licensed under the GNU GPL-3, in line with the GPL-3 license of the ‘CWB’ (<https://…/GPL-3> ). The ‘CWB’ relies on ‘pcre’ (BSD license, see <https://…/licence.txt> ) and ‘GLib’ (LGPL license, see <https://…/lgpl-3.0.en.html> ). See the file LICENSE.note for further information.

An Integrated Regression Model for Normalizing ‘NanoString nCounter’ Data (RCRnorm)
NanoString nCounter’ is a medium-throughput platform that measures gene or microRNA expression levels. Here is a publication that introduces this platform: Malkov (2009) <doi:10.1186/1756-0500-2-80>. Here is the webpage of ‘NanoString nCounter’ where you can find detailed information about this platform <https://…/ncounter-technology>. It has great clinical application, such as diagnosis and prognosis of cancer. Implements integrated system of random-coefficient hierarchical regression model to normalize data from ‘NanoString nCounter’ platform so that noise from various sources can be removed.