Over past several years, deep learning has achieved huge successes in various applications. However, such a data-driven approach is often criticized for lack of interpretability. Recently, we proposed artificial quadratic neural networks consisting of second-order neurons in potentially many layers. In each second-order neuron, a quadratic function is used in the place of the inner product in a traditional neuron, and then undergoes a nonlinear activation. With a single second-order neuron, any fuzzy logic operation, such as XOR, can be implemented. In this sense, any deep network constructed with quadratic neurons can be interpreted as a deep fuzzy logic system. Since traditional neural networks and second-order counterparts can represent each other and fuzzy logic operations are naturally implemented in second-order neural networks, it is plausible to explain how a deep neural network works with a second-order network as the system model. In this paper, we generalize and categorize fuzzy logic operations implementable with individual second-order neurons, and then perform statistical/information theoretic analyses of exemplary quadratic neural networks.
Unsupervised domain adaptation techniques have been successful for a wide range of problems where supervised labels are limited. The task is to classify an unlabeled target’ dataset by leveraging a labeled source’ dataset that comes from a slightly similar distribution. We propose metric-based adversarial discriminative domain adaptation (M-ADDA) which performs two main steps. First, it uses a metric learning approach to train the source model on the source dataset by optimizing the triplet loss function. This results in clusters where embeddings of the same label are close to each other and those with different labels are far from one another. Next, it uses the adversarial approach (as that used in ADDA \cite{2017arXiv170205464T}) to make the extracted features from the source and target datasets indistinguishable. Simultaneously, we optimize a novel loss function that encourages the target dataset’s embeddings to form clusters. While ADDA and M-ADDA use similar architectures, we show that M-ADDA performs significantly better on the digits adaptation datasets of MNIST and USPS. This suggests that using metric-learning for domain adaptation can lead to large improvements in classification accuracy for the domain adaptation task. The code is available at \url{https://…/M-ADDA}.
Novelty detection is the problem of identifying whether a new data point is considered to be an inlier or an outlier. We assume that training data is available to describe only the inlier distribution. Recent approaches primarily leverage deep encoder-decoder network architectures to compute a reconstruction error that is used to either compute a novelty score or to train a one-class classifier. While we too leverage a novel network of that kind, we take a probabilistic approach and effectively compute how likely is that a sample was generated by the inlier distribution. We achieve this with two main contributions. First, we make the computation of the novelty probability feasible because we linearize the parameterized manifold capturing the underlying structure of the inlier distribution, and show how the probability factorizes and can be computed with respect to local coordinates of the manifold tangent space. Second, we improved the training of the autoencoder network. An extensive set of results show that the approach achieves state-of-the-art results on several benchmark datasets.
Recent advances in the field of network representation learning are mostly attributed to the application of the skip-gram model in the context of graphs. State-of-the-art analogues of skip-gram model in graphs define a notion of neighbourhood and aim to find the vector representation for a node, which maximizes the likelihood of preserving this neighborhood. In this paper, we take a drastic departure from the existing notion of neighbourhood of a node by utilizing the idea of coreness. More specifically, we utilize the well-established idea that nodes with similar core numbers play equivalent roles in the network and hence induce a novel and an organic notion of neighbourhood. Based on this idea, we propose core2vec, a new algorithmic framework for learning low dimensional continuous feature mapping for a node. Consequently, the nodes having similar core numbers are relatively closer in the vector space that we learn. We further demonstrate the effectiveness of core2vec by comparing word similarity scores obtained by our method where the node representations are drawn from standard word association graphs against scores computed by other state-of-the-art network representation techniques like node2vec, DeepWalk and LINE. Our results always outperform these existing methods
Today’s software industry requires individuals who are proficient in as many programming languages as possible. Structured query language (SQL), as an adopted standard, is no exception, as it is the most widely used query language to retrieve and manipulate data. However, the process of learning SQL turns out to be challenging. The need for a computer-aided solution to help users learn SQL and improve their proficiency is vital. In this study, we present a new approach to help users conceptualize basic building blocks of the language faster and more efficiently. The adaptive design of the proposed approach aids users in learning SQL by supporting their own path to the solution and employing successful previous attempts, while not enforcing the ideal solution provided by the instructor. Furthermore, we perform an empirical evaluation with 93 participants and demonstrate that the employment of hints is successful, being especially beneficial for users with lower prior knowledge.
The bootstrap, introduced by Efron (1982), has become a very popular method for estimating variances and constructing confidence intervals. A key insight is that one can approximate the properties of estimators by using the empirical distribution function of the sample as an approximation for the true distribution function. This approach views the uncertainty in the estimator as coming exclusively from sampling uncertainty. We argue that for causal estimands the uncertainty arises entirely, or partially, from a different source, corresponding to the stochastic nature of the treatment received. We develop a bootstrap procedure that accounts for this uncertainty, and compare its properties to that of the classical bootstrap.
A Bayesian approach termed BAyesian Least Squares Optimization with Nonnegative L1-norm constraint (BALSON) is proposed. The error distribution of data fitting is described by Gaussian likelihood. The parameter distribution is assumed to be a Dirichlet distribution. With the Bayes rule, searching for the optimal parameters is equivalent to finding the mode of the posterior distribution. In order to explicitly characterize the nonnegative L1-norm constraint of the parameters, we further approximate the true posterior distribution by a Dirichlet distribution. We estimate the statistics of the approximating Dirichlet posterior distribution by sampling methods. Four sampling methods have been introduced. With the estimated posterior distributions, the original parameters can be effectively reconstructed in polynomial fitting problems, and the BALSON framework is found to perform better than conventional methods.
The paper extends Bayesian networks (BNs) by a mechanism for dynamic changes to the probability distributions represented by BNs. One application scenario is the process of knowledge acquisition of an observer interacting with a system. In particular, the paper considers condition/event nets where the observer’s knowledge about the current marking is a probability distribution over markings. The observer can interact with the net to deduce information about the marking by requesting certain transitions to fire and observing their success or failure. Aiming for an efficient implementation of dynamic changes to probability distributions of BNs, we consider a modular form of networks that form the arrows of a free PROP with a commutative comonoid structure, also known as term graphs. The algebraic structure of such PROPs supplies us with a compositional semantics that functorially maps BNs to their underlying probability distribution and, in particular, it provides a convenient means to describe structural updates of networks.
We analyze the time series obtained from different dynamical regimes of the logistic map by constructing their equivalent time series (TS) networks, using the visibility algorithm. The regimes analyzed include both periodic and chaotic regimes, as well as intermittent regimes and the Feigenbaum attractor at the edge of chaos. We use the methods of algebraic topology to define the simplicial characterizers, which can analyse the simplicial structure of the networks at both the global and local levels. The simplicial characterisers bring out the hierarchical levels of complexity at various topological levels. These hierarchical levels of complexity find the skeleton of the local dynamics embedded in the network which influence the global dynamical properties of the system, and also permit the identification of dominant motifs. We also analyze the same networks using conventional network characterizers such as average path lengths and clustering coefficients. We see that the simplicial characterizers are capable of distinguishing between different dynamical regimes and can pick up subtle differences in dynamical behavior, whereas the usual characterizers provide a coarser characterization. However, the two taken in conjunction, can provide information about the dynamical behavior of the time series, as well as the correlations in the evolving system. Our methods can therefore provide powerful tools for the analysis of dynamical systems.
This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches.
Bayesian optimization is an approach to optimizing objective functions that take a long time (minutes or hours) to evaluate. It is best-suited for optimization over continuous domains of less than 20 dimensions, and tolerates stochastic noise in function evaluations. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample. In this tutorial, we describe how Bayesian optimization works, including Gaussian process regression and three common acquisition functions: expected improvement, entropy search, and knowledge gradient. We then discuss more advanced techniques, including running multiple function evaluations in parallel, multi-fidelity and multi-information source optimization, expensive-to-evaluate constraints, random environmental conditions, multi-task Bayesian optimization, and the inclusion of derivative information. We conclude with a discussion of Bayesian optimization software and future research directions in the field. Within our tutorial material we provide a generalization of expected improvement to noisy evaluations, beyond the noise-free setting where it is more commonly applied. This generalization is justified by a formal decision-theoretic argument, standing in contrast to previous ad hoc modifications.
Deep learning and deep architectures are emerging as the best machine learning methods so far in many practical applications such as reducing the dimensionality of data, image classification, speech recognition or object segmentation. In fact, many leading technology companies such as Google, Microsoft or IBM are researching and using deep architectures in their systems to replace other traditional models. Therefore, improving the performance of these models could make a strong impact in the area of machine learning. However, deep learning is a very fast-growing research domain with many core methodologies and paradigms just discovered over the last few years. This thesis will first serve as a short summary of deep learning, which tries to include all of the most important ideas in this research area. Based on this knowledge, we suggested, and conducted some experiments to investigate the possibility of improving the deep learning based on automatic programming (ADATE). Although our experiments did produce good results, there are still many more possibilities that we could not try due to limited time as well as some limitations of the current ADATE version. I hope that this thesis can promote future work on this topic, especially when the next version of ADATE comes out. This thesis also includes a short analysis of the power of ADATE system, which could be useful for other researchers who want to know what it is capable of.
A new forecasting method based on the concept of the profile predictive the likelihood function is proposed for discrete-valued processes. In particular, generalized autoregressive and moving average (GARMA) models for Poisson distributed data are explored in details. Highest density regions are used to construct forecasting regions. The proposed forecast estimates and regions are coherent. Large sample results are derived for the forecasting distribution. Numerical studies using simulations and a real data set are used to establish the performance of the proposed forecasting method. Robustness of the proposed method to possible misspecification in the model is also studied.
We explore solutions for automated labeling of content in bug trackers and customer support systems. In order to do that, we classify content in terms of several criteria, such as priority or product area. In the first part of the paper, we provide an overview of existing methods used for text classification. These methods fall into two categories – the ones that rely on neural networks and the ones that don’t. We evaluate results of several solutions of both kinds. In the second part of the paper we present our own recurrent neural network solution based on hierarchical attention paradigm. It consists of several Hierarchical Attention network blocks with varying Gated Recurrent Unit cell sizes and a complementary shallow network that goes alongside. Lastly, we evaluate above-mentioned methods when predicting fields from two datasets – Arch Linux bug tracker and Chromium bug tracker. Our contributions include a comprehensive benchmark between a variety of methods on relevant datasets; a novel solution that outperforms previous generation methods; and two new datasets that are made public for further research.
Model interpretability is an increasingly important component of practical machine learning. Some of the most common forms of interpretability systems are example-based, local, and global explanations. One of the main challenges in interpretability is designing explanation systems that can capture aspects of each of these explanation types, in order to develop a more thorough understanding of the model. We address this challenge in a novel model called SLIM that uses local linear modeling techniques along with a dual interpretation of random forests (both as a supervised neighborhood approach and as a feature selection method). SLIM has two fundamental advantages over existing interpretability systems. First, while it is effective as a black-box explanation system, SLIM itself is a highly accurate predictive model that provides faithful self explanations, and thus sidesteps the typical accuracy-interpretability trade-off. Second, SLIM provides both example- based and local explanations and can detect global patterns, which allows it to diagnose limitations in its local explanations.
We address the problem of causal discovery from data, making use of the recently proposed causal modeling framework of modular structural causal models (mSCM) to handle cycles, latent confounders and non-linearities. We introduce {\sigma}-connection graphs ({\sigma}-CG), a new class of mixed graphs (containing undirected, bidirected and directed edges) with additional structure, and extend the concept of {\sigma}-separation, the appropriate generalization of the well-known notion of d-separation in this setting, to apply to {\sigma}-CGs. We prove the closedness of {\sigma}-separation under marginalisation and conditioning and exploit this to implement a test of {\sigma}-separation on a {\sigma}-CG. This then leads us to the first causal discovery algorithm that can handle non-linear functional relations, latent confounders, cyclic causal relationships, and data from different (stochastic) perfect interventions. As a proof of concept, we show on synthetic data how well the algorithm recovers features of the causal graph of modular structural causal models.
Given a malfunctioning system, sequential diagnosis aims at identifying the root cause of the failure in terms of abnormally behaving system components. As initial system observations usually do not suffice to deterministically pin down just one explanation of the system’s misbehavior, additional system measurements can help to differentiate between possible explanations. The goal is to restrict the space of explanations until there is only one (highly probable) explanation left. To achieve this with a minimal-cost set of measurements, various (active learning) heuristics for selecting the best next measurement have been proposed. We report preliminary results of extensive ongoing experiments with a set of selection heuristics on real-world diagnosis cases. In particular, we try to answer questions such as ‘Is some heuristic always superior to all others ‘, ‘On which factors does the (relative) performance of the particular heuristics depend ‘ or ‘Under which circumstances should I use which heuristic ‘
We present NMT-Keras, a flexible toolkit for training deep learning models, which puts a particular emphasis on the development of advanced applications of neural machine translation systems, such as interactive-predictive translation protocols and long-term adaptation of the translation system via continuous learning. NMT-Keras is based on an extended version of the popular Keras library, and it runs on Theano and Tensorflow. State-of-the-art neural machine translation models are deployed and used following the high-level framework provided by Keras. Given its high modularity and flexibility, it also has been extended to tackle different problems, such as image and video captioning, sentence classification and visual question answering.
Internet of Things (IoT) data are increasingly viewed as a new form of massively distributed and large scale digital assets, which are continuously generated by millions of connected devices. The real value of such assets can only be realized by allowing IoT data trading to occur on a marketplace that rewards every single producer and consumer, at a very granular level. Crucially, we believe that such a marketplace should not be owned by anybody, and should instead fairly and transparently self-enforce a well defined set of governance rules. In this paper we address some of the technical challenges involved in realizing such a marketplace. We leverage emerging blockchain technologies to build a decentralized, trusted, transparent and open architecture for IoT traffic metering and contract compliance, on top of the largely adopted IoT brokered data infrastructure. We discuss an Ethereum-based prototype implementation and experimentally evaluate the overhead cost associated with Smart Contract transactions, concluding that a viable business model can indeed be associated with our technical approach.
Training of elite athletes requires regular physiological and medical monitoring to plan the schedule, intensity and volume of training, and subsequent recovery. In sports medicine, ECG-based analyses are well established. However, they rarely consider the correspondence of respiratory and cardiac activity. Given such mutual influence, we hypothesize that athlete monitoring might be developed with causal inference and that detailed, time-related techniques should be preceded by a more general, time-independent approach that considers the whole group of participants and parameters describing whole signals. The aim of this study was to discover general causal paths among basic cardiac and respiratory variables in elite athletes in two body positions (supine and standing), at rest. ECG and impedance pneumography signals were obtained from 100 elite athletes to reveal cardiac and respiratory activity, respectively. The RR intervals and tidal volumes were established, and heart rate, the root-mean-square difference of successive R-R intervals, respiratory rate, breathing regularity, and best shift between signals were estimated. The causal discovery framework comprised an absolute residuals criterion, generalized correlations, and causal additive modeling. We found that the average activity rate seems to affect the indicator responsible for diversity (effect of HR on RMSSD, or RR on BR). There are different graphical structures and directions in the two body positions. In several cases, causal mediation analysis supports those findings only for supine body position. The presented approach allows data-driven and time-independent analysis of various groups of athletes, without considering prior knowledge. However, the results seem to be consistent with the medical background. Therefore, in the next step, we plan to expand the study using time-related causality analyses.