In recent years, there has been an increasing interest in extending traditional stream processing engines with logical, rule-based, reasoning capabilities. This poses significant theoretical and practical challenges since rules can derive new information and propagate it both towards past and future time points; as a result, streamed query answers can depend on data that has not yet been received, as well as on data that arrived far in the past. Stream reasoning algorithms, however, must be able to stream out query answers as soon as possible, and can only keep a limited number of previous input facts in memory. In this paper, we propose novel reasoning problems to deal with these challenges, and study their computational properties on Datalog extended with a temporal sort and the successor function (a core rule-based language for stream reasoning applications).
In designing personalized ranking algorithms, it is desirable to encourage a high precision at the top of the ranked list. Existing methods either seek a smooth convex surrogate for a non-smooth ranking metric or directly modify updating procedures to encourage top accuracy. In this work we point out that these methods do not scale well to a large-scale setting, and this is partly due to the inaccurate pointwise or pairwise rank estimation. We propose a new framework for personalized ranking. It uses batch-based rank estimators and smooth rank-sensitive loss functions. This new batch learning framework leads to more stable and accurate rank approximations compared to previous work. Moreover, it enables explicit use of parallel computation to speed up training. We conduct empirical evaluation on three item recommendation tasks. Our method shows consistent accuracy improvements over state-of-the-art methods. Additionally, we observe time efficiency advantages when data scale increases.
We propose to study the problem of few-shot learning with the prism of inference on a partially observed graphical model, constructed from a collection of input images whose label can be either observed or not. By assimilating generic message-passing inference algorithms with their neural-network counterparts, we define a graph neural network architecture that generalizes several of the recently proposed few-shot learning models. Besides providing improved numerical performance, our framework is easily extended to variants of few-shot learning, such as semi-supervised or active learning, demonstrating the ability of graph-based models to operate well on ‘relational’ tasks.
We introduce an adversarial learning framework, which we named KBGAN, to improve the performances of a wide range of existing knowledge graph embedding models. Because knowledge graph datasets typically only contain positive facts, sampling useful negative training examples is a non-trivial task. Replacing the head or tail entity of a fact with a uniformly randomly selected entity is a conventional method for generating negative facts used by many previous works, but the majority of negative facts generated in this way can be easily discriminated from positive facts, and will contribute little towards the training. Inspired by generative adversarial networks (GANs), we use one knowledge graph embedding model as a negative sample generator to assist the training of our desired model, which acts as the discriminator in GANs. The objective of the generator is to generate difficult negative samples that can maximize their likeliness determined by the discriminator, while the discriminator minimizes its training loss. This framework is independent of the concrete form of generator and discriminator, and therefore can utilize a wide variety of knowledge graph embedding models as its building blocks. In experiments, we adversarially train two translation-based models, TransE and TransD, each with assistance from one of the two probability-based models, DistMult and ComplEx. We evaluate the performances of KBGAN on the link prediction task, using three knowledge base completion datasets: FB15k-237, WN18 and WN18RR. Experimental results show that adversarial training substantially improves the performances of target embedding models under various settings.
Recent advances in language modeling such as word2vec motivate a number of graph embedding approaches by treating random walk sequences as sentences to encode structural proximity in a graph. However, most of the existing principles of neural graph embedding do not incorporate auxiliary information such as node content flexibly. In this paper we take a matrix factorization perspective of graph embedding which generalizes to structural embedding as well as content embedding in a natural way. For structure embedding, we validate that the matrix we construct and factorize preserves the high-order proximities of the graph. Label information can be further integrated into the matrix via the process of random walk sampling to enhance the quality of embedding. In addition, we generalize the Skip-Gram Negative Sampling model to integrate the content of the graph in a matrix factorization framework. As a consequence, graph embedding can be learned in a unified framework integrating graph structure and node content as well as label information simultaneously. We demonstrate the efficacy of the proposed model with the tasks of semi-supervised node classification and link prediction on a variety of real-world benchmark network datasets.
Factor analysis is widely used in many application areas. The first step, choosing the number of factors, remains a serious challenge. One of the most popular methods is parallel analysis (PA), which compares the observed factor strengths to simulated ones under a noise-only model. % Abstracts are commonly just one paragraph. This paper presents a deterministic version of PA (DPA), which is faster and more reproducible than PA. We show that DPA selects large factors and does not select small factors just like [Dobriban, 2017] shows for PA. Both PA and DPA are prone to a shadowing phenomenon in which a strong factor makes it hard to detect smaller but more interesting factors. We develop a deflated version of DPA (DDPA) that counters shadowing. By raising the decision threshold in DDPA, a new method (DDPA+) also improves estimation accuracy. We illustrate our methods on data from the Human Genome Diversity Project (HGDP). There PA and DPA select seemingly too many factors, while DDPA+ selects only a few. A Matlab implementation is available.
Correlation filters are special classifiers designed for shift-invariant object recognition, which are robust to pattern distortions. The recent literature shows that combining a set of sub-filters trained based on a single or a small group of images obtains the best performance. The idea is equivalent to estimating variable distribution based on the data sampling (bagging), which can be interpreted as finding solutions (variable distribution approximation) directly from sampled data space. However, this methodology fails to account for the variations existed in the data. In this paper, we introduce an intermediate step — solution sampling — after the data sampling step to form a subspace, in which an optimal solution can be estimated. More specifically, we propose a new method, named latent constrained correlation filters (LCCF), by mapping the correlation filters to a given latent subspace, and develop a new learning framework in the latent subspace that embeds distribution-related constraints into the original problem. To solve the optimization problem, we introduce a subspace based alternating direction method of multipliers (SADMM), which is proven to converge at the saddle point. Our approach is successfully applied to three different tasks, including eye localization, car detection and object tracking. Extensive experiments demonstrate that LCCF outperforms the state-of-the-art methods. The source code will be publicly available. https://…/.
Detecting correlation structures in large networks arises in many domains. Such detection problems are often studied independently of the underlying data acquisition process, rendering settings in which data acquisition policies (e.g., the sample-size) are pre-specified. Motivated by the advantages of data-adaptive sampling, this paper treats the inherently coupled problems of data acquisition and decision-making for correlation detection. Specifically, this paper considers a Markov network of agents generating random variables, and designs a quickest sequential strategy for discerning the correlation model governing the Markov network. By abstracting the Markov network as an undirected graph,designing the quickest sampling strategy becomes equivalent to sequentially and data-adaptively identifying and sampling a sequence of vertices in the graph. The existing data-adaptive approaches fail to solve this problem since the node selection actions are dependent. Hence, a novel sampling strategy is proposed to incorporate the correlation structures into the decision rules for minimizing the average delay in forming a confident decision. The proposed procedure involves searching over all possible future decisions which becomes computationally prohibitive for large networks. However, by leveraging the Markov properties it is shown that the search horizon can be reduced to the neighbors of each node in the dependency graph without compromising the performance. Performance analyses and numerical evaluations demonstrate the gains of the proposed sequential approach over the existing ones.
Modeling informal inference in natural language is very challenging. With the recent availability of large annotated data, it has become feasible to train complex models such as neural networks to perform natural language inference (NLI), which have achieved state-of-the-art performance. Although there exist relatively large annotated data, can machines learn all knowledge needed to perform NLI from the data? If not, how can NLI models benefit from external knowledge and how to build NLI models to leverage it? In this paper, we aim to answer these questions by enriching the state-of-the-art neural natural language inference models with external knowledge. We demonstrate that the proposed models with external knowledge further improve the state of the art on the Stanford Natural Language Inference (SNLI) dataset.
Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data, text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modeling, which Latent Dirichlet allocation (LDA) is one of the most popular methods in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper can be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated scholarly articles highly (between 2003 to 2016) related to Topic Modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. Also, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA.
Effective training of neural networks requires much data. In the low-data regime, parameters are underdetermined, and learnt networks generalise poorly. Data Augmentation \cite{krizhevsky2012imagenet} alleviates this by using existing data more effectively. However standard data augmentation produces only limited plausible alternative data. Given there is potential to generate a much broader set of augmentations, we design and train a generative model to do data augmentation. The model, based on image conditional Generative Adversarial Networks, takes data from a source domain and learns to take any data item and generalise it to generate other within-class data items. As this generative process does not depend on the classes themselves, it can be applied to novel unseen classes of data. We show that a Data Augmentation Generative Adversarial Network (DAGAN) augments standard vanilla classifiers well. We also show a DAGAN can enhance few-shot learning systems such as Matching Networks. We demonstrate these approaches on Omniglot, on EMNIST having learnt the DAGAN on Omniglot, and VGG-Face data. In our experiments we can see over 13\% increase in accuracy in the low-data regime experiments in Omniglot (from 69\% to 82\%), EMNIST (73.9\% to 76\%) and VGG-Face (4.5\% to 12\%); in Matching Networks for Omniglot we observe an increase of 0.5\% (from 96.9\% to 97.4\%) and an increase of 1.8\% in EMNIST (from 59.5\% to 61.3\%).
We propose a new class of distribution-based clustering algorithms, called k-groups, based on energy distance between samples. The energy distance clustering criterion assigns observations to clusters according to a multi-sample energy statistic that measures the distance between distributions. The energy distance determines a consistent test for equality of distributions, and it is based on a population distance that characterizes equality of distributions. The k-groups procedure therefore generalizes the k-means method, which separates clusters that have different means. We propose two k-groups algorithms: k-groups by first variation; and k-groups by second variation. The implementation of k-groups is partly based on Hartigan and Wong’s algorithm for k-means. The algorithm is generalized from moving one point on each iteration (first variation) to moving $m$ $(m > 1)$ points. For univariate data, we prove that Hartigan and Wong’s k-means algorithm is a special case of k-groups by first variation. The simulation results from univariate and multivariate cases show that our k-groups algorithms perform as well as Hartigan and Wong’s k-means algorithm when clusters are well-separated and normally distributed. Moreover, both k-groups algorithms perform better than k-means when data does not have a finite first moment or data has strong skewness and heavy tails. For non–spherical clusters, both k-groups algorithms performed better than k-means in high dimension, and k-groups by first variation is consistent as dimension increases. In a case study on dermatology data with 34 features, both k-groups algorithms performed better than k-means.
While anomaly detection in static networks has been extensively studied, only recently, researchers have focused on dynamic networks. This trend is mainly due to the capacity of dynamic networks in representing complex physical, biological, cyber, and social systems. This paper proposes a new methodology for modeling and monitoring of dynamic attributed networks for quick detection of temporal changes in network structures. In this methodology, the generalized linear model (GLM) is used to model static attributed networks. This model is then combined with a state transition equation to capture the dynamic behavior of the system. Extended Kalman filter (EKF) is used as an online, recursive inference procedure to predict and update network parameters over time. In order to detect changes in the underlying mechanism of edge formation, prediction residuals are monitored through an Exponentially Weighted Moving Average (EWMA) control chart. The proposed modeling and monitoring procedure is examined through simulations for attributed binary and weighted networks. The email communication data from the Enron corporation is used as a case study to show how the method can be applied in real-world problems.
Neural networks have recently had a lot of success for many tasks. However, neural network architectures that perform well are still typically designed manually by experts in a cumbersome trial-and-error process. We propose a new method to automatically search for well-performing CNN architectures based on a simple hill climbing procedure whose operators apply network morphisms, followed by short optimization runs by cosine annealing. Surprisingly, this simple method yields competitive results, despite only requiring resources in the same order of magnitude as training a single network. E.g., on CIFAR-10, our method designs and trains networks with an error rate below 6% in only 12 hours on a single GPU; training for one day reduces this error further, to almost 5%.
Artificial Neural Networks are powerful function approximators capable of modelling solutions to a wide variety of problems, both supervised and unsupervised. As their size and expressivity increases, so too does the variance of the model, yielding a nearly ubiquitous overfitting problem. Although mitigated by a variety of model regularisation methods, the common cure is to seek large amounts of training data—which is not necessarily easily obtained—that sufficiently approximates the data distribution of the domain we wish to test on. In contrast, logic programming methods such as Inductive Logic Programming offer an extremely data-efficient process by which models can be trained to reason on symbolic domains. However, these methods are unable to deal with the variety of domains neural networks can be applied to: they are not robust to noise in or mislabelling of inputs, and perhaps more importantly, cannot be applied to non-symbolic domains where the data is ambiguous, such as operating on raw pixels. In this paper, we propose a Differentiable Inductive Logic framework ($\partial$ILP), which can not only solve tasks which traditional ILP systems are suited for, but shows a robustness to noise and error in the training data which ILP cannot cope with. Furthermore, as it is trained by backpropagation against a likelihood objective, it can be hybridised by connecting it with neural networks over ambiguous data in order to be applied to domains which ILP cannot address, while providing data efficiency and generalisation beyond what neural networks on their own can achieve.
Large volumes of spatio-temporal data are increasingly collected and studied in diverse domains including, climate science, social sciences, neuroscience, epidemiology, transportation, mobile health, and Earth sciences. Spatio-temporal data differs from relational data for which computational approaches are developed in the data mining community for multiple decades, in that both spatial and temporal attributes are available in addition to the actual measurements/attributes. The presence of these attributes introduces additional challenges that needs to be dealt with. Approaches for mining spatio-temporal data have been studied for over a decade in the data mining community. In this article we present a broad survey of this relatively young field of spatio-temporal data mining. We discuss different types of spatio-temporal data and the relevant data mining questions that arise in the context of analyzing each of these datasets. Based on the nature of the data mining problem studied, we classify literature on spatio-temporal data mining into six major categories: clustering, predictive learning, change detection, frequent pattern mining, anomaly detection, and relationship mining. We discuss the various forms of spatio-temporal data mining problems in each of these categories.
An important task for any large-scale organization is to prepare forecasts of key performance metrics. Often these organizations are structured in a hierarchical manner and for operational reasons, projections of these metrics may have been obtained independently from one another at each level of the hierarchy by specialists focusing on certain areas within the business. There is no guarantee that when combined, these aggregates will be consistent with projections produced directly at other levels of the hierarchy. We propose a Bayesian hierarchical method that treats the initial forecasts as observed data which are then combined with prior information and historical predictive accuracy to infer a probability distribution of revised forecasts. When used to create point estimates, this method can reflect preferences for increased accuracy at specific levels in the hierarchy. We present simulated and real data studies to demonstrate when our approach results in improved inferences over alternative methods.
Inferring causal effects from an observational study is challenging because participants are not randomized to treatment. Observational studies in infectious disease research present the additional challenge that one participant’s treatment may affect another participant’s outcome, i.e., there may be interference. In this paper recent approaches to defining causal effects in the presence of interference are considered, and new causal estimands designed specifically for use with observational studies are proposed. Previously defined estimands target counterfactual scenarios in which individuals independently select treatment with equal probability. However, in settings where there is interference between individuals within clusters, it may be unlikely that treatment selection is independent between individuals in the same cluster. The proposed causal estimands instead describe counterfactual scenarios in which the treatment selection correlation structure is the same as in the observed data distribution, allowing for within-cluster dependence in the individual treatment selections. These estimands may be more relevant for policy-makers or public health officials who desire to quantify the effect of increasing the proportion of treated individuals in a population. Inverse probability-weighted estimators for these estimands are proposed. The large-sample properties of the estimators are derived, and a simulation study demonstrating the finite-sample performance of the estimators is presented. The proposed methods are illustrated by analyzing data from a study of cholera vaccination in over 100,000 individuals in Bangladesh.
We propose a new Integral Probability Metric (IPM) between distributions: the Sobolev IPM. The Sobolev IPM compares the mean discrepancy of two distributions for functions (critic) restricted to a Sobolev ball defined with respect to a dominant measure $\mu$. We show that the Sobolev IPM compares two distributions in high dimensions based on weighted conditional Cumulative Distribution Functions (CDF) of each coordinate on a leave one out basis. The Dominant measure $\mu$ plays a crucial role as it defines the support on which conditional CDFs are compared. Sobolev IPM can be seen as an extension of the one dimensional Von-Mises Cram\’er statistics to high dimensional distributions. We show how Sobolev IPM can be used to train Generative Adversarial Networks (GANs). We then exploit the intrinsic conditioning implied by Sobolev IPM in text generation. Finally we show that a variant of Sobolev GAN achieves competitive results in semi-supervised learning on CIFAR-10, thanks to the smoothness enforced on the critic by Sobolev GAN which relates to Laplacian regularization.