For distributed computing environments, we consider the canonical machine learning problem of empirical risk minimization (ERM) with quadratic regularization, and we propose a distributed and communication-efficient Newton-type optimization method. At every iteration, each worker locally finds an Approximate NewTon (ANT) direction, and then it sends this direction to the main driver. The driver, then, averages all the ANT directions received from workers to form a Globally Improved ANT (GIANT) direction. GIANT naturally exploits the trade-offs between local computations and global communications in that more local computations result in fewer overall rounds of communications. GIANT is highly communication efficient in that, for $d$-dimensional data uniformly distributed across $m$ workers, it has $4$ or $6$ rounds of communication and $O (d \log m)$ communication complexity per iteration. Theoretically, we show that GIANT’s convergence rate is faster than first-order methods and existing distributed Newton-type methods. From a practical point-of-view, a highly beneficial feature of GIANT is that it has only one tuning parameter—the iterations of the local solver for computing an ANT direction. This is indeed in sharp contrast with many existing distributed Newton-type methods, as well as popular first order methods, which have several tuning parameters, and whose performance can be greatly affected by the specific choices of such parameters. In this light, we empirically demonstrate the superior performance of GIANT compared with other competing methods.
KnowNER is a multilingual Named Entity Recognition (NER) system that leverages different degrees of external knowledge. A novel modular framework divides the knowledge into four categories according to the depth of knowledge they convey. Each category consists of a set of features automatically generated from different information sources (such as a knowledge-base, a list of names or document-specific semantic annotations) and is used to train a conditional random field (CRF). Since those information sources are usually multilingual, KnowNER can be easily trained for a wide range of languages. In this paper, we show that the incorporation of deeper knowledge systematically boosts accuracy and compare KnowNER with state-of-the-art NER approaches across three languages (i.e., English, German and Spanish) performing amongst state-of-the art systems in all of them.
Inspired by the generation power of generative adversarial networks (GANs) in image domains, we introduce a novel hierarchical architecture for learning characteristic topological features from a single arbitrary input graph via GANs. The hierarchical architecture consisting of multiple GANs preserves both local and global topological features and automatically partitions the input graph into representative stages for feature learning. The stages facilitate reconstruction and can be used as indicators of the importance of the associated topological structures. Experiments show that our method produces subgraphs retaining a wide range of topological features, even in early reconstruction stages (unlike a single GAN, which cannot easily identify such features, let alone reconstruct the original graph). This paper is firstline research on combining the use of GANs and graph topological analysis.
We consider the problem of detecting a few targets among a large number of hierarchical data streams. The data streams are modeled as random processes with unknown and potentially heavy-tailed distributions. The objective is an active inference strategy that determines, sequentially, which data stream to collect samples from in order to minimize the sample complexity under a reliability constraint. We propose an active inference strategy that induces a biased random walk on the tree-structured hierarchy based on confidence bounds of sample statistics. We then establish its order optimality in terms of both the size of the search space (i.e., the number of data streams) and the reliability requirement. The results find applications in hierarchical heavy hitter detection, noisy group testing, and adaptive sampling for active learning, classification, and stochastic root finding.
We develop a framework and methodology for causal inference with linked data. In particular, we describe conceptually how errors in linking can impact causal estimates in observational studies when using propensity score stratification as a causal inference technique. We present a simple procedure for deciding which record pairs to use in estimation of causal effects. Using simulation studies, we demonstrate that the procedure can result in improved accuracy in estimates of treatment effects from linked data.
A classic approach toward zero-shot learning (ZSL) is to map the input domain to a set of semantically meaningful attributes that could be used later on to classify unseen classes of data (e.g. visual data). In this paper, we propose to learn a visual feature dictionary that has semantically meaningful atoms. Such dictionary is learned via joint dictionary learning for the visual domain and the attribute domain, while enforcing the same sparse coding for both dictionaries. Our novel attribute aware formulation provides an algorithmic solution to the domain shift/hubness problem in ZSL. Upon learning the joint dictionaries, images from unseen classes can be mapped into the attribute space by finding the attribute aware joint sparse representation using solely the visual data. We demonstrate that our approach provides superior or comparable performance to that of the state of the art on benchmark datasets.
Message identification (M-I) divergence is an important measure of the information distance between probability distributions, similar to Kullback-Leibler (K-L) and Renyi divergence. In fact, M-I divergence with a variable parameter can make an effect on characterization of distinction between two distributions. Furthermore, by choosing an appropriate parameter of M-I divergence, it is possible to amplify the information distance between adjacent distributions while maintaining enough gap between two nonadjacent ones. Therefore, M-I divergence can play a vital role in distinguishing distributions more clearly. In this paper, we first define a parametric M-I divergence in the view of information theory and then present its major properties. In addition, we design a M-I divergence estimation algorithm by means of the ensemble estimator of the proposed weight kernel estimators, which can improve the convergence of mean squared error from ${O(\varGamma^{-j/d})}$ to ${O(\varGamma^{-1})}$ $({j\in (0,d]})$. We also discuss the decision with M-I divergence for clustering or classification, and investigate its performance in a statistical sequence model of big data for the outlier detection problem.
In this paper, we propose a recurrent neural network (RNN) with residual attention (RRA) to learn long-range dependencies from sequential data. We propose to add residual connections across timesteps to RNN, which explicitly enhances the interaction between current state and hidden states that are several timesteps apart. This also allows training errors to be directly back-propagated through residual connections and effectively alleviates gradient vanishing problem. We further reformulate an attention mechanism over residual connections. An attention gate is defined to summarize the individual contribution from multiple previous hidden states in computing the current state. We evaluate RRA on three tasks: the adding problem, pixel-by-pixel MNIST classification and sentiment analysis on the IMDB dataset. Our experiments demonstrate that RRA yields better performance, faster convergence and more stable training compared to a standard LSTM network. Furthermore, RRA shows highly competitive performance to the state-of-the-art methods.
The goal of this paper is to propose novel strategies for adaptive learning of signals defined over graphs, which are observed over a (randomly time-varying) subset of vertices. We recast two classical adaptive algorithms in the graph signal processing framework, namely, the least mean squares (LMS) and the recursive least squares (RLS) adaptive estimation strategies. For both methods, a detailed mean-square analysis illustrates the effect of random sampling on the adaptive reconstruction capability and the steady-state performance. Then, several probabilistic sampling strategies are proposed to design the sampling probability at each node in the graph, with the aim of optimizing the tradeoff between steady-state performance, graph sampling rate, and convergence rate of the adaptive algorithms. Finally, a distributed RLS strategy is derived and is shown to be convergent to its centralized counterpart. Numerical simulations carried out over both synthetic and real data illustrate the good performance of the proposed sampling and reconstruction strategies for (possibly distributed) adaptive learning of signals defined over graphs.
Deep neural networks (DNNs) have shown the state-of-the-art level of performances in wide range of complicated tasks. In recent years, the studies have been actively conducted to analyze the black box characteristics of DNNs and to grasp the learning behaviours, tendency, and limitations of DNNs. In this paper, we investigate the limitation of DNNs in image classification task and verify it with the method inspired by cognitive psychology. Through analyzing the failure cases of ImageNet classification task, we hypothesize that the DNNs do not sufficiently learn to associate related classes of objects. To verify how DNNs understand the relatedness between object classes, we conducted experiments on the image database provided in cognitive psychology. We applied the ImageNet-trained DNNs to the database consisting of pairs of related and unrelated object images to compare the feature similarities and determine whether the pairs match each other. In the experiments, we observed that the DNNs show limited performance in determining relatedness between object classes. In addition, the DNNs present somewhat improved performance in discovering relatedness based on similarity, but they perform weaker in discovering relatedness based on association. Through these experiments, a novel analysis of learning behaviour of DNNs is provided and the limitation which needs to be overcome is suggested.
We introduce an open-source toolkit for neural machine translation (NMT) to support research into model architectures, feature representations, and source modalities, while maintaining competitive performance, modularity and reasonable training requirements.
We propose in this paper a novel approach to tackle the problem of mode collapse encountered in generative adversarial network (GAN). Our idea is intuitive but proven to be very effective, especially in addressing some key limitations of GAN. In essence, it combines the Kullback-Leibler (KL) and reverse KL divergences into a unified objective function, thus it exploits the complementary statistical properties from these divergences to effectively diversify the estimated density in capturing multi-modes. We term our method dual discriminator generative adversarial nets (D2GAN) which, unlike GAN, has two discriminators; and together with a generator, it also has the analogy of a minimax game, wherein a discriminator rewards high scores for samples from data distribution whilst another discriminator, conversely, favoring data from the generator, and the generator produces data to fool both two discriminators. We develop theoretical analysis to show that, given the maximal discriminators, optimizing the generator of D2GAN reduces to minimizing both KL and reverse KL divergences between data distribution and the distribution induced from the data generated by the generator, hence effectively avoiding the mode collapsing problem. We conduct extensive experiments on synthetic and real-world large-scale datasets (MNIST, CIFAR-10, STL-10, ImageNet), where we have made our best effort to compare our D2GAN with the latest state-of-the-art GAN’s variants in comprehensive qualitative and quantitative evaluations. The experimental results demonstrate the competitive and superior performance of our approach in generating good quality and diverse samples over baselines, and the capability of our method to scale up to ImageNet database.
The sample complexity of learning a Boolean-valued function class is precisely characterized by its Rademacher complexity. This has little bearing, however, on the sample complexity of \emph{efficient} agnostic learning. We introduce \emph{refutation complexity}, a natural computational analog of Rademacher complexity of a Boolean concept class and show that it exactly characterizes the sample complexity of \emph{efficient} agnostic learning. Informally, refutation complexity of a class $\mathcal{C}$ is the minimum number of example-label pairs required to efficiently distinguish between the case that the labels correlate with the evaluation of some member of $\mathcal{C}$ (\emph{structure}) and the case where the labels are i.i.d. Rademacher random variables (\emph{noise}). The easy direction of this relationship was implicitly used in the recent framework for improper PAC learning lower bounds of Daniely and co-authors via connections to the hardness of refuting random constraint satisfaction problems. Our work can be seen as making the relationship between agnostic learning and refutation implicit in their work into an explicit equivalence. In a recent, independent work, Salil Vadhan discovered a similar relationship between refutation and PAC-learning in the realizable (i.e. noiseless) case.
Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries — patterns that are found in the sample data but do not hold in the wider population from which the sample was drawn. Statistical tests can also be applied to filter out patterns that are unlikely to be useful, removing uninformative variations of the key patterns in the data. This tutorial introduces the key statistical and data mining theory and techniques that underpin this fast developing field.
Advertisements