Determining entropy rates of stochastic processes is a fundamental and difficult problem, with closed-form solutions known only for specific cases. This paper pushes the state-of-the-art by solving the problem for Hidden Markov Models (HMMs) and Renyi entropies. While the problem for Markov chains reduces to studying the growth of a matrix product, computations for HMMs involve \emph{products of random matrices}. As a result, this case is much harder and no explicit formulas have been known so far. We show how to circumvent this issue for Renyi entropy of integer orders, reducing the problem again to a \emph{single matrix products} where the matrix is formed from transition and emission probabilities by means of tensor product. To obtain results in the asymptotic setting, we use a novel technique for determining the growth of non-negative matrix powers. The classical approach is the Frobenius-Perron theory, but it requires positivity assumptions; we instead work directly with the spectral formula. As a consequence, our results do not suffer from limitations such as irreducibility and aperiodicity. This improves our understanding of the entropy rate even for standard (unhidden) Markov chains. A recently published side-channel attack against RSA was proven effective using our result, specialized to order 2.
In typical latent-space network models, nodes have latent positions, which are all drawn independently from a common distribution. As a consequence, the number of edges in a network scales quadratically with the number of nodes, resulting in a dense graph sequence as the number of nodes grows. We propose an adjustment to latent-space network models which allows the number edges to scale linearly with the number of nodes, to scale quadratically, or at any intermediate rate. Our models also form projective families, making statistical inference and prediction well-defined. Built through point processes, our models are related to both the Poisson random connection model and the graphex framework.
Previous studies have demonstrated the empirical success of word embeddings in various applications. In this paper, we investigate the problem of learning distributed representations for text documents which many machine learning algorithms take as input for a number of NLP tasks. We propose a neural network model, KeyVec, which learns document representations with the goal of preserving key semantics of the input text. It enables the learned low-dimensional vectors to retain the topics and important information from the documents that will flow to downstream tasks. Our empirical evaluations show the superior quality of KeyVec representations in two different document understanding tasks.
The central aim of most fields of data analysis and experimental scientific investigation is to draw valid conclusions from a given data set. But when past inferences guide future inquiries into the same dataset, reaching valid inferences becomes significantly more difficult. In addition to avoiding the overfitting that can result from adaptive analysis, a data analyst often wants to use as little time and data as possible. A recent line of work in the theory community has established mechanisms that provide low generalization error on adaptive queries, yet there remain large gaps between established theoretical results and how data analysis is done in practice. Many practitioners, for instance, successfully employ bootstrapping and related sampling approaches in order to maintain validity and speed up analysis, but prior to this work, no theoretical analysis existed to justify employing such techniques in this adaptive setting. In this paper, we show how these techniques can be used to provably guarantee validity while speeding up analysis. Through this investigation, we initiate the study of sub-linear time mechanisms to answer adaptive queries into datasets. Perhaps surprisingly, we describe mechanisms that provide an exponential speed-up per query over previous mechanisms, without needing to increase the total amount of data needed for low generalization error. We also provide a method for achieving statistically-meaningful responses even when the mechanism is only allowed to see a constant number of samples from the data per query.
Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose an end-to-end deep neural network approach to detect translational equivalence between sentences in two different languages. In contrast to previous approaches, which typically rely on multiples models and various word alignment features, by leveraging continuous vector representation of sentences we remove the need of any domain specific feature engineering. Using a siamese bidirectional recurrent neural networks, our results against a strong baseline based on a state-of-the-art parallel sentence extraction system show a significant improvement in both the quality of the extracted parallel sentences and the translation performance of statistical machine translation systems. We believe this study is the first one to investigate deep learning for the parallel sentence extraction task.
Pareto Local Search (PLS) is a basic building block in many multiobjective metaheuristics. In this paper, Parallel Pareto Local Search based on Decomposition (PPLS/D) is proposed. PPLS/D decomposes the original search space into L subregions and executes L parallel search processes in these subregions simultaneously. Inside each subregion, the PPLS/D process is first guided by a scalar objective function to approach the Pareto set quickly, then it finds non-dominated solutions in this subregion. Our experimental studies on the multiobjective Unconstrained Binary Quadratic Programming problems (mUBQPs) with two to four objectives demonstrate the efficiency of PPLS/D. We investigate the behavior of PPLS/D to understand its working mechanism. Moreover, we propose a variant of PPLS/D called PPLS/D with Adaptive Expansion (PPLS/D-AE), in which each process can search other subregions after it converges in its own subregion. Its advantages and disadvantages have been studied.
An introduction to the theory of sliding window detection processes, used as alternatives to optimal Neyman-Pearson based radar detectors, is presented. Included is an outline of their historical development, together with an explanation for the resurgence of interest in such detectors for operation in modern maritime surveillance radar clutter. In particular, recent research has developed criteria that enables one to construct such detection processes with the desired constant false alarm rate property for a comprehensive class of clutter model. The chapter also includes some examples of the performance of such detectors in a specific interference environment.
Generative Adversarial Networks (GANs) have shown impressive performance in generating photo-realistic images. They fit generative models by minimizing certain distance measure between the real image distribution and the generated data distribution. Several distance measures have been used, such as Jensen-Shannon divergence, $f$-divergence, and Wasserstein distance, and choosing an appropriate distance measure is very important for training the generative network. In this paper, we choose to use the maximum mean discrepancy (MMD) as the distance metric, which has several nice theoretical guarantees. In fact, generative moment matching network (GMMN) (Li, Swersky, and Zemel 2015) is such a generative model which contains only one generator network $G$ trained by directly minimizing MMD between the real and generated distributions. However, it fails to generate meaningful samples on challenging benchmark datasets, such as CIFAR-10 and LSUN. To improve on GMMN, we propose to add an extra network $F$, called mapper. $F$ maps both real data distribution and generated data distribution from the original data space to a feature representation space $\mathcal{R}$, and it is trained to maximize MMD between the two mapped distributions in $\mathcal{R}$, while the generator $G$ tries to minimize the MMD. We call the new model generative adversarial mapping networks (GAMNs). We demonstrate that the adversarial mapper $F$ can help $G$ to better capture the underlying data distribution. We also show that GAMN significantly outperforms GMMN, and is also superior to or comparable with other state-of-the-art GAN based methods on MNIST, CIFAR-10 and LSUN-Bedrooms datasets.
Bio-inspired algorithms have received a significant amount of attention in both academic and engineering societies. In this paper, based on the observation of two major survival rules of a species of woodlice, i.e., porcellio scaber, we design and propose an algorithm called the porcellio scaber algorithm (PSA) for solving optimization problems, including differentiable and non-differential ones as well as the case with local optimums. Numerical results based on benchmark problems are presented to validate the efficacy of PSA.
The reliable measurement of confidence in classifiers’ predictions is very important for many applications and is, therefore, an important part of classifier design. Yet, although deep learning has received tremendous attention in recent years, not much progress has been made in quantifying the prediction confidence of neural network classifiers. Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with prohibitive computational costs. In this paper we propose a simple, scalable method to achieve a reliable confidence score, based on the data embedding derived from the penultimate layer of the network. We investigate two ways to achieve desirable embeddings, by using either a distance-based loss or Adversarial Training. We then test the benefits of our method when used for classification error prediction, weighting an ensemble of classifiers, and novelty detection. In all tasks we show significant improvement over traditional, commonly used confidence scores.
In order to maximize the applicability of sentiment analysis results, it is necessary to not only classify the overall sentiment (positive/negative) of a given document but also to identify the main words that contribute to the classification. However, most datasets for sentiment analysis only have the sentiment label for each document or sentence. In other words, there is no information about which words play an important role in sentiment classification. In this paper, we propose a method for identifying key words discriminating positive and negative sentences by using a weakly supervised learning method based on a convolutional neural network (CNN). In our model, each word is represented as a continuous-valued vector and each sentence is represented as a matrix whose rows correspond to the word vector used in the sentence. Then, the CNN model is trained using these sentence matrices as inputs and the sentiment labels as the output. Once the CNN model is trained, we implement the word attention mechanism that identifies high-contributing words to classification results with a class activation map, using the weights from the fully connected layer at the end of the learned CNN model. In order to verify the proposed methodology, we evaluated the classification accuracy and inclusion rate of polarity words using two movie review datasets. Experimental result show that the proposed model can not only correctly classify the sentence polarity but also successfully identify the corresponding words with high polarity scores.
Convolutional Neural Network (CNN) image classifiers are traditionally designed to have sequential convolutional layers with a single output layer. This is based on the assumption that all target classes should be treated equally and exclusively. However, some classes can be more difficult to distinguish than others, and classes may be organized in a hierarchy of categories. At the same time, a CNN is designed to learn internal representations that abstract from the input data based on its hierarchical layered structure. So it is natural to ask if an inverse of this idea can be applied to learn a model that can predict over a classification hierarchy using multiple output layers in decreasing order of class abstraction. In this paper, we introduce a variant of the traditional CNN model named the Branch Convolutional Neural Network (B-CNN). A B-CNN model outputs multiple predictions ordered from coarse to fine along the concatenated convolutional layers corresponding to the hierarchical structure of the target classes, which can be regarded as a form of prior knowledge on the output. To learn with B-CNNs a novel training strategy, named the Branch Training strategy (BT-strategy), is introduced which balances the strictness of the prior with the freedom to adjust parameters on the output layers to minimize the loss. In this way we show that CNN based models can be forced to learn successively coarse to fine concepts in the internal layers at the output stage, and that hierarchical prior knowledge can be adopted to boost CNN models’ classification performance. Our models are evaluated to show that the B-CNN extensions improve over the corresponding baseline CNN on the benchmark datasets MNIST, CIFAR-10 and CIFAR-100.
Global registration of multi-view robot data is a challenging task. Appearance-based global localization approaches often fail under drastic view-point changes, as representations have limited view-point invariance. This work is based on the idea that human-made environments contain rich semantics which can be used to disambiguate global localization. Here, we present X-View, a Multi-View Semantic Global Localization system. X-View leverages semantic graph descriptor matching for global localization, enabling localization under drastically different view-points. While the approach is general in terms of the semantic input data, we present and evaluate an implementation on visual data. We demonstrate the system in experiments on the publicly available SYNTHIA dataset, and a realistic urban dataset recorded with a simulator. On these data, our findings show that X-View is able to globally localize aerial-to-ground, and ground-to-ground robot data of drastically different view-points. Our approach achieves an accuracy of up to 85 % on global localizations in the multi-view case, while the benchmarked traditional appearance based method reaches up to 50 %.
Using machine learning algorithms, including deep learning, we studied the prediction of personal attributes from the text of tweets, such as gender, occupation, and age groups. We applied word2vec to construct word vectors, which were then used to vectorize tweet blocks. The resulting tweet vectors were used as inputs for training models, and the prediction accuracy of those models was examined as a function of the dimension of the tweet vectors and the size of the tweet blacks. The results showed that the machine learning algorithms could predict the three personal attributes of interest with 60-70% accuracy.
Pedestrian analysis plays a vital role in intelligent video surveillance and is a key component for security-centric computer vision systems. Despite that the convolutional neural networks are remarkable in learning discriminative features from images, the learning of comprehensive features of pedestrians for fine-grained tasks remains an open problem. In this study, we propose a new attention-based deep neural network, named as HydraPlus-Net (HP-net), that multi-directionally feeds the multi-level attention maps to different feature layers. The attentive deep features learned from the proposed HP-net bring unique advantages: (1) the model is capable of capturing multiple attentions from low-level to semantic-level, and (2) it explores the multi-scale selectiveness of attentive features to enrich the final feature representations for a pedestrian image. We demonstrate the effectiveness and generality of the proposed HP-net for pedestrian analysis on two tasks, i.e. pedestrian attribute recognition and person re-identification. Intensive experimental results have been provided to prove that the HP-net outperforms the state-of-the-art methods on various datasets.
Nowadays, most recommender systems exploit user-provided ratings to infer their preferences. However, the growing popularity of social and e-commerce websites has encouraged users to also share comments and opinions through textual reviews. In this paper, we introduce a new recommendation approach which exploits the semantic annotation of user reviews to extract useful and non-trivial information about the items to recommend. It also relies on the knowledge freely available in the Web of Data, notably in DBpedia and Wikidata, to discover other resources connected with the annotated entities. We evaluated our approach in three domains, using both DBpedia and Wikidata. The results showed that our solution provides a better ranking than another recommendation method based on the Web of Data, while it improves in novelty with respect to traditional techniques based on ratings. Additionally, our method achieved a better performance with Wikidata than DBpedia.
In real-world application scenarios, the identification of groups poses a significant challenge due to possibly occurring outliers and existing noise variables. Therefore, there is a need for a clustering method which is capable of revealing the group structure in data containing both outliers and noise variables without any pre-knowledge. In this paper, we propose a $k$-means-based algorithm incorporating a weighting function which leads to an automatic weight assignment for each observation. In order to cope with noise variables, a lasso-type penalty is used in an objective function adjusted by observation weights. We finally introduce a framework for selecting both the number of clusters and variables based on a modified gap statistic. The conducted experiments on simulated and real-world data demonstrate the advantage of the method to identify groups, outliers, and informative variables simultaneously.
In this paper we investigate the role of the dependency tree in a named entity recognizer upon using a set of GCN. We perform a comparison among different NER architectures and show that the grammar of a sentence positively influences the results. Experiments on the ontonotes dataset demonstrate consistent performance improvements, without requiring heavy feature engineering nor additional language-specific knowledge.
Class imbalance problems manifest in domains such as financial fraud detection or network intrusion analysis, where the prevalence of one class is much higher than another. Typically, practitioners are more interested in predicting the minority class than the majority class as the minority class may carry a higher misclassification cost. However, classifier performance deteriorates in the face of class imbalance as oftentimes classifiers may predict every point as the majority class. Methods for dealing with class imbalance include cost-sensitive learning or resampling techniques. In this paper, we introduce DeepBalance, an ensemble of deep belief networks trained with balanced bootstraps and random feature selection. We demonstrate that our proposed method outperforms baseline resampling methods such as SMOTE and under- and over-sampling in metrics such as AUC and sensitivity when applied to a highly imbalanced financial transaction data. Additionally, we explore performance and training time implications of various model parameters. Furthermore, we show that our model is easily parallelizable, which can reduce training times. Finally, we present an implementation of DeepBalance in R.
The long-tail phenomenon tells us that there are many items in the tail. However, not all tail items are the same. Each item acquires different kinds of users. Some items are loved by the general public, while some items are consumed by eccentric fans. In this paper, we propose a novel metric, item eccentricity, to incorporate this difference between consumers of the items. Eccentric items are defined as items that are consumed by eccentric users. We used this metric to analyze two real-world datasets of music and movies and observed the characteristics of items in terms of eccentricity. The results showed that our defined eccentricity of an item does not change much over time, and classified eccentric and noneccentric items present significantly distinct characteristics. The proposed metric effectively separates the eccentric and noneccentric items mixed in the tail, which could not be done with the previous measures, which only consider the popularity of items.
Data cube materialization is a classical database operator introduced in Gray et al.~(Data Mining and Knowledge Discovery, Vol.~1), which is critical for many analysis tasks. Nandi et al.~(Transactions on Knowledge and Data Engineering, Vol.~6) first studied cube materialization for large scale datasets using the MapReduce framework, and proposed a sophisticated modification of a simple broadcast algorithm to handle a dataset with a 216GB cube size within 25 minutes with 2k machines in 2012. We take a different approach, and propose a simple MapReduce algorithm which (1) minimizes the total number of copy-add operations, (2) leverages locality of computation, and (3) balances work evenly across machines. As a result, the algorithm shows excellent performance, and materialized a real dataset with a cube size of 35.0G tuples and 1.75T bytes in 54 minutes, with 0.4k machines in 2014.