Businesses are naturally interested in detecting anomalies in their internal processes, because these can be indicators for fraud and inefficiencies. Within the domain of business intelligence, classic anomaly detection is not very frequently researched. In this paper, we propose a method, using autoencoders, for detecting and analyzing anomalies occurring in the execution of a business process. Our method does not rely on any prior knowledge about the process and can be trained on a noisy dataset already containing the anomalies. We demonstrate its effectiveness by evaluating it on 700 different datasets and testing its performance against three state-of-the-art anomaly detection methods. This paper is an extension of our previous work from 2016 [30]. Compared to the original publication we have further refined the approach in terms of performance and conducted an elaborate evaluation on more sophisticated datasets including real-life event logs from the Business Process Intelligence Challenges of 2012 and 2017. In our experiments our approach reached an F1 score of 0.87, whereas the best unaltered state-of-the-art approach reached an F1 score of 0.72. Furthermore, our approach can be used to analyze the detected anomalies in terms of which event within one execution of the process causes the anomaly.
In this paper, we consider the estimation and inference of the covariance and precision matrices of a rich class of non-stationary time series. Under mild conditions, we propose an efficient method to consistently estimate the underlying high dimensional covariance matrix and its precision assuming that only one time series is available. Based on our estimation procedure, we provide some useful statistics to test the structure of the covariance and precision matrices of the time series. Monte Carlo experiments are designed to verify the finite sample accuracy of our estimation and the statistical power of our tests. We also prove the high dimensional Gaussian approximation for the quadratic form of the non-stationary time series and give a consistent estimator of the large long-run covariance matrix, which are of great interests by themselves.
This paper presents a practical approach for detecting non-stationarity in time series prediction. This method is called SAFE and works by monitoring the evolution of the spectral contents of time series through a distance function. This method is designed to work in combination with state-of-the-art machine learning methods in real time by informing the online predictors to perform necessary adaptation when a non-stationarity presents. We also propose an algorithm to proportionally include some past data in the adaption process to overcome the Catastrophic Forgetting problem. To validate our hypothesis and test the effectiveness of our approach, we present comprehensive experiments in different elements of the approach involving artificial and real-world datasets. The experiments show that the proposed method is able to significantly save computational resources in term of processor or GPU cycles while maintaining high prediction performances.
In its simplest form, the traffic flow prediction problem is restricted to predicting a single time-step into the future. Multi-step traffic flow prediction extends this set-up to the case where predicting multiple time-steps into the future based on some finite history is of interest. This problem is significantly more difficult than its single-step variant and is known to suffer from degradation in predictions as the time step increases. In this paper, two approaches to improve multi-step traffic flow prediction performance in recursive and multi-output settings are introduced. In particular, a model that allows recursive prediction approaches to take into account the temporal context in term of time-step index when making predictions is introduced. In addition, a conditional generative adversarial network-based data augmentation method is proposed to improve prediction performance in the multi-output setting. The experiments on a real-world traffic flow dataset show that the two methods improve on multi-step traffic flow prediction in recursive and multi-output settings, respectively.
Finding the right cloud configuration for workloads is an essential step to ensure good performance and contain running costs. A poor choice of cloud configuration decreases application performance and increases running cost significantly. While Bayesian Optimization is effective and applicable to any workloads, it is fragile because performance and workload are hard to model (to predict). In this paper, we propose a novel method, SCOUT. The central insight of SCOUT is that using prior measurements, even those for different workloads, improves search performance and reduces search cost. At its core, SCOUT extracts search hints (inference of resource requirements) from low-level performance metrics. Such hints enable SCOUT to navigate through the search space more efficiently—only spotlight region will be searched. We evaluate SCOUT with 107 workloads on Apache Hadoop and Spark. The experimental results demonstrate that our approach finds better cloud configurations with a lower search cost than state of the art methods. Based on this work, we conclude that (i) low-level performance information is necessary for finding the right cloud configuration in an effective, efficient and reliable way, and (ii) a search method can be guided by historical data, thereby reducing cost and improving performance.
In this paper, we focus on general-purpose Distributed Stream Data Processing Systems (DSDPSs), which deal with processing of unbounded streams of continuous data at scale distributedly in real or near-real time. A fundamental problem in a DSDPS is the scheduling problem with the objective of minimizing average end-to-end tuple processing time. A widely-used solution is to distribute workload evenly over machines in the cluster in a round-robin manner, which is obviously not efficient due to lack of consideration for communication delay. Model-based approaches do not work well either due to the high complexity of the system environment. We aim to develop a novel model-free approach that can learn to well control a DSDPS from its experience rather than accurate and mathematically solvable system models, just as a human learns a skill (such as cooking, driving, swimming, etc). Specifically, we, for the first time, propose to leverage emerging Deep Reinforcement Learning (DRL) for enabling model-free control in DSDPSs; and present design, implementation and evaluation of a novel and highly effective DRL-based control framework, which minimizes average end-to-end tuple processing time by jointly learning the system environment via collecting very limited runtime statistics data and making decisions under the guidance of powerful Deep Neural Networks. To validate and evaluate the proposed framework, we implemented it based on a widely-used DSDPS, Apache Storm, and tested it with three representative applications. Extensive experimental results show 1) Compared to Storm’s default scheduler and the state-of-the-art model-based method, the proposed framework reduces average tuple processing by 33.5% and 14.0% respectively on average. 2) The proposed framework can quickly reach a good scheduling solution during online learning, which justifies its practicability for online control in DSDPSs.
Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only ‘syntactically’ applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks.
Researches have shown difficulties in obtaining proximity while maintaining diversity for solving many-objective optimization problems (MaOPs). The complexities of the true Pareto Front (PF) also pose serious challenges for the pervasive algorithms for their insufficient ability to adapt to the characteristics of the true PF with no priori. This paper proposes a cascade Clustering and reference point incremental Learning based Interactive Algorithm (CLIA) for many-objective optimization. In the cascade clustering process, using reference lines provided by the learning process, individuals are clustered and intraclassly sorted in a bi-level cascade style for better proximity and diversity. In the reference point incremental learning process, using the feedbacks from the clustering process, the proper generation of reference points is gradually obtained by incremental learning and the reference lines are accordingly repositioned. The advantages of the proposed interactive algorithm CLIA lie not only in the proximity obtainment and diversity maintenance but also in the versatility for the diverse PFs which uses only the interactions between the two processes without incurring extra evaluations. The experimental studies on the CEC’2018 MaOP benchmark functions have shown that the proposed algorithm CLIA has satisfactory covering of the true PFs, and is competitive, stable and efficient compared with the state-of-the-art algorithms.
For the first time, we introduce ‘Scaling invariable Benford distance’ and ‘Benford cyclic graph’, which can be used to analyze any data set. Using the quantity and the graph, we analyze some date sets with common distributions, such as normal, exponent, etc., find that different data set has a much different value of ‘Scaling invariable Benford distance’ and different figure feature of ‘Benford cyclic graph’. We also explore the influence of data size on ‘Scaling invariable Benford distance’, and find that it firstly reduces with data size increasing, then approximate to a fixed value when the size is large enough.
We consider the problem of exploration in meta reinforcement learning. Two new meta reinforcement learning algorithms are suggested: E-MAML and E-$\text{RL}^2$. Results are presented on a novel environment we call Krazy World’ and a set of maze environments. We show E-MAML and E-$\text{RL}^2$ deliver better performance on tasks where exploration is important.
The purpose of this paper is to construct confidence intervals for the regression coefficients in high-dimensional Cox proportional hazards regression models where the number of covariates may be larger than the sample size. Our debiased estimator construction is similar to those in Zhang and Zhang (2014) and van de Geer et al. (2014), but the time-dependent covariates and censored risk sets introduce considerable additional challenges. Our theoretical results, which provide conditions under which our confidence intervals are asymptotically valid, are supported by extensive numerical experiments.
In this paper, we introduce a new two-parameter lifetime distribution, called the exponential-generalized truncated logarithmic (EGTL) distribution, by compounding the exponential and generalized truncated logarithmic distributions. Our procedure generalizes the exponential-logarithmic (EL) distribution modelling the reliability of systems by the use of first-order concepts, where the minimum lifetime is considered (Tahmasbi 2008). In our approach, we assume that a system fails if a given number k of the components fails and then, we consider the kth-smallest value of lifetime instead of the minimum lifetime. The reliability and failure rate functions as well as their properties are presented for some special cases. The estimation of the parameters is attained by the maximum likelihood, the expectation maximization algorithm, the method of moments and the Bayesian approach, with a simulation study performed to illustrate the different methods of estimation. The application study is illustrated based on two real data sets used in many applications of reliability.
Deep learning has demonstrated tremendous success in variety of application domains in the past few years. This new field of machine learning has been growing rapidly and applied in most of the application domains with some new modalities of applications, which helps to open new opportunity. There are different methods have been proposed on different category of learning approaches, which includes supervised, semi-supervised and un-supervised learning. The experimental results show state-of-the-art performance of deep learning over traditional machine learning approaches in the field of Image Processing, Computer Vision, Speech Recognition, Machine Translation, Art, Medical imaging, Medical information processing, Robotics and control, Bio-informatics, Natural Language Processing (NLP), Cyber security, and many more. This report presents a brief survey on development of DL approaches, including Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) including Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU), Auto-Encoder (AE), Deep Belief Network (DBN), Generative Adversarial Network (GAN), and Deep Reinforcement Learning (DRL). In addition, we have included recent development of proposed advanced variant DL techniques based on the mentioned DL approaches. Furthermore, DL approaches have explored and evaluated in different application domains are also included in this survey. We have also comprised recently developed frameworks, SDKs, and benchmark datasets that are used for implementing and evaluating deep learning approaches. There are some surveys have published on Deep Learning in Neural Networks [1, 38] and a survey on RL [234]. However, those papers have not discussed the individual advanced techniques for training large scale deep learning models and the recently developed method of generative models [1].
In many applications the process of generating label information is expensive and time consuming. We present a new method that combines active and semi-supervised deep learning to achieve high generalization performance from a deep convolutional neural network with as few known labels as possible. In a setting where a small amount of labeled data as well as a large amount of unlabeled data is available, our method first learns the labeled data set. This initialization is followed by an expectation maximization algorithm, where further training reduces classification entropy on the unlabeled data by targeting a low entropy fit which is consistent with the labeled data. In addition the algorithm asks at a specified frequency an oracle for labels of data with entropy above a certain entropy quantile. Using this active learning component we obtain an agile labeling process that achieves high accuracy, but requires only a small amount of known labels. For the MNIST dataset we report an error rate of 2.06% using only 300 labels and 1.06% for 1000 labels. These results are obtained without employing any special network architecture or data augmentation.
The revolution of World Wide Web (WWW) and smart-phone technologies have been the key-factor behind remarkable success of social networks. With the ease of availability of check-in data, the location-based social networks (LBSN) (e.g., Facebook1, etc.) have been heavily explored in the past decade for Point-of-Interest (POI) recommendation. Though many POI recommenders have been defined, most of them have focused on recommending a single location or an arbitrary list that is not contextually coherent. It has been cumbersome to rely on such systems when one needs a contextually coherent list of locations, that can be used for various day-to-day activities, for e.g., itinerary planning. This paper proposes a model termed as CAPS (Context-Aware Personalized POI Sequence Recommender System) that generates contextually coherent POI sequences relevant to user preferences. To the best of our knowledge, CAPS is the first attempt to formulate the contextual POI sequence modeling by extending Recurrent Neural Network (RNN) and its variants. CAPS extends RNN by incorporating multiple contexts to the hidden layer and by incorporating global context (sequence features) to the hidden layers and the output layer. It extends the variants of RNN (e.g., Long-short term memory (LSTM)) by incorporating multiple contexts and global features in the gate update relations. The major contributions of this paper are: (i) it models the contextual POI sequence problem by incorporating personalized user preferences through multiple constraints (e.g., categorical, social, temporal, etc.), (ii) it extends RNN to incorporate the contexts of individual item and that of the whole sequence. It also extends the gated functionality of variants of RNN to incorporate the multiple contexts, and (iii) it evaluates the proposed models against two real-world data sets.
Nonnegative matrix factorization (NMF) has become a workhorse for signal and data analytics, triggered by its model parsimony and interpretability. Perhaps a bit surprisingly, the understanding to its model identifiability—the major reason behind the interpretability in many applications such as topic mining and hyperspectral imaging—had been rather limited until recent years. Beginning from the 2010s, the identifiability research of NMF has progressed considerably: Many interesting and important results have been discovered by the signal processing (SP) and machine learning (ML) communities. NMF identifiability has a great impact on many aspects in practice, such as ill-posed formulation avoidance and performance-guaranteed algorithm design. On the other hand, there is no tutorial paper that introduces NMF from an identifiability viewpoint. In this paper, we aim at filling this gap by offering a comprehensive and deep tutorial on model identifiability of NMF as well as the connections to algorithms and applications. This tutorial will help researchers and graduate students grasp the essence and insights of NMF, thereby avoiding typical pitfalls’ that are often times due to unidentifiable NMF formulations. This paper will also help practitioners pick/design suitable factorization tools for their own problems.
Deep learning is formulated as a discrete-time optimal control problem. This allows one to characterize necessary conditions for optimality and develop training algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discrete-time method of successive approximations (MSA), which is based on the Pontryagin’s maximum principle, for training neural networks. A rigorous error estimate for the discrete MSA is obtained, which sheds light on its dynamics and the means to stabilize the algorithm. The developed methods are applied to train, in a rather principled way, neural networks with weights that are constrained to take values in a discrete set. We obtain competitive performance and interestingly, very sparse weights in the case of ternary networks, which may be useful in model deployment in low-memory devices.
It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption, and recapitulate evidence for and against this postulate. We also report the results of an evaluation in a crowd-sourcing study, which does not reveal a strong preference for simple rules, whereas we can observe a weak preference for longer rules in some domains. We then continue to review criteria for interpretability from the psychological literature, evaluate some of them, and briefly discuss their potential use in machine learning.
To train an inference network jointly with a deep generative topic model, making it both scalable to big corpora and fast in out-of-sample prediction, we develop Weibull hybrid autoencoding inference (WHAI) for deep latent Dirichlet allocation, which infers posterior samples via a hybrid of stochastic-gradient MCMC and autoencoding variational Bayes. The generative network of WHAI has a hierarchy of gamma distributions, while the inference network of WHAI is a Weibull upward-downward variational autoencoder, which integrates a deterministic-upward deep neural network, and a stochastic-downward deep generative model based on a hierarchy of Weibull distributions. The Weibull distribution can be used to well approximate a gamma distribution with an analytic Kullback-Leibler divergence, and has a simple reparameterization via the uniform noise, which help efficiently compute the gradients of the evidence lower bound with respect to the parameters of the inference network. The effectiveness and efficiency of WHAI are illustrated with experiments on big corpora.
Comprehending meaning from natural language is a primary objective of Natural Language Processing (NLP), and text comprehension is the cornerstone for achieving this objective upon which all other problems like chat bots, language translation and others can be achieved. We report a Summary-Attentive Reader we designed to better emulate the human reading process, along with a dictiontary-based solution regarding out-of-vocabulary (OOV) words in the data, to generate answer based on machine comprehension of reading passages and question from the SQuAD benchmark. Our implementation of these features with two popular models (Match LSTM and Dynamic Coattention) was able to reach close to matching the results obtained from humans.
Robotic grasp detection task is still challenging, particularly for novel objects. With the recent advance of deep learning, there have been several works on detecting robotic grasp using neural networks. Typically, regression based grasp detection methods have outperformed classification based detection methods in computation complexity with excellent accuracy. However, classification based robotic grasp detection still seems to have merits such as intermediate step observability and straightforward back propagation routine for end-to-end training. In this work, we propose a novel classification based robotic grasp detection method with multiple-stage spatial transformer networks (STN). Our proposed method was able to achieve state-of-the-art performance in accuracy with real- time computation. Additionally, unlike other regression based grasp detection methods, our proposed method allows partial observation for intermediate results such as grasp location and orientation for a number of grasp configuration candidates.
The Block Tree (BT) is a novel compact data structure designed to compress sequence collections. It obtains compression ratios close to Lempel-Ziv and supports efficient direct access to any substring. The BT divides the text recursively into fixed-size blocks and those appearing earlier are represented with pointers. On repetitive collections, a few blocks can represent all the others, and thus the BT reduces the size by orders of magnitude. In this paper we extend the BT to two dimensions, to exploit repetitiveness in collections of images, graphs, and maps. This two-dimensional Block Tree divides the image regularly into subimages and replaces some of them by pointers to other occurrences thereof. We develop a specific variant aimed at compressing the adjacency matrices of Web graphs, obtaining space reductions of up to 50\% compared with the $k^2$-tree, which is the best alternative supporting direct and reverse navigation in the graph.
Past. Data curation – the process of discovering, integrating, and cleaning data – is one of the oldest data management problems. Unfortunately, it is still the most time consuming and least enjoyable work of data scientists. So far, successful data curation stories are mainly ad-hoc solutions that are either domain-specific (for example, ETL rules) or task-specific (for example, entity resolution). Present. The power of current data curation solutions are not keeping up with the ever changing data ecosystem in terms of volume, velocity, variety and veracity, mainly due to the high human cost, instead of machine cost, needed for providing the ad-hoc solutions mentioned above. Meanwhile, deep learning is making strides in achieving remarkable successes in areas such as image recognition, natural language processing, and speech recognition. This is largely due to its ability to understanding features that are neither domain-specific nor task-specific. Future. Data curation solutions need to keep the pace with the fast-changing data ecosystem, where the main hope is to devise domain-agnostic and task-agnostic solutions. To this end, we start a new research project, called AutoDC, to unleash the potential of deep learning towards self-driving data curation. We will discuss how different deep learning concepts can be adapted and extended to solve various data curation problems. We showcase some low-hanging fruits about the early encounters between deep learning and data curation happening in AutoDC. We believe that the directions pointed out by this work will not only drive AutoDC towards democratizing data curation, but also serve as a cornerstone for researchers and practitioners to move to a new realm of data curation solutions.
Despite significant empirical and theoretically supported evidence that non-static parameter choices can be strongly beneficial in evolutionary computation, the question how to best adjust parameter values plays only a marginal role in contemporary research on discrete black-box optimization. This has led to the unsatisfactory situation in which feedback-free parameter selection rules such as the cooling schedule of Simulated Annealing are predominant in state-of-the-art heuristics, while, at the same time, we understand very well that such time-dependent selection rules can only perform worse than adjustment rules that do take into account the evolution of the optimization process. A number of adaptive and self-adaptive parameter control strategies have been proposed in the literature, but did not (yet) make their way to a broader public. A key obstacle seems to lie in their rather complex update rules. The purpose of our work is to demonstrate that high-performing online parameter selection rules do not have to be very complicated. More precisely, we experiment with a multiplicative, comparison-based update rule to adjust the mutation probability of a (1+1)~Evolutionary Algorithm. We show that this simple self-adjusting rule outperforms the best static unary unbiased black-box algorithm on LeadingOnes, achieving an almost optimal speedup of about~$18\%$.
Neural networks are known to be vulnerable to adversarial examples. Carefully chosen perturbations to real images, while imperceptible to humans, induce misclassification and threaten the reliability of deep learning systems in the wild. To guard against adversarial examples, we take inspiration from game theory and cast the problem as a minimax zero-sum game between the adversary and the model. In general, for such games, the optimal strategy for both players requires a stochastic policy, also known as a mixed strategy. In this light, we propose Stochastic Activation Pruning (SAP), a mixed strategy for adversarial defense. SAP prunes a random subset of activations (preferentially pruning those with smaller magnitude) and scales up the survivors to compensate. We can apply SAP to pretrained networks, including adversarially trained models, without fine-tuning, providing robustness against adversarial examples. Experiments demonstrate that SAP confers robustness against attacks, increasing accuracy and preserving calibration.
Clustering high-dimensional datasets is hard because interpoint distances become less informative in high-dimensional spaces. We present a clustering algorithm that performs nonlinear dimensionality reduction and clustering jointly. The data is embedded into a lower-dimensional space by a deep autoencoder. The autoencoder is optimized as part of the clustering process. The resulting network produces clustered data. The presented approach does not rely on prior knowledge of the number of ground-truth clusters. Joint nonlinear dimensionality reduction and clustering are formulated as optimization of a global continuous objective. We thus avoid discrete reconfigurations of the objective that characterize prior clustering algorithms. Experiments on datasets from multiple domains demonstrate that the presented algorithm outperforms state-of-the-art clustering schemes, including recent methods that use deep networks.
We consider the canonical $L_0$-regularized least squares problem (aka best subsets) which is generally perceived as a gold-standard’ for many sparse learning regimes. In spite of worst-case computational intractability results, recent work has shown that advances in mixed integer optimization can be used to obtain near-optimal solutions to this problem for instances where the number of features $p \approx 10^3$. While these methods lead to estimators with excellent statistical properties, often there is a price to pay in terms of a steep increase in computation times, especially when compared to highly efficient popular algorithms for sparse learning (e.g., based on $L_1$-regularization) that scale to much larger problem sizes. Bridging this gap is a main goal of this paper. We study the computational aspects of a family of $L_0$-regularized least squares problems with additional convex penalties. We propose a hierarchy of necessary optimality conditions for these problems. We develop new algorithms, based on coordinate descent and local combinatorial optimization schemes, and study their convergence properties. We demonstrate that the choice of an algorithm determines the quality of solutions obtained; and local combinatorial optimization-based algorithms generally result in solutions of superior quality. We show empirically that our proposed framework is relatively fast for problem instances with $p\approx 10^6$ and works well, in terms of both optimization and statistical properties (e.g., prediction, estimation, and variable selection), compared to simpler heuristic algorithms. A version of our algorithm reaches up to a three-fold speedup (with $p$ up to $10^6$) when compared to state-of-the-art schemes for sparse learning such as glmnet and ncvreg.
Most recent approaches use the sequence-to-sequence model for paraphrase generation. The existing sequence-to-sequence model tends to memorize the words and the patterns in the training dataset instead of learning the meaning of the words. Therefore, the generated sentences are often grammatically correct but semantically improper. In this work, we introduce a novel model based on the encoder-decoder framework, called Word Embedding Attention Network (WEAN). Our proposed model generates the words by querying distributed word representations (i.e. neural word embeddings), hoping to capturing the meaning of the according words. Following previous work, we evaluate our model on two paraphrase-oriented tasks, namely text simplification and short text abstractive summarization. Experimental results show that our model outperforms the sequence-to-sequence baseline by the BLEU score of 6.3 and 5.5 on two English text simplification datasets, and the ROUGE-2 F1 score of 5.7 on a Chinese summarization dataset. Moreover, our model achieves state-of-the-art performances on these three benchmark datasets.
We introduce Recurrent Predictive State Policy (RPSP) networks, a recurrent architecture that brings insights from predictive state representations to reinforcement learning in partially observable environments. Predictive state policy networks consist of a recursive filter, which keeps track of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. The recursive filter leverages predictive state representations (PSRs) (Rosencrantz and Gordon, 2004; Sun et al., 2016) by modeling predictive state– a prediction of the distribution of future observations conditioned on history and future actions. This representation gives rise to a rich class of statistically consistent algorithms (Hefny et al., 2018) to initialize the recursive filter. Predictive state serves as an equivalent representation of a belief state. Therefore, the policy component of the RPSP-network can be purely reactive, simplifying training while still allowing optimal behaviour. Moreover, we use the PSR interpretation during training as well, by incorporating prediction error in the loss function. The entire network (recursive filter and reactive policy) is still differentiable and can be trained using gradient based methods. We optimize our policy using a combination of policy gradient based on rewards (Williams, 1992) and gradient descent based on prediction error. We show the efficacy of RPSP-networks under partial observability on a set of robotic control tasks from OpenAI Gym. We empirically show that RPSP-networks perform well compared with memory-preserving networks such as GRUs, as well as finite memory models, being the overall best performing method.
Recent advances in object detection are mainly driven by deep learning with large-scale detection benchmarks. However, the fully-annotated training set is often limited for a target detection task, which may deteriorate the performance of deep detectors. To address this challenge, we propose a novel low-shot transfer detector (LSTD) in this paper, where we leverage rich source-domain knowledge to construct an effective target-domain detector with very few training examples. The main contributions are described as follows. First, we design a flexible deep architecture of LSTD to alleviate transfer difficulties in low-shot detection. This architecture can integrate the advantages of both SSD and Faster RCNN in a unified deep framework. Second, we introduce a novel regularized transfer learning framework for low-shot detection, where the transfer knowledge (TK) and background depression (BD) regularizations are proposed to leverage object knowledge respectively from source and target domains, in order to further enhance fine-tuning with a few target images. Finally, we examine our LSTD on a number of challenging low-shot detection experiments, where LSTD outperforms other state-of-the-art approaches. The results demonstrate that LSTD is a preferable deep detector for low-shot scenarios.
Distance metric learning is a successful way to enhance the performance of the nearest neighbor classifier. In most cases, however, the distribution of data does not obey a regular form and may change in different parts of the feature space. Regarding that, this paper proposes a novel local distance metric learning method, namely Local Mahalanobis Distance Learning (LMDL), in order to enhance the performance of the nearest neighbor classifier. LMDL considers the neighborhood influence and learns multiple distance metrics for a reduced set of input samples. The reduced set is called as prototypes which try to preserve local discriminative information as much as possible. The proposed LMDL can be kernelized very easily, which is significantly desirable in the case of highly nonlinear data. The quality as well as the efficiency of the proposed method assesses through a set of different experiments on various datasets and the obtained results show that LDML as well as the kernelized version is superior to the other related state-of-the-art methods.
The goal in extreme multi-label classification is to learn a classifier which can assign a small subset of relevant labels to an instance from an extremely large set of target labels. Datasets in extreme classification exhibit a long tail of labels which have small number of positive training instances. In this work, we pose the learning task in extreme classification with large number of tail-labels as learning in the presence of adversarial perturbations. This view motivates a robust optimization framework and equivalence to a corresponding regularized objective. Under the proposed robustness framework, we demonstrate efficacy of Hamming loss for tail-label detection in extreme classification. The equivalent regularized objective, in combination with proximal gradient based optimization, performs better than state-of-the-art methods on propensity scored versions of precision@k and nDCG@k(upto 20% relative improvement over PFastreXML – a leading tree-based approach and 60% relative improvement over SLEEC – a leading label-embedding approach). Furthermore, we also highlight the sub-optimality of a sparse solver in a widely used package for large-scale linear classification, which is interesting in its own right. We also investigate the spectral properties of label graphs for providing novel insights towards understanding the conditions governing the performance of Hamming loss based one-vs-rest scheme vis-\a-vis label embedding methods.
We describe N-body networks, a neural network architecture for learning the behavior and properties of complex many body physical systems. Our specific application is to learn atomic potential energy surfaces for use in molecular dynamics simulations. Our architecture is novel in that (a) it is based on a hierarchical decomposition of the many body system into subsytems, (b) the activations of the network correspond to the internal state of each subsystem, (c) the ‘neurons’ in the network are constructed explicitly so as to guarantee that each of the activations is covariant to rotations, (d) the neurons operate entirely in Fourier space, and the nonlinearities are realized by tensor products followed by Clebsch-Gordan decompositions. As part of the description of our network, we give a characterization of what way the weights of the network may interact with the activations so as to ensure that the covariance property is maintained.
Supervised deep learning methods have shown promising results for the task of monocular depth estimation; but acquiring ground truth is costly, and prone to noise as well as inaccuracies. While synthetic datasets have been used to circumvent above problems, the resultant models do not generalize well to natural scenes due to the inherent domain shift. Recent adversarial approaches for domain adaption have performed well in mitigating the differences between the source and target domains. But these methods are mostly limited to a classification setup and do not scale well for fully-convolutional architectures. In this work, we propose AdaDepth – an unsupervised domain adaptation strategy for the pixel-wise regression task of monocular depth estimation. The proposed approach is devoid of above limitations through a) adversarial learning and b) explicit imposition of content consistency on the adapted target representation. Our unsupervised approach performs competitively with other established approaches on depth estimation tasks and achieves state-of-the-art results in a semi-supervised setting.
Many real-world complex systems are well represented as multilayer networks; predicting interactions in those systems is one of the most pressing problems in predictive network science. To address this challenge, we introduce two stochastic block models for multilayer and temporal networks; one of them uses nodes as its fundamental unit, whereas the other focuses on links. We also develop scalable algorithms for inferring the parameters of these models. Because our models describe all layers simultaneously, our approach takes full advantage of the information contained in the whole network when making predictions about any particular layer. We illustrate the potential of our approach by analyzing two empirical datasets—a temporal network of email communications, and a network of drug interactions for treating different cancer types. We find that modeling all layers simultaneously does result, in general, in more accurate link prediction. However, the most predictive model depends on the dataset under consideration; whereas the node-based model is more appropriate for predicting drug interactions, the link-based model is more appropriate for predicting email communication.
Collaborative Filtering (CF) is a widely adopted technique in recommender systems. Traditional CF models mainly focus on predicting a user’s preference to the items in a single domain such as the movie domain or the music domain. A major challenge for such models is the data sparsity problem, and especially, CF cannot make accurate predictions for the cold-start users who have no ratings at all. Although Cross-Domain Collaborative Filtering (CDCF) is proposed for effectively transferring users’ rating preference across different domains, it is still difficult for existing CDCF models to tackle the cold-start users in the target domain due to the extreme data sparsity. In this paper, we propose a Cross-Domain Latent Feature Mapping (CDLFM) model for cold-start users in the target domain. Firstly, in order to better characterize users in sparse domains, we take the users’ similarity relationship on rating behaviors into consideration and propose the Matrix Factorization by incorporating User Similarities (MFUS) in which three similarity measures are proposed. Next, to perform knowledge transfer across domains, we propose a neighborhood based gradient boosting trees method to learn the cross-domain user latent feature mapping function. For each cold-start user, we learn his/her feature mapping function based on the latent feature pairs of those linked users who have similar rating behaviors with the cold-start user in the auxiliary domain. And the preference of the cold-start user in the target domain can be predicted based on the mapping function and his/her latent features in the auxiliary domain. Experimental results on two real data sets extracted from Amazon transaction data demonstrate the superiority of our proposed model against other state-of-the-art methods.
Advances in numerical optimization have supported breakthroughs in several areas of signal processing. This paper focuses on the recent enhanced variants of the proximal gradient numerical optimization algorithm, which combine quasi-Newton methods with forward-adjoint oracles to tackle large-scale problems and reduce the computational burden of many applications. These proximal gradient algorithms are here described in an easy-to-understand way, illustrating how they are able to address a wide variety of problems arising in signal processing. A new high-level modeling language is presented which is used to demonstrate the versatility of the presented algorithms in a series of signal processing application examples such as sparse deconvolution, total variation denoising, audio de-clipping and others.
There currently exists a wide range of techniques to model and evolve artificial players for games. Existing techniques range from black box neural networks to entirely hand-designed solutions. In this paper, we demonstrate the feasibility of a genetic programming framework using human controller input to derive meaningful artificial players which can, later on, be optimised by hand. The current state of the art in game character design relies heavily on human designers to manually create and edit scripts and rules for game characters. To address this manual editing bottleneck, current computational intelligence techniques approach the issue with fully autonomous character generators, replacing most of the design process using black box solutions such as neural networks or the like. Our GP approach to this problem creates character controllers which can be further authored and developed by a designer it also offers designers to included their play style without the need to use a programming language. This keeps the designer in the loop while reducing repetitive manual labour. Our system also provides insights into how players express themselves in games and into deriving appropriate models for representing those insights. We present our framework, supporting findings and open challenges.
String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set $\Gamma\subseteq [1..n]$ is a $k$-attractor for a string $S\in[1..\sigma]^n$ if and only if every distinct substring of $S$ of length at most $k$ has an occurrence straddling at least one of the positions in $\Gamma$. Finding the smallest $k$-attractor is NP-hard for $k\geq3$, but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the $k$-attractor problem to a set-cover instance where string’s positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a $k$-attractor in near-optimal time and how to quickly compute exact and approximate solutions. For example, we prove that a minimum $3$-attractor can be found in optimal $O(n)$ time when $\sigma\in O(\sqrt[3+\epsilon]{\log n})$ for any constant $\epsilon>0$, and $2.45$-approximation can be computed in $O(n)$ time on general alphabets. To conclude, we introduce and study the complexity of the closely-related sharp-$k$-attractor problem: to find the smallest set of positions capturing all distinct substrings of length exactly $k$. We show that the problem is in P for $k=1,2$ and is NP-complete for constant $k\geq 3$.
We propose a new class of spatio-temporal models with unknown and banded autoregressive coefficient matrices. The setting represents a sparse structure for high-dimensional spatial panel dynamic models when panel members represent economic (or other type) individuals at many different locations. The structure is practically meaningful when the order of panel members is arranged appropriately. Note that the implied autocovariance matrices are unlikely to be banded, and therefore, the proposal is radically different from the existing literature on the inference for high-dimensional banded covariance matrices. Due to the innate endogeneity, we apply the least squares method based on a Yule-Walker equation to estimate autoregressive coefficient matrices. The estimators based on multiple Yule-Walker equations are also studied. A ratio-based method for determining the bandwidth of autoregressive matrices is also proposed. Some asymptotic properties of the inference methods are established. The proposed methodology is further illustrated using both simulated and real data sets.
We investigate the effects of initialization and architecture on the start of training in deep ReLU nets. We identify two common failure modes for early training in which the mean and variance of activations are poorly behaved. For each failure mode, we give a rigorous proof of when it occurs at initialization and how to avoid it. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in. The second failure mode, exponentially large variance of activation length, can be avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained.
A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the de facto standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization – finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the ‘ground-truth’ clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations. We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in partially recovering cluster structure even when the above deterministic condition is not met.
Many online applications, such as online social networks or knowledge bases, are often attacked by malicious users who commit different types of actions such as vandalism on Wikipedia or fraudulent reviews on eBay. Currently, most of the fraud detection approaches require a training dataset that contains records of both benign and malicious users. However, in practice, there are often no or very few records of malicious users. In this paper, we develop one-class adversarial nets (OCAN) for fraud detection using training data with only benign users. OCAN first uses LSTM-Autoencoder to learn the representations of benign users from their sequences of online activities. It then detects malicious users by training a discriminator with a complementary GAN model that is different from the regular GAN model. Experimental results show that our OCAN outperforms the state-of-the-art one-class classification models and achieves comparable performance with the latest multi-source LSTM model that requires both benign and malicious users in the training phase.
We address the problem of finding realistic geometric corrections to a foreground object such that it appears natural when composited into a background image. To achieve this, we propose a novel Generative Adversarial Network (GAN) architecture that utilizes Spatial Transformer Networks (STNs) as the generator, which we call Spatial Transformer GANs (ST-GANs). ST-GANs seek image realism by operating in the geometric warp parameter space. In particular, we exploit an iterative STN warping scheme and propose a sequential training strategy that achieves better results compared to naive training of a single generator. One of the key advantages of ST-GAN is its applicability to high-resolution images indirectly since the predicted warp parameters are transferable between reference frames. We demonstrate our approach in two applications: (1) visualizing how indoor furniture (e.g. from product images) might be perceived in a room, (2) hallucinating how accessories like glasses would look when matched with real portraits.