This paper describes ANN-Benchmarks, a tool for evaluating the performance of in-memory approximate nearest neighbor algorithms. It provides a standard interface for measuring the performance and quality achieved by nearest neighbor algorithms on different standard data sets. It supports several different ways of integrating $k$-NN algorithms, and its configuration system automatically tests a range of parameter settings for each algorithm. Algorithms are compared with respect to many different (approximate) quality measures, and adding more is easy and fast; the included plotting front-ends can visualise these as images, $\LaTeX$ plots, and websites with interactive plots. ANN-Benchmarks aims to provide a constantly updated overview of the current state of the art of $k$-NN algorithms. In the short term, this overview allows users to choose the correct $k$-NN algorithm and parameters for their similarity search task; in the longer term, algorithm designers will be able to use this overview to test and refine automatic parameter tuning. The paper gives an overview of the system, evaluates the results of the benchmark, and points out directions for future work. Interestingly, very different approaches to $k$-NN search yield comparable quality-performance trade-offs. The system is available at http://ann-benchmarks.com.
Fuzzing has become the de facto standard technique for finding software vulnerabilities. However, even the state-of-the-art fuzzers are not very efficient at finding hard-to-trigger software bugs. Coverage-guided evolutionary fuzzers, while fast and scalable, often get stuck at fruitless sequences of random mutations. By contrast, more systematic techniques like symbolic and concolic execution incur significant performance overhead and struggle to scale to larger programs. We design, implement, and evaluate NEUZZ, an efficient fuzzer that guides the fuzzing input generation process using deep neural networks. NEUZZ efficiently learns a differentiable neural approximation of the target program logic. The differentiability of the surrogate neural program, unlike the original target program, allows us to use efficient optimization techniques like gradient descent to identify promising mutations that are more likely to trigger hard-to-reach code in the target program. We evaluate NEUZZ on 10 popular real-world programs and demonstrate that NEUZZ consistently outperforms AFL, a state-of-the-art evolutionary fuzzer, both at finding new bugs and achieving higher edge coverage. In total, NEUZZ found 36 previously unknown bugs that AFL failed to find and achieved, on average, 70 more edge coverage than AFL. Our results also demonstrate that NEUZZ can achieve average 9 more edge coverage while taking 16 less training time than other learning-enabled fuzzers.
Despite the somewhat different techniques used in developing search engines and recommender systems, they both follow the same goal: helping people to get the information they need at the right time. Due to this common goal, search and recommendation models can potentially benefit from each other. The recent advances in neural network technologies make them effective and easily extendable for various tasks, including retrieval and recommendation. This raises the possibility of jointly modeling and optimizing search ranking and recommendation algorithms, with potential benefits to both. In this paper, we present theoretical and practical reasons to motivate joint modeling of search and recommendation as a research direction. We propose a general framework that simultaneously learns a retrieval model and a recommendation model by optimizing a joint loss function. Our preliminary results on a dataset of product data indicate that the proposed joint modeling substantially outperforms the retrieval and recommendation models trained independently. We list a number of future directions for this line of research that can potentially lead to development of state-of-the-art search and recommendation models.
This paper describes a data reduction technique in case of a markov chain of specified order. Instead of observing all the transitions in a markov chain we record only a few of them and treat the remaining part as missing. The decision about which transitions to be filtered is taken before the observation process starts. Based on the filtered chain we try to estimate the parameters of the markov model using EM algorithm. In the first half of the paper we characterize a class of filtering mechanism for which all the parameters remain identifiable. In the later half we explain methods of estimation and testing about the transition probabilities of the markov chain based on the filtered data. The methods are first developed assuming a simple markov model with each probability of transition positive, but then generalized for models with structural zeroes in the transition probability matrix. Further extension is also done for multiple markov chains. The performance of the developed method of estimation is studied using simulated data along with a real life data.
Wang and Tchetgen Tchetgen (2017) studied identification and estimation of the average treatment effect when some confounders are unmeasured. Under their identification condition, they showed that the semiparametric efficient influence function depends on five unknown functionals. They proposed to parameterize all functionals and estimate the average treatment effect from the efficient influence function by replacing the unknown functionals with estimated functionals. They established that their estimator is consistent when certain functionals are correctly specified and attains the semiparametric efficiency bound when all functionals are correctly specified. In applications, it is likely that those functionals could all be misspecified. Consequently their estimator could be inconsistent or consistent but not efficient. This paper presents an alternative estimator that does not require parameterization of any of the functionals. We establish that the proposed estimator is always consistent and always attains the semiparametric efficiency bound. A simple and intuitive estimator of the asymptotic variance is presented, and a small scale simulation study reveals that the proposed estimation outperforms the existing alternatives in finite samples.
Your computer is continuously executing programs, but does it really understand them Not in any meaningful sense. That burden falls upon human knowledge workers, who are increasingly asked to write and understand code. They would benefit greatly from intelligent tools that reveal the connections between their code and its subject matter. Towards this prospect, we develop an AI system that forms semantic representations of computer programs, using techniques from knowledge representation and program analysis. We focus on code written for data science, although our method is more generally applicable. The semantic representations are created through a novel algorithm for the semantic enrichment of dataflow graphs. This algorithm is undergirded by a new ontology language for modeling computer programs and a new ontology about data science, written in this language.
The minor probability events detection is a crucial problem in Big data. Such events tend to include rarely occurring phenomenons which should be detected and monitored carefully. Given the prior probabilities of separate events and the conditional distributions of observations on the events, the Bayesian detection can be applied to estimate events behind the observations. It has been proved that Bayesian detection has the smallest overall testing error in average sense. However, when detecting an event with very small prior probability, the conditional Bayesian detection would result in high miss testing rate. To overcome such a problem, a modified detection approach is proposed based on Bayesian detection and message importance measure, which can reduce miss testing rate in conditions of detecting events with minor probability. The result can help to dig minor probability events in big data.
In dataflow representations for signal processing systems, applications are represented as directed graphs in which vertices represent computations and edges correspond to buffers that store data as it passes between computations. The buffers are single-input, single-output components that manage data in a first-in, first-out (FIFO) fashion. In this paper, we generalize the concept of dataflow buffers with a concept called ‘passive blocks’. Like dataflow buffers, passive blocks are used to store data during the intervals between its generation by producing actors, and its use by consuming actors. However, passive blocks can have multiple inputs and multiple outputs, and can incorporate operations on and rearrangements of the stored data subject to certain constraints. We define a form of flowgraph representation that is based on replacing dataflow edges with the proposed concept of passive blocks. We present a structured design methodology for utilizing this new form of signal processing flowgraph, and demonstrate its utility in improving memory management efficiency, and execution time performance.
One of the most common needs in manufacturing plants is rejecting products not coincident with the standards as anomalies. Accurate and automatic anomaly detection improves product reliability and reduces inspection cost. Probabilistic models have been employed to detect test samples with lower likelihoods as anomalies in unsupervised manner. Recently, a probabilistic model called deep generative model (DGM) has been proposed for end-to-end modeling of natural images and already achieved a certain success. However, anomaly detection of machine components with complicated structures is still challenging because they produce a wide variety of normal image patches with low likelihoods. For overcoming this difficulty, we propose unregularized score for the DGM. As its name implies, the unregularized score is the anomaly score of the DGM without the regularization terms. The unregularized score is robust to the inherent complexity of a sample and has a smaller risk of rejecting a sample appearing less frequently but being coincident with the standards.
The recently proposed adversarial training methods show the robustness to both adversarial and original examples and achieve state-of-the-art results in supervised and semi-supervised learning. All the existing adversarial training methods consider only how the worst perturbed examples (i.e., adversarial examples) could affect the model output. Despite their success, we argue that such setting may be in lack of generalization, since the output space (or label space) is apparently less informative. In this paper, we propose a novel method, called Manifold Adversarial Training (MAT). MAT manages to build an adversarial framework based on how the worst perturbation could affect the distributional manifold rather than the output space. Particularly, a latent data space with the Gaussian Mixture Model (GMM) will be first derived. On one hand, MAT tries to perturb the input samples in the way that would rough the distributional manifold the worst. On the other hand, the deep learning model is trained trying to promote in the latent space the manifold smoothness, measured by the variation of Gaussian mixtures (given the local perturbation around the data point). Importantly, since the latent space is more informative than the output space, the proposed MAT can learn better a robust and compact data representation, leading to further performance improvement. The proposed MAT is important in that it can be considered as a superset of one recently-proposed discriminative feature learning approach called center loss. We conducted a series of experiments in both supervised and semi-supervised learning on three benchmark data sets, showing that the proposed MAT can achieve remarkable performance, much better than those of the state-of-the-art adversarial approaches.
Machine learning models leak information about the datasets on which they are trained. An adversary can build an algorithm to trace the individual members of a model’s training dataset. As a fundamental inference attack, he aims to distinguish between data points that were part of the model’s training set and any other data points from the same distribution. This is known as the tracing (and also membership inference) attack. In this paper, we focus on such attacks against black-box models, where the adversary can only observe the output of the model, but not its parameters. This is the current setting of machine learning as a service in the Internet. We introduce a privacy mechanism to train machine learning models that provably achieve membership privacy: the model’s predictions on its training data are indistinguishable from its predictions on other data points from the same distribution. We design a strategic mechanism where the privacy mechanism anticipates the membership inference attacks. The objective is to train a model such that not only does it have the minimum prediction error (high utility), but also it is the most robust model against its corresponding strongest inference attack (high privacy). We formalize this as a min-max game optimization problem, and design an adversarial training algorithm that minimizes the classification loss of the model as well as the maximum gain of the membership inference attack against it. This strategy, which guarantees membership privacy (as prediction indistinguishability), acts also as a strong regularizer and significantly generalizes the model. We evaluate our privacy mechanism on deep neural networks using different benchmark datasets. We show that our min-max strategy can mitigate the risk of membership inference attacks (close to the random guess) with a negligible cost in terms of the classification error.
Collaborative Filtering (CF) is one of the most commonly used recommendation methods. CF consists in predicting whether, or how much, a user will like (or dislike) an item by leveraging the knowledge of the user’s preferences as well as that of other users. In practice, users interact and express their opinion on only a small subset of items, which makes the corresponding user-item rating matrix very sparse. Such data sparsity yields two main problems for recommender systems: (1) the lack of data to effectively model users’ preferences, and (2) the lack of data to effectively model item characteristics. However, there are often many other data sources that are available to a recommender system provider, which can describe user interests and item characteristics (e.g., users’ social network, tags associated to items, etc.). These valuable data sources may supply useful information to enhance a recommendation system in modeling users’ preferences and item characteristics more accurately and thus, hopefully, to make recommenders more precise. For various reasons, these data sources may be managed by clusters of different data centers, thus requiring the development of distributed solutions. In this paper, we propose a new distributed collaborative filtering algorithm, which exploits and combines multiple and diverse data sources to improve recommendation quality. Our experimental evaluation using real datasets shows the effectiveness of our algorithm compared to state-of-the-art recommendation algorithms.
Deep Reinforcement Learning (DRL) has achieved impressive success in many applications. A key component of many DRL models is a neural network representing a Q function, to estimate the expected cumulative reward following a state-action pair. The Q function neural network contains a lot of implicit knowledge about the RL problems, but often remains unexamined and uninterpreted. To our knowledge, this work develops the first mimic learning framework for Q functions in DRL. We introduce Linear Model U-trees (LMUTs) to approximate neural network predictions. An LMUT is learned using a novel on-line algorithm that is well-suited for an active play setting, where the mimic learner observes an ongoing interaction between the neural net and the environment. Empirical evaluation shows that an LMUT mimics a Q function substantially better than five baseline methods. The transparent tree structure of an LMUT facilitates understanding the network’s learned knowledge by analyzing feature influence, extracting rules, and highlighting the super-pixels in image inputs.
Survival analysis in the presence of multiple possible adverse events, i.e., competing risks, is a pervasive problem in many industries (healthcare, finance, etc.). Since only one event is typically observed, the incidence of an event of interest is often obscured by other related competing events. This nonidentifiability, or inability to estimate true cause-specific survival curves from empirical data, further complicates competing risk survival analysis. We introduce Siamese Survival Prognosis Network (SSPN), a novel deep learning architecture for estimating personalized risk scores in the presence of competing risks. SSPN circumvents the nonidentifiability problem by avoiding the estimation of cause-specific survival curves and instead determines pairwise concordant time-dependent risks, where longer event times are assigned lower risks. Furthermore, SSPN is able to directly optimize an approximation to the C-discrimination index, rather than relying on well-known metrics which are unable to capture the unique requirements of survival analysis with competing risks.
We reinterpreting the variational inference in a new perspective. Via this way, we can easily prove that EM algorithm, VAE, GAN, AAE, ALI(BiGAN) are all special cases of variational inference. The proof also reveals the loss of standard GAN is incomplete and it explains why we need to train GAN cautiously. From that, we find out a regularization term to improve stability of GAN training.
Gradient-based meta-learning techniques are both widely applicable and proficient at solving challenging few-shot learning and fast adaptation problems. However, they have the practical difficulties of operating in high-dimensional parameter spaces in extreme low-data regimes. We show that it is possible to bypass these limitations by learning a low-dimensional latent generative representation of model parameters and performing gradient-based meta-learning in this space with latent embedding optimization (LEO), effectively decoupling the gradient-based adaptation procedure from the underlying high-dimensional space of model parameters. Our evaluation shows that LEO can achieve state-of-the-art performance on the competitive 5-way 1-shot miniImageNet classification task.
This is a writeup of lectures on ‘statistics’ that have evolved from the 2009 Hadron Collider Physics Summer School at CERN to the forthcoming 2018 school at Fermilab. The emphasis is on foundations, using simple examples to illustrate the points that are still debated in the professional statistics literature. The three main approaches to interval estimation (Neyman confidence, Bayesian, likelihood ratio) are discussed and compared in detail, with and without nuisance parameters. Hypothesis testing is discussed mainly from the frequentist point of view, with pointers to the Bayesian literature. Various foundational issues are emphasized, including the conditionality principle and the likelihood principle.