To assure cyber security of an enterprise, typically SIEM (Security Information and Event Management) system is in place to normalize security event from different preventive technologies and flag alerts. Analysts in the security operation center (SOC) investigate the alerts to decide if it is truly malicious or not. However, generally the number of alerts is overwhelming with majority of them being false positive and exceeding the SOC’s capacity to handle all alerts. There is a great need to reduce the false positive rate as much as possible. While most previous research focused on network intrusion detection, we focus on risk detection and propose an intelligent Deep Belief Network machine learning system. The system leverages alert information, various security logs and analysts’ investigation results in a real enterprise environment to flag hosts that have high likelihood of being compromised. Text mining and graph based method are used to generate targets and create features for machine learning. In the experiment, Deep Belief Network is compared with other machine learning algorithms, including multi-layer neural network, random forest, support vector machine and logistic regression. Results on real enterprise data indicate that the deep belief network machine learning system performs better than other algorithms for our problem and is six times more effective than current rule-based system. We also implement the whole system from data collection, label creation, feature engineering to host score generation in a real enterprise production environment.
Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users’ personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users’ feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.
In this work, methods to detect one or several change points in multivariate time series are reviewed. They include retrospective (off-line) procedure such as maximum likelihood estimation, regression, kernel methods, etc. In this large area of research, applications are numerous and diverse; many different models and operational constraints (on precision, complexity,…) exist. A formal framework for change point detection is introduced to give sens to this significant body of work. Precisely, all methods are described as a collection of three elements: a cost function, a search method and a constraint on the number of changes to detect. For a given method, we detail the assumed signal model, the associated algorithm, theoretical guarantees (if any) and the application domain. This approach is intended to facilitate prototyping of change point detection methods: for a given segmentation task, one can appropriately choose among the described elements to design an algorithm.
Identifying causal relationships from observation data is difficult, in large part, due to the presence of hidden common causes. In some cases, where just the right patterns of conditional independence and dependence lie in the data—for example, Y-structures—it is possible to identify cause and effect. In other cases, the analyst deliberately makes an uncertain assumption that hidden common causes are absent, and infers putative causal relationships to be tested in a randomized trial. Here, we consider a third approach, where there are sufficient clues in the data such that hidden common causes can be inferred.
Data cleaning, or the process of detecting and repairing inaccurate or corrupt records in the data, is inherently human-driven. State of the art systems assume cleaning experts can access the data (or a sample of it) to tune the cleaning process. However, in many cases, privacy constraints disallow unfettered access to the data. To address this challenge, we observe and provide empirical evidence that data cleaning can be achieved without access to the sensitive data, but with access to a (noisy) query interface that supports a small set of linear counting query primitives. Motivated by this, we present DPClean, a first of a kind system that allows engineers tune data cleaning workflows while ensuring differential privacy. In DPClean, a cleaning engineer can pose sequences of aggregate counting queries with error tolerances. A privacy engine translates each query into a differentially private mechanism that returns an answer with error matching the specified tolerance, and allows the data owner track the overall privacy loss. With extensive experiments using human and simulated cleaning engineers on blocking and matching tasks, we demonstrate that our approach is able to achieve high cleaning quality while ensuring a reasonable privacy loss.
Deep learning is at the heart of the current rise of machine learning and artificial intelligence. In the field of Computer Vision, it has become the workhorse for applications ranging from self-driving cars to surveillance and security. Whereas deep neural networks have demonstrated phenomenal success (often beyond human capabilities) in solving complex problems, recent studies show that they are vulnerable to adversarial attacks in the form of subtle perturbations to inputs that lead a model to predict incorrect outputs. For images, such perturbations are often too small to be perceptible, yet they completely fool the deep learning models. Adversarial attacks pose a serious threat to the success of deep learning in practice. This fact has lead to a large influx of contributions in this direction. This article presents the first comprehensive survey on adversarial attacks on deep learning in Computer Vision. We review the works that design adversarial attacks, analyze the existence of such attacks and propose defenses against them. To emphasize that adversarial attacks are possible in practical conditions, we separately review the contributions that evaluate adversarial attacks in the real-world scenarios. Finally, we draw on the literature to provide a broader outlook of the research direction.
Partially identified models occur commonly in economic applications. A common problem in this literature is a regression problem with bracketed (interval-censored) outcome variable Y, which creates a set-identified parameter of interest. The recent studies have only considered finite-dimensional linear regression in such context. To incorporate more complex controls into the problem, we consider a partially linear projection of Y on the set functions that are linear in treatment/policy variables and nonlinear in the controls. We characterize the identified set for the linear component of this projection and propose an estimator of its support function. Our estimator converges at parametric rate and has asymptotic normality properties. It may be useful for labor economics applications that involve bracketed salaries and rich, high-dimensional demographic data about the subjects of the study.
Production distributed systems are challenging to formally verify, in particular when they are based on distributed protocols that are not rigorously described or fully understood. In this paper, we derive models and properties for two core distributed protocols used in eventually consistent production key-value stores such as Riak and Cassandra. We propose a novel modeling called certified program models, where complete distributed systems are captured as programs written in traditional systems languages such as concurrent C. Specifically, we model the read-repair and hinted-handoff recovery protocols as concurrent C programs, test them for conformance with real systems, and then verify that they guarantee eventual consistency, modeling precisely the specification as well as the failure assumptions under which the results hold.
Research in analogical reasoning suggests that higher-order cognitive functions such as abstract reasoning, far transfer, and creativity are founded on recognizing structural similarities among relational systems. Here we integrate theories of analogy with the computational framework of reinforcement learning (RL). We propose a psychology theory that is a computational synergy between analogy and RL, in which analogical comparison provides the RL learning algorithm with a measure of relational similarity, and RL provides feedback signals that can drive analogical learning. Simulation results support the power of this approach.
The extraction of useful deep features is important for many computer vision tasks. Deep features extracted from classification networks have proved to perform well in those tasks. To obtain features of greater usefulness, end-to-end distance metric learning (DML) has been applied to train the feature extractor directly. However, in these DML studies, there were no equitable comparisons between features extracted from a DML-based network and those from a softmax-based network. In this paper, by presenting objective comparisons between these two approaches under the same network architecture, we show that the softmax-based features perform competitive, or even better, to the state-of-the-art DML features when the size of the dataset, that is, the number of training samples per class, is large. The results suggest that softmax-based features should be properly taken into account when evaluating the performance of deep features.
An algorithm is proposed for solving stochastic and finite sum minimization problems. Based on a trust region methodology, the algorithm employs normalized steps, at least as long as the norms of the stochastic gradient estimates are within a user-defined interval. The complete algorithm—which dynamically chooses whether or not to employ normalized steps—is proved to have convergence guarantees that are similar to those possessed by a traditional stochastic gradient approach under various sets of conditions related to the accuracy of the stochastic gradient estimates and choice of stepsize sequence. The results of numerical experiments are presented when the method is employed to minimize convex and nonconvex machine learning test problems, illustrating that the method can outperform a traditional stochastic gradient approach.
In many applications such as color image processing, data has more than one piece of information associated with each spatial coordinate, and in such cases the classical optimal mass transport (OMT) must be generalized to handle vector-valued or matrix-valued densities. In this paper, we discuss the vector and matrix optimal mass transport and present three contributions. We first present a rigorous mathematical formulation for these setups and provide analytical results including existence of solutions and strong duality. Next, we present a simple, scalable, and parallelizable methods to solve the vector and matrix-OMT problems. Finally, we implement the proposed methods on a CUDA GPU and present experiments and applications.
In this letter, we present a unified Bayesian inference framework for generalized linear models (GLM) which iteratively reduces the GLM problem to a sequence of standard linear model (SLM) problems. This framework provides new perspectives on some established GLM algorithms derived from SLM ones and also suggests novel extensions for some other SLM algorithms. Specific instances elucidated under such framework are the GLM versions of approximate message passing (AMP), vector AMP (VAMP), and sparse Bayesian learning (SBL). It is proved that the resultant GLM version of AMP is equivalent to the well-known generalized approximate message passing (GAMP). Numerical results for 1-bit quantized compressed sensing (CS) demonstrate the effectiveness of this unified framework.
A graph database is a database where the data structures for the schema and/or instances are modeled as a (labeled)(directed) graph or generalizations of it, and where querying is expressed by graph-oriented operations and type constructors. In this article we present the basic notions of graph databases, give an historical overview of its main development, and study the main current systems that implement them.
In this work, we have proposed a system theoretic method to compute sensitivities of different lines for $N-k$ contingency analysis in power network. We have formulated the $N-k$ contingency analysis as the stability problem of power network with uncertain links. We have derived a necessary condition for stochastic stability of the power network with the link uncertainty. The necessary condition is then used to rank order the contingencies. We have shown due to interaction between different uncertainties the ranking can substantially change. The state of the art $N-k$ contingency analysis does not consider the possibility of interference between link uncertainties and rank the links according to the severity of $N-1$ contingencies. We have presented simulation results for New England $39$ bus system as a support of our claim.
Learning probability distributions on the weights of neural networks (NNs) has recently proven beneficial in many applications. Bayesian methods, such as Stein variational gradient descent (SVGD), offer an elegant framework to reason about NN model uncertainty. However, by assuming independent Gaussian priors for the individual NN weights (as often applied), SVGD does not impose prior knowledge that there is often structural information (dependence) among weights. We propose efficient posterior learning of structural weight uncertainty, within an SVGD framework, by employing matrix variate Gaussian priors on NN parameters. We further investigate the learned structural uncertainty in sequential decision-making problems, including contextual bandits and reinforcement learning. Experiments on several synthetic and real datasets indicate the superiority of our model, compared with state-of-the-art methods.
We introduce an efficient algorithmic framework for model selection in online learning, also known as parameter-free online learning. Departing from previous work, which has focused on highly structured function classes such as nested balls in Hilbert space, we propose a generic meta-algorithm framework that achieves online model selection oracle inequalities under minimal structural assumptions. We give the first computationally efficient parameter-free algorithms that work in arbitrary Banach spaces under mild smoothness assumptions; previous results applied only to Hilbert spaces. We further derive new oracle inequalities for matrix classes, non-nested convex sets, and $\mathbb{R}^{d}$ with generic regularizers. Finally, we generalize these results by providing oracle inequalities for arbitrary non-linear classes in the online supervised learning model. These results are all derived through a unified meta-algorithm scheme using a novel ‘multi-scale’ algorithm for prediction with expert advice based on random playout, which may be of independent interest.
The discovery of community structures in social networks has gained significant attention since it is a fundamental problem in understanding the networks’ topology and functions. However, most social network data are collected from partially observable networks with both missing nodes and edges. In this paper, we address a new problem of detecting overlapping community structures in the context of such an incomplete network, where communities in the network are allowed to overlap since nodes belong to multiple communities at once. To solve this problem, we introduce KroMFac, a new framework that conducts community detection via regularized nonnegative matrix factorization (NMF) based on the Kronecker graph model. Specifically, from a generative parameter matrix acquired by the expectation-maximization (EM) algorithm, we first estimate the missing part of the network. As our major contribution to the proposed framework, to improve community detection accuracy, we then characterize and select influential nodes (which tend to have high degrees) by ranking, and add them to the existing graph. Finally, we uncover the community structures by solving the regularized NMF-aided optimization problem in terms of maximizing the likelihood of the underlying graph. Furthermore, adopting normalized mutual information (NMI), we empirically show superiority of our KroMFac approach over two baseline schemes.
A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient. Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regularization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for ‘low noise’ datasets. The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.
Inspired by coarse-graining approaches used in physics, we show how similar algorithms can be adapted for data. The resulting algorithms are based on layered tree tensor networks and scale linearly with both the dimension of the input and the training set size. Computing most of the layers with an unsupervised algorithm, then optimizing just the top layer for supervised classification of the MNIST and fashion-MNIST data sets gives very good results. We also discuss mixing a prior guess for supervised weights together with an unsupervised representation of the data, yielding a smaller number of features nevertheless able to give good performance.
This paper presents the integration of Dijkstra’s algorithm within a Blackboard framework to optimize the selection of web services from service providers. In addition, methods are presented how dynamic changes during the workflow execution can be handled; specifically, how changes of the service parameters have effects on the system. For justification of our approach, and to show practical feasibility, a sample implementation is presented.
Discovering significant itemsets is one of the fundamental problems in data mining. It has recently been shown that constraint programming is a flexible way to tackle data mining tasks. With a constraint programming approach, we can easily express and efficiently answer queries with users constraints on items. However, in many practical cases it is possible that queries also express users constraints on the dataset itself. For instance, asking for a particular itemset in a particular part of the dataset. This paper presents a general constraint programming model able to handle any kind of query on the items or the dataset for itemset mining.
Online job boards are one of the central components of modern recruitment industry. With millions of candidates browsing through job postings everyday, the need for accurate, effective, meaningful, and transparent job recommendations is apparent more than ever. While recommendation systems are successfully advancing in variety of online domains by creating social and commercial value, the job recommendation domain is less explored. Existing systems are mostly focused on content analysis of resumes and job descriptions, relying heavily on the accuracy and coverage of the semantic analysis and modeling of the content in which case, they end up usually suffering from rigidity and the lack of implicit semantic relations that are uncovered from users’ behavior and could be captured by Collaborative Filtering (CF) methods. Few works which utilize CF do not address the scalability challenges of real-world systems and the problem of cold-start. In this paper, we propose a scalable item-based recommendation system for online job recommendations. Our approach overcomes the major challenges of sparsity and scalability by leveraging a directed graph of jobs connected by multi-edges representing various behavioral and contextual similarity signals. The short lived nature of the items (jobs) in the system and the rapid rate in which new users and jobs enter the system make the cold-start a serious problem hindering CF methods. We address this problem by harnessing the power of deep learning in addition to user behavior to serve hybrid recommendations. Our technique has been leveraged by CareerBuilder.com which is one of the largest job boards in the world to generate high-quality recommendations for millions of users.
In the era of big data, data may come from multiple sources, known as multi-view data. Multi-view clustering aims at generating better clusters by exploiting complementary and consistent information from multiple views rather than relying on the individual view. Due to inevitable system errors caused by data-captured sensors or others, the data in each view may be erroneous. Various types of errors behave differently and inconsistently in each view. More precisely, error could exhibit as noise and corruptions in reality. Unfortunately, none of the existing multi-view clustering approaches handle all of these error types. Consequently, their clustering performance is dramatically degraded. In this paper, we propose a novel Markov chain method for Error-Robust Multi-View Clustering (EMVC). By decomposing each view into a shared transition probability matrix and error matrix and imposing structured sparsity-inducing norms on error matrices, we characterize and handle typical types of errors explicitly. To solve the challenging optimization problem, we propose a new efficient algorithm based on Augmented Lagrangian Multipliers and prove its convergence rigorously. Experimental results on various synthetic and real-world datasets show the superiority of the proposed EMVC method over the baseline methods and its robustness against different types of errors.
Text representation using neural word embeddings has proven efficacy in many NLP applications. Recently, a lot of research interest goes beyond word embeddings by adapting the traditional word embedding models to learn vectors of multiword expressions (concepts/entities). However, current methods are limited to textual knowledge bases only (e.g., Wikipedia). In this paper, we propose a novel approach for learning concept vectors from two large scale knowledge bases (Wikipedia, and Probase). We adapt the skip-gram model to seamlessly learn from the knowledge in Wikipedia text and Probase concept graph. We evaluate our concept embedding models intrinsically on two tasks: 1) analogical reasoning where we achieve a state-of-the-art performance of 91% on semantic analogies, 2) concept categorization where we achieve a state-of-the-art performance on two benchmark datasets achieving categorization accuracy of 100% on one and 98% on the other. Additionally, we present a case study to extrinsically evaluate our model on unsupervised argument type identification for neural semantic parsing. We demonstrate the competitive accuracy of our unsupervised method and its ability to better generalize to out of vocabulary entity mentions compared to the tedious and error prone methods which depend on gazetteers and regular expressions.
Restricted Boltzmann machines (RBMs) and their extensions, often called ‘deep-belief networks’, are very powerful neural networks that have found widespread applicability in the fields of machine learning and big data. The standard way to training these models resorts to an iterative unsupervised procedure based on Gibbs sampling, called ‘contrastive divergence’, and additional supervised tuning via back-propagation. However, this procedure has been shown not to follow any gradient and can lead to suboptimal solutions. In this paper, we show a very efficient alternative to contrastive divergence by means of simulations of digital memcomputing machines (DMMs). We test our approach on pattern recognition using the standard MNIST data set of hand-written numbers. DMMs sample very effectively the vast phase space defined by the probability distribution of RBMs over the test sample inputs, and provide a very good approximation close to the optimum. This efficient search significantly reduces the number of generative pre-training iterations necessary to achieve a given level of accuracy in the MNIST data set, as well as a total performance gain over the traditional approaches. In fact, the acceleration of the pre-training achieved by simulating DMMs is comparable to, in number of iterations, the recently reported hardware application of the quantum annealing method on the same network and data set. Notably, however, DMMs perform far better than the reported quantum annealing results in terms of quality of the training. Our approach is agnostic about the connectivity of the network. Therefore, it can be extended to train full Boltzmann machines, and even deep networks at once.
Multimodal models have been proven to outperform text-based models on learning semantic word representations. Almost all previous multimodal models typically treat the representations from different modalities equally. However, it is obvious that information from different modalities contributes differently to the meaning of words. This motivates us to build a multimodal model that can dynamically fuse the semantic representations from different modalities according to different types of words. To that end, we propose three novel dynamic fusion methods to assign importance weights to each modality, in which weights are learned under the weak supervision of word association pairs. The extensive experiments have demonstrated that the proposed methods outperform strong unimodal baselines and state-of-the-art multimodal models.
Although deep learning has historical roots going back decades, neither the term ‘deep learning’ nor the approach was popular just over five years ago, when the field was reignited by papers such as Krizhevsky, Sutskever and Hinton’s now classic (2012) deep network model of Imagenet. What has the field discovered in the five subsequent years? Against a background of considerable progress in areas such as speech recognition, image recognition, and game playing, and considerable enthusiasm in the popular press, I present ten concerns for deep learning, and suggest that deep learning must be supplemented by other techniques if we are to reach artificial general intelligence.
How should one perform matching in observational studies when the units are text documents? The lack of randomized assignment of documents into treatment and control groups may lead to systematic differences between groups on high-dimensional and latent features of text such as topical content and sentiment. Standard balance metrics, used to measure the quality of a matching method, fail in this setting. We decompose text matching methods into two parts: (1) a text representation, and (2) a distance metric, and present a framework for measuring the quality of text matches experimentally using human subjects. We consider 28 potential methods, and find that representing text as term vectors and matching on cosine distance significantly outperform alternative representations and distance metrics. We apply our chosen method to a substantive debate in the study of media bias using a novel data set of front page news articles from thirteen news sources. Media bias is composed of topic selection bias and presentation bias; using our matching method to control for topic selection, we find that both components contribute significantly to media bias, though some news sources rely on one component more than the other.
Predictive modelling and supervised learning are central to modern data science. With predictions from an ever-expanding number of supervised black-box strategies – e.g., kernel methods, random forests, deep learning aka neural networks – being employed as a basis for decision making processes, it is crucial to understand the statistical uncertainty associated with these predictions. As a general means to approach the issue, we present an overarching framework for black-box prediction strategies that not only predict the target but also their own predictions’ uncertainty. Moreover, the framework allows for fair assessment and comparison of disparate prediction strategies. For this, we formally consider strategies capable of predicting full distributions from feature variables, so-called probabilistic supervised learning strategies. Our work draws from prior work including Bayesian statistics, information theory, and modern supervised machine learning, and in a novel synthesis leads to (a) new theoretical insights such as a probabilistic bias-variance decomposition and an entropic formulation of prediction, as well as to (b) new algorithms and meta-algorithms, such as composite prediction strategies, probabilistic boosting and bagging, and a probabilistic predictive independence test. Our black-box formulation also leads (c) to a new modular interface view on probabilistic supervised learning and a modelling workflow API design, which we have implemented in the newly released skpro machine learning toolbox, extending the familiar modelling interface and meta-modelling functionality of sklearn. The skpro package provides interfaces for construction, composition, and tuning of probabilistic supervised learning strategies, together with orchestration features for validation and comparison of any such strategy – be it frequentist, Bayesian, or other.
Advertisements