Recently, convex formulations of low-rank matrix factorization problems have received considerable attention in machine learning. However, such formulations often require solving for a matrix of the size of the data matrix, making it challenging to apply them to large scale datasets. Moreover, in many applications the data can display structures beyond simply being low-rank, e.g., images and videos present complex spatio-temporal structures that are largely ignored by standard low-rank methods. In this paper we study a matrix factorization technique that is suitable for large datasets and captures additional structure in the factors by using a particular form of regularization that includes well-known regularizers such as total variation and the nuclear norm as particular cases. Although the resulting optimization problem is non-convex, we show that if the size of the factors is large enough, under certain conditions, any local minimizer for the factors yields a global minimizer. A few practical algorithms are also provided to solve the matrix factorization problem, and bounds on the distance from a given approximate solution of the optimization problem to the global optimum are derived. Examples in neural calcium imaging video segmentation and hyperspectral compressed recovery show the advantages of our approach on high-dimensional datasets.
We investigate methods for combining multiple self-supervised tasks–i.e., supervised tasks where data can be collected without manual labeling–in order to train a single visual representation. First, we provide an apples-to-apples comparison of four different self-supervised tasks using the very deep ResNet-101 architecture. We then combine tasks to jointly train a network. We also explore lasso regularization to encourage the network to factorize the information in its representation, and methods for ‘harmonizing’ network inputs in order to learn a more unified representation. We evaluate all methods on ImageNet classification, PASCAL VOC detection, and NYU depth prediction. Our results show that deeper networks work better, and that combining tasks–even via a naive multi-head architecture–always improves performance. Our best joint network nearly matches the PASCAL performance of a model pre-trained on ImageNet classification, and matches the ImageNet network on NYU depth prediction.
In recent years, many deep-learning based models are proposed for text classification. This kind of models well fits the training set from the statistical point of view. However, it lacks the capacity of utilizing instance-level information from individual instances in the training set. In this work, we propose to enhance neural network models by allowing them to leverage information from $k$-nearest neighbor (kNN) of the input text. Our model employs a neural network that encodes texts into text embeddings. Moreover, we also utilize $k$-nearest neighbor of the input text as an external memory, and utilize it to capture instance-level information from the training set. The final prediction is made based on features from both the neural network encoder and the kNN memory. Experimental results on several standard benchmark datasets show that our model outperforms the baseline model on all the datasets, and it even beats a very deep neural network model (with 29 layers) in several datasets. Our model also shows superior performance when training instances are scarce, and when the training set is severely unbalanced. Our model also leverages techniques such as semi-supervised training and transfer learning quite well.
Dependency graph, as a heterogeneous graph representing the intrinsic relationships between different pairs of system entities, is essential to many data analysis applications, such as root cause diagnosis, intrusion detection, etc. Given a well-trained dependency graph from a source domain and an immature dependency graph from a target domain, how can we extract the entity and dependency knowledge from the source to enhance the target? One way is to directly apply a mature dependency graph learned from a source domain to the target domain. But due to the domain variety problem, directly using the source dependency graph often can not achieve good performance. Traditional transfer learning methods mainly focus on numerical data and are not applicable. In this paper, we propose ACRET, a knowledge transfer based model for accelerating dependency graph learning from heterogeneous categorical event streams. In particular, we first propose an entity estimation model to filter out irrelevant entities from the source domain based on entity embedding and manifold learning. Only the entities with statistically high correlations are transferred to the target domain. On the surviving entities, we propose a dependency construction model for constructing the unbiased dependency relationships by solving a two-constraint optimization problem. The experimental results on synthetic and real-world datasets demonstrate the effectiveness and efficiency of ACRET. We also apply ACRET to a real enterprise security system for intrusion detection. Our method is able to achieve superior detection performance at least 20 days lead lag time in advance with more than 70% accuracy.
An important objective for analyzing real-world graphs is to achieve scalable performance on large, streaming graphs. A challenging and relevant example is the graph partition problem. As a combinatorial problem, graph partition is NP-hard, but existing relaxation methods provide reasonable approximate solutions that can be scaled for large graphs. Competitive benchmarks and challenges have proven to be an effective means to advance state-of-the-art performance and foster community collaboration. This paper describes a graph partition challenge with a baseline partition algorithm of sub-quadratic complexity. The algorithm employs rigorous Bayesian inferential methods based on a statistical model that captures characteristics of the real-world graphs. This strong foundation enables the algorithm to address limitations of well-known graph partition approaches such as modularity maximization. This paper describes various aspects of the challenge including: (1) the data sets and streaming graph generator, (2) the baseline partition algorithm with pseudocode, (3) an argument for the correctness of parallelizing the Bayesian inference, (4) different parallel computation strategies such as node-based parallelism and matrix-based parallelism, (5) evaluation metrics for partition correctness and computational requirements, (6) preliminary timing of a Python-based demonstration code and the open source C++ code, and (7) considerations for partitioning the graph in streaming fashion. Data sets and source code for the algorithm as well as metrics, with detailed documentation are available at GraphChallenge.org.
There are so many vehicles in the world and the number of vehicles is increasing rapidly. To alleviate the parking problems caused by that, the smart parking system has been developed. The parking planning is one of the most important parts of it. An effective parking planning strategy makes the better use of parking resources possible. In this paper, we present a feasible method to do parking planning. We transform the parking planning problem into a kind of linear assignment problem. We take vehicles as jobs and parking spaces as agents. We take distances between vehicles and parking spaces as costs for agents doing jobs. Then we design an algorithm for this particular assignment problem and solve the parking planning problem. The method proposed can give timely and efficient guide information to vehicles for a real time smart parking system. Finally, we show the effectiveness of the method with experiments over some data, which can simulate the situation of doing parking planning in the real world.
Multivariate time series (MTS) have become increasingly common in healthcare domains where human vital signs and laboratory results are collected for predictive diagnosis. Recently, there have been increasing efforts to visualize healthcare MTS data based on star charts or parallel coordinates. However, such techniques might not be ideal for visualizing a large MTS dataset, since it is difficult to obtain insights or interpretations due to the inherent high dimensionality of MTS. In this paper, we propose ‘m-TSNE’: a simple and novel framework to visualize high-dimensional MTS data by projecting them into a low-dimensional (2-D or 3-D) space while capturing the underlying data properties. Our framework is easy to use and provides interpretable insights for healthcare professionals to understand MTS data. We evaluate our visualization framework on two real-world datasets and demonstrate that the results of our m-TSNE show patterns that are easy to understand while the other methods’ visualization may have limitations in interpretability.
Sales forecast is an essential task in E-commerce and has a crucial impact on making informed business decisions. It can help us to manage the workforce, cash flow and resources such as optimizing the supply chain of manufacturers etc. Sales forecast is a challenging problem in that sales is affected by many factors including promotion activities, price changes, and user preferences etc. Traditional sales forecast techniques mainly rely on historical sales data to predict future sales and their accuracies are limited. Some more recent learning-based methods capture more information in the model to improve the forecast accuracy. However, these methods require case-by-case manual feature engineering for specific commercial scenarios, which is usually a difficult, time-consuming task and requires expert knowledge. To overcome the limitations of existing methods, we propose a novel approach in this paper to learn effective features automatically from the structured data using the Convolutional Neural Network (CNN). When fed with raw log data, our approach can automatically extract effective features from that and then forecast sales using those extracted features. We test our method on a large real-world dataset from CaiNiao.com and the experimental results validate the effectiveness of our method.
Analyzing large-scale graphs provides valuable insights in different application scenarios. While many graph processing systems working on top of distributed infrastructures have been proposed to deal with big graphs, the tasks of profiling and debugging their massive computations remain time consuming and error-prone. This paper presents GiViP, a visual profiler for distributed graph processing systems based on a Pregel-like computation model. GiViP captures the huge amount of messages exchanged throughout a computation and provides an interactive user interface for the visual analysis of the collected data. We show how to take advantage of GiViP to detect anomalies related to the computation and to the infrastructure, such as slow computing units and anomalous message patterns.
The recent, remarkable growth of machine learning has led to intense interest in the privacy of the data on which machine learning relies, and to new techniques for preserving privacy. However, older ideas about privacy may well remain valid and useful. This note reviews two recent works on privacy in the light of the wisdom of some of the early literature, in particular the principles distilled by Saltzer and Schroeder in the 1970s.
As more people move back into densely populated cities, bike sharing is emerging as an important mode of urban mobility. In a typical bike sharing system, riders arrive at a station and take a bike if it is available. After retrieving a bike, they ride it for a while, then return it to a station near their final destinations. Since space is limited in cities, each station has a finite capacity of docks, which cannot hold more bikes than its capacity. In this paper, we study bike sharing systems with stations having a finite capacity. By an appropriate scaling of our stochastic model, we prove a central limit theorem for an empirical process of the number of stations with $k$ bikes. The central limit theorem provides insight on the variance, and sample path dynamics of large scale bike sharing systems. We also leverage our results to estimate confidence intervals for various performance measures such as the proportion of empty stations, the proportion of full stations, and the number of bikes in circulation. These performance measures have the potential to inform the operations and design of future bike sharing systems.
Wireless sensor networks usually comprise a large number of sensors monitoring changes in variables. These changes in variables represent changes in physical quantities. The changes can occur for various reasons; these reasons are highlighted in this work. Outliers are unusual measurements. Outliers are important; they are information-bearing occurrences. This work seeks to identify them based on an approach presented in [1]. A critical review of most previous works in this area has been presented in [2], and few more are considered here just to set the stage. The main work can be described as this; given a set of measurements from sensors that represent a normal situation, [1] proceeds by first estimating the probability density function (pdf) of the set using a data-split approach, then estimate the entropy of the set using the arithmetic mean as an approximation for the expectation. The increase in entropy that occurs when strange data is recorded is used to detect unusual measurements in the test set depending on the desired confidence interval or false alarm rate. The results presented in [1] have been confirmed for different test signals such as the Gaussian, Beta, in one dimension and beta in two dimensions, and a beta and uniform mixture distribution in two dimensions. Finally, the method was confirmed on real data and the results are presented. The major drawbacks of [1] were identified, and a method that seeks to mitigate this using the Bhattacharyya distance is presented. This method detects more subtle anomalies, especially the type that would pass as normal in [1]. Finally, recommendations for future research are presented: the subject of interpretability, especially for subtle measurements, being the most elusive as of today.
Social media datasets, especially Twitter tweets, are popular in the field of text classification. Tweets are a valuable source of micro-text (sometimes referred to as ‘micro-blogs’), and have been studied in domains such as sentiment analysis, recommendation systems, spam detection, clustering, among others. Tweets often include keywords referred to as ‘Hashtags’ that can be used as labels for the tweet. Using tweets encompassing 50 labels, we studied the impact of word versus character-level feature selection and extraction on different learners to solve a multi-class classification task. We show that feature extraction of simple character-level groups performs better than simple word groups and pre-processing methods like normalizing using Porter’s Stemming and Part-of-Speech (‘POS’)-Lemmatization.
It is known that the common factors in a large panel of data can be consistently estimated by the method of principal components, and principal components can be constructed by iterative least squares regressions. Replacing least squares with ridge regressions turns out to have the effect of shrinking the singular values of the common component and possibly reducing its rank. The method is used in the machine learning literature to recover low-rank matrices. We study the procedure from the perspective of estimating a minimum-rank approximate factor model. We show that the constrained factor estimates are biased but can be more efficient in terms of mean-squared errors. Rank consideration suggests a data-dependent penalty for selecting the number of factors. The new criterion is more conservative in cases when the nominal number of factors is inflated by the presence of weak factors or large measurement noise. The framework is extended to incorporate a priori linear constraints on the loadings. We provide asymptotic results that can be used to test economic hypotheses.
Despite significant progress of deep learning in the field of computer vision, there has not been a software library that covers these methods in a unifying manner. We introduce ChainerCV, a software library that is intended to fill this gap. ChainerCV supports numerous neural network models as well as software components needed to conduct research in computer vision. These implementations emphasize simplicity, flexibility and good software engineering practices. The library is designed to perform on par with the results reported in published papers and its tools can be used as a baseline for future research in computer vision. Our implementation includes sophisticated models like Faster R-CNN and SSD, and covers tasks such as object detection and semantic segmentation.
Large scale image dataset and deep convolutional neural network (DCNN) are two primary driving forces for the rapid progress made in generic object recognition tasks in recent years. While lots of network architectures have been continuously designed to pursue lower error rates, few efforts are devoted to enlarge existing datasets due to high labeling cost and unfair comparison issues. In this paper, we aim to achieve lower error rate by augmenting existing datasets in an automatic manner. Our method leverages both Web and DCNN, where Web provides massive images with rich contextual information, and DCNN replaces human to automatically label images under guidance of Web contextual information. Experiments show our method can automatically scale up existing datasets significantly from billions web pages with high accuracy, and significantly improve the performance on object recognition tasks by using the automatically augmented datasets, which demonstrates that more supervisory information has been automatically gathered from the Web. Both the dataset and models trained on the dataset are made publicly available.
We propose new methods for Support Vector Machines (SVMs) using tree architecture for multi-class classification. In each node of the tree, we select an appropriate binary classifier using entropy and generalization error estimation, then group the examples into positive and negative classes based on the selected classifier and train a new classifier for use in the classification phase. The proposed methods can work in time complexity between O(log2N) to O(N) where N is the number of classes. We compared the performance of our proposed methods to the traditional techniques on the UCI machine learning repository using 10-fold cross-validation. The experimental results show that our proposed methods are very useful for the problems that need fast classification time or problems with a large number of classes as the proposed methods run much faster than the traditional techniques but still provide comparable accuracy.
Time series often contain outliers and level shifts or structural changes. These unexpected events are of the utmost importance in fraud detection, as they may pinpoint suspicious transactions. The presence of such unusual events can easily mislead conventional time series analysis and yield erroneous conclusions. In this paper we provide a unified framework for detecting outliers and level shifts in short time series that may have a seasonal pattern. The approach combines ideas from the FastLTS algorithm for robust regression with alternating least squares. The double wedge plot is proposed, a graphical display which indicates outliers and potential level shifts. The methodology is illustrated on real and artificial time series.
In school, a teacher plays an important role in various classroom teaching patterns. Likewise to this human learning activity, the learning using privileged information (LUPI) paradigm provides additional information generated by the teacher to ‘teach’ learning algorithms during the training stage. Therefore, this novel learning paradigm is a typical Teacher-Student Interaction mechanism. This paper is the first to present a random vector functional link network based on the LUPI paradigm, called RVFL+. Rather than simply combining two existing approaches, the newly-derived RVFL+ fills the gap between neural networks and the LUPI paradigm, which offers an alternative way to train RVFL networks. Moreover, the proposed RVFL+ can perform in conjunction with the kernel trick for highly complicated nonlinear feature learning, which is termed KRVFL+. Furthermore, the statistical property of the proposed RVFL+ is investigated, and we derive a sharp and high-quality generalization error bound based on the Rademacher complexity. Competitive experimental results on 14 real-world datasets illustrate the great effectiveness and efficiency of the novel RVFL+ and KRVFL+, which can achieve better generalization performance than state-of-the-art algorithms.
With the availability of large databases and recent improvements in deep learning methodology, the performance of AI systems is reaching or even exceeding the human level on an increasing number of complex tasks. Impressive examples of this development can be found in domains such as image classification, sentiment analysis, speech understanding or strategic game playing. However, because of their nested non-linear structure, these highly successful machine learning and artificial intelligence models are usually applied in a black box manner, i.e., no information is provided about what exactly makes them arrive at their predictions. Since this lack of transparency can be a major drawback, e.g., in medical applications, the development of methods for visualizing, explaining and interpreting deep learning models has recently attracted increasing attention. This paper summarizes recent developments in this field and makes a plea for more interpretability in artificial intelligence. Furthermore, it presents two approaches to explaining predictions of deep learning models, one method which computes the sensitivity of the prediction with respect to changes in the input and one approach which meaningfully decomposes the decision in terms of the input variables. These methods are evaluated on three classification tasks.
The areas of machine learning and communication technology are converging. Today’s communications systems generate a huge amount of traffic data, which can help to significantly enhance the design and management of networks and communication components when combined with advanced machine learning methods. Furthermore, recently developed end-to-end training procedures offer new ways to jointly optimize the components of a communication system. Also in many emerging application fields of communication technology, e.g., smart cities or internet of things, machine learning methods are of central importance. This paper gives an overview over the use of machine learning in different areas of communications and discusses two exemplar applications in wireless networking. Furthermore, it identifies promising future research topics and discusses their potential impact.
In a real-world setting, visual recognition systems can be brought to make predictions for images belonging to previously unknown class labels. In order to make semantically meaningful predictions for such inputs, we propose a two-step approach that utilizes information from knowledge graphs. First, a knowledge-graph representation is learned to embed a large set of entities into a semantic space. Second, an image representation is learned to embed images into the same space. Under this setup, we are able to predict structured properties in the form of relationship triples for any open-world image. This is true even when a set of labels has been omitted from the training protocols of both the knowledge graph and image embeddings. Furthermore, we append this learning framework with appropriate smoothness constraints and show how prior knowledge can be incorporated into the model. Both these improvements combined increase performance for visual recognition by a factor of six compared to our baseline. Finally, we propose a new, extended dataset which we use for experiments.
Machine learning is widely used in security applications, particularly in the form of statistical classification aimed at distinguishing benign from malicious entities. Recent research has shown that such classifiers are often vulnerable to evasion attacks, whereby adversaries change behavior to be categorized as benign while preserving malicious functionality. Research into evasion attacks has followed two paradigms: attacks in problem space, where the actual malicious instance is modified, and attacks in feature space, where the attack is abstracted into modifying numerical features of an instance to evade a classifier. In contrast, research into designing evasion-robust classifiers generally relies on feature space attack models. We make several contributions to address this gap, using PDF malware detection as a case study. First, we present a systematic retraining procedure which uses an automated problem space attack generator to design a more robust PDF malware detector. Second, we demonstrate that replacing problem space attacks with feature space attacks dramatically reduces the robustness of the resulting classifier, severely undermining feature space defense methods to date. Third, we demonstrate the existence of conserved (or invariant) features, and show how these can be leveraged to design evasion- robust classifiers that are nearly as effective, and far more efficient, than those relying on the problem space attack. Finally, we present a general approach for identifying conserved features.
Operating System-level virtualization technology, or containers as they are commonly known, represents the next generation of light-weight virtualization, and is primarily represented by Docker. However, Docker’s current design does not complement the SLAs from Docker-based container cloud offerings promising both reliability and high availability. The tight coupling between the containers and the Docker daemon proves fatal for the containers’ uptime during daemon’s unavailability due to either failure or upgrade. We present the design and implementation of HYDRA, which fundamentally isolates the containers from the running daemon. Our evaluation shows that HYDRA imposes only moderate overheads even under load, while achieving much higher container availability.