The growth of Internet commerce has stimulated the use of collaborative filtering (CF) algorithms as recommender systems. A collaborative filtering (CF) algorithm recommends items of interest to the target user by leveraging the votes given by other similar users. In a standard CF framework, it is assumed that the credibility of every voting user is exactly the same with respect to the target user. This assumption is not satisfied and thus may lead to misleading recommendations in many practical applications. A natural countermeasure is to design a trust-aware CF (TaCF) algorithm, which can take account of the difference in the credibilities of the voting users when performing CF. To this end, this paper presents a trust inference approach, which can predict the implicit trust of the target user on every voting user from a sparse explicit trust matrix. Then an improved CF algorithm termed iTrace is proposed, which takes advantage of both the explicit and the predicted implicit trust to provide recommendations with the CF framework. An empirical evaluation on a public dataset demonstrates that the proposed algorithm provides a significant improvement in recommendation quality in terms of mean absolute error (MAE).
Google uses continuous streams of data from industry partners in order to deliver accurate results to users. Unexpected drops in traffic can be an indication of an underlying issue and may be an early warning that remedial action may be necessary. Detecting such drops is non-trivial because streams are variable and noisy, with roughly regular spikes (in many different shapes) in traffic data. We investigated the question of whether or not we can predict anomalies in these data streams. Our goal is to utilize Machine Learning and statistical approaches to classify anomalous drops in periodic, but noisy, traffic patterns. Since we do not have a large body of labeled examples to directly apply supervised learning for anomaly classification, we approached the problem in two parts. First we used TensorFlow to train our various models including DNNs, RNNs, and LSTMs to perform regression and predict the expected value in the time series. Secondly we created anomaly detection rules that compared the actual values to predicted values. Since the problem requires finding sustained anomalies, rather than just short delays or momentary inactivity in the data, our two detection methods focused on continuous sections of activity rather than just single points. We tried multiple combinations of our models and rules and found that using the intersection of our two anomaly detection methods proved to be an effective method of detecting anomalies on almost all of our models. In the process we also found that not all data fell within our experimental assumptions, as one data stream had no periodicity, and therefore no time based model could predict it.
Classifying pages or text lines into font categories aids transcription because single font Optical Character Recognition (OCR) is generally more accurate than omni-font OCR. We present a simple framework based on Convolutional Neural Networks (CNNs), where a CNN is trained to classify small patches of text into predefined font classes. To classify page or line images, we average the CNN predictions over densely extracted patches. We show that this method achieves state-of-the-art performance on a challenging dataset of 40 Arabic computer fonts with 98.8\% line level accuracy. This same method also achieves the highest reported accuracy of 86.6% in predicting paleographic scribal script classes at the page level on medieval Latin manuscripts. Finally, we analyze what features are learned by the CNN on Latin manuscripts and find evidence that the CNN is learning both the defining morphological differences between scribal script classes as well as overfitting to class-correlated nuisance factors. We propose a novel form of data augmentation that improves robustness to text darkness, further increasing classification performance.
This paper introduces Deep Incremental Boosting, a new technique derived from AdaBoost, specifically adapted to work with Deep Learning methods, that reduces the required training time and improves generalisation. We draw inspiration from Transfer of Learning approaches to reduce the start-up time to training each incremental Ensemble member. We show a set of experiments that outlines some preliminary results on some common Deep Learning datasets and discuss the potential improvements Deep Incremental Boosting brings to traditional Ensemble methods in Deep Learning.
Outlier detection is an inevitable step to most statistical data analyses. However, the mere detection of an outlying case does not always answer all scientific questions associated with that data point. Outlier detection techniques, classical and robust alike, will typically flag the entire case as outlying, or attribute a specific case weight to the entire case. In practice, particularly in high dimensional data, the outlier will most likely not be outlying along all of its variables, but just along a subset of them. If so, the scientific question why the case has been flagged as an outlier becomes of interest. In this article, a fast and efficient method is proposed to detect variables that contribute most to an outlier’s outlyingness. Thereby, it helps the analyst understand why an outlier lies out. The approach pursued in this work is to estimate the univariate direction of maximal outlyingness. It is shown that the problem of estimating that direction can be rewritten as the normed solution of a classical least squares regression problem. Identifying the subset of variables contributing most to outlyingness, can thus be achieved by estimating the associated least squares problem in a sparse manner. From a practical perspective, sparse partial least squares (SPLS) regression, preferably by the fast sparse NIPALS (SNIPLS) algorithm, is suggested to tackle that problem. The proposed methodology is illustrated to perform well both on simulated data and real life examples.
Over the past few years, softmax and SGD have become a commonly used component and the default training strategy in CNN frameworks, respectively. However, when optimizing CNNs with SGD, the saturation behavior behind softmax always gives us an illusion of training well and then is omitted. In this paper, we first emphasize that the early saturation behavior of softmax will impede the exploration of SGD, which sometimes is a reason for model converging at a bad local-minima, then propose Noisy Softmax to mitigating this early saturation issue by injecting annealed noise in softmax during each iteration. This operation based on noise injection aims at postponing the early saturation and further bringing continuous gradients propagation so as to significantly encourage SGD solver to be more exploratory and help to find a better local-minima. This paper empirically verifies the superiority of the early softmax desaturation, and our method indeed improves the generalization ability of CNN model by regularization. We experimentally find that this early desaturation helps optimization in many tasks, yielding state-of-the-art or competitive results on several popular benchmark datasets.
Matrix factorization has now become a dominant solution for personalized recommendation on the Social Web. To alleviate the cold start problem, previous approaches have incorporated various additional sources of information into traditional matrix factorization models. These upgraded models, however, achieve only ‘marginal’ enhancements on the performance of personalized recommendation. Therefore, inspired by the recent development of deep-semantic modeling, we propose a hybrid deep-semantic matrix factorization (HDMF) model to further improve the performance of tag-aware personalized recommendation by integrating the techniques of deep-semantic modeling, hybrid learning, and matrix factorization. Experimental results show that HDMF significantly outperforms the state-of-the-art baselines in tag-aware personalized recommendation, in terms of all evaluation metrics, e.g., its mean reciprocal rank (resp., mean average precision) is 1.52 (resp., 1.66) times as high as that of the best baseline.
We introduce a new technique for reducing the dimension of the ambient space of low-degree polynomials in the Gaussian space while preserving their relative correlation structure, analogous to the Johnson-Lindenstrauss lemma. As applications, we address the following problems: 1. Computability of Approximately Optimal Noise Stable function over Gaussian space: The goal is to find a partition of $\mathbb{R}^n$ into $k$ parts, that maximizes the noise stability. An $\delta$-optimal partition is one which is within additive $\delta$ of the optimal noise stability. De, Mossel & Neeman (CCC 2017) raised the question of proving a computable bound on the dimension $n_0(\delta)$ in which we can find an $\delta$-optimal partition. While De et al. provide such a bound, using our new technique, we obtain improved explicit bounds on the dimension $n_0(\delta)$. 2. Decidability of Non-Interactive Simulation of Joint Distributions: A ‘non-interactive simulation’ problem is specified by two distributions $P(x,y)$ and $Q(u,v)$: The goal is to determine if two players that observe sequences $X^n$ and $Y^n$ respectively where $\{(X_i, Y_i)\}_{i=1}^n$ are drawn i.i.d. from $P(x,y)$ can generate pairs $U$ and $V$ respectively (without communicating with each other) with a joint distribution that is arbitrarily close in total variation to $Q(u,v)$. Even when $P$ and $Q$ are extremely simple, it is open in several cases if $P$ can simulate $Q$. In the special where $Q$ is a joint distribution over $\{0,1\} \times \{0,1\}$, Ghazi, Kamath and Sudan (FOCS 2016) proved a computable bound on the number of samples $n_0(\delta)$ that can be drawn from $P(x,y)$ to get $\delta$-close to $Q$ (if it is possible at all). Recently De, Mossel & Neeman obtained such bounds when $Q$ is a distribution over $[k] \times [k]$ for any $k \ge 2$. We recover this result with improved explicit bounds on $n_0(\delta)$.
Despite the large improvements in performance attained by using deep learning in computer vision, one can often further improve results with some additional post-processing that exploits the geometric nature of the underlying task. This commonly involves displacing the posterior distribution of a CNN in a way that makes it more appropriate for the task at hand, e.g. better aligned with local image features, or more compact. In this work we integrate this geometric post-processing within a deep architecture, introducing a differentiable and probabilistically sound counterpart to the common geometric voting technique used for evidence accumulation in vision. We refer to the resulting neural models as Mass Displacement Networks (MDNs), and apply them to human pose estimation in two distinct setups: (a) landmark localization, where we collapse a distribution to a point, allowing for precise localization of body keypoints and (b) communication across body parts, where we transfer evidence from one part to the other, allowing for a globally consistent pose estimate. We evaluate on large-scale pose estimation benchmarks, such as MPII Human Pose and COCO datasets, and report systematic improvements when compared to strong baselines.
We present a novel coreset construction algorithm for solving classification tasks using Support Vector Machines (SVMs) in a computationally efficient manner. A coreset is a weighted subset of the original data points that provably approximates the original set. We show that coresets of size polylogarithmic in $n$ and polynomial in $d$ exist for a set of $n$ input points with $d$ features and present an $(\epsilon,\delta)$-FPRAS for constructing coresets for scalable SVM training. Our method leverages the insight that data points are often redundant and uses an importance sampling scheme based on the sensitivity of each data point to construct coresets efficiently. We evaluate the performance of our algorithm in accelerating SVM training against real-world data sets and compare our algorithm to state-of-the-art coreset approaches. Our empirical results show that our approach outperforms a state-of-the-art coreset approach and uniform sampling in enabling computational speedups while achieving low approximation error.
A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored – which is the structure stemming from the relationships between the coordinates. Specifically we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space.We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan-Meier survival plot.
In this study, we formulate the concept of ‘mining maximal-size frequent subgraphs’ in the challenging domain of visual data (images and videos). In general, visual knowledge can usually be modeled as attributed relational graphs (ARGs) with local attributes representing local parts and pairwise attributes describing the spatial relationship between parts. Thus, from a practical perspective, such mining of maximal-size subgraphs can be regarded as a general platform for discovering and modeling the common objects within cluttered and unlabeled visual data. Then, from a theoretical perspective, visual graph mining should encode and overcome the great fuzziness of messy data collected from complex real-world situations, which conflicts with the conventional theoretical basis of graph mining designed for tabular data. Common subgraphs hidden in these ARGs usually have soft attributes, with considerable inter-graph variation. More importantly, we should also discover the latent pattern space, including similarity metrics for the pattern and hidden node relations, during the mining process. In this study, we redefine the visual subgraph pattern that encodes all of these challenges in a general way, and propose an approximate but efficient solution to graph mining. We conduct five experiments to evaluate our method with different kinds of visual data, including videos and RGB/RGB-D images. These experiments demonstrate the generality of the proposed method.
Modern multiscale type segmentation methods are known to detect multiple change-points with high statistical accuracy, while allowing for fast computation. Underpinning theory has been developed mainly for models which assume the signal as an unknown piecewise constant function. In this paper this will be extended to certain function classes beyond step functions in a nonparametric regression setting, revealing certain multiscale segmentation methods as robust to deviation from such piecewise constant functions. Although these methods are designed for step functions, our main finding is its adaptation over such function classes for a universal thresholding. On the one hand, this includes nearly optimal convergence rates for step functions with increasing number of jumps. On the other hand, for models which are characterized by certain approximation spaces, we obtain nearly optimal rates as well. This includes bounded variation functions, and (piecewise) H\'{o}lder functions of smoothness order $0 < \alpha \le1$. All results are formulated in terms of $L^p$-loss ($0 < p < \infty$) both almost surely and in expectation. Theoretical findings are examined by various numerical simulations.
In this paper new tests for the independence of two high-dimensional vectors are investigated. We consider the case where the dimension of the vectors increases with the sample size and propose multivariate analysis of variance-type statistics for the hypothesis of a block diagonal covariance matrix. The asymptotic properties of the new test statistics are investigated under the null hypothesis and the alternative hypothesis using random matrix theory. For this purpose we study the weak convergence of linear spectral statistics of central and (conditionally) non-central Fisher matrices. In particular, a central limit theorem for linear spectral statistics of large dimensional (conditionally) non-central Fisher matrices is derived which is then used to analyse the power of the tests under the alternative. The theoretical results are illustrated by means of a simulation study where we also compare the new tests with several alternative, in particular with the commonly used corrected likelihood ratio test. It is demonstrated that the latter test does not keep its nominal level, if the dimension of one sub-vector is relatively small compared to the dimension of the other sub-vector. On the other hand the tests proposed in this paper provide a reasonable approximation of the nominal level in such situations. Moreover, we observe that one of the proposed tests is most powerful under a variety of correlation scenarios.
A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks.
Word embeddings are representations of individual words of a text document in a vector space and they are often use- ful for performing natural language pro- cessing tasks. Current state of the art al- gorithms for learning word embeddings learn vector representations from large corpora of text documents in an unsu- pervised fashion. This paper introduces SWESA (Supervised Word Embeddings for Sentiment Analysis), an algorithm for sentiment analysis via word embeddings. SWESA leverages document label infor- mation to learn vector representations of words from a modest corpus of text doc- uments by solving an optimization prob- lem that minimizes a cost function with respect to both word embeddings as well as classification accuracy. Analysis re- veals that SWESA provides an efficient way of estimating the dimension of the word embeddings that are to be learned. Experiments on several real world data sets show that SWESA has superior per- formance when compared to previously suggested approaches to word embeddings and sentiment analysis tasks.
Spectral graph wavelets introduce a notion of scale in networks, and are thus used to obtain a local view of the network from each node. By carefully constructing a wavelet filter function for these wavelets, a multi-scale community detection method for monoplex networks has already been developed. This construction takes advantage of the partitioning properties of the network Laplacian. In this paper we elaborate on a novel method which uses spectral graph wavelets to detect multi-scale communities in temporal networks. To do this we extend the definition of spectral graph wavelets to temporal networks by adopting a multilayer framework. We use arguments from Perturbation Theory to investigate the spectral properties of the supra-Laplacian matrix for clustering purposes in temporal networks. Using these properties, we construct a new wavelet filter function, which attenuates the influence of uninformative eigenvalues and centres the filter around eigenvalues which contain information on the coarsest description of prevalent community structures over time. We use the spectral graph wavelets as feature vectors in a connectivity-constrained clustering procedure to detect multi-scale communities at different scales, and refer to this method as Temporal Multi-Scale Community Detection (TMSCD). We validate the performance of TMSCD and a competing methodology on various benchmarks. The advantage of TMSCD is the automated selection of relevant scales at which communities should be sought.
This paper studies the Tensor Robust Principal Component (TRPCA) problem which extends the known Robust PCA \cite{RPCA} to the tensor case. Our model is based on a new tensor Singular Value Decomposition (t-SVD) \cite{kilmer2011factorization} and its induced tensor tubal rank and tensor nuclear norm. Consider that we have a 3-way tensor $\bm{\mathcal{X}}\in\mathbb{R}^{n_1\times n_2\times n_3}$ such that $\bm{\mathcal{X}}=\bm{\mathcal{L}}_0+\bm{\mathcal{S}}_0$, where $\bm{\mathcal{L}}_0$ has low tubal rank and $\bm{\mathcal{S}}_0$ is sparse. Is that possible to recover both components? In this work, we prove that under certain suitable assumptions, we can recover both the low-rank and the sparse components exactly by simply solving a convex program whose objective is a weighted combination of the tensor nuclear norm and the $\ell_1$-norm, i.e., \begin{align*} \min_{\bm{\mathcal{L}},\bm{\mathcal{E}}} \ \|{\bm{\mathcal{L}}}\|_*+\lambda\|{\bm{\mathcal{E}}}\|_1, \ \text{s.t.} \ \bm{\mathcal{X}}=\bm{\mathcal{L}}+\bm{\mathcal{E}}, \end{align*} where $\lambda= {1}/{\sqrt{\max(n_1,n_2)n_3}}$. Interestingly, TRPCA involves RPCA as a special case when $n_3=1$ and thus it is a simple and elegant tensor extension of RPCA. Also numerical experiments verify our theory and the application for the image denoising demonstrates the effectiveness of our method.