We present an asymptotic criterion to determine the optimal number of clusters in k-means. We consider k-means as data compression, and propose to adopt the number of clusters that minimizes the estimated description length after compression. Here we report two types of compression ratio based on two ways to quantify the description length of data after compression. This approach further offers a way to evaluate whether clusters obtained with k-means have a hierarchical structure by examining whether multi-stage compression can further reduce the description length. We applied our criteria to determine the number of clusters to synthetic data and empirical neuroimaging data to observe the behavior of the criteria across different types of data set and suitability of the two types of criteria for different datasets. We found that our method can offer reasonable clustering results that are useful for dimension reduction. While our numerical results revealed dependency of our criteria on the various aspects of dataset such as the dimensionality, the description length approach proposed here provides a useful guidance to determine the number of clusters in a principled manner when underlying properties of the data are unknown and only inferred from observation of data.
In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. Numerical experiments demonstrate the efficiency of our algorithm.
Collecting quality data from software projects can be time-consuming and expensive. Hence, some researchers explore ‘unsupervised’ approaches to quality prediction that does not require labelled data. An alternate technique is to use ‘supervised’ approaches that learn models from project data labelled with, say, ‘defective’ or ‘not-defective’.Most researchers use these supervised models since, it is argued, they can exploit more knowledge of the projects. At FSE’16, Yang et al. reported startling results where unsupervised defect predictors outperformed supervised predictors for effort-aware just-in-time defect prediction. If confirmed, these results would lead to a dramatic simplification of a seemingly complex task (data mining) that is widely explored in the SE literature. This paper repeats and refutes those results as follows. (1)There is much variability in the efficacy of the Yang et al. models so even with their approach, some supervised data is required to prune weaker models. (2)Their findings were grouped across $N$ projects. When we repeat their analysis on a project-by-project basis, supervised predictors are seen to work better. Even though this paper rejects the specific conclusions of Yang et al., we still endorse their general goal. In our our experiments, supervisedpredictors did not perform outstandingly better than unsupervised ones. Hence, they may indeed be some combination of unsupervisedlearners to achieve comparable performance to supervised. We therefore encourage others to work in this promising area.
While deep learning is an exciting new technique, the benefits of this method need to be assessed w.r.t. its computational cost. This is particularly important for deep learning since these learners need hours (to weeks) to train the model. Such long CPU times limit the ability of (a) a researcher to test the stability of their conclusion via repeated runs with different random seeds; and (b)other researchers to repeat, improve, or even refute that original work. For example, recently, deep learning was used to find which questions in the Stack Overflow programmer discussion forum can be linked together. That system took 14 hours to execute. We show here that a very simple optimizer called DE (differential evolution) can achieve similar (and sometimes better). The DE approach terminated in 10 minutes; i.e. 84 times faster hours than deep learning. We offer these results as a cautionary tale to the software analytics community and suggest that not every new innovation should be applied without critical analysis. If researchers deploy some new and expensive process, that work should be baselined against some simpler and faster alternatives.
This paper deals with the non-uniqueness of the solutions of an analysis-Lasso regularization. Most of previous works in this area is concerned with the case where the solution set is a singleton, or to derive guarantees to enforce uniqueness. Our main contribution consists in providing a geometrical interpretation of a solution with a maximal D-support, namely the fact that such a solution lives in the relative interior of the solution set. With this result in hand, we also provide a way to exhibit a maximal solution using a primal-dual interior point algorithm.
We review the relationship between Kalman filtering and Amari’s natural gradient in statistical learning. Namely, using an online natural gradient descent on data log-likelihood to evaluate the parameter of a probabilistic model given a series of observations, is exactly equivalent to using an extended Kalman filter to estimate the parameter (assumed to have constant dynamics). In the non-recurrent case, this relation is a consequence of the ‘information filter’ phrasing of the Kalman filter. In the recurrent case, we prove that the joint Kalman filter over states and parameters is a natural gradient on top of real-time recurrent learning (RTRL), a classical algorithm to train recurrent models. This correspondence provides relevant settings for natural gradient hyperparameters such as learning rates or initialization and regularization of the Fisher information matrix.
The FLAME methodology makes it possible to derive provably correct algorithms from a formal description of a linear algebra problem. So far, the methodology has been successfully used to automate the derivation of direct algorithms such as the Cholesky decomposition and the solution of Sylvester equations. In this thesis, we present an extension of the FLAME methodology to tackle iterative methods such as Conjugate Gradient. As a starting point, we use a formal description of the iterative method in matrix form. The result is a family of provably correct pseudocode algorithms. We argue that all the intermediate steps are sufficiently systematic to be fully automated.
Boosting is a popular ensemble algorithm that generates more powerful learners by linearly combining base models from a simpler hypothesis class. In this work, we investigate the problem of adapting batch gradient boosting for minimizing convex loss functions to online setting where the loss at each iteration is i.i.d sampled from an unknown distribution. To generalize from batch to online, we first introduce the definition of online weak learning edge with which for strongly convex and smooth loss functions, we present an algorithm, Streaming Gradient Boosting (SGB) with exponential shrinkage guarantees in the number of weak learners. We further present an adaptation of SGB to optimize non-smooth loss functions, for which we derive a O(ln N/N) convergence rate. We also show that our analysis can extend to adversarial online learning setting under a stronger assumption that the online weak learning edge will hold in adversarial setting. We finally demonstrate experimental results showing that in practice our algorithms can achieve competitive results as classic gradient boosting while using less computation.
Sophisticated gated recurrent neural network architectures like LSTMs and GRUs have been shown to be highly effective in a myriad of applications. We develop an un-gated unit, the statistical recurrent unit (SRU), that is able to learn long term dependencies in data by only keeping moving averages of statistics. The SRU’s architecture is simple, un-gated, and contains a comparable number of parameters to LSTMs; yet, SRUs perform favorably to more sophisticated LSTM and GRU alternatives, often outperforming one or both in various tasks. We show the efficacy of SRUs as compared to LSTMs and GRUs in an unbiased manner by optimizing respective architectures’ hyperparameters in a Bayesian optimization scheme for both synthetic and real-world tasks.
Most exact methods for k-nearest neighbour search suffer from the curse of dimensionality; that is, their query times exhibit exponential dependence on either the ambient or the intrinsic dimensionality. Dynamic Continuous Indexing (DCI) offers a promising way of circumventing the curse by avoiding space partitioning and achieves a query time that grows sublinearly in the intrinsic dimensionality. In this paper, we develop a variant of DCI, which we call Prioritized DCI, and show a further improvement in the dependence on the intrinsic dimensionality compared to standard DCI, thereby improving the performance of DCI on datasets with high intrinsic dimensionality. We also demonstrate empirically that Prioritized DCI compares favourably to standard DCI and Locality-Sensitive Hashing (LSH) both in terms of running time and space consumption at all levels of approximation quality. In particular, relative to LSH, Prioritized DCI reduces the number of distance evaluations by a factor of 5 to 30 and the space consumption by a factor of 47 to 55.
Learning to Optimize is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR-100.