L  L is a highlevel, opensource, generalpurpose and system programming language which emphasizes readability, simplicity, extensibility, conciseness and performance. The L compiler features native code generation through LLVM, and is fully documented in a literate programming style. The language and compiler are usable, but are under heavy development as new features are being implemented. 
L1Norm Batch Normalization (L1BN) 
Batch Normalization (BN) has been proven to be quite effective at accelerating and improving the training of deep neural networks (DNNs). However, BN brings additional computation, consumes more memory and generally slows down the training process by a large margin, which aggravates the training effort. Furthermore, the nonlinear square and root operations in BN also impede the low bitwidth quantization techniques, which draws much attention in deep learning hardware community. In this work, we propose an L1norm BN (L1BN) with only linear operations in both the forward and the backward propagations during training. L1BN is shown to be approximately equivalent to the original L2norm BN (L2BN) by multiplying a scaling factor. Experiments on various convolutional neural networks (CNNs) and generative adversarial networks (GANs) reveal that L1BN maintains almost the same accuracies and convergence rates compared to L2BN but with higher computational efficiency. On FPGA platform, the proposed signum and absolute operations in L1BN can achieve 1.5$\times$ speedup and save 50\% power consumption, compared with the original costly square and root operations, respectively. This hardwarefriendly normalization method not only surpasses L2BN in speed, but also simplify the hardware design of ASIC accelerators with higher energy efficiency. Last but not the least, L1BN promises a fully quantized training of DNNs, which is crucial to future adaptive terminal devices. 
L1Norm Kernel PCA  We present the first model and algorithm for L1norm kernel PCA. While L2norm kernel PCA has been widely studied, there has been no work on L1norm kernel PCA. For this nonconvex and nonsmooth problem, we offer geometric understandings through reformulations and present an efficient algorithm where the kernel trick is applicable. To attest the efficiency of the algorithm, we provide a convergence analysis including linear rate of convergence. Moreover, we prove that the output of our algorithm is a local optimal solution to the L1norm kernel PCA problem. We also numerically show its robustness when extracting principal components in the presence of influential outliers, as well as its runtime comparability to L2norm kernel PCA. Lastly, we introduce its application to outlier detection and show that the L1norm kernel PCA based model outperforms especially for high dimensional data. 
L1Penalized Censored Gaussian Graphical Model  Graphical lasso is one of the most used estimators for inferring genetic networks. Despite its diffusion, there are several fields in applied research where the limits of detection of modern measurement technologies make the use of this estimator theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. Typical examples are data generated by polymerase chain reactions and flow cytometer. The combination of censoring and highdimensionality make inference of the underlying genetic networks from these data very challenging. In this paper we propose an $\ell_1$penalized Gaussian graphical model for censored data and derive two EMlike algorithms for inference. By an extensive simulation study, we evaluate the computational efficiency of the proposed algorithms and show that our proposal overcomes existing competitors when censored data are available. We apply the proposed method to gene expression data coming from microfluidic RTqPCR technology in order to make inference on the regulatory mechanisms of blood development. 
L2Nonexpansive Neural Network  This paper proposes a class of wellconditioned neural networks in which a unit amount of change in the inputs causes at most a unit amount of change in the outputs or any of the internal layers. We develop the known methodology of controlling Lipschitz constants to realize its full potential in maximizing robustness: our linear and convolution layers subsume those in the previous Parseval networks as a special case and allow greater degrees of freedom; aggregation, pooling, splitting and other operators are adapted in new ways, and a new loss function is proposed, all for the purpose of improving robustness. With MNIST and CIFAR10 classifiers, we demonstrate a number of advantages. Without needing any adversarial training, the proposed classifiers exceed the state of the art in robustness against whitebox L2bounded adversarial attacks. Their outputs are quantitatively more meaningful than ordinary networks and indicate levels of confidence. They are also free of exploding gradients, among other desirable properties. 
Label Augmentation  A major impediment to the application of deep learning to realworld problems is the scarcity of labeled data. Small training sets are in fact of no use to deep networks as, due to the large number of trainable parameters, they will very likely be subject to overfitting phenomena. On the other hand, the increment of the training set size through further manual or semiautomatic labellings can be costly, if not possible at times. Thus, the standard techniques to address this issue are transfer learning and data augmentation, which consists of applying some sort of ‘transformation’ to existing labeled instances to let the training set grow in size. Although this approach works well in applications such as image classification, where it is relatively simple to design suitable transformation operators, it is not obvious how to apply it in more structured scenarios. Motivated by the observation that in virtually all application domains it is easy to obtain unlabeled data, in this paper we take a different perspective and propose a \emph{label augmentation} approach. We start from a small, curated labeled dataset and let the labels propagate through a larger set of unlabeled data using graph transduction techniques. This allows us to naturally use (secondorder) similarity information which resides in the data, a source of information which is typically neglected by standard augmentation techniques. In particular, we show that by using known game theoretic transductive processes we can create larger and accurate enough labeled datasets which use results in better trained neural networks. Preliminary experiments are reported which demonstrate a consistent improvement over standard image classification datasets. 
Label Embedding Network  We propose a method, called Label Embedding Network, which can learn label representation (label embedding) during the training process of deep networks. With the proposed method, the label embedding is adaptively and automatically learned through back propagation. The original onehot represented loss function is converted into a new loss function with soft distributions, such that the originally unrelated labels have continuous interactions with each other during the training process. As a result, the trained model can achieve substantially higher accuracy and with faster convergence speed. Experimental results based on competitive tasks demonstrate the effectiveness of the proposed method, and the learned label embedding is reasonable and interpretable. The proposed method achieves comparable or even better results than the stateoftheart systems. The source code is available at \url{https://…/LabelEmb}. 
Labeled Latent Dirichlet Allocation (LLDA) 
Labeled Latent Dirichlet Allocation (LLDA) is an extension of the standard unsupervised Latent Dirichlet Allocation (LDA) algorithm, to address multilabel learning tasks. Previous work has shown it to perform in par with other stateoftheart multilabel methods. Nonetheless, with increasing label sets sizes LLDA encounters scalability issues. In this work, we introduce Subset LLDA, a simple variant of the standard LLDA algorithm, that not only can effectively scale up to problems with hundreds of thousands of labels but also improves over the LLDA stateoftheart. We conduct extensive experiments on eight data sets, with label sets sizes ranging from hundreds to hundreds of thousands, comparing our proposed algorithm with the previously proposed LLDA algorithms (Prior–LDA, Dep–LDA), as well as the state of the art in extreme multilabel classification. The results show a steady advantage of our method over the other LLDA algorithms and competitive results compared to the extreme multilabel classification algorithms. 
Laconic  We motivate a method for transparently identifying ineffectual computations in unmodified Deep Learning models and without affecting accuracy. Specifically, we show that if we decompose multiplications down to the bit level the amount of work performed during inference for image classification models can be consistently reduced by two orders of magnitude. In the best case studied of a sparse variant of AlexNet, this approach can ideally reduce computation work by more than 500x. We present Laconic a hardware accelerator that implements this approach to improve execution time, and energy efficiency for inference with Deep Learning Networks. Laconic judiciously gives up some of the work reduction potential to yield a lowcost, simple, and energy efficient design that outperforms other stateoftheart accelerators. For example, a Laconic configuration that uses a weight memory interface with just 128 wires outperforms a conventional accelerator with a 2Kwire weight memory interface by 2.3x on average while being 2.13x more energy efficient on average. A Laconic configuration that uses a 1Kwire weight memory interface, outperforms the 2Kwire conventional accelerator by 15.4x and is 1.95x more energy efficient. Laconic does not require but rewards advances in model design such as a reduction in precision, the use of alternate numeric representations that reduce the number of bits that are ‘1’, or an increase in weight or activation sparsity. 
LAD Regression  
Ladder  The organizer of a machine learning competition faces the problem of maintaining an accurate leaderboard that faithfully represents the quality of the best submission of each competing team. What makes this estimation problem particularly challenging is its sequential and adaptive nature. As participants are allowed to repeatedly evaluate their submissions on the leaderboard, they may begin to overfit to the holdout data that supports the leaderboard. Few theoretical results give actionable advice on how to design a reliable leaderboard. Existing approaches therefore often resort to poorly understood heuristics such as limiting the bit precision of answers and the rate of resubmission. In this work, we introduce a notion of leaderboard accuracy tailored to the format of a competition. We introduce a natural algorithm called the Ladder and demonstrate that it simultaneously supports strong theoretical guarantees in a fully adaptive model of estimation, withstands practical adversarial attacks, and achieves high utility on real submission files from an actual competition hosted by Kaggle. 
Lagrange Multiplier  In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints. 
Lagrange Policy Gradient  Most algorithms for reinforcement learning work by estimating actionvalue functions. Here we present a method that uses Lagrange multipliers, the costate equation, and multilayer neural networks to compute policy gradients. We show that this method can find solutions to timeoptimal control problems, driving nonlinear mechanical systems quickly to a target configuration. On these tasks its performance is comparable to that of deep deterministic policy gradient, a recent actionvalue method. 
Lambda Architecture  Lambda Architecture proposes a simpler, elegant paradigm that is designed to tame complexity while being able to store and effectively process large amounts of data. The Lambda Architecture was originally presented by Nathan Marz, who is well known in the big data community for his work on the Storm project. Lambda Architecture Lambda Architecture 
LambdaMART  At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG. 
Lambert W Function  In mathematics, the Lambert W function, also called the omega function or product logarithm, is a set of functions, namely the branches of the inverse relation of the function z = f(W) = We^W where e^W is the exponential function and W is any complex number. In other words, the defining equation for W(z) is: z = W(z)e^{W(z)} for any complex number z. http://…/LambertWFunction.html lamW 
Lanczos Latent Factor Recommender (LLFR) 
The purpose if this master’s thesis is to study and develop a new algorithmic framework for Collaborative Filtering to produce recommendations in the topN recommendation problem. Thus, we propose Lanczos Latent Factor Recommender (LLFR); a novel ‘big data friendly’ collaborative filtering algorithm for topN recommendation. Using a computationally efficient Lanczosbased procedure, LLFR builds a low dimensional item similarity model, that can be readily exploited to produce personalized ranking vectors over the item space. A number of experiments on real datasets indicate that LLFR outperforms other stateoftheart topN recommendation methods from a computational as well as a qualitative perspective. Our experimental results also show that its relative performance gains, compared to competing methods, increase as the data get sparser, as in the Cold Start Problem. More specifically, this is true both when the sparsity is generalized – as in the New Community Problem, a very common problem faced by real recommender systems in their beginning stages, when there is not sufficient number of ratings for the collaborative filtering algorithms to uncover similarities between items or users – and in the very interesting case where the sparsity is localized in a small fraction of the dataset – as in the New Users Problem, where new users are introduced to the system, they have not rated many items and thus, the CF algorithm can not make reliable personalized recommendations yet. 
Lanczos Method  The Lanczos algorithm is a direct algorithm devised by Cornelius Lanczos that is an adaptation of power methods to find the most useful eigenvalues and eigenvectors of an n t h {\displaystyle n^{th}} n^{{th}} order linear system with a limited number of operations, m {\displaystyle m} m, where m {\displaystyle m} m is much smaller than n {\displaystyle n} n. Although computationally efficient in principle, the method as initially formulated was not useful, due to its numerical instability. In 1970, Ojalvo and Newman showed how to make the method numerically stable and applied it to the solution of very large engineering structures subjected to dynamic loading. This was achieved using a method for purifying the vectors to any degree of accuracy, which when not performed, produced a series of vectors that were highly contaminated by those associated with the lowest natural frequencies. In their original work, these authors also suggested how to select a starting vector (i.e. use a random number generator to select each element of the starting vector) and suggested an empirically determined method for determining m {\displaystyle m} m, the reduced number of vectors (i.e. it should be selected to be approximately 1 ½ times the number of accurate eigenvalues desired). Soon thereafter their work was followed by Paige who also provided an error analysis. In 1988, Ojalvo produced a more detailed history of this algorithm and an efficient eigenvalue error test. Currently, the method is widely used in a variety of technical fields and has seen a number of variations. 
LanczOs Variance Estimates (LOVE) 
One of the most compelling features of Gaussian process (GP) regression is its ability to provide well calibrated posterior distributions. Recent advances in inducing point methods have drastically sped up marginal likelihood and posterior mean computations, leaving posterior covariance estimation and sampling as the remaining computational bottlenecks. In this paper we address this shortcoming by using the Lanczos decomposition algorithm to rapidly approximate the predictive covariance matrix. Our approach, which we refer to as LOVE (LanczOs Variance Estimates), substantially reduces the time and space complexity over any previous method. In practice, it can compute predictive covariances up to 2,000 times faster and draw samples 18,000 time faster than existing methods, all without sacrificing accuracy. 
Landmark Retracing Network (LRN) 
Since convolutional neural network (CNN) lacks an inherent mechanism to handle large scale variations, we always need to compute feature maps multiple times for multiscale object detection, which has the bottleneck of computational cost in practice. To address this, we devise a recurrent scale approximation (RSA) to compute feature map once only, and only through this map can we approximate the rest maps on other levels. At the core of RSA is the recursive rolling out mechanism: given an initial map on a particular scale, it generates the prediction on a smaller scale that is half the size of input. To further increase efficiency and accuracy, we (a): design a scaleforecast network to globally predict potential scales in the image since there is no need to compute maps on all levels of the pyramid. (b): propose a landmark retracing network (LRN) to retrace back locations of the regressed landmarks and generate a confidence score for each landmark; LRN can effectively alleviate false positives due to the accumulated error in RSA. The whole system could be trained endtoend in a unified CNN framework. Experiments demonstrate that our proposed algorithm is superior against stateofthearts on face detection benchmarks and achieves comparable results for generic proposal generation. The source code of RSA is available at github.com/sciencefans/RSAforobjectdetection. 
Langevin Monte Carlo  
Language Model  A statistical language model assigns a probability to a sequence of m words by means of a probability distribution. Language modeling is used in many natural language processing applications such as speech recognition, machine translation, partofspeech tagging, parsing and information retrieval. 
Laplacian Power Network  Deep Neural Networks often suffer from lack of robustness to adversarial noise. To mitigate this drawback, authors have proposed different approaches, such as adding regularizers or training using adversarial examples. In this paper we propose a new regularizer built upon the Laplacian of similarity graphs obtained from the representation of training data at each intermediate representation. This regularizer penalizes large changes (across consecutive layers in the architecture) in the distance between examples of different classes. We provide theoretical justification for this regularizer and demonstrate its effectiveness when facing adversarial noise on classical supervised learning vision datasets. 
Large Deviation Principles  We establish the Large Deviation Principles for a topological Markov shift on a countably infinite number of alphabets which satisfies a strong combinatorial assumption called ‘finite primitiveness’ by Mauldin $\&$ Urba\’nski. More precisely, we assume the existence of a Gibbs measure for a potential $\phi$ in the sense of Bowen, and prove the level2 Large Deviation Principles for the distribution of Birkhoff averages under the Gibbs measure, as well as that of weighted periodic points and iterated preimages. The rate function is in common, written with the pressure and the free energy associated with the potential $\phi$. The Gibbs measure is not assumed to be an equilibrium state for the potential $\phi$, nor is assumed the existence of an equilibrium state. We provide a sufficient condition for minimizers of the rate function to be equilibrium states. We apply our results to the Gauss transformation and obtain a global limit theorem on the frequency of digits in the regular continued fraction expansion. 
Large Margin Deep Network  We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin, minimum distance to a decision boundary, has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with a preset feature representation; and conventional margin methods for neural networks only enforce margin at the output layer. Such methods are therefore not well suited for deep networks. In this work, we propose a novel loss function to impose a margin on any chosen set of layers of a deep network (including input and hidden layers). Our formulation allows choosing any norm on the metric measuring the margin. We demonstrate that the decision boundary obtained by our loss has nice properties compared to standard classification loss functions. Specifically, we show improved empirical results on the MNIST, CIFAR10 and ImageNet datasets on multiple tasks: generalization from small training sets, corrupted labels, and robustness against adversarial perturbations. The resulting loss is general and complementary to existing data augmentation (such as random/adversarial input transform) and regularization techniques (such as weight decay, dropout, and batch norm). 
Large Vocabulary Continuous Speech Recognition System (LVCSR) 
The search problem in LVCSR can be simply stated: find the most probable sequence of words given a sequence of acoustic observations, an acoustic model and a language model. This is a demanding problem since word boundary information is not available in continuous speech and each word in the dictionary may be hypothesized to start at each frame of acoustic data. The problem is further complicated by the vocabulary size (typically 65,000 words) and the structure imposed on the search space by the language model. Direct evaluation of all the possible word sequences is impossible (given the large vocabulary) and an efficient search algorithm will consider only a very small subset of all possible utterance models. Typically, the effective size of the search space is reduced through pruning of unlikely hypotheses and/or the elimination of repeated computations. 
LargeScale Information Network Embedding (LINE) 
This paper studies the problem of embedding very large information networks into lowdimensional vector spaces, which is useful in many tasks such as visualization, node classification, and link prediction. Most existing graph embedding methods do not scale for real world information networks which usually contain millions of nodes. In this paper, we propose a novel network embedding method called the ‘LINE,’ which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures. An edgesampling algorithm is proposed that addresses the limitation of the classical stochastic gradient descent and improves both the effectiveness and the efficiency of the inference. Empirical experiments prove the effectiveness of the LINE on a variety of realworld information networks, including language networks, social networks, and citation networks. The algorithm is very efficient, which is able to learn the embedding of a network with millions of vertices and billions of edges in a few hours on a typical single machine. The source code of the LINE is available online. 
Largest Gaps  In this paper, the algorithm $Largest$ $Gaps$ is introduced, for simultaneously clustering both rows and columns of a matrix to form homogeneous blocks. The definition of clustering is modelbased: clusters and data are generated under the Latent Block Model. In comparison with algorithms designed for this model, the major advantage of the $Largest$ $Gaps$ algorithm is to cluster using only some marginals of the matrix, the size of which is much smaller than the whole matrix. The procedure is linear with respect to the number of entries and thus much faster than the classical algorithms. It simultaneously selects the number of classes as well, and the estimation of the parameters is then made very easily once the classification is obtained. Moreover, the paper proves the procedure to be consistent under the LBM, and it illustrates the statistical performance with some numerical experiments. 
Lasagne  In this work we propose Lasagne, a methodology to learn locality and structure aware graph node embeddings in an unsupervised way. In particular, we show that the performance of existing randomwalk based approaches depends strongly on the structural properties of the graph, e.g., the size of the graph, whether the graph has a flat or upwardsloping Network Community Profile (NCP), whether the graph is expanderlike, whether the classes of interest are more kcorelike or more peripheral, etc. For larger graphs with flat NCPs that are strongly expanderlike, existing methods lead to random walks that expand rapidly, touching many dissimilar nodes, thereby leading to lowerquality vector representations that are less useful for downstream tasks. Rather than relying on global random walks or neighbors within fixed hop distances, Lasagne exploits strongly local Approximate Personalized PageRank stationary distributions to more precisely engineer local information into node embeddings. This leads, in particular, to more meaningful and more useful vector representations of nodes in poorlystructured graphs. We show that Lasagne leads to significant improvement in downstream multilabel classification for larger graphs with flat NCPs, that it is comparable for smaller graphs with upwardsloping NCPs, and that is comparable to existing methods for link prediction tasks. 
Lasso Penalized Sparse Asymmetric Least Squares (SALES) 
SALES 
Lasso Regression  
Lassoing Eigenvalues (elasso) 
The properties of penalized sample covariance matrices depend on the choice of the penalty function. In this paper, we introduce a class of nonsmooth penalty functions for the sample covariance matrix, and demonstrate how this method results in a grouping of the estimated eigenvalues. We refer to this method as ‘lassoing eigenvalues’ or as the ‘elasso’. 
Last Observation Projection  
LaSVM (LaSVM) 
LASVM is an approximate SVM solver that uses online approximation. It reaches accuracies similar to that of a real SVM after performing a single sequential pass through the training examples. Further benefits can be achieved using selective sampling techniques to choose which example should be considered next. As show in the graph, LASVM requires considerably less memory than a regular SVM solver. This becomes a considerable speed advantage for large training sets. In fact LASVM has been used to train a 10 class SVM classifier with 8 million examples on a single processor. lasvmR 
Latent Association Mining in Binary Data (LAMB) 
We consider the problem of identifying groups of mutually associated variables in moderate or high dimensional data. In many cases, ordinary Pearson correlation provides useful information concerning the linear relationship between variables. However, for binary data, ordinary correlation may lose power and may lack interpretability. In this paper, we develop and investigate a new method called Latent Association Mining in Binary Data (LAMB). The LAMB method is built on the assumption that the binary observations represent a random thresholding of a latent continuous variable that may have a complex correlation structure. We consider a new measure of association, latent correlation, that is designed to assess association in the underlying continuous variable, without bias due to the mediating effects of the thresholding procedure. The full LAMB procedure makes use of iterative hypothesis testing to identify groups of latently correlated variables. LAMB is shown to improve power over existing methods in simulated settings, to be computationally efficient for large datasets, and to uncover new meaningful results from common real data types. 
Latent Attention Network  Deep neural networks are able to solve tasks across a variety of domains and modalities of data. Despite many empirical successes, we lack the ability to clearly understand and interpret the learned internal mechanisms that contribute to such effective behaviors or, more critically, failure modes. In this work, we present a general method for visualizing an arbitrary neural network’s inner mechanisms and their power and limitations. Our datasetcentric method produces visualizations of how a trained network attends to components of its inputs. The computed ‘attention masks’ support improved interpretability by highlighting which input attributes are critical in determining output. We demonstrate the effectiveness of our framework on a variety of deep neural network architectures in domains from computer vision, natural language processing, and reinforcement learning. The primary contribution of our approach is an interpretable visualization of attention that provides unique insights into the network’s underlying decisionmaking process irrespective of the data modality. 
Latent Autoregressive Count Models  See Pedeli and Varin (2018) <arXiv:1805.10865> for details. lacm 
Latent Class Analysis (LCA) 
Latent class analysis (LCA) identifies unobservable subgroups within a population. 
Latent Class Model (LCM) 
In statistics, a latent class model (LCM) relates a set of observed (usually discrete) multivariate variables to a set of latent variables. It is a type of latent variable model. It is called a latent class model because the latent variable is discrete. A class is characterized by a pattern of conditional probabilities that indicate the chance that variables take on certain values. Latent Class Analysis (LCA) is a subset of structural equation modeling, used to find groups or subtypes of cases in multivariate categorical data. These subtypes are called “latent classes”. Confronted with a situation as follows, a researcher might choose to use LCA to understand the data: Imagine that symptoms ad have been measured in a range of patients with diseases X Y and Z, and that disease X is associated with the presence of symptoms a, b, and c, disease Y with symptoms b, c, d, and disease Z with symptoms a, c and d. The LCA will attempt to detect the presence of latent classes (the disease entities), creating patterns of association in the symptoms. As in factor analysis, the LCA can also be used to classify case according to their maximum likelihood class membership. Because the criterion for solving the LCA is to achieve latent classes within which there is no longer any association of one symptom with another (because the class is the disease which causes their association, and the set of diseases a patient has (or class a case is a member of) causes the symptom association, the symptoms will be “conditionally independent”, i.e., conditional on class membership, they are no longer related. 
Latent Constrained Correlation Filter (LCCF) 
Correlation filters are special classifiers designed for shiftinvariant object recognition, which are robust to pattern distortions. The recent literature shows that combining a set of subfilters trained based on a single or a small group of images obtains the best performance. The idea is equivalent to estimating variable distribution based on the data sampling (bagging), which can be interpreted as finding solutions (variable distribution approximation) directly from sampled data space. However, this methodology fails to account for the variations existed in the data. In this paper, we introduce an intermediate step — solution sampling — after the data sampling step to form a subspace, in which an optimal solution can be estimated. More specifically, we propose a new method, named latent constrained correlation filters (LCCF), by mapping the correlation filters to a given latent subspace, and develop a new learning framework in the latent subspace that embeds distributionrelated constraints into the original problem. To solve the optimization problem, we introduce a subspace based alternating direction method of multipliers (SADMM), which is proven to converge at the saddle point. Our approach is successfully applied to three different tasks, including eye localization, car detection and object tracking. Extensive experiments demonstrate that LCCF outperforms the stateoftheart methods. The source code will be publicly available. https://…/. 
Latent Constraints  Deep generative neural networks have proven effective at both conditional and unconditional modeling of complex data distributions. Conditional generation enables interactive control, but creating new controls often requires expensive retraining. In this paper, we develop a method to condition generation without retraining the model. By posthoc learning latent constraints, value functions that identify regions in latent space that generate outputs with desired attributes, we can conditionally sample from these regions with gradientbased optimization or amortized actor functions. Combining attribute constraints with a universal ‘realism’ constraint, which enforces similarity to the data distribution, we generate realistic conditional images from an unconditional variational autoencoder. Further, using gradientbased optimization, we demonstrate identitypreserving transformations that make the minimal adjustment in latent space to modify the attributes of an image. Finally, with discrete sequences of musical notes, we demonstrate zeroshot conditional generation, learning latent constraints in the absence of labeled data or a differentiable reward function. Code with dedicated cloud instance has been made publicly available (https://goo.gl/STGMGx ). 
Latent Dirichlet Allocation (LDA) 
In natural language processing, latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan in 2003. LDAvis 
Latent Dirichlet allocation Gibbs Newton (LDAGN) 
Hyperparameters play a major role in the learning and inference process of latent Dirichlet allocation (LDA). In order to begin the LDA latent variables learning process, these hyperparameters values need to be predetermined. We propose an extension for LDA that we call ‘Latent Dirichlet allocation Gibbs Newton’ (LDAGN), which places noninformative priors over these hyperparameters and uses Gibbs sampling to learn appropriate values for them. At the heart of LDAGN is our proposed ‘GibbsNewton’ algorithm, which is a new technique for learning the parameters of multivariate Polya distributions. We report GibbsNewton performance results compared with two prominent existing approaches to the latter task: Minka’s fixedpoint iteration method and the Moments method. We then evaluate LDAGN in two ways: (i) by comparing it with standard LDA in terms of the ability of the resulting topic models to generalize to unseen documents; (ii) by comparing it with standard LDA in its performance on a binary classification task. 
Latent Factor Interpretation (LFI) 
Many machine learning systems utilize latent factors as internal representations for making predictions. However, since these latent factors are largely uninterpreted, predictions made using them are opaque. Collaborative filtering via matrix factorization is a prime example of such an algorithm that uses uninterpreted latent features, and yet has seen widespread adoption for many recommendation tasks. We present Latent Factor Interpretation (LFI), a method for interpreting models by leveraging interpretations of latent factors in terms of humanunderstandable features. The interpretation of latent factors can then replace the uninterpreted latent factors, resulting in a new model that expresses predictions in terms of interpretable features. This new model can then be interpreted using recently developed model explanation techniques. In this paper, we develop LFI for collaborative filtering based recommender systems, which are particularly challenging from an interpretation perspective. We illustrate the use of LFI interpretations on the MovieLens dataset demonstrating that latent factors can be predicted with enough accuracy for accurately replicating the predictions of the true model. Further, we demonstrate the accuracy of interpretations by applying the methodology to a collaborative recommender system using DB tropes and IMDB data and synthetic user preferences. 
Latent Feature Relational Model (LFRM) 
We present a discriminative nonparametric latent feature relational model (LFRM) for link prediction to automatically infer the dimensionality of latent features. Under the generic RegBayes (regularized Bayesian inference) framework, we handily incorporate the prediction loss with probabilistic inference of a Bayesian model; set distinct regularization parameters for different types of links to handle the imbalance issue in real networks; and unify the analysis of both the smooth logistic logloss and the piecewise linear hinge loss. For the nonconjugate posterior inference, we present a simple Gibbs sampler via data augmentation, without making restricting assumptions as done in variational methods. We further develop an approximate sampler using stochastic gradient Langevin dynamics to handle large networks with hundreds of thousands of entities and millions of links, orders of magnitude larger than what existing LFRM models can process. Extensive studies on various real networks show promising performance. 
Latent Gaussian Process Regression  We introduce Latent Gaussian Process Regression which is a latent variable extension allowing modelling of nonstationary processes using stationary GP priors. The approach is built on extending the input space of a regression problem with a latent variable that is used to modulate the covariance function over the input space. We show how our approach can be used to model nonstationary processes but also how multimodal or nonfunctional processes can be described where the input signal cannot fully disambiguate the output. We exemplify the approach on a set of synthetic data and provide results on real data from geostatistics. 
Latent Order Logistic (LOLOG) 
Full probability models are critical for the statistical modeling of complex networks, and yet there are few general, flexible and widely applicable generative methods. We propose a new family of probability models motivated by the idea of network growth, which we call the Latent Order Logistic (LOLOG) model. LOLOG is a fully general framework capable of describing any probability distribution over graph configurations, though not all distributions are easily expressible or estimable as a LOLOG. We develop inferential procedures based on Monte Carlo Method of Moments, Generalized Method of Moments and variational inference. To show the flexibility of the model framework, we show how socalled scalefree networks can be modeled as LOLOGs via preferential attachment. The advantages of LOLOG in terms of avoidance of degeneracy, ease of sampling, and model flexibility are illustrated. Connections with the popular Exponentialfamily Random Graph model (ERGM) are also explored, and we find that they are identical in the case of dyadic independence. Finally, we apply the model to a social network of collaboration within a corporate law firm, a friendship network among adolescent students, and the friendship relations in an online social network. 
Latent Profile Analysis (LPA) 
tidyLPA 
Latent RANSAC  We present a method that can evaluate a RANSAC hypothesis in constant time, i.e. independent of the size of the data. A key observation here is that correct hypotheses are tightly clustered together in the latent parameter domain. In a manner similar to the generalized Hough transform we seek to find this cluster, only that we need as few as two votes for a successful detection. Rapidly locating such pairs of similar hypotheses is made possible by adapting the recent ‘Random Grids’ rangesearch technique. We only perform the usual (costly) hypothesis verification stage upon the discovery of a close pair of hypotheses. We show that this event rarely happens for incorrect hypotheses, enabling a significant speedup of the RANSAC pipeline. The suggested approach is applied and tested on three robust estimation problems: camera localization, 3D rigid alignment and 2Dhomography estimation. We perform rigorous testing on both synthetic and real datasets, demonstrating an improvement in efficiency without a compromise in accuracy. Furthermore, we achieve stateoftheart 3D alignment results on the challenging ‘Redwood’ loopclosure challenge. 
Latent Semantic Analysis (LSA) 
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the two vectors formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words. 
Latent Sequence Decompositions (LSD) 
We present the Latent Sequence Decompositions (LSD) framework. LSD decomposes sequences with variable lengthed output units as a function of both the input sequence and the output sequence. We present a training algorithm which samples valid extensions and an approximate decoding algorithm. We experiment with the Wall Street Journal speech recognition task. Our LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER. When combined with a convolutional network on the encoder, we achieve 9.2% WER. 
Latent Structure Analysis (LSA) 
latent structure analysis (LSA). LSA is a broad category that subsumes several individual methods, including latent class analysis (LCA) and latent trait analysis (LTA). The purpose of LSA is to infer, from observed variables (manifest variables), the structure of other, more fundamental variables that are not directly observed (latent variables). Both manifest variables and latent variables can be binary, nominal, orderedcategorical, or interval/continuous – leading to a large different combinations and different methods. For example, classical latent class analysis considers binary, nominal, or orderedcategorical manifest variables and nominal latent variables, and latent trait analysis considers binary or orderedcategorical variables and continuous latent variables. 
Latent Structure Learning (LSL) 
lsl 
Latent Trait Analysis (LTA) 
Latent Trait Analysis (LTA), a form of latent structure analysis (Lazarsfeld & Henry, 1968), is used for the analysis of categorical data. The simplest way to understand it is that LTA is form of factor analysis for binary (dichotomous) or orderedcategory data. In the area of educational testing and psychological measurement, latent trait analysis is termed Item Response Theory (IRT). There is so much overlap between LTA and IRT that these terms are basically interchangeable. 
Latent Transition Analysis (LTA) 
Latent transition analysis (LTA) and latent class analysis (LCA) are closely related methods. LCA identifies unobservable (latent) subgroups within a population based on individuals’ responses to multiple observed variables. LTA is an extension of LCA that uses longitudinal data to identify movement between the subgroups over time. 
Latent Tree Models  Latent tree models are graphical models defined on trees, in which only a subset of variables is observed. They were first discussed by Judea Pearl as treedecomposable distributions to generalise stardecomposable distributions such as the latent class model. Latent tree models, or their submodels, are widely used in: phylogenetic analysis, network tomography, computer vision, causal modeling, and data clustering. They also contain other wellknown classes of models like hidden Markov models, Brownian motion tree model, the Ising model on a tree, and many popular models used in phylogenetics. This article offers a concise introduction to the theory of latent tree models. We emphasise the role of tree metrics in the structural description of this model class, in designing learning algorithms, and in understanding fundamental limits of what and when can be learned. 
Latent Tree Variational Autoencoder (LTVAE) 
Recently, deep learning based clustering methods are shown superior to traditional ones by jointly conducting representation learning and clustering. These methods rely on the assumptions that the number of clusters is known, and that there is one single partition over the data and all attributes define that partition. However, in realworld applications, prior knowledge of the number of clusters is usually unavailable and there are multiple ways to partition the data based on subsets of attributes. To resolve the issues, we propose latent tree variational autoencoder (LTVAE), which simultaneously performs representation learning and multidimensional clustering. LTVAE learns latent embeddings from data, discovers multifacet clustering structures based on subsets of latent features, and automatically determines the number of clusters in each facet. Experiments show that the proposed method achieves stateoftheart clustering performance and reals reasonable multifacet structures of the data. 
Latent Variable Mixture Model (LVMM) 
Latent variable mixture modeling (LVMM) is a flexible analytic tool that allows researchers to investigate questions about patterns of data and to determine the extent to which identified patterns relate to important variables. For example, • Do patterns of cooccurring developmental and medical diagnoses influence the severity of pediatric feeding problems (Berlin, Lobato, Pinkos, Cerezo, & LeLeiko, 2011)? • Do differential longitudinal trajectories of glycemic control exist among youth with type 1 diabetes (Helgeson et al., 2010) • Do differential trajectories of adherence among youth newly diagnosed with epilepsy exist (Modi, Rausch, & Glauser, 2011), and if so, • Do psychosocial and demographic variables predict these patterns? • Do patterns of perceived stressors among youth with type 1 diabetes differentially affect glycemic control (Berlin, Rabideau, & Hains, 2012)? http://…cgi?article=1093&context=famconfacpub http://…/latentvariablemixturemodelslvmm.html 
Latent Variable Model  A latent variable model is a statistical model that relates a set of variables (socalled manifest variables) to a set of latent variables. It is assumed that the responses on the indicators or manifest variables are the result of an individual’s position on the latent variable(s), and that the manifest variables have nothing in common after controlling for the latent variable (local independence). Different types of the latent variable model can be grouped according to whether the manifest and latent variables are categorical or continuous. 
Latin Hypercube Design  MOLHD 
Latitude  Nonnegative matrix factorization (NMF) is one of the most frequentlyused matrix factorization models in data analysis. A significant reason to the popularity of NMF is its interpretability and the `parts of whole’ interpretation of its components. Recently, maxtimes, or subtropical, matrix factorization (SMF) has been introduced as an alternative model with equally interpretable `winner takes it all’ interpretation. In this paper we propose a new mixed linear–tropical model, and a new algorithm, called Latitude, that combines NMF and SMF, being able to smoothly alternate between the two. In our model, the data is modeled using the latent factors and latent parameters that control whether the factors are interpreted as NMF or SMF features, or their mixtures. We present an algorithm for our novel matrix factorization. Our experiments show that our algorithm improves over both baselines, and can yield interpretable results that reveal more of the latent structure than either NMF or SMF alone. 
Lavaan Project  The lavaan package is developed to provide useRs, researchers and teachers a free opensource, but commercialquality package for latent variable modeling. You can use lavaan to estimate a large variety of multivariate statistical models, including path analysis, confirmatory factor analysis, structural equation modeling and growth curve models. The official reference to the lavaan package is the following paper: Yves Rosseel (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 136. URL http://…/i02 lavaan,lavaan.shiny,lavaanPlot,blavaan 
Law of Likelihood  If hypothesis A implies that the probability that a random variable X takes the value x is pA(x), while hypothesis B implies that the probability is pB(x), then the observation X = x is evidence supporting A over B if and only if pA(x) > pB(x), and the likelihood ratio, pA(x)/ pB(x), measures the strength of that evidence.’ ‘This says simply that if an event is more probable under hypothesis A than hypothesis B, then the occurrence of that event is evidence supporting A over B – the hypothesis that did the better job of predicting the event is better supported by its occurrence.’ Moreover, ‘the likelihood ratio, is the exact factor by which the probability ratio (ratio of priors in A and B) is changed. 
Layered SelfOrganizing Map (LSOM) 
This paper defines a new learning architecture, Layered SelfOrganizing Maps (LSOMs), that uses the SOM and supervisedSOM learning algorithms. The architecture is validated with the MNIST database of handwritten digit images. LSOMs are similar to convolutional neural nets (covnets) in the way they sample data, but different in the way they represent features and learn. LSOMs analyze (or generate) image patches with maps of exemplars determined by the SOM learning algorithm rather than feature maps from filterbanks learned via backprop. LSOMs provide an alternative to features derived from covnets. Multilayer LSOMs are trained bottomup, without the use of backprop and therefore may be of interest as a model of the visual cortex. The results show organization at multiple levels. The algorithm appears to be resource efficient in learning, classifying and generating images. Although LSOMs can be used for classification, their validation accuracy for these exploratory runs was well below the state of the art. The goal of this article is to define the architecture and display the structures resulting from its application to the MNIST images. 
Layered TreeBased Pipeline Optimization Tool (Layered TPOT) 
With the demand for machine learning increasing, so does the demand for tools which make it easier to use. Automated machine learning (AutoML) tools have been developed to address this need, such as the TreeBased Pipeline Optimization Tool (TPOT) which uses genetic programming to build optimal pipelines. We introduce Layered TPOT, a modification to TPOT which aims to create pipelines equally good as the original, but in significantly less time. This approach evaluates candidate pipelines on increasingly large subsets of the data according to their fitness, using a modified evolutionary algorithm to allow for separate competition between pipelines trained on different sample sizes. Empirical evaluation shows that, on sufficiently large datasets, Layered TPOT indeed finds better models faster. ➘ “TreeBased Pipeline Optimization Tool” 
LayerWise Relevance Propagation (LRP) 
Despite the tremendous achievements of deep convolutional neural networks~(CNNs) in most of computer vision tasks, understanding how they actually work remains a significant challenge. In this paper, we propose a novel twostep visualization method that aims to shed light on how deep CNNs recognize images and the objects therein. We start out with a layerwise relevance propagation (LRP) step which estimates a pixelwise relevance map over the input image. Following, we construct a contextaware saliency map from the LRPgenerated map which predicts regions close to the foci of attention. We show that our algorithm clearly and concisely identifies the key pixels that contribute to the underlying neural network’s comprehension of images. Experimental results using the ILSVRC2012 validation dataset in conjunction with two wellestablished deep CNNs demonstrate that combining the LRP with the visual salience estimation can give great insight into how a CNNs model perceives and understands a presented scene, in relation to what it has learned in the prior training phase. 
Lazy Bayesian Rules (LBR) 
The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. A number of approaches have sought to alleviate this problem. A Bayesian tree learning algorithm builds a decision tree, and generates a local naive Bayesian classifier at each leaf. The tests leading to a leaf can alleviate attribute interdependencies for the local naive Bayesian classifier. However, Bayesian tree learning still suffers from the small disjunct problem of tree learning. While inferred Bayesian trees demonstrate low average prediction error rates, there is reason to believe that error rates will be higher for those leaves with few training examples. This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called Lbr. This algorithm can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayes. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. It is demonstrated that the computational requirements of Lbr are reasonable in a wide crosssection of natural domains. Experiments with these domains show that, on average, this new algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm. It also outperforms, although the result is not statistically significant, a selective naive Bayesian classifier. http://…/ZhengWebbTing99.pdf http://…/CRPITV87Xie.pdf 
Lazy Learning  In artificial intelligence, lazy learning is a learning method in which generalization beyond the training data is delayed until a query is made to the system, as opposed to in eager learning, where the system tries to generalize the training data before receiving queries. The main advantage gained in employing a lazy learning method, such as Case based reasoning, is that the target function will be approximated locally, such as in the knearest neighbor algorithm. Because the target function is approximated locally for each query to the system, lazy learning systems can simultaneously solve multiple problems and deal successfully with changes in the problem domain. The disadvantages with lazy learning include the large space requirement to store the entire training dataset. Particularly noisy training data increases the case base unnecessarily, because no abstraction is made during the training phase. Another disadvantage is that lazy learning methods are usually slower to evaluate, though this is coupled with a faster training phase. Lazy classifiers are most useful for large datasets with few attributes. 
Lazy Stochastic Principal Component Analysis (Lazy SPCA) 
Stochastic principal component analysis (SPCA) has become a popular dimensionality reduction strategy for large, highdimensional datasets. We derive a simplified algorithm, called Lazy SPCA, which has reduced computational complexity and is better suited for largescale distributed computation. We prove that SPCA and Lazy SPCA find the same approximations to the principal subspace, and that the pairwise distances between samples in the lowerdimensional space is invariant to whether SPCA is executed lazily or not. Empirical studies find downstream predictive performance to be identical for both methods, and superior to random projections, across a range of predictive models (linear regression, logistic lasso, and random forests). In our largest experiment with 4.6 million samples, Lazy SPCA reduced 43.7 hours of computation to 9.9 hours. Overall, Lazy SPCA relies exclusively on matrix multiplications, besides an operation on a small square matrix whose size depends only on the target dimensionality. 
lda2vec  Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about Englishspecific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she ‘used to wear scrubs to work’, and distill ‘taking a trip’ into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I’ll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec. 
Leader Clustering Algorithm  The leader clustering algorithm provides a means for clustering a set of data points. Unlike many other clustering algorithms it does not require the user to specify the number of clusters, but instead requires the approximate radius of a cluster as its primary tuning parameter. The package provides a fast implementation of this algorithm in ndimensions using Lpdistances (with special cases for p=1,2, and infinity) as well as for spatial data using the Haversine formula, which takes latitude/longitude pairs as inputs and clusters based on great circle distances. leaderCluster 
Leaders and Subleaders Algorithm  An efficient hierarchical clustering algorithm, suitable for large data sets is proposed for effective clustering and prototype selection for pattern classification. It is another simple and efficient technique which uses incremental clustering principles to generate a hierarchical structure for finding the subgroups/subclusters within each cluster. As an example, a two level clustering algorithm – LeadersSubleaders, an extension of the leader algorithm is presented. Classification accuracy (CA) obtained using the representatives generated by the LeadersSubleaders method is found to be better than that of using leaders as representatives. Even if more number of prototypes are generated, classification time is less as only a part of the hierarchical structure is searched. 
Leadlike Recognizer (LeadR) 
A competitive baseline in sentencelevel extractive summarization of news articles is the Lead3 heuristic, where only the first 3 sentences are extracted. The success of this method is due to the tendency for writers to implement progressive elaboration in their work by writing the most important content at the beginning. In this paper, we introduce the Leadlike Recognizer (LeadR) to show how the Lead heuristic can be extended to summarize multisection documents where it would not usually work well. This is done by introducing a neural model which produces a probability distribution over positions for sentences, so that we can locate sentences with introductionlike qualities. To evaluate the performance of our model, we use the task of summarizing multisection documents. LeadR outperforms several baselines on this task, including a simple extension of the Lead heuristic designed for the task. Our work suggests that predicted position is a strong feature to use when extracting summaries. 
leaflet  Leaflet is a modern opensource JavaScript library for mobilefriendly interactive maps. It is developed by Vladimir Agafonkin with a team of dedicated contributors. Weighing just about 33 KB of JS, it has all the features most developers ever need for online maps. Leaflet is designed with simplicity, performance and usability in mind. It works efficiently across all major desktop and mobile platforms out of the box, taking advantage of HTML5 and CSS3 on modern browsers while still being accessible on older ones. It can be extended with a huge amount of plugins, has a beautiful, easy to use and welldocumented API and a simple, readable source code that is a joy to contribute to. http://…neo4jspatialandleafletjswithmapbox Leaflet: Interactive web maps with R leaflet 
Lean Analytics  Lean Analytics is about measuring the right thing, in the right way, to produce the change the business needs the most at that point in time. With that in mind, here’s some background on metrics that matter. 
Learnable Histogram  Statistical features, such as histogram, BagofWords (BoW) and Fisher Vector, were commonly used with handcrafted features in conventional classification methods, but attract less attention since the popularity of deep learning methods. In this paper, we propose a learnable histogram layer, which learns histogram features within deep neural networks in endtoend training. Such a layer is able to backpropagate (BP) errors, learn optimal bin centers and bin widths, and be jointly optimized with other layers in deep networks during training. Two vision problems, semantic segmentation and object detection, are explored by integrating the learnable histogram layer into deep networks, which show that the proposed layer could be well generalized to different applications. Indepth investigations are conducted to provide insights on the newly introduced layer. 
Learning Active Learning (LAL) 
In this paper, we suggest a novel datadriven approach to active learning: Learning Active Learning (LAL). The key idea behind LAL is to train a regressor that predicts the expected error reduction for a potential sample in a particular learning state. By treating the query selection procedure as a regression problem we are not restricted to dealing with existing AL heuristics; instead, we learn strategies based on experience from previous active learning experiments. We show that LAL can be learnt from a simple artificial 2D dataset and yields strategies that work well on real data from a wide range of domains. Moreover, if some domainspecific samples are available to bootstrap active learning, the LAL strategy can be tailored for a particular problem. 
Learning Analytics (LA) 
Learning analytics is the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimising learning and the environments in which it occurs. A related field is educational data mining. For general audience introductions, see: • The Educause Learning Initiative Briefing • The Educause Review on Learning analytics • And the UNESCO “Learning Analytics Policy Brief” (2012) 
Learning Automata Based SVM (LASVM) 
As an indispensable defensive measure of network security, the intrusion detection is a process of monitoring the events occurring in a computer system or network and analyzing them for signs of possible incidents. It is a classifier to judge the event is normal or malicious. The information used for intrusion detection contains some redundant features which would increase the difficulty of training the classifier for intrusion detection and increase the time of making predictions. To simplify the training process and improve the efficiency of the classifier, it is necessary to remove these dispensable features. in this paper, we propose a novel LASVM scheme to automatically remove redundant features focusing on intrusion detection. This is the first application of learning automata for solving dimension reduction problems. The simulation results indicate that the LASVM scheme achieves a higher accuracy and is more efficient in making predictions compared with traditional SVM. 
Learning by Association  In many realworld scenarios, labeled data for a specific machine learning task is costly to obtain. Semisupervised training methods make use of abundantly available unlabeled data and a smaller number of labeled examples. We propose a new framework for semisupervised training of deep neural networks inspired by learning in humans. ‘Associations’ are made from embeddings of labeled samples to those of unlabeled ones and back. The optimization schedule encourages correct association cycles that end up at the same class from which the association was started and penalizes wrong associations ending at a different class. The implementation is easy to use and can be added to any existing endtoend training setup. We demonstrate the capabilities of learning by association on several data sets and show that it can improve performance on classification tasks tremendously by making use of additionally available unlabeled data. In particular, for cases with few labeled data, our training scheme outperforms the current state of the art on SVHN. 
Learning Classifier System (LCS) 
Learning classifier systems, or LCS, are a paradigm of rulebased machine learning methods that combine a discovery component (e.g. typically a genetic algorithm) with a learning component (performing either supervised learning, reinforcement learning, or unsupervised learning).[2] Learning classifier systems seek to identify a set of contextdependent rules that collectively store and apply knowledge in a piecewise manner in order to make predictions (e.g. behavior modeling,[3] classification,[4][5] data mining,[5][6][7] regression,[8] function approximation,[9] or game strategy). This approach allows complex solution spaces to be broken up into smaller, simpler parts. The founding concepts behind learning classifier systems came from attempts to model complex adaptive systems, using rulebased agents to form an artificial cognitive system (i.e. artificial intelligence). 
Learning Curve  Plots relating performance to experience are widely used in machine learning. Performance is the error rate or accuracy of the learning system, while experience may be the number of training examples used for learning or the number of iterations used in optimizing the system model parameters. The machine learning curve is useful for many purposes including comparing different algorithms, choosing model parameters during design, adjusting optimization to improve convergence, and determining the amount of data used for training. 
learning from Subgraphs, Embeddings, and Attributes for Link prediction (SEAL) 
Traditional methods for link prediction can be categorized into three main types: graph structure featurebased, latent featurebased, and explicit featurebased. Graph structure feature methods leverage some handcrafted node proximity scores, e.g., common neighbors, to estimate the likelihood of links. Latent feature methods rely on factorizing networks’ matrix representations to learn an embedding for each node. Explicit feature methods train a machine learning model on two nodes’ explicit attributes. Each of the three types of methods has its unique merits. In this paper, we propose SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction), a new framework for link prediction which combines the power of all the three types into a single graph neural network (GNN). GNN is a new type of neural network which directly accepts graphs as input and outputs their labels. In SEAL, the input to the GNN is a local subgraph around each target link. We prove theoretically that our local subgraphs also reserve a great deal of highorder graph structure features related to link existence. Another key feature is that our GNN can naturally incorporate latent features and explicit features. It is achieved by concatenating node embeddings (latent features) and node attributes (explicit features) in the node information matrix for each subgraph, thus combining the three types of features to enhance GNN learning. Through extensive experiments, SEAL shows unprecedentedly strong performance against a wide range of baseline methods, including various link prediction heuristics and network embedding methods. 
Learning MWay Tree (LMWTree) 
LMWtree is a generic template library written in C++ that implements several algorithms that use the mway nearest neighbor tree structure to store their data. See the related PhD thesis for more details on mway nn trees. The algorithms and data structures are generic to support different data representations such as dense real valued and bit vectors, and sparse vectors. Additionally, it can index any object type that can form a prototype representation of a set of objects. The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It places similar items into clusters and dissimilar items into different clusters. The data structures and algorithms can also be used for nearest neighbor search, supervised learning and other machine learning applications. The package includes EMtree, Ktree, kmeans, TSVQ, repeated kmeans, clustering, random projections, random indexing, hashing, bit signatures. See the related PhD thesis for more details these algorithms and representations. LMWtree is licensed under the BSD license. 
Learning Solving Procedure  It is expected that progress toward true artificial intelligence will be achieved through the emergence of a system that integrates representation learning and complex reasoning (LeCun et al. 2015). In response to this prediction, research has been conducted on implementing the symbolic reasoning of a von Neumann computer in an artificial neural network (Graves et al. 2016; Graves et al. 2014; Reed et al. 2015). However, these studies have many limitations in realizing neuralsymbolic integration (Jaeger. 2016). Here, we present a new learning paradigm: a learning solving procedure (LSP) that learns the procedure for solving complex problems. This is not accomplished merely by learning inputoutput data, but by learning algorithms through a solving procedure that obtains the output as a sequence of tasks for a given input problem. The LSP neural network system not only learns simple problems of addition and multiplication, but also the algorithms of complicated problems, such as complex arithmetic expression, sorting, and Hanoi Tower. To realize this, the LSP neural network structure consists of a deep neural network and long shortterm memory, which are recursively combined. Through experimentation, we demonstrate the efficiency and scalability of LSP and its validity as a mechanism of complex reasoning. 
Learning Through Deterministic Assignment of Hidden Parameters (LtDaHP) 
Supervised learning frequently boils down to determining hidden and bright parameters in a parameterized hypothesis space based on finite inputoutput samples. The hidden parameters determine the attributions of hidden predictors or the nonlinear mechanism of an estimator, while the bright parameters characterize how hidden predictors are linearly combined or the linear mechanism. In traditional learning paradigm, hidden and bright parameters are not distinguished and trained simultaneously in one learning process. Such an onestage learning (OSL) brings a benefit of theoretical analysis but suffers from the high computational burden. To overcome this difficulty, a twostage learning (TSL) scheme, featured by learning through deterministic assignment of hidden parameters (LtDaHP) was proposed, which suggests to deterministically generate the hidden parameters by using minimal Riesz energy points on a sphere and equally spaced points in an interval. We theoretically show that with such deterministic assignment of hidden parameters, LtDaHP with a neural network realization almost shares the same generalization performance with that of OSL. We also present a series of simulations and application examples to support the outperformance of LtDaHP 
Learning to Coordinate and Teach Reinforcement (LeCTR) 
We present a framework and algorithm for peertopeer teaching in cooperative multiagent reinforcement learning. Our algorithm, Learning to Coordinate and Teach Reinforcement (LeCTR), trains advising policies by using students’ learning progress as a teaching reward. Agents using LeCTR learn to assume the role of a teacher or student at the appropriate moments, exchanging action advice to accelerate the entire learning process. Our algorithm supports teaching heterogeneous teammates, advising under communication constraints, and learns both what and when to advise. LeCTR is demonstrated to outperform the final performance and rate of learning of prior teaching methods on multiple benchmark domains. To our knowledge, this is the first approach for learning to teach in a multiagent setting. 
Learning to Multitask (L2MT) 
Multitask learning has shown promising performance in many applications and many multitask models have been proposed. In order to identify an effective multitask model for a given multitask problem, we propose a learning framework called learning to multitask (L2MT). To achieve the goal, L2MT exploits historical multitask experience which is organized as a training set consists of several tuples, each of which contains a multitask problem with multiple tasks, a multitask model, and the relative test error. Based on such training set, L2MT first uses a proposed layerwise graph neural network to learn task embeddings for all the tasks in a multitask problem and then learns an estimation function to estimate the relative test error based on task embeddings and the representation of the multitask model based on a unified formulation. Given a new multitask problem, the estimation function is used to identify a suitable multitask model. Experiments on benchmark datasets show the effectiveness of the proposed L2MT framework. 
Learning to Teach  Teaching plays a very important role in our society, by spreading human knowledge and educating our next generations. A good teacher will select appropriate teaching materials, impact suitable methodologies, and set up targeted examinations, according to the learning behaviors of the students. In the field of artificial intelligence, however, one has not fully explored the role of teaching, and pays most attention to machine \emph{learning}. In this paper, we argue that equal attention, if not more, should be paid to teaching, and furthermore, an optimization framework (instead of heuristics) should be used to obtain good teaching strategies. We call this approach `learning to teach’. In the approach, two intelligent agents interact with each other: a student model (which corresponds to the learner in traditional machine learning algorithms), and a teacher model (which determines the appropriate data, loss function, and hypothesis space to facilitate the training of the student model). The teacher model leverages the feedback from the student model to optimize its own teaching strategies by means of reinforcement learning, so as to achieve teacherstudent coevolution. To demonstrate the practical value of our proposed approach, we take the training of deep neural networks (DNN) as an example, and show that by using the learning to teach techniques, we are able to use much less training data and fewer iterations to achieve almost the same accuracy for different kinds of DNN models (e.g., multilayer perceptron, convolutional neural networks and recurrent neural networks) under various machine learning tasks (e.g., image classification and text understanding). 
Learning Under Privileged Information (LUPI) 
Conformal Prediction in Learning Under Privileged Information Paradigm with Applications in Drug Discovery 
Learning Vector Quantization (LVQ) 
In computer science, learning vector quantization (LVQ), is a prototypebased supervised classification algorithm. LVQ is the supervised counterpart of vector quantization systems. 
Learning with Counts  
Learning with OpponentLearning Awareness (LOLA) 
Multiagent settings are quickly gathering importance in machine learning. Beyond a plethora of recent work on deep multiagent reinforcement learning, hierarchical reinforcement learning, generative adversarial networks and decentralized optimization can all be seen as instances of this setting. However, the presence of multiple learning agents in these settings renders the training problem nonstationary and often leads to unstable training or undesired final results. We present Learning with OpponentLearning Awareness (LOLA), a method that reasons about the anticipated learning of the other agents. The LOLA learning rule includes an additional term that accounts for the impact of the agent’s policy on the anticipated parameter update of the other agents. We show that the LOLA update rule can be efficiently calculated using an extension of the likelihood ratio policy gradient update, making the method suitable for modelfree reinforcement learning. This method thus scales to large parameter and input spaces and nonlinear function approximators. Preliminary results show that the encounter of two LOLA agents leads to the emergence of titfortat and therefore cooperation in the infinitely iterated prisoners’ dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradientbased methods. Applied to infinitely repeated matching pennies, only LOLA agents converge to the Nash equilibrium. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies. Again, by considering the learning of the other agent, LOLA agents learn to cooperate out of selfish interests. 
LearningBased Visual Saliency Fusion Method for HDR Content (LVBSHDR) 
Saliency prediction for Standard Dynamic Range (SDR) videos has been well explored in the last decade. However, limited studies are available on High Dynamic Range (HDR) Visual Attention Models (VAMs). Considering that the characteristic of HDR content in terms of dynamic range and color gamut is quite different than those of SDR content, it is essential to identify the importance of different saliency attributes of HDR videos for designing a VAM and understand how to combine these features. To this end we propose a learningbased visual saliency fusion method for HDR content (LVBSHDR) to combine various visual saliency features. In our approach various conspicuity maps are extracted from HDR data, and then for fusing conspicuity maps, a Random Forests algorithm is used to train a model based on the collected data from an eyetracking experiment. Performance evaluations demonstrate the superiority of the proposed fusion method against other existing fusion methods. 
Least Absolute Deviations (LAD) 
Least absolute deviations (LAD), also known as Least Absolute Errors (LAE), Least Absolute Value (LAV), or Least Absolute Residual (LAR) or the L1 norm problem, is a statistical optimization technique similar to the popular least squares technique that attempts to find a function which closely approximates a set of data. In the simple case of a set of (x,y) data, the approximation function is a simple ‘trend line’ in twodimensional Cartesian coordinates. The method minimizes the sum of absolute errors (SAE) (the sum of the absolute values of the vertical ‘residuals’ between points generated by the function and corresponding points in the data). The least absolute deviations estimate also arises as the maximum likelihood estimate if the errors have a Laplace distribution. 
Least Absolute Deviations Estimator (LADE) 
This paper provides an entire inference procedure for the autoregressive model under (conditional) heteroscedasticity of unknown form with a finite variance. We first establish the asymptotic normality of the weighted least absolute deviations estimator (LADE) for the model. Second, we develop the random weighting (RW) method to estimate its asymptotic covariance matrix, leading to the implementation of the Wald test. Third, we construct a portmanteau test for model checking, and use the RW method to obtain its critical values. As a special weighted LADE, the feasible adaptive LADE (ALADE) is proposed and proved to have the same efficiency as its infeasible counterpart. The importance of our entire methodology based on the feasible ALADE is illustrated by simulation results and the real data analysis on three U.S. economic data sets. 
Least Absolute Shrinkage and Screening Operator (LASSO) 
Slide 31: ‘Tibshirani (1996): LASSO = Least Absolute Shrinkage and Selection Operator new translation: LASSO = Least Absolute Shrinkage and Screening Operator’ 
Least Absolute Shrinkage and Selection Operator (LASSO) 
The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients. It has connections to softthresholding of wavelet coefficients, forward stagewise regression, and boosting methods. 
Least Square Projection (LSP) 
The problem of projecting multidimensional data into lower dimensions has been pursued by many researchers due to its potential application to data analysis of various kinds. This paper presents a novel multidimensional projection technique based on least square approximations. The approximations compute the coordinates of a set of projected points based on the coordinates of a reduced number of control points with defined geometry. We name the technique Least Square Projections (LSP). 
Least Squares Deep QNetwork (LSDQN) 
Deep reinforcement learning (DRL) methods such as the Deep QNetwork (DQN) have achieved stateoftheart results in a variety of challenging, highdimensional domains. This success is mainly attributed to the power of deep neural networks to learn rich domain representations for approximating the value function or policy. Batch reinforcement learning methods with linear representations, on the other hand, are more stable and require less hyper parameter tuning. Yet, substantial feature engineering is necessary to achieve good results. In this work we propose a hybrid approach — the Least Squares Deep QNetwork (LSDQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method. We do this by periodically retraining the last hidden layer of a DRL network with a batch least squares update. Key to our approach is a Bayesian regularization term for the least squares update, which prevents overfitting to the more recent data. We tested LSDQN on five Atari games and demonstrate significant improvement over vanilla DQN and DoubleDQN. We also investigated the reasons for the superior performance of our method. Interestingly, we found that the performance improvement can be attributed to the large batch size used by the LS method when optimizing the last layer. 
LeastAngle Regression (LARS) 
In statistics, leastangle regression (LARS) is a regression algorithm for highdimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Suppose we expect a response variable to be determined by a linear combination of a subset of potential covariates. Then the LARS algorithm provides a means of producing an estimate of which variables to include, as well as their coefficients. Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1 norm of the parameter vector. The algorithm is similar to forward stepwise regression, but instead of including variables at each step, the estimated parameters are increased in a direction equiangular to each one’s correlations with the residual. 
LeaveOneOut  In this paper, we introduce a powerful technique, LeaveOneOut, to the analysis of lowrank matrix completion problems. Using this technique, we develop a general approach for obtaining finegrained, entrywise bounds on iterative stochastic procedures. We demonstrate the power of this approach in analyzing two of the most important algorithms for matrix completion: the nonconvex approach based on Singular Value Projection (SVP), and the convex relaxation approach based on nuclear norm minimization (NNM). In particular, we prove for the first time that the original form of SVP, without resampling or sample splitting, converges linearly in the infinity norm. We further apply our leaveoneout approach to an iterative procedure that arises in the analysis of the dual solutions of NNM. Our results show that NNM recovers the true $ d $by$ d $ rank$ r $ matrix with $\mathcal{O}(\mu^2 r^3d \log d )$ observed entries, which has optimal dependence on the dimension and is independent of the condition number of the matrix. To the best of our knowledge, this is the first sample complexity result for a tractable matrix completion algorithm that satisfies these two properties simultaneously. 
LeaveOneOut Cross Validation (LOOCV) 
Leaveoneout crossvalidation (LOOCV) is a particular case of leavepout crossvalidation with p = 1. loo 
LeavepOut Cross Validation (LpOCV) 
As the name suggests, leavepout crossvalidation (LpO CV) involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p’ observations and a training set. LpO crossvalidation requires to learn and validate times (where n is the number of observation in the original sample). So as soon as n is quite big it becomes impossible to calculate. 
Lecture Hall Tableaux  We introduce lecture hall tableaux, which are fillings of a skew Young diagram satisfying certain conditions. Lecture hall tableaux generalize both lecture hall partitions and antilecture hall compositions, and also contain reverse semistandard Young tableaux as a limit case. We show that the coefficients in the Schur expansion of multivariate little $q$Jacobi polynomials are generating functions for lecture hall tableaux. Using a Selbergtype integral we show that moments of multivariate little $q$Jacobi polynomials, which are equal to generating functions for lecture hall tableaux of a Young diagram, have a product formula. We also explore various combinatorial properties of lecture hall tableaux. 
Lemmatization  Lemmatisation (or lemmatization) in linguistics is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language. In many languages, words appear in several inflected forms. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The combination of the base form with the part of speech is often called the lexeme of the word. Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications. 
Lempel–Ziv–Oberhumer (LZO) 
LempelZivOberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed. 
Lenstra Lenstra Lovasz (LLL) 
The LenstraLenstraLovász (LLL) lattice basis reduction algorithm is a polynomial time lattice reduction algorithm invented by Arjen Lenstra, Hendrik Lenstra and László Lovász in 1982. 
LevenbergMarquardt Algorithm (LMA) 
In mathematics and computing, the LevenbergMarquardt algorithm (LMA), also known as the damped leastsquares (DLS) method, is used to solve nonlinear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the GaussNewton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For wellbehaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA. LMA can also be viewed as GaussNewton using a trust region approach. The LMA is a very popular curvefitting algorithm used in many software applications for solving generic curvefitting problems. However, as for many fitting algorithms, the LMA finds only a local minimum, which is not necessarily the global minimum. onls 
Levenshtein Distance  In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of singlecharacter edits (i.e. insertions, deletions or substitutions) required to change one word into the other. The phrase edit distance is often used to refer specifically to Levenshtein distance. It is named after Vladimir Levenshtein, who considered this distance in 1965. It is closely related to pairwise string alignments. 
Lexical Dispersion Plot  A Lexical Dispersion Plot shows the position of words in a given text. On the y axis there is the list of words to be looked at and on the x axis there is the position in the text. Therefore the highest value on the x axis is the lenght of the text. qdap 
Lexical Table  
Lexis Surface Map  
LexVec  In this paper, we propose LexVec, a new method for generating distributed word representations that uses lowrank, weighted factorization of the Positive Pointwise Mutual Information matrix via stochastic gradient descent, employing a weighting scheme that assigns heavier penalties for errors on frequent cooccurrences while still accounting for negative cooccurrence. Evaluation on word similarity and analogy tasks shows that LexVec matches and often outperforms stateoftheart methods on many of these tasks. 
libDirectional  In this paper, we present libDirectional, a MATLAB library for directional statistics and directional estimation. It supports a variety of commonly used distributions on the unit circle, such as the von Mises, wrapped normal, and wrapped Cauchy distributions. Furthermore, various distributions on higherdimensional manifolds such as the unit hypersphere and the hypertorus are available. Based on these distributions, several recursive filtering algorithms in libDirectional allow estimation on these manifolds. The functionality is implemented in a clear, welldocumented, and objectoriented structure that is both easy to use and easy to extend. 
libFM  Factorization machines (FM) are a generic approach that allows to mimic most factorization models by feature engineering. This way, factorization machines combine the generality of feature engineering with the superiority of factorization models in estimating interactions between categorical variables of large domain. libFM is a software implementation for factorization machines that features stochastic gradient descent (SGD) and alternating least squares (ALS) optimization as well as Bayesian inference using Markov Chain Monte Carlo (MCMC). 
LibLinear  LibLinear is a linear classifier for data with millions of instances and features. It supports • L2regularized classifiers L2loss linear SVM, • L1loss linear SVM, and logistic regression (LR) • L1regularized classifiers (after version 1.4) • L2loss linear SVM and logistic regression (LR) • L2regularized support vector regression (after version 1.9) • L2loss linear SVR and L1loss linear SVR. 
Library for Online Learning (LIBOL) 
LIBOL is an opensource library for largescale online learning, which consists of a large family of e cient and scalable stateoftheart online learning algorithms for largescale online classification tasks. We have offered easytouse commandline tools and examples for users and developers, and also have made comprehensive documents available for both beginners and advanced users. LIBOL is not only a machine learning toolbox, but also a comprehensive experimental platform for conducting online learning research. http://…/LIBOL_manual.pdf. http://libol.stevenhoi.org 
LIDIOMS  In this paper, we describe the LIDIOMS data set, a multilingual RDF representation of idioms currently containing five languages: English, German, Italian, Portuguese, and Russian. The data set is intended to support natural language processing applications by providing links between idioms across languages. The underlying data was crawled and integrated from various sources. To ensure the quality of the crawled data, all idioms were evaluated by at least two native speakers. Herein, we present the model devised for structuring the data. We also provide the details of linking LIDIOMS to wellknown multilingual data sets such as BabelNet. The resulting data set complies with best practices according to Linguistic Linked Open Data Community. 
Lifelong Learning (LL) 
This paper proposes a novel lifelong learning (LL) approach to sentiment classification. LL mimics the human continuous learning process, i.e., retaining the knowledge learned from past tasks and use it to help future learning. In this paper, we first discuss LL in general and then LL for sentiment classification in particular. The proposed LL approach adopts a Bayesian optimization framework based on stochastic gradient descent. Our experimental results show that the proposed method outperforms baseline methods significantly, which demonstrates that lifelong learning is a promising research direction. 
Lift  In data mining and association rule learning, lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response. For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%). Typically, the modeller seeks to divide the population into quantiles, and rank the quantiles by lift. Organizations can then consider each quantile, and by weighing the predicted response rate (and associated financial benefit) against the cost, they can decide whether to market to that quantile or not. Lift is analogous to information retrieval’s average precision metric, if one treats the precision (fraction of the positives that are true positives) as the target response probability. The lift curve can also be considered a variation on the receiver operating characteristic (ROC) curve, and is also known in econometrics as the Lorenz or power curve. The difference between the lifts observed on two different subgroups is called the uplift. The subtraction of two lift curves forms the uplift curve, which is a metric used in uplift modelling. It is important to note that in general marketing practice the term Lift is also defined as the difference in response rate between the treatment and control groups, indicating the causal impact of a marketing program (versus not having it as in the control group). As a result, ‘no lift’ often means there is no statistically significant effect of the program. On top of this, uplift modelling is a predictive modeling technique to improve (up) lift over control. lift 
Lift Chart  The lift chart provides a visual summary of the usefulness of the information provided by one or more statistical models for predicting a binomial (categorical) outcome variable (dependent variable); for multinomial (multiplecategory) outcome variables, lift charts can be computed for each category. Specifically, the chart summarizes the utility that we may expect by using the respective predictive models, as compared to using baseline information only. The lift chart is applicable to most statistical methods that compute predictions (predicted classifications) for binomial or multinomial responses. Let us start with an example. A marketing agency is planning to send advertisements to selected households with the goal to boost sales of a product. The agency has a list of all households where each household is described by a set of attributes. Each advertisement sent costs a few pennies, but it is well paid off if the customer buys the product. Therefore an agency wants to minimize the number of advertisements sent, while at the same time maximize the number of sold products by reaching only the consumers that will actually buy the product. Therefore it develops a classifier that predicts the probability that a household is a potential customer. To fit this classifier and to express the dependency between the costs and the expected benefit the lift chart can be used. The number off all potential customers P is often unknown, therefore TPrate cannot be computed and the ROC curve cannot used, but the lift chart is useful in such settings. Also the TP is often hard to measure in practice; one might have just a few measurements from a sales analysis. Even in such cases lift chart can help the agency select the amount of most promising households to which an advertisement should be sent. Of course, lift charts are also useful for many other similar problems. http://…/vuk.pdf A lift chart, sometimes called a cumulative gains chart, or a banana chart, is a measure of model performance. It shows how responses, (i.e., to a direct mail solicitation, or a surgical treatment for instance) are changed by applying the model. This change ratio, which is hopefully, the increase in response rate, is called the ‘lift’. A lift chart indicates which subset of the dataset contains the greatest possible proportion of positive responses. The higher the lift curve is from the baseline, the better the performance of the model since the baseline represents the null model, which is no model at all. To explain a lift chart, suppose we had a twoclass prediction where the outcomes were yes (a positive response) or no (a negative response). To create a lift chart, instances in the dataset are sorted in descending probability order according to the predicted probability of a positive response. When the data is plotted, we can see a graphical depiction of the various probabilities. While the example shown in Figure 10 plots the results of different datasets for a single model, a lift chart can also be used to plot the results of a single dataset for different models. Note that the best model is not the one with the highest lift when it is being built. It is the model that performs the best on unseen, future data. http://…/dm_c_ov.pdf http://…/lift_chart.html gains 
Lifted Neural Network  We describe a novel family of models of multi layer feedforward neural networks in which the activation functions are encoded via penalties in the training problem. Our approach is based on representing a nondecreasing activation function as the argmin of an appropriate convex optimization problem. The new framework allows for algorithms such as blockcoordinate descent methods to be applied, in which each step is composed of a simple (no hidden layer) supervised learning problem that is parallelizable across data points and/or layers. Experiments indicate that the proposed models provide excellent initial guesses for weights for standard neural networks. In addition, the model provides avenues for interesting extensions, such as robustness against noisy inputs and optimizing over parameters in activation functions. 
Lifting  The great advances of learningbased approaches in image processing and computer vision are largely based on deeply nested networks that compose linear transfer functions with suitable nonlinearities. Interestingly, the most frequently used nonlinearities in imaging applications (variants of the rectified linear unit) are uncommon in low dimensional approximation problems. In this paper we propose a novel nonlinear transfer function, called lifting, which is motivated from a related technique in convex optimization. A lifting layer increases the dimensionality of the input, naturally yields a linear spline when combined with a fully connected layer, and therefore closes the gap between low and high dimensional approximation problems. Moreover, applying the lifting operation to the loss layer of the network allows us to handle nonconvex and flat (zerogradient) cost functions. We analyze the proposed lifting theoretically, exemplify interesting properties in synthetic experiments and demonstrate its effectiveness in deep learning approaches to image classification and denoising. 
Light Recurrent Neural Networks (LightRNN) 
Recurrent neural networks (RNNs) have achieved stateoftheart performances in many natural language processing tasks, such as language modeling and machine translation. However, when the vocabulary is large, the RNN model will become very big (e.g., possibly beyond the memory capacity of a GPU device) and its training will become very inefficient. In this work, we propose a novel technique to tackle this challenge. The key idea is to use 2Component (2C) shared embedding for word representations. We allocate every word in the vocabulary into a table, each row of which is associated with a vector, and each column associated with another vector. Depending on its position in the table, a word is jointly represented by two components: a row vector and a column vector. Since the words in the same row share the row vector and the words in the same column share the column vector, we only need $2 \sqrt{V}$ vectors to represent a vocabulary of $V$ unique words, which are far less than the $V$ vectors required by existing approaches. Based on the 2Component shared embedding, we design a new RNN algorithm and evaluate it using the language modeling task on several benchmark datasets. The results show that our algorithm significantly reduces the model size and speeds up the training process, without sacrifice of accuracy (it achieves similar, if not better, perplexity as compared to stateoftheart language models). Remarkably, on the OneBillionWord benchmark Dataset, our algorithm achieves comparable perplexity to previous language models, whilst reducing the model size by a factor of 40100, and speeding up the training process by a factor of 2. We name our proposed algorithm \emph{LightRNN} to reflect its very small model size and very high training speed. 
LightFM Model (LightFM) 
I present a hybrid matrix factorisation model representing users and items as linear combinations of their content features’ latent factors. The model outperforms both collaborative and contentbased models in coldstart or sparse interaction data scenarios (using both user and item metadata), and performs at least as well as a pure collaborative matrix factorisation model where interaction data is abundant. Additionally, feature embeddings produced by the model encode semantic information in a way reminiscent of word embedding approaches, making them useful for a range of related tasks such as tag recommendations. 
Lightweight Convolutional Neural Network (LCNN) 
Edge computing efficiently extends the realm of information technology beyond the boundary defined by cloud computing paradigm. Performing computation near the source and destination, edge computing is promising to address the challenges in many delay sensitive applications, like real time surveillance. Leveraging the ubiquitously connected cameras and smart mobile devices, it enables video analytics at the edge. However, traditional human objects detection and tracking approaches are still computationally too expensive to edge devices. Aiming at intelligent surveillance as an edge network service, this work explored the feasibility of two popular human objects detection schemes, Harr Cascade and SVM, at the edge. Understanding the existing constraints of the algorithms, a lightweight Convolutional Neural Network (LCNN) is proposed using the depthwise separable convolution. The proposed LCNN considerably reduces the number of parameters without affecting the quality of the output, thus it is ideal for an edge device usage. Being trained with Single Shot Multi box Detector (SSD) to pinpoint each human object location, it gives coordination of bounding box around the object. We implemented and tested LCNN on an edge device using Raspberry PI 3. The intensive experimental comparison study has validated that the proposed LCNN is a feasible design for real time human object detection as an edge service. 
Lightweight Probabilistic Deep Network  Even though probabilistic treatments of neural networks have a long history, they have not found widespread use in practice. Sampling approaches are often too slow already for simple networks. The size of the inputs and the depth of typical CNN architectures in computer vision only compound this problem. Uncertainty in neural networks has thus been largely ignored in practice, despite the fact that it may provide important information about the reliability of predictions and the inner workings of the network. In this paper, we introduce two lightweight approaches to making supervised learning with probabilistic deep networks practical: First, we suggest probabilistic output layers for classification and regression that require only minimal changes to existing networks. Second, we employ assumed density filtering and show that activation uncertainties can be propagated in a practical fashion through the entire network, again with minor changes. Both probabilistic networks retain the predictive power of the deterministic counterpart, but yield uncertainties that correlate well with the empirical error induced by their predictions. Moreover, the robustness to adversarial examples is significantly increased. 
Lightweight Pyramid of Networks (LPNet) 
Existing deep convolutional neural networks have found major success in image deraining, but at the expense of an enormous number of parameters. This limits their potential application, for example in mobile devices. In this paper, we propose a lightweight pyramid of networks (LPNet) for single image deraining. Instead of designing a complex network structures, we use domainspecific knowledge to simplify the learning process. Specifically, we find that by introducing the mature GaussianLaplacian image pyramid decomposition technology to the neural network, the learning problem at each pyramid level is greatly simplified and can be handled by a relatively shallow network with few parameters. We adopt recursive and residual network structures to build the proposed LPNet, which has less than 8K parameters while still achieving stateoftheart performance on rain removal. We also discuss the potential value of LPNet for other low and highlevel vision tasks. 
Likelihood  Likelihood is a funny concept. It’s not a probability, but it is proportional to a probability. The likelihood of a hypothesis (H) given some data (D) is proportional to the probability of obtaining D given that H is true, multiplied by an arbitrary positive constant (K). In other words, L(HD) = K · P(DH). Since a likelihood isn’t actually a probability it doesn’t obey various rules of probability. For example, likelihood need not sum to 1. A critical difference between probability and likelihood is in the interpretation of what is fixed and what can vary. In the case of a conditional probability, P(DH), the hypothesis is fixed and the data are free to vary. Likelihood, however, is the opposite. The likelihood of a hypothesis, L(HD), conditions on the data as if they are fixed while allowing the hypotheses to vary. The distinction is subtle, so I’ll say it again. For conditional probability, the hypothesis is treated as a given and the data are free to vary. For likelihood, the data are a given and the hypotheses vary. 
Likelihood Function  In statistics, a likelihood function (often simply the likelihood) is a function of the parameters of a statistical model. The likelihood of a set of parameter values, theta, given outcomes x, is equal to the probability of those observed outcomes given those parameter values, that is L(thetax) = P(xtheta). Likelihood functions play a key role in statistical inference, especially methods of estimating a parameter from a set of statistics. In informal contexts, “likelihood” is often used as a synonym for “probability.” But in statistical usage, a distinction is made depending on the roles of the outcome or parameter. Probability is used when describing a function of the outcome given a fixed parameter value. For example, if a coin is flipped 10 times and it is a fair coin, what is the probability of it landing headsup every time? Likelihood is used when describing a function of a parameter given an outcome. For example, if a coin is flipped 10 times and it has landed headsup 10 times, what is the likelihood that the coin is fair? 
Likelihood Ratio Similarity (LiRa) 
Recommender system data presents unique challenges to the data mining, machine learning, and algorithms communities. The high missing data rate, in combination with the large scale and high dimensionality that is typical of recommender systems data, requires new tools and methods for efficient data analysis. Here, we address the challenge of evaluating similarity between two users in a recommender system, where for each user only a small set of ratings is available. We present a new similarity score, that we call LiRa, based on a statistical model of user similarity, for largescale, discrete valued data with many missing values. We show that this score, based on a ratio of likelihoods, is more effective at identifying similar users than traditional similarity scores in userbased collaborative filtering, such as the Pearson correlation coefficient. We argue that our approach has significant potential to improve both accuracy and scalability in collaborative filtering. 
LikelihoodRatio Test (LRT) 
In statistics, a likelihood ratio test is a statistical test used to compare the fit of two models, one of which (the null model) is a special case of the other (the alternative model). The test is based on the likelihood ratio, which expresses how many times more likely the data are under one model than the other. This likelihood ratio, or equivalently its logarithm, can then be used to compute a pvalue, or compared to a critical value to decide whether to reject the null model in favour of the alternative model. When the logarithm of the likelihood ratio is used, the statistic is known as a loglikelihood ratio statistic, and the probability distribution of this test statistic, assuming that the null model is true, can be approximated using Wilks’s theorem. In the case of distinguishing between two models, each of which has no unknown parameters, use of the likelihood ratio test can be justified by the NeymanPearson lemma, which demonstrates that such a test has the highest power among all competitors. tsc 
Likert Scale  A Likert scale is a psychometric scale commonly involved in research that employs questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, or more accurately the Likerttype scale, even though the two are not synonymous. The scale is named after its inventor, psychologist Rensis Likert. Likert distinguished between a scale proper, which emerges from collective responses to a set of items (usually eight or more), and the format in which responses are scored along a range. Technically speaking, a Likert scale refers only to the former. likert,Scale 
LIMESUP  Supervised Machine Learning (SML) algorithms such as Gradient Boosting, Random Forest, and Neural Networks have become popular in recent years due to their increased predictive performance over traditional statistical methods. This is especially true with large data sets (millions or more observations and hundreds to thousands of predictors). However, the complexity of the SML models makes them opaque and hard to interpret without additional tools. There has been a lot of interest recently in developing global and local diagnostics for interpreting and explaining SML models. In this paper, we propose locally interpretable models and effects based on supervised partitioning (trees) referred to as LIMESUP. This is in contrast with the KLIME approach that is based on clustering the predictor space. We describe LIMESUP based on fitting trees to the fitted response (LIMSUPR) as well as the derivatives of the fitted response (LIMESUPD). We compare the results with KLIME and describe its advantages using simulation and real data. 
Limit Deterministic Buchi Automaton (LDBA) 
LogicallyCorrect Reinforcement Learning 
Limited Memory Steepest Descent Method (LMSD) 
The possibilities inherent in steepest descent methods have been considerably amplified by the introduction of the BarzilaiBorwein choice of stepsize, and other related ideas. These methods have proved to be competitive with conjugate gradient methods for the minimization of large dimension unconstrained minimization problems. This paper suggests a method which is able to take advantage of the availability of a few additional ‘long’ vectors of storage to achieve a significant improvement in performance, both for quadratic and nonquadratic objective functions. It makes use of certain Ritz values related to the Lanczos process (Lanczos in J Res Nat Bur Stand 45:255282, 1950). Some underlying theory is provided, and numerical evidence is set out showing that the new method provides a competitive and more simple alternative to the state of the art lBFGS limited memory method. 
Limitedmemory BFGS (LBFGS) 
Limitedmemory BFGS (LBFGS or LMBFGS) is an optimization algorithm in the family of quasiNewton methods that approximates the BroydenFletcherGoldfarbShanno (BFGS) algorithm using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. 
Lindy Effect  The Lindy effect is a theory of the life expectancy of nonperishable things that posits for a certain class of nonperishables, like a technology or an idea, every additional day may imply a longer (remaining) life expectancy: the mortality rate decreases with time. This contrasts with living creatures and mechanical things, which instead follow a bathtub curve, where every additional day in its life translates into a shorter additional life expectancy (though longer overall life expectancy, due to surviving this far): after childhood, the mortality rate increases with time. 
Line Map  linemap 
Linear Additive Markov Process (LAMP) 
We introduce LAMP: the Linear Additive Markov Process. Transitions in LAMP may be influenced by states visited in the distant history of the process, but unlike higherorder Markov processes, LAMP retains an efficient parametrization. LAMP also allows the specific dependence on history to be learned efficiently from data. We characterize some theoretical properties of LAMP, including its steadystate and mixing time. We then give an algorithm based on alternating minimization to learn LAMP models from data. Finally, we perform a series of realworld experiments to show that LAMP is more powerful than firstorder Markov processes, and even holds its own against deep sequential models (LSTMs) with a negligible increase in parameter complexity. 
Linear Algebra Package (LAPACK) 
LAPACK (Linear Algebra Package) is a software library for numerical linear algebra. It provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also includes routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition. 
Linear Analog SelfAssessment (LASA) 
ordinalCont 
Linear Centralization Classifier (LCC) 
A classification algorithm, called the Linear Centralization Classifier (LCC), is introduced. The algorithm seeks to find a transformation that best maps instances from the feature space to a space where they concentrate towards the center of their own classes, while maximimizing the distance between class centers. We formulate the classifier as a quadratic program with quadratic constraints. We then simplify this formulation to a linear program that can be solved effectively using a linear programming solver (e.g., simplexdual). We extend the formulation for LCC to enable the use of kernel functions for nonlinear classification applications. We compare our method with two standard classification methods (support vector machine and linear discriminant analysis) and four stateoftheart classification methods when they are applied to eight standard classification datasets. Our experimental results show that LCC is able to classify instances more accurately (based on the area under the receiver operating characteristic) in comparison to other tested methods on the chosen datasets. We also report the results for LCC with a particular kernel to solve for synthetic nonlinear classification problems. 
Linear Congruential Generator (LCG) 
A linear congruential generator (LCG) is an algorithm that yields a sequence of pseudorandomized numbers calculated with a discontinuous piecewise linear equation. The method represents one of the oldest and bestknown pseudorandom number generator algorithms. The theory behind them is relatively easy to understand, and they are easily implemented and fast, especially on computer hardware which can provide modulo arithmetic by storagebit truncation. 
Linear Dimension Reduction  Methods: 1. Principal component analysis (PCA) 2. Canonical correlation analysis (CCA) 3. Linear discriminant analysis (LDA) 4. Nonnegative matrix factorization (NMF) 5. Independent component analysis (ICA) LDRTools 
Linear Discriminant Analysis (LDA) 
Linear discriminant analysis (LDA) and the related Fisher’s linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. Fisher’s Linear Discriminant Analysis 
Linear Discriminant Generative Adversarial Networks (LDGAN) 
We develop a novel method for training of GANs for unsupervised and class conditional generation of images, called Linear Discriminant GAN (LDGAN). The discriminator of an LDGAN is trained to maximize the linear separability between distributions of hidden representations of generated and targeted samples, while the generator is updated based on the decision hyperplanes computed by performing LDA over the hidden representations. LDGAN provides a concrete metric of separation capacity for the discriminator, and we experimentally show that it is possible to stabilize the training of LDGAN simply by calibrating the update frequencies between generators and discriminators in the unsupervised case, without employment of normalization methods and constraints on weights. In the class conditional generation tasks, the proposed method shows improved training stability together with better generalization performance compared to WGAN that employs an auxiliary classifier. 
Linear Mixed Effects Model  CLME,lmenssp 
Linear Mixed Model (LMM) 
A statistical model containing both fixed effects and random effects, that is: mixed effects. LMM is a kind of regression analysis. 
Linear Programming (LP) 
Linear programming (LP; also called linear optimization) is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is a special case of mathematical programming (mathematical optimization). More formally, linear programming is a technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints. Its feasible region is a convex polyhedron, which is a set defined as the intersection of finitely many half spaces, each of which is defined by a linear inequality. Its objective function is a realvalued affine function defined on this polyhedron. A linear programming algorithm finds a point in the polyhedron where this function has the smallest (or largest) value if such a point exists. 
Linear Quadratic Estimation (LQE) 
Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing noise (random variations) and other inaccuracies, and produces estimates of unknown variables that tend to be more precise than those based on a single measurement alone. More formally, the Kalman filter operates recursively on streams of noisy input data to produce a statistically optimal estimate of the underlying system state. The filter is named after Rudolf (Rudy) E. Kálmán, one of the primary developers of its theory. The Kalman filter has numerous applications in technology. A common application is for guidance, navigation and control of vehicles, particularly aircraft and spacecraft. Furthermore, the Kalman filter is a widely applied concept in time series analysis used in fields such as signal processing and econometrics. Kalman filters also are one of the main topics in the field of Robotic motion planning and control, and sometimes included in Trajectory optimization. 
Linear Regression  In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. (This term should be distinguished from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.) In linear regression, data are modeled using linear predictor functions, and unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an affine function of X. Less commonly, linear regression could refer to a model in which the median, or some other quantile of the conditional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis. Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are nonlinearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine. Linear regression has many practical uses. Most applications fall into one of the following two broad categories: • If the goal is prediction, or forecasting, or reduction, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y. • Given a variable y and a number of variables X1, …, Xp that may be related to y, linear regression analysis can be applied to quantify the strength of the relationship between y and the Xj, to assess which Xj may have no relationship with y at all, and to identify which subsets of the Xj contain redundant information about y. Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the ‘lack of fit’ in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares loss function as in ridge regression (L2norm penalty) and lasso (L1norm penalty). Conversely, the least squares approach can be used to fit models that are not linear models. Thus, although the terms ‘least squares’ and ‘linear model’ are closely linked, they are not synonymous. 
Linear Superiorization (LinSup) 
Linear superiorization (abbreviated: LinSup) considers linear programming (LP) problems wherein the constraints as well as the objective function are linear. It allows to steer the iterates of a feasibilityseeking iterative process toward feasible points that have lower (not necessarily minimal) values of the objective function than points that would have been reached by the same feasiblityseeking iterative process without superiorization. Using a feasibilityseeking iterative process that converges even if the linear feasible set is empty, LinSup generates an iterative sequence that converges to a point that minimizes a proximity function which measures the linear constraints violation. In addition, due to LinSup’s repeated objective function reduction steps such a point will most probably have a reduced objective function value. We present an exploratory experimental result that illustrates the behavior of LinSup on an infeasible LP problem. 
Linear Unified LASSO (LLASSO) 
We propose a rescaled LASSO, by premultipying the LASSO with a matrix term, namely linear unified LASSO (LLASSO) for multicollinear situations. Our numerical study has shown that the LLASSO is comparable with other sparse modeling techniques and often outperforms the LASSO and elastic net. Our findings open new visions about using the LASSO still for sparse modeling and variable selection. We conclude our study by pointing that the LLASSO can be solved by the same efficient algorithm for solving the LASSO and suggest to follow the same construction technique for other penalized estimators. 
Linearized Binary Regression  Probit regression was first proposed by Bliss in 1934 to study mortality rates of insects. Since then, an extensive body of work has analyzed and used probit or related binary regression methods (such as logistic regression) in numerous applications and fields. This paper provides a fresh angle to such wellestablished binary regression methods. Concretely, we demonstrate that linearizing the probit model in combination with linear estimators performs on par with stateoftheart nonlinear regression methods, such as posterior mean or maximum aposteriori estimation, for a broad range of realworld regression problems. We derive exact, closedform, and nonasymptotic expressions for the meansquared error of our linearized estimators, which clearly separates them from nonlinear regression methods that are typically difficult to analyze. We showcase the efficacy of our methods and results for a number of synthetic and realworld datasets, which demonstrates that linearized binary regression finds potential use in a variety of inference, estimation, signal processing, and machine learning applications that deal with binaryvalued observations or measurements. 
LinearTime Clustering Algorithm (Ksets+) 
In this paper, we first propose a new iterative algorithm, called the Ksets+ algorithm for clustering data points in a semimetric space, where the distance measure does not necessarily satisfy the triangular inequality. We show that the Ksets+ algorithm converges in a finite number of iterations and it retains the same performance guarantee as the Ksets algorithm for clustering data points in a metric space. We then extend the applicability of the Ksets+ algorithm from data points in a semimetric space to data points that only have a symmetric similarity measure. Such an extension leads to great reduction of computational complexity. In particular, for an n * n similarity matrix with m nonzero elements in the matrix, the computational complexity of the Ksets+ algorithm is O((Kn + m)I), where I is the number of iterations. The memory complexity to achieve that computational complexity is O(Kn + m). As such, both the computational complexity and the memory complexity are linear in n when the n * n similarity matrix is sparse, i.e., m = O(n). We also conduct various experiments to show the effectiveness of the Ksets+ algorithm by using a synthetic dataset from the stochastic block model and a real network from the WonderNetwork website. 
LinearTime Detection of NonLinear Changes (LIGHT) 
Change detection in multivariate time series has applications in many domains, including health care and network monitoring. A common approach to detect changes is to compare the divergence between the distributions of a reference window and a test window. When the number of dimensions is very large, however, the na¨ıve approach has both quality and efficiency issues: to ensure robustness the window size needs to be large, which not only leads to missed alarms but also increases runtime. To this end, we propose LIGHT, a lineartime algorithm for robustly detecting nonlinear changes in massively high dimensional time series. Importantly, LIGHT provides high flexibility in choosing the window size, allowing the domain expert to fit the level of details required. To do such, we 1) perform scalable PCA to reduce dimensionality, 2) perform scalable factorization of the joint distribution, and 3) scalably compute divergences between these lower dimensional distributions. Extensive empirical evaluation on both synthetic and realworld data show that LIGHT outperforms state of the art with up to 100% improvement in both quality and efficiency. 
Linguistic Descriptions of Complex Phenomena (LDCP) 
Linguistic Descriptions of Complex Phenomena (LDCP) is an architecture and methodology that allows us to model complex phenomena, interpreting input data, and generating automatic text reports customized to the user needs (see <doi:10.1016/j.ins.2016.11.002> and <doi:10.1007/s0050001624305> ). rLDCP 
Link Function  In GLM, the link function provides the relationship between the linear predictor and the mean of the distribution function. There are many commonly used link functions, and their choice can be somewhat arbitrary. It makes sense to try to match the domain of the link function to the range of the distribution function’s mean. 
Link Prediction  Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future? We formalize this question as the link prediction problem, and develop approaches to link prediction based on measures for analyzing the \proximity” of nodes in a network. Experiments on large coauthorship networks suggest that information about future interactions can be extracted from network topology alone, and that fairly subtle measures for detecting node proximity can outperform more direct measures. 
Linked Data Ranking Algorithm (LDRANK) 
The advances of the Linked Open Data (LOD) initiative are giving rise to a more structured Web of data. Indeed, a few datasets act as hubs (e.g., DBpedia) connecting many other datasets. They also made possible new Web services for entity detection inside plain text (e.g., DBpedia Spotlight), thus allowing for new applications that can benefit from a combination of the Web of documents and the Web of data. To ease the emergence of these new applications, we propose a querybiased algorithm (LDRANK) for the ranking of web of data resources with associated textual data. Our algorithm combines link analysis with dimensionality reduction. We use crowdsourcing for building a publicly available and reusable dataset for the evaluation of querybiased ranking of Web of data resources detected in Web pages. We show that, on this dataset, LDRANK outperforms the state of the art. Finally, we use this algorithm for the construction of semantic snippets of which we evaluate the usefulness with a crowdsourcingbased approach. 
Linked Matrix Factorization (LMF) 
In recent years, a number of methods have been developed for the dimension reduction and decomposition of multiple linked highcontent data matrices. Typically these methods assume that just one dimension, rows or columns, is shared among the data sources. This shared dimension may represent common features that are measured for different sample sets (i.e., horizontal integration) or a common set of samples with measurements for different feature sets (i.e., vertical integration). In this article we introduce an approach for simultaneous horizontal and vertical integration, termed Linked Matrix Factorization (LMF), for the more general situation where some matrices share rows (e.g., features) and some share columns (e.g., samples). Our motivating application is a cytotoxicity study with accompanying genomic and molecular chemical attribute data. In this data set, the toxicity matrix (cell lines $\times$ chemicals) shares its sample set with a genotype matrix (cell lines $\times$ SNPs), and shares its feature set with a chemical molecular attribute matrix (chemicals $\times$ attributes). LMF gives a unified lowrank factorization of these three matrices, which allows for the decomposition of systematic variation that is shared among the three matrices and systematic variation that is specific to each matrix. This may be used for efficient dimension reduction, exploratory visualization, and the imputation of missing data even when entire rows or columns are missing from a constituent data matrix. We present theoretical results concerning the uniqueness, identifiability, and minimal parametrization of LMF, and evaluate it with extensive simulation studies. 
Linked Micromaps  Linked Micromaps is a graphing program written in Java. It allows users to view multiple variables interactively and compare statistics across regions (states, counties, registries, hospitals) as well as across time. It supports six types of graph: • bar graphs; • box plots; • raw data tables; • point graphs; • point graphs with arrow; and • point graphs with confidence intervals. 
Links  We present a novel algorithm, called Links, designed to perform online clustering on unit vectors in a highdimensional Euclidean space. The algorithm is appropriate when it is necessary to cluster data efficiently as it streams in, and is to be contrasted with traditional batch clustering algorithms that have access to all data at once. For example, Links has been successfully applied to embedding vectors generated from face images or voice recordings for the purpose of recognizing people, thereby providing realtime identification during video or audio capture. 
Liquid Analytics  Liquid analytics. That’s the part that automatically updates and refines the training sets, rules, inferences, confidence intervals and predictions, every day, as mutating data keeps pouring nonstop in the databases (be it NoSQL or not). While this (most of the time) still ends up being coded in production mode by software engineers or developers, the framework and logical architecture is designed by data scientists. Because of this, data science is to data floods what statistical science is to frozen data. 
LISAL  Most environmental phenomena, such as wind profiles, ozone concentration and sunlight distribution under a forest canopy, exhibit nonstationary dynamics i.e. phenomenon variation change depending on the location and time of occurrence. Nonstationary dynamics pose both theoretical and practical challenges to statistical machine learning algorithms aiming to accurately capture the complexities governing the evolution of such processes. In this paper, we address the sampling aspects of the problem of learning nonstationary spatiotemporal models, and propose an efficient yet simple algorithm – LISAL. The core idea in LISAL is to learn two models using Gaussian processes (GPs) wherein the first is a nonstationary GP directly modeling the phenomenon. The second model uses a stationary GP representing a latent space corresponding to changes in dynamics, or the nonstationarity characteristics of the first model. LISAL involves adaptively sampling the latent space dynamics using information theory quantities to reduce the computational cost during the learning phase. The relevance of LISAL is extensively validated using multiple real world datasets. Efficiently Learning Nonstationary Gaussian Processes 
ListOps  Latent tree learning models learn to parse a sentence without syntactic supervision, and use that parse to build the sentence representation. Existing work on such models has shown that, while they perform well on tasks like sentence classification, they do not learn grammars that conform to any plausible semantic or syntactic formalism (Williams et al., 2018a). Studying the parsing ability of such models in natural language can be challenging due to the inherent complexities of natural language, like having several valid parses for a single sentence. In this paper we introduce ListOps, a toy dataset created to study the parsing ability of latent tree models. ListOps sequences are in the style of prefix arithmetic. The dataset is designed to have a single correct parsing strategy that a system needs to learn to succeed at the task. We show that the current leading latent tree models are unable to learn to parse and succeed at ListOps. These models achieve accuracies worse than purely sequential RNNs. 
LISTwise ExplaiNer (LISTEN) 
There is an increasing demand for algorithms to explain their outcomes. So far, there is no method that explains the rankings produced by a ranking algorithm. To address this gap we propose LISTEN, a LISTwise ExplaiNer, to explain rankings produced by a ranking algorithm. To efficiently use LISTEN in production, we train a neural network to learn the underlying explanation space created by LISTEN; we call this model QLISTEN. We show that LISTEN produces faithful explanations and that QLISTEN is able to learn these explanations. Moreover, we show that LISTEN is safe to use in a real world environment: users of a news recommendation system do not behave significantly differently when they are exposed to explanations generated by LISTEN instead of manually generated explanations. 
Literate Programming  Literate programming is an approach to programming introduced by Donald Knuth in which a program is given as an explanation of the program logic in a natural language, such as English, interspersed with snippets of macros and traditional source code, from which a compilable source code can be generated. 
littler (“little R”)  littler provides the r program, a simplified commandline interface for GNU R. This allows direct execution of commands, use in piping where the output of one program supplies the input of the next, as well as adding the ability for writing hashbang scripts, i.e. creating executable files starting with, say, #!/usr/bin/r. GNU R, a language and environment for statistical computing and graphics, provides a wonderful system for ‘programming with data’ as well as interactive exploratory analysis, often involving graphs. Sometimes, however, simple scripts are desired. While R can be used in batch mode, and while socalled here documents can be crafted, a longstanding need for a scripting frontend has often been expressed by the R Community. littler (pronounced little R and written r) aims to fill this need. 
LjungBox Test  The LjungBox test (named for Greta M. Ljung and George E. P. Box) is a type of statistical test of whether any of a group of autocorrelations of a time series are different from zero. Instead of testing randomness at each distinct lag, it tests the ‘overall’ randomness based on a number of lags, and is therefore a portmanteau test. This test is sometimes known as the LjungBox Q test, and it is closely connected to the BoxPierce test (which is named after George E. P. Box and David A. Pierce). In fact, the LjungBox test statistic was described explicitly in the paper that led to the use of the BoxPierce statistic, and from which that statistic takes its name. The BoxPierce test statistic is a simplified version of the LjungBox statistic for which subsequent simulation studies have shown poor performance. The LjungBox test is widely applied in econometrics and other applications of time series analysis. http://…/ljungboxtest 
LloydMax  In computer science and electrical engineering, Lloyd’s algorithm, also known as Voronoi iteration or relaxation, is an algorithm named after Stuart P. Lloyd for finding evenly spaced sets of points in subsets of Euclidean spaces, and partitions of these subsets into wellshaped and uniformly sized convex cells. Like the closely related kmeans clustering algorithm, it repeatedly finds the centroid of each set in the partition, and then repartitions the input according to which of these centroids is closest. However, Lloyd’s algorithm differs from kmeans clustering in that its input is a continuous geometric region rather than a discrete set of points. Thus, when repartitioning the input, Lloyd’s algorithm uses Voronoi diagrams rather than simply determining the nearest center to each of a finite set of points as the kmeans algorithm does. Although the algorithm may be applied most directly to the Euclidean plane, similar algorithms may also be applied to higherdimensional spaces or to spaces with other nonEuclidean metrics. Lloyd’s algorithm can be used to construct close approximations to centroidal Voronoi tessellations of the input, which can be used for quantization, dithering, and stippling. Other applications of Lloyd’s algorithm include smoothing of triangle meshes in the finite element method. ➚ “Compressive Kmeans” 
LMKLNet  In this paper we propose solving localized multiple kernel learning (LMKL) using LMKLNet, a feedforward deep neural network. In contrast to previous works, as a learning principle we propose {\em parameterizing} both the gating function for learning kernel combination weights and the multiclass classifier in LMKL using an attentional network (AN) and a multilayer perceptron (MLP), respectively. In this way we can learn the (nonlinear) decision function in LMKL (approximately) by sequential applications of AN and MLP. Empirically on benchmark datasets we demonstrate that overall LMKLNet can not only outperform the stateoftheart MKL solvers in terms of accuracy, but also be trained about {\em two orders of magnitude} faster with much smaller memory footprint for largescale learning. 
Local Average Treatment Effect (LATE) 

Local Binary Convolution (LBC) 
We propose local binary convolution (LBC), an efficient alternative to convolutional layers in standard convolutional neural networks (CNN). The design principles of LBC are motivated by local binary patterns (LBP). The LBC layer comprises of a set of fixed sparse predefined binary convolutional filters that are not updated during the training process, a nonlinear activation function and a set of learnable linear weights. The linear weights combine the activated filter responses to approximate the corresponding activated filter responses of a standard convolutional layer. The LBC layer affords significant parameter savings, 9x to 169x in the number of learnable parameters compared to a standard convolutional layer. Furthermore, due to lower model complexity and sparse and binary nature of the weights also results in up to 9x to 169x savings in model size compared to a standard convolutional layer. We demonstrate both theoretically and experimentally that our local binary convolution layer is a good approximation of a standard convolutional layer. Empirically, CNNs with LBC layers, called local binary convolutional neural networks (LBCNN), reach stateoftheart performance on a range of visual datasets (MNIST, SVHN, CIFAR10, and a subset of ImageNet) while enjoying significant computational savings. 
Local Binary Pattern Network (LBPNet) 
Memory and computation efficient deep learning architec tures are crucial to continued proliferation of machine learning capabili ties to new platforms and systems. Binarization of operations in convo lutional neural networks has shown promising results in reducing model size and computing efficiency. In this paper, we tackle the problem us ing a strategy different from the existing literature by proposing local binary pattern networks or LBPNet, that is able to learn and perform binary operations in an endtoend fashion. LBPNet1 uses local binary comparisons and random projection in place of conventional convolu tion (or approximation of convolution) operations. These operations can be implemented efficiently on different platforms including direct hard ware implementation. We applied LBPNet and its variants on standard benchmarks. The results are promising across benchmarks while provid ing an important means to improve memory and speed efficiency that is particularly suited for small footprint devices and hardware accelerators. 
Local Error Driven and Associative Biologically Realistic Algorithm (leabra) 
The algorithm Leabra (local error driven and associative biologically realistic algorithm) allows for the construction of artificial neural networks that are biologically realistic and balance supervised and unsupervised learning within a single framework. leabRa 
Local Expansion via Minimum One Norm (LEMON) 
We propose a novel approach for finding overlapping communities called LEMON (Local Expansion via Minimum One Norm). The algorithm finds the community by seeking a sparse vector in the span of the local spectra such that the seeds are in its support. We show that LEMON can achieve the highest detection accuracy among stateoftheart proposals. The running time depends on the size of the community rather than that of the entire graph. The algorithm is easy to implement, and is highly parallelizable. 
Local False Discovery Rate (LFDR) 
➚ “False Discovery Rate” LFDR.MLE 
Local Fisher Discriminant Analysis (LFDA) 
lfda 
LOcal Group Graphical Lasso Estimation (loggle) 
In this paper, we study timevarying graphical models based on data measured over a temporal grid. Such models are motivated by the needs to describe and understand evolving interacting relationships among a set of random variables in many real applications, for instance the study of how stocks interact with each other and how such interactions change over time. We propose a new model, LOcal Group Graphical Lasso Estimation (loggle), under the assumption that the graph topology changes gradually over time. Specifically, loggle uses a novel local grouplasso type penalty to efficiently incorporate information from neighboring time points and to impose structural smoothness of the graphs. We implement an ADMM based algorithm to fit the loggle model. This algorithm utilizes blockwise fast computation and pseudolikelihood approximation to improve computational efficiency. An R package loggle has also been developed. We evaluate the performance of loggle by simulation experiments. We also apply loggle to S&P 500 stock price data and demonstrate that loggle is able to reveal the interacting relationships among stocks and among industrial sectors in a time period that covers the recent global financial crisis. 
Local Interpretable ModelAgnostic Explanations (LIME) 
Machine learning is at the core of many recent advances in science and technology. With computers beating professionals in games like Go, many people have started asking if machines would also make for better drivers or even better doctors. In many applications of machine learning, users are asked to trust a model to help them make decisions. A doctor will certainly not operate on a patient simply because “the model said so.” Even in lowerstakes situations, such as when choosing a movie to watch from Netflix, a certain measure of trust is required before we surrender hours of our time based on a model. Despite the fact that many machine learning models are black boxes, understanding the rationale behind the model’s predictions would certainly help users decide when to trust or not to trust their predictions. An example is shown in Figure 1, in which a model predicts that a certain patient has the flu. The prediction is then explained by an ‘explainer’ that highlights the symptoms that are most important to the model. With this information about the rationale behind the model, the doctor is now empowered to trust the model—or not. 
Local Mahalanobis Distance Learning (LMDL) 
Distance metric learning is a successful way to enhance the performance of the nearest neighbor classifier. In most cases, however, the distribution of data does not obey a regular form and may change in different parts of the feature space. Regarding that, this paper proposes a novel local distance metric learning method, namely Local Mahalanobis Distance Learning (LMDL), in order to enhance the performance of the nearest neighbor classifier. LMDL considers the neighborhood influence and learns multiple distance metrics for a reduced set of input samples. The reduced set is called as prototypes which try to preserve local discriminative information as much as possible. The proposed LMDL can be kernelized very easily, which is significantly desirable in the case of highly nonlinear data. The quality as well as the efficiency of the proposed method assesses through a set of different experiments on various datasets and the obtained results show that LDML as well as the kernelized version is superior to the other related stateoftheart methods. 
Local Outlier Factor (LOF) 
In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, HansPeter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours. LOF shares some concepts with DBSCAN and OPTICS such as the concepts of ‘core distance’ and ‘reachability distance’, which are used for local density estimation. Rlof 
Local Projections  In this paper, we propose a novel approach for outlier detection, called local projections, which is based on concepts of Local Outlier Factor (LOF) (Breunig et al., 2000) and RobPCA (Hubert et al., 2005). By using aspects of both methods, our algorithm is robust towards noise variables and is capable of performing outlier detection in multigroup situations. We are further not reliant on a specific underlying data distribution. For each observation of a dataset, we identify a local group of dense nearby observations, which we call a core, based on a modification of the knearest neighbours algorithm. By projecting the dataset onto the space spanned by those observations, two aspects are revealed. First, we can analyze the distance from an observation to the center of the core within the projection space in order to provide a measure of quality of description of the observation by the projection. Second, we consider the distance of the observation to the projection space in order to assess the suitability of the core for describing the outlyingness of the observation. These novel interpretations lead to a univariate measure of outlyingness based on aggregations over all local projections, which outperforms LOF and RobPCA as well as other popular methods like PCOut (Filzmoser et al., 2008) and subspacebased outlier detection (Kriegel et al., 2009) in our simulation setups. Experiments in the context of realword applications employing datasets of various dimensionality demonstrate the advantages of local projections. 
Local Regression (LOESS, LOWESS) 
LOESS and LOWESS (locally weighted scatterplot smoothing) are two strongly related nonparametric regression methods that combine multiple regression models in a knearestneighborbased metamodel. “LOESS” is a later generalization of LOWESS; although it is not a true initialism, it may be understood as standing for “LOcal regrESSion”. 
Local Reparameterization Network (LRNets) 
Recent breakthroughs in computer vision make use of large deep neural networks, utilizing the substantial speedup offered by GPUs. For applications running on limited hardware however, high precision realtime processing can still be a challenge. One approach to solve this problem is learning networks with binary or ternary weights, thus removing the need to calculate multiplications and significantly reduce memory size and access. In this work we introduce LRnets (Local reparameterization networks), a new method for training neural networks with discrete weights using stochastic parameters. We show how a simple modification to the local reparameterization trick, previously used to train Gaussian distributed weights, allows us to train discrete weights. We tested our method on MNIST, CIFAR10 and ImageNet, achieving stateoftheart results compared to previous binary and ternary models. 
Local Shrunk Discriminant Analysis (LSDA) 
Dimensionality reduction is a crucial step for pattern recognition and data mining tasks to overcome the curse of dimensionality. Principal component analysis (PCA) is a traditional technique for unsupervised dimensionality reduction, which is often employed to seek a projection to best represent the data in a leastsquares sense, but if the original data is nonlinear structure, the performance of PCA will quickly drop. An supervised dimensionality reduction algorithm called Linear discriminant analysis (LDA) seeks for an embedding transformation, which can work well with Gaussian distribution data or singlemodal data, but for nonGaussian distribution data or multimodal data, it gives undesired results. What is worse, the dimension of LDA cannot be more than the number of classes. In order to solve these issues, Local shrunk discriminant analysis (LSDA) is proposed in this work to process the nonGaussian distribution data or multimodal data, which not only incorporate both the linear and nonlinear structures of original data, but also learn the pattern shrinking to make the data more flexible to fit the manifold structure. Further, LSDA has more strong generalization performance, whose objective function will become local LDA and traditional LDA when different extreme parameters are utilized respectively. What is more, a new efficient optimization algorithm is introduced to solve the nonconvex objective function with low computational cost. Compared with other related approaches, such as PCA, LDA and local LDA, the proposed method can derive a subspace which is more suitable for nonGaussian distribution and real data. Promising experimental results on different kinds of data sets demonstrate the effectiveness of the proposed approach. 
Locality Sensitive Hashing (LSH) 
Localitysensitive hashing (LSH) is a method of performing probabilistic dimension reduction of highdimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items). This is different from the conventional hash functions, such as those used in cryptography, as in the LSH case the goal is to maximize probability of ‘collision’ of similar items rather than avoid collisions. Note how localitysensitive hashing, in many ways, mirrors data clustering and Nearest neighbor search. http://…/LSH http://…descriptionoflocalitysensitivehashing 
Localized Information Privacy (LIP) 
In this paper, localized information privacy (LIP) is proposed, as a new privacy definition, which allows statistical aggregation while protecting users’ privacy without relying on a trusted third party. The notion of contextawareness is incorporated in LIP by the introduction of priors, which enables the design of privacypreserving data aggregation with knowledge of priors. We show that LIP relaxes the Localized Differential Privacy (LDP) notion by explicitly modeling the adversary’s knowledge. However, it is stricter than $2\epsilon$LDP and $\epsilon$mutual information privacy. The incorporation of local priors allows LIP to achieve higher utility compared to other approaches. We then present an optimization framework for privacypreserving data aggregation, with the goal of minimizing the expected squared error while satisfying the LIP privacy constraints. Utilityprivacy tradeoffs are obtained under several models in closedform. We then validate our analysis by {numerical analysis} using both synthetic and realworld data. Results show that our LIP mechanism provides better utilityprivacy tradeoffs than LDP and when the prior is not uniformly distributed, the advantage of LIP is even more significant. 
Locally Estimated Scatterplot Smoothing (LOESS) 
LOESS and LOWESS (locally weighted scatterplot smoothing) are two strongly related nonparametric regression methods that combine multiple regression models in a knearestneighborbased metamodel. “LOESS” is a later generalization of LOWESS; although it is not a true initialism, it may be understood as standing for “LOcal regrESSion”. 
Locally Linear Embedding (LLE) 
Locally Linear Embedding (LLE), an unsupervised learning algorithm that computes low dimensional, neighborhood preserving embeddings of high dimensional data. LLE attempts to discover nonlinear structure in high dimensional data by exploiting the local symmetries of linear reconstructions. Notably, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations – though capable of generating highly nonlinear embeddings – do not involve local minima. http://…/2323.full.pdf 
Locally Smoothed Neural Network (LSNN) 
Convolutional Neural Networks (CNN) and the locally connected layer are limited in capturing the importance and relations of different local receptive fields, which are often crucial for tasks such as face verification, visual question answering, and word sequence prediction. To tackle the issue, we propose a novel locally smoothed neural network (LSNN) in this paper. The main idea is to represent the weight matrix of the locally connected layer as the product of the kernel and the smoother, where the kernel is shared over different local receptive fields, and the smoother is for determining the importance and relations of different local receptive fields. Specifically, a multivariate Gaussian function is utilized to generate the smoother, for modeling the location relations among different local receptive fields. Furthermore, the content information can also be leveraged by setting the mean and precision of the Gaussian function according to the content. Experiments on some variant of MNIST clearly show our advantages over CNN and locally connected layer. 
LocateLinkVisualize (LocLinkVis) 
In this paper we present LocLinkVis (LocateLinkVisualize); a system which supports exploratory information access to a document collection based on georeferencing and visualization. It uses a gazetteer which contains representations of places ranging from countries to buildings, and that is used to recognize toponyms, disambiguate them into places, and to visualize the resulting spatial footprints. 
Location Determination Problem (LDP) 

Log Gaussian Cox Process Network  We generalize the log Gaussian Cox process (LGCP) framework to model multiple correlated point data jointly. The resulting log Gaussian Cox process network (LGCPN) considers the observations as realizations of multiple LGCPs, whose log intensities are given by linear combinations of latent functions drawn from Gaussian process priors. The coefficients of these linear combinations are also drawn from Gaussian processes and can incorporate additional dependencies a priori. We derive closedform expressions for the moments of the intensity functions in our model and use them to develop an efficient variational inference algorithm that is orders of magnitude faster than competing deterministic and stochastic approximations of multivariate LGCP and coregionalization models. Our approach outperforms the state of the art in jointly estimating multiple bovine tuberculosis incidents in Cornwall, UK, and multiple crime type intensities across New York city. 
LogicallyCorrect Reinforcement Learning  We propose a novel Reinforcement Learning (RL) algorithm to synthesize policies for a Markov Decision Process (MDP), such that a linear time property is satisfied. We convert the property into a Limit Deterministic Buchi Automaton (LDBA), then construct a product MDP between the automaton and the original MDP. A reward function is then assigned to the states of the product automaton, according to accepting conditions of the LDBA. With this reward function, RL synthesizes a policy that satisfies the property: as such, the policy synthesis procedure is ‘constrained’ by the given specification. Additionally, we show that the RL procedure sets up an online value iteration method to calculate the maximum probability of satisfying the given property, at any given state of the MDP – a convergence proof for the procedure is provided. Finally, the performance of the algorithm is evaluated via a set of numerical examples. We observe an improvement of one order of magnitude in the number of iterations required for the synthesis compared to existing approaches. 
Logistic Regression  In statistics, logistic regression, or logit regression, is a type of probabilistic statistical classification model. It is also used to predict a binary response from a binary predictor, used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features). That is, it is used in estimating empirical values of the parameters in a qualitative response model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Frequently (and subsequently in this article) “logistic regression” is used to refer specifically to the problem in which the dependent variable is binarythat is, the number of available categories is twowhile problems with more than two categories are referred to as multinomial logistic regression or, if the multiple categories are ordered, as ordered logistic regression. 
logit  The logit function is the inverse of the sigmoidal “logistic” function used in mathematics, especially in statistics. When the function’s parameter represents a probability p, the logit function gives the logodds, or the logarithm of the odds p/(1p). 
LogitBoost Autoregressive Networks  Multivariate binary distributions can be decomposed into products of univariate conditional distributions. Recently popular approaches have modeled these conditionals through neural networks with sophisticated weightsharing structures. It is shown that stateoftheart performance on several standard benchmark datasets can actually be achieved by training separate probability estimators for each dimension. In that case, model training can be trivially parallelized over data dimensions. On the other hand, complexity control has to be performed for each learned conditional distribution. Three possible methods are considered and experimentally compared. The estimator that is employed for each conditional is LogitBoost. Similarities and differences between the proposed approach and autoregressive models based on neural networks are discussed in detail. 
LogLikelihood  For many applications, the natural logarithm of the likelihood function, called the loglikelihood, is more convenient to work with. Because the logarithm is a monotonically increasing function, the logarithm of a function achieves its maximum value at the same points as the function itself, and hence the loglikelihood can be used in place of the likelihood in maximum likelihood estimation and related techniques. Finding the maximum of a function often involves taking the derivative of a function and solving for the parameter being maximized, and this is often easier when the function being maximized is a loglikelihood rather than the original likelihood function. For example, some likelihood functions are for the parameters that explain a collection of statistically independent observations. In such a situation, the likelihood function factors into a product of individual likelihood functions. The logarithm of this product is a sum of individual logarithms, and the derivative of a sum of terms is often easier to compute than the derivative of a product. In addition, several common distributions have likelihood functions that contain products of factors involving exponentiation. The logarithm of such a function is a sum of products, again easier to differentiate than the original function. In phylogenetics the loglikelihood ratio is sometimes termed support and the loglikelihood function support function. However, given the potential for confusion with the mathematical meaning of ‘support’ this terminology is rarely used outside this field. 
LogLinear Model  A loglinear model is a mathematical model that takes the form of a function whose logarithm is a firstdegree polynomial function of the parameters of the model, which makes it possible to apply (possibly multivariate) linear regression. 
Logrank Test  In statistics, the logrank test is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test and appropriate to use when the data are right skewed and censored (technically, the censoring must be noninformative). It is widely used in clinical trials to establish the efficacy of a new treatment in comparison with a control treatment when the measurement is the time to event (such as the time from initial treatment to a heart attack). The test is sometimes called the MantelCox test, named after Nathan Mantel and David Cox. The logrank test can also be viewed as a timestratified CochranMantelHaenszel test. glrt 
LoIDE  Logicbased paradigms are nowadays widely used in many different fields, also thank to the availability of robust tools and systems that allow the development of realworld and industrial applications. In this work we present LoIDE, an advanced and modular webeditor for logicbased languages that also integrates with stateoftheart solvers. 
LOKI  Imitation learning (IL) consists of a set of tools that leverage expert demonstrations to quickly learn policies. However, if the expert is suboptimal, IL can yield policies with inferior performance compared to reinforcement learning (RL). In this paper, we aim to provide an algorithm that combines the best aspects of RL and IL. We accomplish this by formulating several popular RL and IL algorithms in a common mirror descent framework, showing that these algorithms can be viewed as a variation on a single approach. We then propose LOKI, a strategy for policy learning that first performs a small but random number of IL iterations before switching to a policy gradient RL method. We show that if the switching time is properly randomized, LOKI can learn to outperform a suboptimal expert and converge faster than running policy gradient from scratch. Finally, we evaluate the performance of LOKI experimentally in several simulated environments. 
Long and ShortTerm TimeSeries Network (LSTNet) 
Multivariate time series forecasting is an important machine learning problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. Temporal data arise in these realworld applications often involves a mixture of longterm and shortterm patterns, for which traditional approaches such as Autoregressive models and Gaussian Process may fail. In this paper, we proposed a novel deep learning framework, namely Long and Shortterm Timeseries network (LSTNet), to address this open challenge. LSTNet uses the Convolution Neural Network (CNN) to extract shortterm local dependency patterns among variables, and the Recurrent Neural Network (RNN) to discover longterm patterns and trends. In our evaluation on realworld data with complex mixtures of repetitive patterns, LSTNet achieved significant performance improvements over that of several stateoftheart baseline methods. 
Long Short Term Memory (LSTM) 
Long short term memory (LSTM) is a recurrent neural network (RNN) architecture (an artificial neural network) published in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Like most RNNs, an LSTM network is universal in the sense that given enough network units it can compute anything a conventional computer can compute, provided it has the proper weight matrix, which may be viewed as its program. (Of course, finding such a weight matrix is more challenging with some problems than with others.) Unlike traditional RNNs, an LSTM network is wellsuited to learn from experience to classify, process and predict time series when there are very long time lags of unknown size between important events. This is one of the main reasons why LSTM outperforms alternative RNNs and Hidden Markov Models and other sequence learning methods in numerous applications. For example, LSTM achieved the best known results in unsegmented connected handwriting recognition, and in 2009 won the ICDAR handwriting competition. LSTM networks have also been used for automatic speech recognition, and were a major component of a network that recently achieved a record 17.7% phoneme error rate on the classic TIMIT natural speech dataset. 
Longitudinal Study  A longitudinal survey is a correlational research study that involves repeated observations of the same variables over long periods of time — often many decades. It is a type of observational study. Longitudinal studies are often used in psychology to study developmental trends across the life span, and in sociology to study life events throughout lifetimes or generations. The reason for this is that, unlike crosssectional studies, in which different individuals with same characteristics are compared, longitudinal studies track the same people, and therefore the differences observed in those people are less likely to be the result of cultural differences across generations. Because of this benefit, longitudinal studies make observing changes more accurate, and they are applied in various other fields. In medicine, the design is used to uncover predictors of certain diseases. In advertising, the design is used to identify the changes that advertising has produced in the attitudes and behaviors of those within the target audience who have seen the advertising campaign. Because most longitudinal studies are observational, in the sense that they observe the state of the world without manipulating it, it has been argued that they may have less power to detect causal relationships than experiments. But because of the repeated observation at the individual level, they have more power than crosssectional observational studies, by virtue of being able to exclude timeinvariant unobserved individual differences, and by virtue of observing the temporal order of events. Some of the disadvantages of longitudinal study include the fact that they take a lot of time and are very expensive. Therefore, they are not very convenient. Longitudinal studies allow social scientists to distinguish short from longterm phenomena, such as poverty. If the poverty rate is 10% at a point in time, this may mean that 10% of the population are always poor, or that the whole population experiences poverty for 10% of the time. It is impossible to conclude which of these possibilities is the case using oneoff crosssectional studies. Types of longitudinal studies include cohort studies and panel studies. Cohort studies sample a cohort, defined as a group experiencing some event (typically birth) in a selected time period, and studying them at intervals through time. Panel studies sample a crosssection, and survey it at (usually regular) intervals. A retrospective study is a longitudinal study that looks back in time. For instance, a researcher may look up the medical records of previous years to look for a trend. 
LongRange Dependency (LRD) 
Longrange dependency (LRD), also called long memory or longrange persistence, is a phenomenon that may arise in the analysis of spatial or time series data. It relates to the rate of decay of statistical dependence, with the implication that this decays more slowly than an exponential decay, typically a powerlike decay. LRD is often related to selfsimilar processes or fields. LRD has been used in various fields such as internet traffic modelling, econometrics, hydrology, linguistics and the earth sciences. Different mathematical definitions of LRD are used for different contexts and purposes. 
LookupBased Convolutional Neural Network (LCNN) 
Porting state of the art deep learning algorithms to resource constrained compute platforms (e.g. VR, AR, wearables) is extremely challenging. We propose a fast, compact, and accurate model for convolutional neural networks that enables efficient learning and inference. We introduce LCNN, a lookupbased convolutional neural network that encodes convolutions by few lookups to a dictionary that is trained to cover the space of weights in CNNs. Training LCNN involves jointly learning a dictionary and a small set of linear combinations. The size of the dictionary naturally traces a spectrum of tradeoffs between efficiency and accuracy. Our experimental results on ImageNet challenge show that LCNN can offer 3.2x speedup while achieving 55.1% top1 accuracy using AlexNet architecture. Our fastest LCNN offers 37.6x speed up over AlexNet while maintaining 44.3% top1 accuracy. LCNN not only offers dramatic speed ups at inference, but it also enables efficient training. In this paper, we show the benefits of LCNN in fewshot learning and fewiteration learning, two crucial aspects of ondevice training of deep learning models. 
Lord’s Paradox  In statistics, Lord’s paradox raises the issue of when it is appropriate to control for baseline status. In three papers, Frederic Lord noted that different results obtain if researchers adjust for preexisting differences. The paradox was resolved by Paul Holland and Donald Rubin using the Rubin causal model. The most famous formulation of Lord’s paradox was in his 1967 paper and was phrased in terms of weight change over freshman year of college in two different dormitories because Lord did not want readers to assume that measurement error was responsible for the paradox. “A large university is interested in investigating the effects on the students of the diet provided in the university dining halls and any sex differences in these effects. Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and his weight the following June are recorded.” (Lord 1967, p. 304) Resolving the Lord’s Paradox 
Lorenz Curve  In economics, the Lorenz curve is a graphical representation of the cumulative distribution function of the empirical probability distribution of wealth or income, and was developed by Max O. Lorenz in 1905 for representing inequality of the wealth distribution. The curve is a graph showing the proportion of overall income or wealth assumed by the bottom x% of the people, although this is not rigorously true for a finite population (see below). It is often used to represent income distribution, where it shows for the bottom x% of households, what percentage (y%) of the total income they have. The percentage of households is plotted on the xaxis, the percentage of income on the yaxis. It can also be used to show distribution of assets. In such use, many economists consider it to be a measure of social inequality. The concept is useful in describing inequality among the size of individuals in ecology and in studies of biodiversity, where the cumulative proportion of species is plotted against the cumulative proportion of individuals. It is also useful in business modeling: e.g., in consumer finance, to measure the actual percentage y% of delinquencies attributable to the x% of people with worst risk scores. 
Loss Distributional Approach (LDA) 
While AMA does not specify the use of any particular modeling technique, one common approach taken in the banking industry is the Loss Distribution Approach (LDA). With LDA, a bank first segments operational losses into homogeneous segments, called unit of measure’s (UoMs). For each unit of measure, the bank then constructs a loss distribution that represents its expectation of total losses that can materialize in a oneyear horizon. Given that data sufficiency is a major challenge for the industry, annual loss distribution cannot be built directly using annual loss figures. Instead, a bank will develop a frequency distribution that describes the number of loss events in a given year, and a severity distribution that describes the loss amount of a single loss event. The frequency and severity distributions are assumed to be independent. The convolution of these two distributions then give rise to the (annual) loss distribution. 
Loss Function  In mathematical optimization, statistics, decision theory and machine learning, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some ‘cost’ associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative (sometimes called a reward function or a utility function), in which case it is to be maximized. In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as Laplace, was reintroduced in statistics by Abraham Wald in the middle of the 20th Century. In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of Harald Cramér in the 1920s. In optimal control the loss is the penalty for failing to achieve a desired value. In financial risk management the function is precisely mapped to a monetary loss. 
Loss Rank Mining (LRM) 
Modern object detectors usually suffer from low accuracy issues, as foregrounds always drown in tons of backgrounds and become hard examples during training. Compared with those proposalbased ones, realtime detectors are in far more serious trouble since they renounce the use of regionproposing stage which is used to filter a majority of backgrounds for achieving realtime rates. Though foregrounds as hard examples are in urgent need of being mined from tons of backgrounds, a considerable number of stateoftheart realtime detectors, like YOLO series, have yet to profit from existing hard example mining methods, as using these methods need detectors fit series of prerequisites. In this paper, we propose a general hard example mining method named Loss Rank Mining (LRM) to fill the gap. LRM is a general method for realtime detectors, as it utilizes the final feature map which exists in all realtime detectors to mine hard examples. By using LRM, some elements representing easy examples in final feature map are filtered and detectors are forced to concentrate on hard examples during training. Extensive experiments validate the effectiveness of our method. With our method, the improvements of YOLOv2 detector on autodriving related dataset KITTI and more general dataset PASCAL VOC are over 5% and 2% mAP, respectively. In addition, LRM is the first hard example mining strategy which could fit YOLOv2 perfectly and make it better applied in series of real scenarios where both realtime rates and accurate detection are strongly demanded. 
Lotka’s Law  Lotka’s law, named after Alfred J. Lotka, is one of a variety of special applications of Zipf’s law. It describes the frequency of publication by authors in any given field. It states that the number of authors making n contributions is about 1/n^{a} of those making one contribution, where a nearly always equals two. More plainly, the number of authors publishing a certain number of articles is a fixed ratio to the number of authors publishing a single article. As the number of articles published increases, authors producing that many publications become less frequent. There are 1/4 as many authors publishing two articles within a specified time period as there are singlepublication authors, 1/9 as many publishing three articles, 1/16 as many publishing four articles, etc. Though the law itself covers many disciplines, the actual ratios involved (as a function of ‘a’) are very disciplinespecific. LotkasLaw 
Lottery Ticket Hypothesis  Recent work on neural network pruning indicates that, at training time, neural networks need to be significantly larger in size than is necessary to represent the eventual functions that they learn. This paper articulates a new hypothesis to explain this phenomenon. This conjecture, which we term the ‘lottery ticket hypothesis,’ proposes that successful training depends on lucky random initialization of a smaller subcomponent of the network. Larger networks have more of these ‘lottery tickets,’ meaning they are more likely to luck out with a subcomponent initialized in a configuration amenable to successful optimization. This paper conducts a series of experiments with XOR and MNIST that support the lottery ticket hypothesis. In particular, we identify these fortuitouslyinitialized subcomponents by pruning lowmagnitude weights from trained networks. We then demonstrate that these subcomponents can be successfully retrained in isolation so long as the subnetworks are given the same initializations as they had at the beginning of the training process. Initialized as such, these small networks reliably converge successfully, often faster than the original network at the same level of accuracy. However, when these subcomponents are randomly reinitialized or rearranged, they perform worse than the original network. In other words, large networks that train successfully contain small subnetworks with initializations conducive to optimization. The lottery ticket hypothesis and its connection to pruning are a step toward developing architectures, initializations, and training strategies that make it possible to solve the same problems with much smaller networks. 
Louvain Method  Our method, that we call Louvain Method (because, even though the coauthors now hold positions in Paris, London and Louvain, the method was devised when they all were in Louvain), outperforms other methods in terms of computation time, which allows us to analyze networks of unprecedented size (e.g. the analysis of a typical network of 2 million nodes only takes 2 minutes). The Louvain method has also been to shown to be very accurate by focusing on adhoc networks with known community structure. Moreover, due to its hierarchical structure, which is reminiscent of renormalization methods, it allows to look at communities at different resolutions. ➚ “Community Detection” 
Louvain Modularity  The Louvain Method for community detection is a method to extract communities from large networks created by Vincent Blondel. The method is a greedy optimization method that appears to run in time O(n log n). http://…/0803.0476v2.pdf 
Lovasz Convolutional Network (LCN) 
Semisupervised learning on graph structured data has received significant attention with the recent introduction of graph convolution networks (GCN). While traditional methods have focused on optimizing a loss augmented with Laplacian regularization framework, GCNs perform an implicit Laplacian type regularization to capture local graph structure. In this work, we propose Lovasz convolutional network (LCNs) which are capable of incorporating global graph properties. LCNs achieve this by utilizing Lovasz’s orthonormal embeddings of the nodes. We analyse local and global properties of graphs and demonstrate settings where LCNs tend to work better than GCNs. We validate the proposed method on standard random graph models such as stochastic block models (SBM) and certain community structure based graphs where LCNs outperform GCNs and learn more intuitive embeddings. We also perform extensive binary and multiclass classification experiments on real world datasets to demonstrate LCN’s effectiveness. In addition to simple graphs, we also demonstrate the use of LCNs on hypergraphs by identifying settings where they are expected to work better than GCNs. 
Low Algebraic Dimension Matrix Completion (LADMC) 
In the low rank matrix completion (LRMC) problem, the low rank assumption means that the columns (or rows) of the matrix to be completed are points on a lowdimensional linear algebraic variety. This paper extends this thinking to cases where the columns are points on a lowdimensional nonlinear algebraic variety, a problem we call Low Algebraic Dimension Matrix Completion (LADMC). Matrices whose columns belong to a union of subspaces (UoS) are an important special case. We propose a LADMC algorithm that leverages existing LRMC methods on a tensorized representation of the data. For example, a secondorder tensorization representation is formed by taking the outer product of each column with itself, and we consider higher order tensorizations as well. This approach will succeed in many cases where traditional LRMC is guaranteed to fail because the data are lowrank in the tensorized representation but not in the original representation. We also provide a formal mathematical justification for the success of our method. In particular, we show bounds of the rank of these data in the tensorized representation, and we prove sampling requirements to guarantee uniqueness of the solution. Interestingly, the sampling requirements of our LADMC algorithm nearly match the information theoretic lower bounds for matrix completion under a UoS model. We also provide experimental results showing that the new approach significantly outperforms existing stateoftheart methods for matrix completion in many situations. 
Low Complexity Neural Network (LCNN) 
Modern neural network architectures for largescale learning tasks have substantially higher model complexities, which makes understanding, visualizing and training these architectures difficult. Recent contributions to deep learning techniques have focused on architectural modifications to improve parameter efficiency and performance. In this paper, we derive a continuous and differentiable error functional for a neural network that minimizes its empirical error as well as a measure of the model complexity. The latter measure is obtained by deriving a differentiable upper bound on the VapnikChervonenkis (VC) dimension of the classifier layer of a class of deep networks. Using standard backpropagation, we realize a training rule that tries to minimize the error on training samples, while improving generalization by keeping the model complexity low. We demonstrate the effectiveness of our formulation (the Low Complexity Neural Network – LCNN) across several deep learning algorithms, and a variety of large benchmark datasets. We show that hidden layer neurons in the resultant networks learn features that are crisp, and in the case of image datasets, quantitatively sharper. Our proposed approach yields benefits across a wide range of architectures, in comparison to and in conjunction with methods such as Dropout and Batch Normalization, and our results strongly suggest that deep learning techniques can benefit from model complexity control methods such as the LCNN learning rule. 
Low Dimensional Manifold Regularized Neural Network (LDMNet) 
Deep neural networks have proved very successful on archetypal tasks for which large training sets are available, but when the training data are scarce, their performance suffers from overfitting. Many existing methods of reducing overfitting are dataindependent, and their efficacy is often limited when the training set is very small. Datadependent regularizations are mostly motivated by the observation that data of interest lie close to a manifold, which is typically hard to parametrize explicitly and often requires human input of tangent vectors. These methods typically only focus on the geometry of the input data, and do not necessarily encourage the networks to produce geometrically meaningful features. To resolve this, we propose a new framework, the LowDimensionalManifoldregularized neural Network (LDMNet), which incorporates a feature regularization method that focuses on the geometry of both the input data and the output features. In LDMNet, we regularize the network by encouraging the combination of the input data and the output features to sample a collection of low dimensional manifolds, which are searched efficiently without explicit parametrization. To achieve this, we directly use the manifold dimension as a regularization term in a variational functional. The resulting EulerLagrange equation is a LaplaceBeltrami equation over a point cloud, which is solved by the point integral method without increasing the computational complexity. We demonstrate two benefits of LDMNet in the experiments. First, we show that LDMNet significantly outperforms widelyused network regularizers such as weight decay and DropOut. Second, we show that LDMNet can be designed to extract common features of an object imaged via different modalities, which proves to be very useful in realworld applications such as crossspectral face recognition. 
Lowest Posterior Loss (LPL) 
This paper defines intrinsic credible regions, a method to produce objective Bayesian credible regions which only depends on the assumed model and the available data. Lowest posterior loss (LPL) regions are defined as Bayesian credible regions which contain values of minimum posterior expected loss: they depend both on the loss function and on the prior specification. An invariant, informationtheory based loss function, the intrinsic discrepancy is argued to be appropriate for scientific communication. Intrinsic credible regions are the lowest posterior loss regions with respect to the intrinsic discrepancy loss and the appropriate reference prior. The proposed procedure is completely general, and it is invariant under both reparametrization and marginalization. The exact derivation of intrinsic credible regions often requires numerical integration, but good analytical approximations are provided. Special attention is given to onedimensional intrinsic credible intervals; their coverage properties show that they are always approximate (and sometimes exact) frequentist confidence intervals. 
Lowest Posterior Loss Interval (LPLI) 
The Lowest Posterior Loss (LPL) interval (Bernardo, 2005), or LPLI, is a probability interval based on intrinsic discrepancy loss between prior and posterior distributions. The expected posterior loss is the loss associated with using a particular value theta[i] in theta of the parameter as the unknown true value of theta (Bernardo, 2005). Parameter values with smaller expected posterior loss should always be preferred. The LPL interval includes a region in which all parameter values have smaller expected posterior loss than those outside the region. Although any loss function could be used, the loss function should be invariant under reparameterization. Any intrinsic loss function is invariant under reparameterization, but not necessarily invariant under onetoone transformations of data x. When a loss function is also invariant under onetoone transformations, it is usually also invariant when reduced to a sufficient statistic. Only an intrinsic loss function that is invariant when reduced to a sufficient statistic should be considered. The intrinsic discrepancy loss is easily a superior loss function to the overused quadratic loss function, and is more appropriate than other popular measures, such as Hellinger distance, KullbackLeibler divergence (KLD), and Jeffreys logarithmic divergence. The intrinsic discrepancy loss is also an informationtheory related divergence measure. Intrinsic discrepancy loss is a symmetric, nonnegative loss function, and is a continuous, convex function. Intrinsic discrepancy loss was introduced by Bernardo and Rueda (2002) in a different context: hypothesis testing. Formally, it is: delta f(p[2],p[1]) = min[kappa(p[2]  p[1]), kappa(p[1]  p[2])] where delta is the discrepancy, kappa is the KLD, and p[1] and p[2] are the probability distributions. The intrinsic discrepancy loss is the loss function, and the expected posterior loss is the mean of the directed divergences. The LPL interval is also called an intrinsic credible interval or intrinsic probability interval, and the area inside the interval is often called an intrinsic credible region or intrinsic probability region. In practice, whether a reference prior or weakly informative prior (WIP) is used, the LPL interval is usually very close to the HPD interval, though the posterior losses may be noticeably different. If LPL used a zeroone loss function, then the HPD interval would be produced. An advantage of the LPL interval over HPD interval (see p.interval) is that the LPL interval is invariant to reparameterization. This is due to the invariant reparameterization property of reference priors. The quantilebased probability interval is also invariant to reparameterization. The LPL interval enjoys the same advantage as the HPD interval does over the quantilebased probability interval: it does not produce equal tails when inappropriate. Compared with probability intervals, the LPL interval is slightly less convenient to calculate. Although the prior distribution is specified within the Model specification function, the user must specify it for the LPL.interval function as well. A comparison of the quantilebased probability interval, HPD interval, and LPL interval is available here: http://…/credible. 
Lowpass Recurrent Neural Network  Reinforcement learning (RL) agents performing complex tasks must be able to remember observations and actions across sizable time intervals. This is especially true during the initial learning stages, when exploratory behaviour can increase the delay between specific actions and their effects. Many new or popular approaches for learning these distant correlations employ backpropagation through time (BPTT), but this technique requires storing observation traces long enough to span the interval between cause and effect. Besides memory demands, learning dynamics like vanishing gradients and slow convergence due to infrequent weight updates can reduce BPTT’s practicality; meanwhile, although online recurrent network learning is a developing topic, most approaches are not efficient enough to use as replacements. We propose a simple, effective memory strategy that can extend the window over which BPTT can learn without requiring longer traces. We explore this approach empirically on a few tasks and discuss its implications. 
LowRank Kernel Subspace Clustering  Most stateoftheart subspace clustering methods only work with linear (or affine) subspaces. In this paper, we present a kernel subspace clustering method that can handle nonlinear models. While an arbitrary kernel can nonlinearly map data into highdimensional Hilbert feature space, the data in the resulting feature space are very unlikely to have the desired subspace structures. By contrast, we propose to learn a lowrank kernel mapping, with which the mapped data in feature space are not only lowrank but also selfexpressive, such that the lowdimensional subspace structures are present and manifested in the highdimensional feature space. We have evaluated the proposed method extensively on both motion segmentation and image clustering benchmarks, and obtained superior results, outperforming the kernel subspace clustering method that uses standard kernels~\cite{patel2014kernel} and other stateoftheart linear subspace clustering methods. 
LowShot Transfer Detector (LSTD) 
Recent advances in object detection are mainly driven by deep learning with largescale detection benchmarks. However, the fullyannotated training set is often limited for a target detection task, which may deteriorate the performance of deep detectors. To address this challenge, we propose a novel lowshot transfer detector (LSTD) in this paper, where we leverage rich sourcedomain knowledge to construct an effective targetdomain detector with very few training examples. The main contributions are described as follows. First, we design a flexible deep architecture of LSTD to alleviate transfer difficulties in lowshot detection. This architecture can integrate the advantages of both SSD and Faster RCNN in a unified deep framework. Second, we introduce a novel regularized transfer learning framework for lowshot detection, where the transfer knowledge (TK) and background depression (BD) regularizations are proposed to leverage object knowledge respectively from source and target domains, in order to further enhance finetuning with a few target images. Finally, we examine our LSTD on a number of challenging lowshot detection experiments, where LSTD outperforms other stateoftheart approaches. The results demonstrate that LSTD is a preferable deep detector for lowshot scenarios. 
Lp Space  In mathematics, the Lp spaces are function spaces defined using a natural generalization of the pnorm for finitedimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue (Dunford & Schwartz 1958, III.3), although according to the Bourbaki group (Bourbaki 1987) they were first introduced by Frigyes Riesz (Riesz 1910). Lp spaces form an important class of Banach spaces in functional analysis, and of topological vector spaces. Lebesgue spaces have applications in physics, statistics, finance, engineering, and other disciplines. 
Lua  Lua is a powerful, efficient, lightweight, embeddable scripting language. It supports procedural programming, objectoriented programming, functional programming, datadriven programming, and data description. Lua combines simple procedural syntax with powerful data description constructs based on associative arrays and extensible semantics. Lua is dynamically typed, runs by interpreting bytecode with a registerbased virtual machine, and has automatic memory management with incremental garbage collection, making it ideal for configuration, scripting, and rapid prototyping. 
Luigi  Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in. Luigi is an open source Pythonbased data framework for building data pipelines. Instead of using an XML/YAML configuration of some sort, all the jobs and their dependencies are written as Python programs. Because it’s Python, developers can backtrack to figure out exactly how data is processed. The framework makes it easier to build large data pipelines, with builtin checkpointing, failure recovery, parallel execution, command line integration, etc. Since it’s a Python program, any Python library assets can be reused. The Luigi framework itself is a couple of thousand lines, so it’s also easy to understand the entire mechanism. Facebook built a similar internal system called Dataswarm (Video), which allows developers to manage the entire data pipeline on Git + Python. While Luigi was originally invented for Spotify’s internal needs, companies such as Foursquare, Stripe, and Asana are using it in production. 
Lurking Variable  Lurking variables represent hidden information, and preclude a full understanding of phenomena of interest. Detection is usually based on serendipity — visual detection of unexplained, systematic variation. However, these approaches are doomed to fail if the lurking variables do not vary. 
Advertisements