A/B Testing In marketing, A/B testing is a simple randomized experiment with two variants, A and B, which are the control and treatment in the controlled experiment. It is a form of statistical hypothesis testing. Other names include randomized controlled experiments, online controlled experiments, and split testing. In online settings, such as web design (especially user experience design), the goal is to identify changes to web pages that increase or maximize an outcome of interest (e.g., click-through rate for a banner advertisement).
Why does designing a simple A/B test seem so complicated?
ABC Analysis In materials management, the ABC analysis (or Selective Inventory Control) is an inventory categorization technique. ABC analysis divides an inventory into three categories- “A items” with very tight control and accurate records, “B items” with less tightly controlled and good records, and “C items” with the simplest controls possible and minimal records. The ABC analysis provides a mechanism for identifying items that will have a significant impact on overall inventory cost, while also providing a mechanism for identifying different categories of stock that will require different management and controls. The ABC analysis suggests that inventories of an organization are not of equal value. Thus, the inventory is grouped into three categories (A, B, and C) in order of their estimated importance.
‘A’ items are very important for an organization. Because of the high value of these ‘A’ items, frequent value analysis is required. In addition to that, an organization needs to choose an appropriate order pattern (e.g. ‘Just- in- time’) to avoid excess capacity. ‘B’ items are important, but of course less important than ‘A’ items and more important than ‘C’ items. Therefore ‘B’ items are intergroup items. ‘C’ items are marginally important.
Abstract Meaning Representation
We describe AbstractMeaning Representation (AMR), a semantic representation language in which we are writing down the meanings of thousands of English sentences. We hope that a sembank of simple, whole-sentence semantic structures will spur new work in statistical natural language understanding and generation, like the Penn Treebank encouraged work on statistical parsing. This paper gives an overviewof AMR and tools associated with it.
Abstract Meaning Representation (AMR) Annotation Release 1.0, Linguistic Data Consortium (LDC) Catalog Number LDC2014T12 and ISBN 1-58563-677-0, was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado’s Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 13,000 English natural language sentences from newswire, weblogs and web discussion forums.
AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
AMR Specification
Abstract Meaning Representation: A survey
ABtree An Algorithm for Subgroup-Based Treatment Assignment. Given two possible treatments, there may exist subgroups who benefit greater from one treatment than the other. This problem is relevant to the field of marketing, where treatments may correspond to different ways of selling a product. It is similarly relevant to the field of public policy, where treatments may correspond to specific government programs
Accelerated Destructive Degradation Test
Degradation data analysis is a powerful tool for reliability assessment. Useful reliability information is available from degradation data when there are few or even no failures. For some applications the degradation measurement process destroys or changes the physical/mechanical characteristics of test units. In such applications, only one meaningful measurement can be can be taken on each test unit. This is known as ‘destructive degradation’. Degradation tests are often accelerated by testing at higher than usual levels of accelerating variables like temperature.
Accelerated Distributed Nesterov Gradient Descent
This paper considers the distributed optimization problem over a network, where the objective is to optimize a global function formed by a sum of local functions, using only local computation and communication. We develop an Accelerated Distributed Nesterov Gradient Descent (Acc-DNGD) method. When the objective function is convex and $L$-smooth, we show that it achieves a $O(\frac{1}{t^{1.4-\epsilon}})$ convergence rate for all $\epsilon\in(0,1.4)$. We also show the convergence rate can be improved to $O(\frac{1}{t^2})$ if the objective function is a composition of a linear map and a strongly-convex and smooth function. When the objective function is $\mu$-strongly convex and $L$-smooth, we show that it achieves a linear convergence rate of $O([ 1 – O( (\frac{\mu}{L})^{5/7} )]^t)$, where $\frac{L}{\mu}$ is the condition number of the objective.
Accelerated Failure Time Model
In the statistical area of survival analysis, an accelerated failure time model (AFT model) is a parametric model that provides an alternative to the commonly used proportional hazards models. Whereas a proportional hazards model assumes that the effect of a covariate is to multiply the hazard by some constant, an AFT model assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant. This is especially appealing in a technical context where the ‘disease’ is a result of some mechanical process with a known sequence of intermediary stages.
Accelerated Gradient Descent
In the world of optimization, we have a space and a convex objective function f we wish to minimize. We have seen that gradient descent is a simple greedy algorithm that works to minimize the objective function at some convergence rate (in this post we shall remain in discrete time). But the world is always stranger than we think. Indeed, there is a phenomenon of acceleration in convex optimization, in which we can boost the performance of some gradient-based algorithms by subtly modifying their implementation. In particular, we will discuss accelerated gradient descent, proposed by Yurii Nesterov in 1983, which achieves a faster – and optimal – convergence rate under the same assumption as gradient descent. Acceleration has received renewed research interests in recent years, leading to many proposed interpretations and further generalizations. Nevertheless, there is still a sense of mystery to what acceleration is doing and why it works; these are the questions that we want to understand better.
Accelerated Hierarchical Density Clustering We present an accelerated algorithm for hierarchical density based clustering. Our new algorithm improves upon HDBSCAN*, which itself provided a significant qualitative improvement over the popular DBSCAN algorithm. The accelerated HDBSCAN* algorithm provides comparable performance to DBSCAN, while supporting variable density clusters, and eliminating the need for the difficult to tune distance scale parameter. This makes accelerated HDBSCAN* the default choice for density based clustering. Library available at: https://…/hdbscan
Accelerated Mini-Batch k-Means We propose an accelerated Mini-Batch k-means algorithm which combines three key improvements. The first is a modified center update which results in convergence to a local minimum in fewer iterations. The second is an adaptive increase of batchsize to meet an increasing requirement for centroid accuracy. The third is the inclusion of distance bounds based on the triangle inequality, which are used to eliminate distance calculations along the same lines as Elkan’s algorithm. The combination of the two latter constitutes a very powerful scheme to reuse computation already done over samples until statistical accuracy requires the use of additional data points.
Accelerated Proximal Gradient
More Efficient Accelerated Proximal Algorithm for Nonconvex Problems
Acceptance-Rejection Method “Rejection Sampling”
Accuracy In the fields of science, engineering, industry, and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity’s actual (true) value.
Accuracy and Precision Accuracy and precision are defined in terms of systematic and random errors. The more common definition associates accuracy with systematic errors and precision with random errors. Another definition, advanced by ISO, associates trueness with systematic errors and precision with random errors, and defines accuracy as the combination of both trueness and precision.
Accuracy Paradox The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favor of other metrics such as precision and recall.
Accuracy is often the starting point for analyzing the quality of a predictive model, as well as an obvious criterion for prediction. Accuracy measures the ratio of correct predictions to the total number of cases evaluated. It may seem obvious that the ratio of correct predictions to cases should be a key metric. A predictive model may have high accuracy, but be useless.
Activation Ensemble Many activation functions have been proposed in the past, but selecting an adequate one requires trial and error. We propose a new methodology of designing activation functions within a neural network at each layer. We call this technique an ‘activation ensemble’ because it allows the use of multiple activation functions at each layer. This is done by introducing additional variables, $\alpha$, at each activation layer of a network to allow for multiple activation functions to be active at each neuron. By design, activations with larger $\alpha$ values at a neuron is equivalent to having the largest magnitude. Hence, those higher magnitude activations are ‘chosen’ by the network. We implement the activation ensembles on a variety of datasets using an array of Feed Forward and Convolutional Neural Networks. By using the activation ensemble, we achieve superior results compared to traditional techniques. In addition, because of the flexibility of this methodology, we more deeply explore activation functions and the features that they capture.
Active Information Store
The goal of the AIS project is to provide a scalable information repository to support data-intensive information worker and decision support applications. AIS can help businesses to enhance, structure and navigate data to discover information and relationships that are in existing data stores but not readily available or obvious. (see also HANA Graph Engine)
Active Learning Active learning is a special case of semi-supervised machine learning in which a learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. In statistics literature it is sometimes also called optimal experimental design.
There are situations in which unlabeled data is abundant but manually labeling is expensive. In such a scenario, learning algorithms can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning.
Active Long Term Memory Network
Continual Learning in artificial neural networks suffers from interference and forgetting when different tasks are learned sequentially. This paper introduces the Active Long Term Memory Networks (A-LTM), a model of sequential multi-task deep learning that is able to maintain previously learned association between sensory input and behavioral output while acquiring knew knowledge. A-LTM exploits the non-convex nature of deep neural networks and actively maintains knowledge of previously learned, inactive tasks using a distillation loss. Distortions of the learned input-output map are penalized but hidden layers are free to transverse towards new local optima that are more favorable for the multi-task objective. We re-frame the McClelland’s seminal Hippocampal theory with respect to Catastrophic Inference (CI) behavior exhibited by modern deep architectures trained with back-propagation and inhomogeneous sampling of latent factors across epochs. We present empirical results of non-trivial CI during continual learning in Deep Linear Networks trained on the same task, in Convolutional Neural Networks when the task shifts from predicting semantic to graphical factors and during domain adaptation from simple to complex environments. We present results of the A-LTM model’s ability to maintain viewpoint recognition learned in the highly controlled iLab-20M dataset with 10 object categories and 88 camera viewpoints, while adapting to the unstructured domain of Imagenet with 1,000 object categories.
Active Rotating Filters
Deep Convolution Neural Networks (DCNNs) are capable of learning unprecedentedly effective image representations. However, their ability in handling significant local and global image rotations remains limited. In this paper, we propose Active Rotating Filters (ARFs) that actively rotate during convolution and produce feature maps with location and orientation explicitly encoded. An ARF acts as a virtual filter bank containing the filter itself and its multiple unmaterialised rotated versions. During back-propagation, an ARF is collectively updated using errors from all its rotated versions. DCNNs using ARFs, referred to as Oriented Response Networks (ORNs), can produce within-class rotation-invariant deep features while maintaining inter-class discrimination for classification tasks. The oriented response produced by ORNs can also be used for image and object orientation estimation tasks. Over multiple state-of-the-art DCNN architectures, such as VGG, ResNet, and STN, we consistently observe that replacing regular filters with the proposed ARFs leads to significant reduction in the number of network parameters and improvement in classification performance. We report the best results on several commonly used benchmarks.
Active Sampler Recent years have witnessed amazing outcomes from ‘Big Models’ trained by ‘Big Data’. Most popular algorithms for model training are iterative. Due to the surging volumes of data, we can usually afford to process only a fraction of the training data in each iteration. Typically, the data are either uniformly sampled or sequentially accessed. In this paper, we study how the data access pattern can affect model training. We propose an Active Sampler algorithm, where training data with more ‘learning value’ to the model are sampled more frequently. The goal is to focus training effort on valuable instances near the classification boundaries, rather than evident cases, noisy data or outliers. We show the correctness and optimality of Active Sampler in theory, and then develop a light-weight vectorized implementation. Active Sampler is orthogonal to most approaches optimizing the efficiency of large-scale data analytics, and can be applied to most analytics models trained by stochastic gradient descent (SGD) algorithm. Extensive experimental evaluations demonstrate that Active Sampler can speed up the training procedure of SVM, feature selection and deep learning, for comparable training quality by 1.6-2.2x.
ActiveClean Data cleaning is often an important step to ensure that predictive models, such as regression and classification, are not affected by systematic errors such as inconsistent, out-of-date, or outlier data. Identifying dirty data is often a manual and iterative process, and can be challenging on large datasets. However, many data cleaning workflows can introduce subtle biases into the training processes due to violation of independence assumptions. We propose ActiveClean, a progressive cleaning approach where the model is updated incrementally instead of re-training and can guarantee accuracy on partially cleaned data. ActiveClean supports a popular class of models called convex loss models (e.g., linear regression and SVMs). ActiveClean also leverages the structure of a user’s model to prioritize cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, Dollars For Docs, and WorldBank with both real and synthetic errors. Our results suggest that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.
Actuarial Science Actuarial science is the discipline that applies mathematical and statistical methods to assess risk in insurance, finance and other industries and professions. Actuaries are professionals who are qualified in this field through education and experience. In many countries, actuaries must demonstrate their competence by passing a series of rigorous professional examinations. Actuarial science includes a number of interrelated subjects, including probability, mathematics, statistics, finance, economics, financial economics, and computer programming. Historically, actuarial science used deterministic models in the construction of tables and premiums. The science has gone through revolutionary changes during the last 30 years due to the proliferation of high speed computers and the union of stochastic actuarial models with modern financial theory (Frees 1990). Many universities have undergraduate and graduate degree programs in actuarial science. In 2010, a study published by job search website CareerCast ranked actuary as the #1 job in the United States (Needleman 2010). The study used five key criteria to rank jobs: environment, income, employment outlook, physical demands, and stress. A similar study by U.S. News & World Report in 2006 included actuaries among the 25 Best Professions that it expects will be in great demand in the future (Nemko 2006).
AdaGAN Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) are an effective method for training generative models of complex data such as natural images. However, they are notoriously hard to train and can suffer from the problem of missing modes where the model is not able to produce examples in certain regions of the space. We propose an iterative procedure, called AdaGAN, where at every step we add a new component into a mixture model by running a GAN algorithm on a reweighted sample. This is inspired by boosting algorithms, where many potentially weak individual predictors are greedily aggregated to form a strong composite predictor. We prove that such an incremental procedure leads to convergence to the true distribution in a finite number of steps if each step is optimal, and convergence at an exponential rate otherwise. We also illustrate experimentally that this procedure addresses the problem of missing modes.
ADaptive Algorithm for Betweenness via Random Approximation
We present KADABRA, a new algorithm to approximate betweenness centrality in directed and undirected graphs, which significantly outperforms all previous approaches on real-world complex networks. The efficiency of the new algorithm relies on two new theoretical contribution, of independent interest. The first contribution focuses on sampling shortest paths, a subroutine used by most algorithms that approximate betweenness centrality. We show that, on realistic random graph models, we can perform this task in time $|E|^{\frac{1}{2}+o(1)}$ with high probability, obtaining a significant speedup with respect to the $\Theta(|E|)$ worst-case performance. We experimentally show that this new technique achieves similar speedups on real-world complex networks, as well. The second contribution is a new rigorous application of the adaptive sampling technique. This approach decreases the total number of shortest paths that need to be sampled to compute all betweenness centralities with a given absolute error, and it also handles more general problems, such as computing the $k$ most central nodes. Furthermore, our analysis is general, and it might be extended to other settings, as well.
Adaptive BAll COver for Classification
Stream mining poses unique challenges to machine learning: predictive models are required to be scalable, incrementally trainable, must remain bounded in size (even when the data stream is arbitrarily long), and be nonparametric in order to achieve high accuracy even in complex and dynamic environments. Moreover, the learning system must be parameterless —traditional tuning methods are problematic in streaming settings— and avoid requiring prior knowledge of the number of distinct class labels occurring in the stream. In this paper, we introduce a new algorithmic approach for nonparametric learning in data streams. Our approach addresses all above mentioned challenges by learning a model that covers the input space using simple local classifiers. The distribution of these classifiers dynamically adapts to the local (unknown) complexity of the classification problem, thus achieving a good balance between model complexity and predictive accuracy. We design four variants of our approach of increasing adaptivity. By means of an extensive empirical evaluation against standard nonparametric baselines, we show state-of-the-art results in terms of accuracy versus model size. For the variant that imposes a strict bound on the model size, we show better performance against all other methods measured at the same model size value. Our empirical analysis is complemented by a theoretical performance guarantee which does not rely on any stochastic assumption on the source generating the stream.
Adaptive Boosting
AdaBoost, short for “Adaptive Boosting”, is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire who won the prestigious “Gödel Prize” in 2003 for their work. It can be used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms (‘weak learners’) is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems, however, it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (i.e., their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner. While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative ‘hardness’ of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder to classify examples.
Adaptive Coordinate Descent Adaptive coordinate descent is an extension of the coordinate descent algorithm to non-separable optimization. The adaptive coordinate descent approach gradually builds a transformation of the coordinate system such that the new coordinates are as decorrelated as possible with respect to the objective function. The adaptive coordinate descent was shown to be competitive to the state-of-the-art evolutionary algorithms and has the following invariance properties:
1. Invariance with respect to monotonous transformations of the function (scaling)
2. Invariance with respect to orthogonal transformations of the search space (rotation).
CMA-like Adaptive Encoding Update mostly based on principal component analysis is used to extend the coordinate descent method to the optimization of non-separable problems. The adaptation of an appropriate coordinate system allows adaptive coordinate descent to outperform coordinate descent on non-separable functions.
Adaptive Density Peak Detection
ADPclust clustering procedures (Fast Clustering Using Adaptive Density Peak Detection). The work is built and improved upon Rodriguez and Laio’s idea. ADPclust clusters data by finding density peaks in a density-distance plot generated from local multivariate Gaussian density estimation. It includes an automatic centroids selection and parameter optimization algorithm, which finds the number of clusters and cluster centroids by comparing average silhouettes on a grid of testing clustering results; It also includes an user interactive algorithm that allows the user to manually selects cluster centroids from a two dimensional ‘density-distance plot’.
Adaptive Generalized PCA
(adaptive gPCA)
When working with large biological data sets, exploratory analysis is an important first step for understanding the latent structure and for generating hypotheses to be tested in subsequent analyses. However, when the number of variables is large compared to the number of samples, standard methods such as principal components analysis give results which are unstable and difficult to interpret. To mitigate these problems, we have developed a method which allows the analyst to incorporate side information about the relationships between the variables in a way that encourages similar variables to have similar loadings on the principal axes. This leads to a low-dimensional representation of the samples which both describes the latent structure and which has axes which are interpretable in terms of groups of closely related variables. The method is derived by putting a prior encoding the relationships between the variables on the data and following through the analysis on the posterior distributions of the samples. We show that our method does well at reconstructing true latent structure in simulated data and we also demonstrate the method on a dataset investigating the effects of antibiotics on the composition of bacteria in the human gut.
Adaptive gPCA When working with large biological data sets, exploratory analysis is an important first step for understanding the latent structure and for generating hypotheses to be tested in subsequent analyses. However, when the number of variables is large compared to the number of samples, standard methods such as principal components analysis give results which are unstable and difficult to interpret. To mitigate these problems, we have developed a method which allows the analyst to incorporate side information about the relationships between the variables in a way that encourages similar variables to have similar loadings on the principal axes. This leads to a low-dimensional representation of the samples which both describes the latent structure and which has axes which are interpretable in terms of groups of closely related variables. The method is derived by putting a prior encoding the relationships between the variables on the data and following through the analysis on the posterior distributions of the samples. We show that our method does well at reconstructing true latent structure in simulated data and we also demonstrate the method on a dataset investigating the effects of antibiotics on the composition of bacteria in the human gut.
Adaptive Huber Regression Big data are often contaminated by outliers and heavy-tailed errors, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our framework is able to handle heavy-tailed data with bounded $(1 \! + \! \delta)$-th moment for any $\delta\!>\!0$. We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when $\delta \!\geq\! 1$, the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime $0 \!<\! \delta \!<\! 1$ and the transition is smooth and optimal. Moreover, a nonasymptotic Bahadur representation for finite-sample inference is derived when the variance is finite. Numerical experiments lend further support to our obtained theory.
Adaptive Learning Rate
Automatically set learning rate for each neuron in a neural network based on it´s training history.
Adaptive Linear Neuron
ADALINE (Adaptive Linear Neuron or later Adaptive Linear Element) is an early single-layer artificial neural network and the name of the physical device that implemented this network. The network uses memistors. It was developed by Professor Bernard Widrow and his graduate student Ted Hoff at Stanford University in 1960. It is based on the McCulloch–Pitts neuron. It consists of a weight, a bias and a summation function. The difference between Adaline and the standard (McCulloch–Pitts) perceptron is that in the learning phase the weights are adjusted according to the weighted sum of the inputs (the net). In the standard perceptron, the net is passed to the activation (transfer) function and the function’s output is used for adjusting the weights. There also exists an extension known as Madaline.
Adaptive Mixture Discriminant Analysis
In supervised learning, an important issue usually not taken into account by classical methods is the possibility of having in the test set individuals belonging to a class which has not been observed during the learning phase. Classical supervised algorithms will automatically label such observations as belonging to one of the known classes in the training set and will not be able to detect new classes. This work introduces a model-based discriminant analysis method, called adaptive mixture discriminant analysis (AMDA), which is able to detect several unobserved groups of points and to adapt the learned classifier to the new situation. Two EM-based procedures are proposed for parameter estimation and model selection criteria are used for selecting the actual number of classes. Experiments on artificial and real data demonstrate the ability of the proposed method to deal with complex and real word problems. The proposed approach is also applied to the detection of unobserved communities in social network analysis.
Adaptive Moving Average
The Adaptive Moving Average (AMA) study is similar to the exponential moving average (EMA), except the AMA uses a scalable constant instead of a fixed constant for smoothing the data.
Adaptive Moving Self-organizing Map
Self-Organizing Map (SOM) is a neural network model which is used to obtain a topology-preserving mapping from the (usually high dimensional) input/feature space to an output/map space of fewer dimensions (usually two or three in order to facilitate visualization). Neurons in the output space are connected with each other but this structure remains fixed throughout training and learning is achieved through the updating of neuron reference vectors in feature space. Despite the fact that growing variants of SOM overcome the fixed structure limitation they increase computational cost and also do not allow the removal of a neuron after its introduction. In this paper, a variant of SOM is proposed called AMSOM (Adaptive Moving Self-Organizing Map) that on the one hand creates a more flexible structure where neuron positions are dynamically altered during training and on the other hand tackles the drawback of having a predefined grid by allowing neuron addition and/or removal during training. Experiments using multiple literature datasets show that the proposed method improves training performance of SOM, leads to a better visualization of the input dataset and provides a framework for determining the optimal number and structure of neurons.
Adaptive p-value Thresholding
We consider the problem of multiple hypothesis testing with generic side information: for each hypothesis $H_i$ we observe both a p-value $p_i$ and some predictor $x_i$ encoding contextual information about the hypothesis. For large-scale problems, adaptively focusing power on the more promising hypotheses (those more likely to yield discoveries) can lead to much more powerful multiple testing procedures. We propose a general iterative framework for this problem, called the Adaptive p-value Thresholding (AdaPT) procedure, which adaptively estimates a Bayes-optimal p-value rejection threshold and controls the false discovery rate (FDR) in finite samples. At each iteration of the procedure, the analyst proposes a rejection threshold and observes partially censored p-values, estimates the false discovery proportion (FDP) below the threshold, and either stops to reject or proposes another threshold, until the estimated FDP is below $\alpha$. Our procedure is adaptive in an unusually strong sense, permitting the analyst to use any statistical or machine learning method she chooses to estimate the optimal threshold, and to switch between different models at each iteration as information accrues.
Adaptive Resonance Theory
Adaptive resonance theory (ART) is a theory developed by Stephen Grossberg and Gail Carpenter on aspects of how the brain processes information. It describes a number of neural network models which use supervised and unsupervised learning methods, and address problems such as pattern recognition and prediction.
Adaptive Robust Control In this paper we propose a new methodology for solving an uncertain stochastic Markovian control problem in discrete time. We call the proposed methodology the adaptive robust control. We demonstrate that the uncertain control problem under consideration can be solved in terms of associated adaptive robust Bellman equation. The success of our approach is to the great extend owed to the recursive methodology for construction of relevant confidence regions. We illustrate our methodology by considering an optimal portfolio allocation problem, and we compare results obtained using the adaptive robust control method with some other existing methods.
Adaptive Software http://…/Adaptive_software_development
Adaptive System The term adaptation is used in biology in relation to how living beings adapt to their environments, but with two different meanings. First, the continuous adaptation of an organism to its environment, so as to maintain itself in a viable state, through sensory feedback mechanisms. Second, the development (through evolutionary steps) of an adaptation (an anatomic structure, physiological process or behavior characteristic) that increases the probability of an organism reproducing itself (although sometimes not directly). Generally speaking, an adaptive system is a set of interacting or interdependent entities, real or abstract, forming an integrated whole that together are able to respond to environmental changes or changes in the interacting parts. Feedback loops represent a key feature of adaptive systems, allowing the response to changes; examples of adaptive systems include: natural ecosystems, individual organisms, human communities, human organizations, and human families. Some artificial systems can be adaptive as well; for instance, robots employ control systems that utilize feedback loops to sense new conditions in their environment and adapt accordingly.
Adaptive Thouless-Anderson-Palmer Mean Field Approach
We develop an advanced mean field method for approximating averages in probabilistic data models that is based on the TAP approach of disorder physics. In contrast to conventional TAP, where the knowledge of the distribution of couplings between the random variables is required, our method adapts to the concrete couplings. We demonstrate the validity of our approach, which is sofar restricted to models with non-glassy behaviour, by replica calculations for a wide class of models as well as by simulations for a real data set.
adaQN Recurrent Neural Networks (RNNs) are powerful models that achieve unparalleled performance on several pattern recognition problems. However, training of RNNs is a computationally difficult task owing to the well-known ‘vanishing/exploding’ gradient problems. In recent years, several algorithms have been proposed for training RNNs. These algorithms either: exploit no (or limited) curvature information and have cheap per-iteration complexity; or attempt to gain significant curvature information at the cost of increased per-iteration cost. The former set includes diagonally-scaled first-order methods such as ADAM and ADAGRAD while the latter consists of second-order algorithms like Hessian-Free Newton and K-FAC. In this paper, we present an novel stochastic quasi-Newton algorithm (adaQN) for training RNNs. Our approach retains a low per-iteration cost while allowing for non-diagonal scaling through a stochastic L-BFGS updating scheme. The method is judicious in storing and retaining L-BFGS curvature pairs which is indirectly used as a means of controlling the quality of the steps. We present numerical experiments on two language modeling tasks and show that adaQN performs at par, if not better, than popular RNN training algorithms. These results suggest that quasi-Newton algorithms have the potential to be a viable alternative to first- and second-order methods for training RNNs.
Additive Noise Models
The core idea of these so-called Additive Noise Models (ANM) is that if X->Y, then the variability observed in Y will be either explained by X, or by some noise that is independent of X.
Additive Polynomial Design Matrix
An implementation of the additive polynomial (AP) design matrix. It constructs and appends an AP design matrix to a data frame for use with longitudinal data subject to seasonality.
Additive Principal Components
Additive principal components are a nonlinear generalization of linear principal components.
Additive Smoothing In statistics, additive smoothing, also called Laplace smoothing (not to be confused with Laplacian smoothing), or Lidstone smoothing, is a technique used to smooth categorical data.
Adjacency Diagram The adjacency diagram is a space-filling variant of the node-link diagram; rather than drawing a link between parent and child in the hierarchy, nodes are drawn as solid areas (either arcs or bars), and their placement relative to adjacent nodes reveals their position in the hierarchy. The icicle layout in figure 4D is similar to the first node-link diagram in that the root node appears at the top, with child nodes underneath. Because the nodes are now space-filling, however, we can use a length encoding for the size of software classes and packages. This reveals an additional dimension that would be difficult to show in a node-link diagram.
Adjacency Matrix In mathematics and computer science, an adjacency matrix is a means of representing which vertices (or nodes) of a graph are adjacent to which other vertices. Another matrix representation for a graph is the incidence matrix.
Admixture Models The core of the admixture model, proposed by Smith (1953) to deal with heterogeneous traits in genetic linkage analysis, is the hypothesis that a fraction lambda of pedigrees is linked to the marker locus being tested while a fraction 1 – lambda is unlinked. This model leads naturally to the vexing problem of testing for a mixture and has given rise to a large literature.
Adometry Adometry by Google solves the complex challenge of integrating, measuring, and optimizing marketing data across all channels — both online and offline — so you can generate actionable insights that improve ROI.
Advanced Analytics There is an increasing use of the term advanced analytics, typically used to describe the technical aspects of analytics, especially predictive modeling, machine learning techniques, and neural networks.
Adversarial Eliminating With GAN
Although Neural networks could achieve state-of-the-art performance while recongnizing images, they often suffer a tremendous defeat from adversarial examples–inputs generated by utilizing imperceptible but intentional perturbations to samples from the datasets. How to defense against adversarial examples is an important problem which is well worth to research. So far, only two well-known methods adversarial training and defensive distillation have provided a significant defense. In contrast to existing methods mainly based on model itself, we address the problem purely based on the adversarial examples itself. In this paper, a novel idea and the first framework based Generative Adversarial Nets named AE-GAN capable of resisting adversarial examples are proposed. Extensive experiments on benchmark datasets indicate that AE-GAN is able to defense against adversarial examples effectively.
Adversarial Feature Learning
The ability of the Generative Adversarial Networks (GANs) framework to learn generative models mapping from simple latent distributions to arbitrarily complex data distributions has been demonstrated empirically, with compelling results showing generators learn to ‘linearize semantics’ in the latent space of such models. Intuitively, such latent spaces may serve as useful feature representations for auxiliary problems where semantics are relevant. However, in their existing form, GANs have no means of learning the inverse mapping — projecting data back into the latent space. We propose Bidirectional Generative Adversarial Networks (BiGANs) as a means of learning this inverse mapping, and demonstrate that the resulting learned feature representation is useful for auxiliary supervised discrimination tasks, competitive with contemporary approaches to unsupervised and self-supervised feature learning.
Adversarial Generator-Encoder Networks We present a new autoencoder-type architecture, that is trainable in an unsupervised mode, sustains both generation and inference, and has the quality of conditional and unconditional samples boosted by adversarial learning. Unlike previous hybrids of autoencoders and adversarial networks, the adversarial game in our approach is set up directly between the encoder and the generator, and no external mappings are trained in the process of learning. The game objective compares the divergences of each of the real and the generated data distributions with the canonical distribution in the latent space. We show that direct generator-vs-encoder game leads to a tight coupling of the two components, resulting in samples and reconstructions of a comparable quality to some recently-proposed more complex architectures.
Adversarial Machine Learning
Adversarial machine learning is the formal name for studying what happens when conceding even a slightly more realistic alternative to assumptions of these types (harmlessly called “relaxing assumptions” ….
Adversarial Machine Learning
Adversarial Transformation Networks Multiple different approaches of generating adversarial examples have been proposed to attack deep neural networks. These approaches involve either directly computing gradients with respect to the image pixels, or directly solving an optimization on the image pixels. In this work, we present a fundamentally new method for generating adversarial examples that is fast to execute and provides exceptional diversity of output. We efficiently train feed-forward neural networks in a self-supervised manner to generate adversarial examples against a target network or set of networks. We call such a network an Adversarial Transformation Network (ATN). ATNs are trained to generate adversarial examples that minimally modify the classifier’s outputs given the original input, while constraining the new classification to match an adversarial target class. We present methods to train ATNs and analyze their effectiveness targeting a variety of MNIST classifiers as well as the latest state-of-the-art ImageNet classifier Inception ResNet v2.
Adversary Model In computer science, an online algorithm measures its competitiveness against different adversary models. For deterministic algorithms, the adversary is the same, the adaptive offline adversary. For randomized online algorithms competitiveness can depend upon the adversary model used.
Affect-LM Human verbal communication includes affective messages which are conveyed through use of emotionally colored words. There has been a lot of research in this direction but the problem of integrating state-of-the-art neural language models with affective information remains an area ripe for exploration. In this paper, we propose an extension to an LSTM (Long Short-Term Memory) language model for generating conversational text, conditioned on affect categories. Our proposed model, Affect-LM enables us to customize the degree of emotional content in generated sentences through an additional design parameter. Perception studies conducted using Amazon Mechanical Turk show that Affect-LM generates naturally looking emotional sentences without sacrificing grammatical correctness. Affect-LM also learns affect-discriminative word representations, and perplexity experiments show that additional affective information in conversational text can improve language model prediction.
Affinity Analysis Affinity analysis is a data analysis and data mining technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans.
“Market Basket Analysis”
agate agate is a Python data analysis library that is optimized for humans instead of machines. It is an alternative to numpy and pandas that helps you solve real-world problems
Age Period Cohort Model
Age-Period-Cohort models is a class of models for demographic rates (mortality/morbidity/fertility/…) observed for a broad age range over a reasonably long time period, and classified by age and date of follow-up (period) and date of birth (cohort). This type of follow-up can be shown in a Lexis-diagram; a coordinate system with data of follow-up along the x-axis, and age along the y-axis. A single persons life-trajectory is therefore a straight line with slope 1 (as calender time and age advance at the same pace). Tabulated data enumerates the number of events and the risk time (sum of lengths of life-trajectories) in some subsets of the Lexis diagram, usually subsets classified by age and period in equally long intervals. Individual life-lines can be shown with colouring according to states, or the diagram can just be shown to indicate what ages and periods are covered, and what subsets are used for classification of events and risk time. The Age-Period-Cohort model describes the (log)rates as a sum of (non-linear) age- period- and cohort-effects. The three variables age (at follow-up), a, period (i.e. date of follow-up), p, and cohort (date of birth), c, are related by a=p-c – any one person’s age is calculated by subtracting the date of birth from the current date. Hence the three variables used to describe rates are linearly related, and the model can therefore be parametrized in different ways, and still produce the same estimated rates. In popular terms you can say that it is possible to move two linear effects around between the three terms, because the age-terms contains the linear effect of age, the period-terms contains the linear effect of period and the cohort effect contains the linear effect of cohort. An illustration of this phenomenon is in this little “film” of APC-effects on testis cancer rates in Denmark. All sets of estimates will yield the same set of fitted rates.
Agent Based Model
An agent-based model (ABM) is a class of computational models for simulating the actions and interactions of autonomous agents (both individual or collective entities such as organizations or groups) with a view to assessing their effects on the system as a whole. It combines elements of game theory, complex systems, emergence, computational sociology, multi-agent systems, and evolutionary programming. Monte Carlo Methods are used to introduce randomness. Particularly within ecology, ABMs are also called individual-based models (IBMs), and individuals within IBMs may be simpler than than fully autonomous agents within ABMs. A review of recent literature on individual-based models, agent-based models, and multiagent systems shows that ABMs are used on non-computing related scientific domains including biology, ecology and social science. Agent-based modeling is related to, but distinct from, the concept of multi-agent systems or multi-agent simulation in that the goal of ABM is to search for explanatory insight into the collective behavior of agents obeying simple rules, typically in natural systems, rather than in designing agents or solving specific practical or engineering problems.
Agent-Based Modeling and Simulation
“Agent Based Model”
Agent-Based Modeling and Simulation
Agglomerative Hierarchical Clustering
Hierarchical clustering algorithms are either top-down or bottom-up. Bottom-up algorithms treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC. Top-down clustering requires a method for splitting a cluster. It proceeds by splitting clusters recursively until individual documents are reached.
Agglomerative Info-Clustering An agglomerative clustering of random variables is proposed, where clusters of random variables sharing the maximum amount of multivariate mutual information are merged successively to form larger clusters. Compared to the previous info-clustering algorithms, the agglomerative approach allows the computation to stop earlier when clusters of desired size and accuracy are obtained. An efficient algorithm is also derived based on the submodularity of entropy and the duality between the principal sequence of partitions and the principal sequence for submodular functions.
Aggregated Lexical Table
Correspondence Analysis on Generalised Aggregated Lexical Tables
Aggregated Wasserstein We propose a framework, named Aggregated Wasserstein, for computing a dissimilarity measure or distance between two Hidden Markov Models with state conditional distributions being Gaussian. For such HMMs, the marginal distribution at any time spot follows a Gaussian mixture distribution, a fact exploited to softly match, aka register, the states in two HMMs. We refer to such HMMs as Gaussian mixture model-HMM (GMM-HMM). The registration of states is inspired by the intrinsic relationship of optimal transport and the Wasserstein metric between distributions. Specifically, the components of the marginal GMMs are matched by solving an optimal transport problem where the cost between components is the Wasserstein metric for Gaussian distributions. The solution of the optimization problem is a fast approximation to the Wasserstein metric between two GMMs. The new Aggregated Wasserstein distance is a semi-metric and can be computed without generating Monte Carlo samples. It is invariant to relabeling or permutation of the states. This distance quantifies the dissimilarity of GMM-HMMs by measuring both the difference between the two marginal GMMs and the difference between the two transition matrices. Our new distance is tested on the tasks of retrieval and classification of time series. Experiments on both synthetic data and real data have demonstrated its advantages in terms of accuracy as well as efficiency in comparison with existing distances based on the Kullback-Leibler divergence.
Aggregation Operator Aggregation operators are mathematical functions that are used to combine information. That is, they are used to combine N data (for example, N numerical values) in a single datum. The arithmetic mean and the weighted mean are the most well-known aggregation operators. The median and the mode can also been as aggregation operators. The main difference between the arithmetic mean and the weighted mean is that the latter permits us to weight the different data according to their relevance. There exist a large number of different aggregation operators that differ on the assumptions on the data (data types) and about the type of information that we can incorporate in the model. For example, fuzzy integrals permits us to assign relevance to sets of information sources and not only to individual sources as for the weighted mean.
Agnostic Disambiguation of Named Entities Using Linked Open Data
AGDISTIS is an Open Source Named Entity Disambiguation Framework able to link entities against every Linked Data Knowledge Base. The ongoing transition from the current Web of unstructured data to the Data Web yet requires scalable and accurate approaches for the extraction of structured data in RDF (Resource Description Framework). One of the key steps towards extracting RDF from natural-language corpora is the disambiguation of named entities. AGDISTIS combines the HITS algorithm with label expansion strategies and string similarity measures. Based on this combination, it can efficiently detect the correct URIs for a given set of named entities within an input text. Furthermore, AGDISTIS is agnostic of the underlying knowledge base. AGDISTIS has been evaluated on different datasets against state-of-the-art named entity disambiguation frameworks.
AIDE In this paper, we present two new communication-efficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANE algorithm that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANE significantly. In fact, our approach can be viewed as a robustification strategy since the method is substantially better behaved than DANE on data partition arising in practice. It is well known that DANE algorithm does not match the communication complexity lower bounds. To bridge this gap, we propose an accelerated variant of the first method, called AIDE, that not only matches the communication lower bounds but can also be implemented using a purely first-order oracle. Our empirical results show that AIDE is superior to other communication efficient algorithms in settings that naturally arise in machine learning applications.
AIXIjs Reinforcement learning is a general and powerful framework with which to study and implement artificial intelligence. Recent advances in deep learning have enabled RL algorithms to achieve impressive performance in restricted domains such as playing Atari video games (Mnih et al., 2015) and, recently, the board game Go (Silver et al., 2016). However, we are still far from constructing a generally intelligent agent. Many of the obstacles and open questions are conceptual: What does it mean to be intelligent? How does one explore and learn optimally in general, unknown environments? What, in fact, does it mean to be optimal in the general sense? The universal Bayesian agent AIXI (Hutter, 2005) is a model of a maximally intelligent agent, and plays a central role in the sub-field of general reinforcement learning (GRL). Recently, AIXI has been shown to be flawed in important ways; it doesn’t explore enough to be asymptotically optimal (Orseau, 2010), and it can perform poorly with certain priors (Leike and Hutter, 2015). Several variants of AIXI have been proposed to attempt to address these shortfalls: among them are entropy-seeking agents (Orseau, 2011), knowledge-seeking agents (Orseau et al., 2013), Bayes with bursts of exploration (Lattimore, 2013), MDL agents (Leike, 2016a), Thompson sampling (Leike et al., 2016), and optimism (Sunehag and Hutter, 2015). We present AIXIjs, a JavaScript implementation of these GRL agents. This implementation is accompanied by a framework for running experiments against various environments, similar to OpenAI Gym (Brockman et al., 2016), and a suite of interactive demos that explore different properties of the agents, similar to REINFORCEjs (Karpathy, 2015). We use AIXIjs to present numerous experiments illustrating fundamental properties of, and differences between, these agents.
Akaike Information Criterion
The Akaike information criterion (AIC) is a measure of the relative quality of a statistical model for a given set of data. As such, AIC provides a means for model selection. AIC deals with the trade-off between the goodness of fit of the model and the complexity of the model. It is founded on information theory: it offers a relative estimate of the information lost when a given model is used to represent the process that generates the data. AIC does not provide a test of a model in the sense of testing a null hypothesis; i.e. AIC can tell nothing about the quality of the model in an absolute sense. If all the candidate models fit poorly, AIC will not give any warning of that.
Akid Neural networks are a revolutionary but immature technique that is fast evolving and heavily relies on data. To benefit from the newest development and newly available data, we want the gap between research and production as small as possibly. On the other hand, differing from traditional machine learning models, neural network is not just yet another statistic model, but a model for the natural processing engine — the brain. In this work, we describe a neural network library named {\texttt akid}. It provides higher level of abstraction for entities (abstracted as blocks) in nature upon the abstraction done on signals (abstracted as tensors) by Tensorflow, characterizing the dataism observation that all entities in nature processes input and emit out in some ways. It includes a full stack of software that provides abstraction to let researchers focus on research instead of implementation, while at the same time the developed program can also be put into production seamlessly in a distributed environment, and be production ready. At the top application stack, it provides out-of-box tools for neural network applications. Lower down, akid provides a programming paradigm that lets user easily build customized models. The distributed computing stack handles the concurrency and communication, thus letting models be trained or deployed to a single GPU, multiple GPUs, or a distributed environment without affecting how a model is specified in the programming paradigm stack. Lastly, the distributed deployment stack handles how the distributed computing is deployed, thus decoupling the research prototype environment with the actual production environment, and is able to dynamically allocate computing resources, so development (Devs) and operations (Ops) could be separated. Please refer to http://…/latest for documentation.
Akismet Akismet or Automattic Kismet is a spam filtering service. It attempts to filter link spam from blog comments and spam TrackBack pings. The filter works by combining information about spam captured on all participating blogs, and then using those spam rules to block future spam. Akismet is offered by Automattic, the company behind Launched on October 25, 2005, Akismet is said to have captured over 100 billion spam comments and pings as of October 2013.
akka Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM.
ALAMO ALAMO is a computational methodology for leaning algebraic functions from data. Given a data set, the approach begins by building a low-complexity, linear model composed of explicit non-linear transformations of the independent variables. Linear combinations of these non-linear transformations allow a linear model to better approximate complex behavior observed in real processes. The model is refined, as additional data are obtained in an adaptive fashion through error maximization sampling using derivative-free optimization. Models built using ALAMO can enforce constraints on the response variables to incorporate first-principles knowledge. The ability of ALAMO to generate simple and accurate models for a number of reaction problems is demonstrated. The error maximization sampling is compared with Latin hypercube designs to demonstrate its sampling efficiency. ALAMO’s constrained regression methodology is used to further refine concentration models, resulting in models that perform better on validation data and satisfy upper and lower bounds placed on model outputs.
Algebraic Statistics Algebraic statistics is the use of algebra to advance statistics. Algebra has been useful for experimental design, parameter estimation, and hypothesis testing. Traditionally, algebraic statistics has been associated with the design of experiments and multivariate analysis (especially time series). In recent years, the term “algebraic statistics” has been sometimes restricted, sometimes being used to label the use of algebraic geometry and commutative algebra in statistics.
Algebraic Subspace Clustering
Algebraic Subspace Clustering (ASC) is a simple and elegant method based on polynomial fitting and differentiation for clustering noiseless data drawn from an arbitrary union of subspaces. In practice, however, ASC is limited to equi-dimensional subspaces because the estimation of the subspace dimension via algebraic methods is sensitive to noise. This paper proposes a new ASC algorithm that can handle noisy data drawn from subspaces of arbitrary dimensions. The key ideas are (1) to construct, at each point, a decreasing sequence of subspaces containing the subspace passing through that point; (2) to use the distances from any other point to each subspace in the sequence to construct a subspace clustering affinity, which is superior to alternative affinities both in theory and in practice. Experiments on the Hopkins 155 dataset demonstrate the superiority of the proposed method with respect to sparse and low rank subspace clustering methods.
Algorithm Quasi-Optimal Learning
The algorithm quasi-optimal (AQ) is a powerful machine learning methodology aimed at learning symbolic decision rules from a set of examples and counterexamples. It was first proposed in the late 1960s to solve the Boolean function satisfiability problem and further refined over the following decade to solve the general covering problem. In its newest implementations, it is a powerful but yet little explored methodology for symbolic machine learning classification. It has been applied to solve several problems from different domains, including the generation of individuals within an evolutionary computation framework. The current article introduces the main concepts of the AQ methodology and describes AQ for source detection(AQ4SD), a tailored implementation of the AQ methodology to solve the problem of finding the sources of atmospheric releases using distributed sensor measurements.
Algorithmia An Open Marketplace For Algorithms: We’re building a community around state-of-the-art algorithm development. Users can create, share, and build on other algorithms and then instantly make them available as a web service.
Algorithmic Complexity
The information content or complexity of an object can be measured by the length of its shortest description. For instance the string “01010101010101010101010101010101” has the short description “16 repetitions of 01”, while “11001000011000011101111011101100” presumably has no simpler description other than writing down the string itself. More formally, the Algorithmic “Kolmogorov” Complexity (AC) of a string x is defined as the length of the shortest program that computes or outputs x , where the program is run on some fixed reference universal computer.
Algorithmic Transparency What information is your black-box analytics system really relying on, and should it?
Aligned Rank Transform
Nonparametric data from multi-factor experiments arise often in human-computer interaction (HCI). Examples may include error counts, Likert responses, and preference tallies. But because multiple factors are involved, common nonparametric tests (e.g., Friedman) are inadequate, as they are unable to examine interaction effects. While some statistical techniques exist to handle such data, these techniques are not widely available and are complex. To address these concerns, we present the Aligned Rank Transform (ART) for nonparametric factorial data analysis in HCI. The ART relies on a preprocessing step that “aligns” data before applying averaged ranks, after which point common ANOVA procedures can be used, making the ART accessible to anyone familiar with the F-test. Unlike most articles on the ART, which only address two factors, we generalize the ART to N factors. We also provide ARTool and ARTweb, desktop and Web-based programs for aligning and ranking data. Our re-examination of some published HCI results exhibits advantages of the ART.
ALINE Algorithm
The ALINE algorithm (Kondrak, 2000) assigns a similarity score to pairs of phonetically-transcribed words on the basis of the decomposition of phonemes into elementary phonetic features. The algorithm was originally designed to identify and align cognates in vocabularies of related languages. Nevertheless, thanks to its grounding in universal phonetic principles, the algorithm can be used for estimating the similarity of any pair of words. The principal component of ALINE is a function that calculates the similarity of two phonemes that are expressed in terms of about a dozen multi-valued phonetic features (Place, Manner, Voice, etc.). The phonetic features are assigned salience weights that express their relative importance. Feature values are encoded as oating-point numbers in the range [0;1]. For example, the feature Manner can take any of the following seven values: stop = 1.0, affricate = 0.9, fricative = 0.8, approximant = 0.6, high vowel = 0.4, mid vowel = 0.2, and low vowel = 0.0. The numerical values re ect the distances between vocal organs during speech production. The overall similarity score is the sum of individual similarity scores between pairs of phonemes in an optimal alignment of two words, which is computed by a dynamic programming algorithm (Wagner and Fischer, 1974). A constant insertion/deletion penalty is applied for each unaligned phoneme. Another constant penalty is set to reduce relative importance of vowel—as opposed to consonant— phoneme matches. The similarity value is normalized by the length of the longer word. ALINE’s behavior is controlled by a number of parameters: the maximum phonemic score, the insertion/ deletion penalty, the vowel penalty, and the feature salience weights. The parameters have default settings for the cognate matching task, but these settings can be optimized (tuned) on a development set that includes both positive and negative examples of similar words.
Greg Kondrak
Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification
Allan Variance
The Allan variance (AVAR), also known as two-sample variance, is a measure of frequency stability in clocks, oscillators and amplifiers. It is named after David W. Allan. The Allan variance is intended to estimate stability due to noise processes and not that of systematic errors or imperfections such as frequency drift or temperature effects. The Allan variance and Allan deviation describe frequency stability, i.e. the stability in frequency. See also the section entitled ‘Interpretation of value’ below. There are also different adaptations or alterations of Allan variance, notably the modified Allan variance MAVAR or MVAR, the total variance, and the Hadamard variance. There also exist time stability variants such as time deviation TDEV or time variance TVAR. Allan variance and its variants have proven useful outside the scope of timekeeping and are a set of improved statistical tools to use whenever the noise processes are not unconditionally stable, thus a derivative exists. The general M-sample variance remains important since it allows dead time in measurements and bias functions allows conversion into Allan variance values. Nevertheless, for most applications the special case of 2-sample, or ‘Allan variance’ with T = tau is of greatest interest.
All-Pairs Testing In computer science, all-pairs testing or pairwise testing is a combinatorial method of software testing that, for each pair of input parameters to a system (typically, a software algorithm), tests all possible discrete combinations of those parameters. Using carefully chosen test vectors, this can be done much faster than an exhaustive search of all combinations of all parameters, by “parallelizing” the tests of parameter pairs.
Alluvial Diagram Alluvial diagrams are a type of flow diagram originally developed to represent changes in network structure over time. In allusion to both their visual appearance and their emphasis on flow, alluvial diagrams are named after alluvial fans that are naturally formed by the soil deposited from streaming water.
ALOJA This article presents the ALOJA project and its analytics tools, which leverages machine learning to interpret Big Data benchmark performance data and tuning. ALOJA is part of a long-term collaboration between BSC and Microsoft to automate the characterization of cost-effectiveness on Big Data deployments, currently focusing on Hadoop. Hadoop presents a complex run-time environment, where costs and performance depend on a large number of configuration choices. The ALOJA project has created an open, vendor-neutral repository, featuring over 40,000 Hadoop job executions and their performance details. The repository is accompanied by a test-bed and tools to deploy and evaluate the cost-effectiveness of different hardware configurations, parameters and Cloud services. Despite early success within ALOJA, a comprehensive study requires automation of modeling procedures to allow an analysis of large and resource-constrained search spaces. The predictive analytics extension, ALOJA-ML, provides an automated system allowing knowledge discovery by modeling environments from observed executions. The resulting models can forecast execution behaviors, predicting execution times for new configurations and hardware choices. That also enables model-based anomaly detection or efficient benchmark guidance by prioritizing executions. In addition, the community can benefit from ALOJA data-sets and framework to improve the design and deployment of Big Data applications.
alpha-Algorithm The alpha-algorithm is an algorithm used in process mining, aimed at reconstructing causality from a set of sequences of events. It was first put forward by van der Aalst, Weijters and Maruster. Several extensions or modifications of it have since been presented, which will be listed below.
It constructs P/T nets with special properties (workflow nets) from event logs (as might be collected by an ERP system). Each transition in the net corresponds to an observed task.
Alpha-Outliers Crossover,crossdes
Alternating Direction Method of Multipliers
The alternating direction method of multipliers (ADMM) is a variant of the augmented Lagrangian scheme that uses partial updates for the dual variables.
Alternating Directions Dual Decomposition
We present AD3, a new algorithm for approximate maximum a posteriori (MAP) inference on factor graphs, based on the alternating directions method of multipliers. Like other dual decomposition algorithms, AD3 has a modular architecture, where local subproblems are solved independently, and their solutions are gathered to compute a global update. The key characteristic of AD3 is that each local subproblem has a quadratic regularizer, leading to faster convergence, both theoretically and in practice. We provide closed-form solutions for these AD3 subproblems for binary pairwise factors and factors imposing rst-order logic constraints. For arbitrary factors (large or combinatorial), we introduce an active set method which requires only an oracle for computing a local MAP con guration, making AD3 applicable to a wide range of problems. Experiments on synthetic and real-world problems show that AD3 compares favorably with the state-of-the-art.
Alternating Least Squares
In ALS you’re minimizing the entire loss function at once, but, only twiddling half the parameters. That’s because the optimization has an easy algebraic solution – if half your parameters are fixed. So you fix half, recompute the other half, and repeat. There is no gradient in the optimization step since each optimization problem is convex and doesn’t need an approximate approach. But, each problem you’re solving is not the “real” optimization problem – you fixed half the parameters.
Altham-Poisson Distribution The multiplicative binomial model was introduced as a generalization of the binomial distribution for modelling correlated binomial data. This distribution has not been extensively explored and is revisited in the present study. Some properties of the multiplicative binomial distribution, such as, expressions for the factorial moments and the information matrix, are investigated. The distribution is also extended to accommodate data arising from a Wadley’s problem setting which is frequently encountered in dose-mortality studies and is one in which the number of organisms initially treated with a drug is unobserved. The Altham-Poisson distribution is introduced by modelling the unobserved initial number of organisms, as specified by Formula in the multiplicative binomial model, with a Poisson distribution and its suitability for overdispersed data from a Wadley’s problem setting is explored.
Amazon Machine Learning Amazon Machine Learning is a new service that makes it easy for developers of all skill levels to use machine learning technology. Amazon Machine Learning provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology. Once your models are ready, Amazon Machine Learning makes it easy to get predictions for your application using simple APIs, without having to implement custom prediction generation code, or manage any infrastructure. Amazon Machine Learning is based on the same proven, highly scalable, ML technology used for years by Amazon’s internal data scientist community. The service uses powerful algorithms to create ML models by finding patterns in your existing data. Then, Amazon Machine Learning uses these models to process new data and generate predictions for your application. Amazon Machine Learning is highly scalable and can generate billions of predictions daily, and serve those predictions in real-time and at high throughput. With Amazon Machine Learning, there is no upfront hardware or software investment, and you pay as you go, so you can start small and scale as your application grows.
Ambient Diagnostics People can usually sense troubles in a car from noises, vibrations, or smells. An experienced driver can even tell where the problem is. We call this kind of skill ‘Ambient Diagnostics’. Ambient Diagnostics is an emerging field that is aimed at detecting abnormities from seemly disconnected ambient data that we take for granted. For example, the human body is a rich ambient data source: temperature, pulses, gestures, sound, forces, moisture, et al. Also, many electronic devices provide pervasive ambient data streams, such as mobile phones, surveillance cameras, satellite images, personal data assistants, wireless networks and so on.
Analysis of Covariance
Covariance is a measure of how much two variables change together and how strong the relationship is between them. Analysis of covariance (ANCOVA) is a general linear model which blends ANOVA and regression. ANCOVA evaluates whether population means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV), while statistically controlling for the effects of other continuous variables that are not of primary interest, known as covariates (CV). Therefore, when performing ANCOVA, the DV means are adjusted to what they would be if all groups were equal on the CV.
Analysis of Means
The analysis of means (ANOM) is a graphical procedure for comparing a collection of means, rates, or proportions to see if any of them differ significantly from the overall mean, rate, or proportion. The ANOM is a type of multiple comparison procedure. The results of the analysis are summarized in an ANOM decision chart. This chart is similar in appearance to a control chart. It has a centerline, located at the overall mean (rate, or proportion), and upper and lower decision limits. Group means (or rates, or proportions) are plotted on this chart and if one falls beyond a decision limit then that group is said to be statistically different from the overal mean (rate or proportion).
In many situations, you need to examine differences among more than two groups. When you are interested in comparing multiple group means, you can use the analysis of means (ANOM) as an alternative to the one-way analysis of variance F test. The ANOM provides a “confidence interval type of approach” that allows you to determine which, if any, of the c groups has a mean significantly different from the overall average of all the group means combined. Instead of looking at the lower and upper limits of a confidence interval, you will be studying which of the c group means are not contained in an interval formed between a lower decision line and an upper decision line. Any individual group mean not contained in this interval is deemed significantly higher than the overall average of all groups if it lies above the upper decision line. Similarly, any group mean that falls below the lower decision line is declared significantly lower than the overall group average.
Analytic Hierarchy Process
The analytic hierarchy process (AHP) is a structured technique for organizing and analyzing complex decisions, based on mathematics and psychology. It was developed by Thomas L. Saaty in the 1970s and has been extensively studied and refined since then. It has particular application in group decision making, and is used around the world in a wide variety of decision situations, in fields such as government, business, industry, healthcare, shipbuilding and education. Rather than prescribing a ‘correct’ decision, the AHP helps decision makers find one that best suits their goal and their understanding of the problem. It provides a comprehensive and rational framework for structuring a decision problem, for representing and quantifying its elements, for relating those elements to overall goals, and for evaluating alternative solutions. Users of the AHP first decompose their decision problem into a hierarchy of more easily comprehended sub-problems, each of which can be analyzed independently. The elements of the hierarchy can relate to any aspect of the decision problem—tangible or intangible, carefully measured or roughly estimated, well or poorly understood—anything at all that applies to the decision at hand. Once the hierarchy is built, the decision makers systematically evaluate its various elements by comparing them to each other two at a time, with respect to their impact on an element above them in the hierarchy. In making the comparisons, the decision makers can use concrete data about the elements, but they typically use their judgments about the elements’ relative meaning and importance. It is the essence of the AHP that human judgments, and not just the underlying information, can be used in performing the evaluations. The AHP converts these evaluations to numerical values that can be processed and compared over the entire range of the problem. A numerical weight or priority is derived for each element of the hierarchy, allowing diverse and often incommensurable elements to be compared to one another in a rational and consistent way. This capability distinguishes the AHP from other decision making techniques. In the final step of the process, numerical priorities are calculated for each of the decision alternatives. These numbers represent the alternatives’ relative ability to achieve the decision goal, so they allow a straightforward consideration of the various courses of action.
Analytical Data Mart Data Mart is a table or a collection of tables containing only the information which the analysts require to do their job. This data is pulled from multiple sources, processed in a uniform manner, documented and optimized. Data Marts frequently contain data aggregated on the customer level, such as the average number of transactions in last 6 months, number of cash loans drawn by the client during the last 12 months, etc. When the computed aggregate values are available, it is much easier to prepare reports. Data Marts are built only once, at the start of the analytical process, but they are cyclically and automatically updated, so as to contain all the relevant information pertaining to customers/products/transactions in a given period of time.
“Analytics Data Store”
Analytics Analytics is the discovery and communication of meaningful patterns in data. Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favors data visualization to communicate insight.
Analytics Data Store
The ADS (which can be a distributed data store on a Cloud somewhere) supports the data requirements of statistical analysis. Here the data is organised, structured, integrated and enriched to meet the ongoing and occasionally volatile needs of the statisticians and data scientists focusing on data mining. Data in the ADS can be accumulative or completely refreshed. It can have a short life span or have a significantly long life-time. The ADS is the logistics centre for analytics data. Not only can it be used to provide data into the statistical analysis process, but it can also be used to provide persistent long term storage for analysis outcomes and scenarios, and for future analysis, hence the ability to ‘write back’. The data and information in the ADS may also be augmented with data derived from data stored in the data warehouse, it may also benefit from having its own dedicated Data Mart specifically designed for this purpose. Results of statistical analysis on the ADS data may also result in feedback being used to tune the data reduction, filtering and enrichment rules further downstream, either in smart data analytics, complex event and discrimination adapters or in ET(AL) job streams.
AnchorNet Despite significant progress of deep learning in recent years, state-of-the-art semantic matching methods still rely on legacy features such as SIFT or HoG. We argue that the strong invariance properties that are key to the success of recent deep architectures on the classification task make them unfit for dense correspondence tasks, unless a large amount of supervision is used. In this work, we propose a deep network, termed AnchorNet, that produces image representations that are well-suited for semantic matching. It relies on a set of filters whose response is geometrically consistent across different object instances, even in the presence of strong intra-class, scale, or viewpoint variations. Trained only with weak image-level labels, the final representation successfully captures information about the object structure and improves results of state-of-the-art semantic matching methods such as the deformable spatial pyramid or the proposal flow methods. We show positive results on the cross-instance matching task where different instances of the same object category are matched as well as on a new cross-category semantic matching task aligning pairs of instances each from a different object class.
Ancillary Statistic In statistics, an ancillary statistic is a statistic whose sampling distribution does not depend on the parameters of the model. An ancillary statistic is a pivotal quantity that is also a statistic. Ancillary statistics can be used to construct prediction intervals. This concept was introduced by the statistical geneticist Sir Ronald Fisher.
https://… Lesson of the Day – Ancillary Statistics
Angle-Based Outlier Detection
Detecting outliers in a large set of data objects is a major data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in the full-dimensional Euclidean data space. In high-dimensional data, these approaches are bound to deteriorate due to the notorious ‘curse of dimensionality’. In this paper, we propose a novel approach named ABOD (Angle-Based Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points. This way, the effects of the ‘curse of dimensionality’ are alleviated compared to purely distance-based approaches. A main advantage of our new approach is that our method does not rely on any parameter selection influencing the quality of the achieved ranking. In a thorough experimental evaluation, we compare ABOD to the well-established distance-based method LOF for various artificial and a real world data set and show ABOD to perform especially well on high-dimensional data.
Annabell A cognitive neural architecture able to learn and communicate through natural language
Annealed Generative Adversarial Networks We introduce a novel framework for adversarial training where the target distribution is annealed between the uniform distribution and the data distribution. We posited a conjecture that learning under continuous annealing in the nonparametric regime is stable irrespective of the divergence measures in the objective function and proposed an algorithm, dubbed {\ss}-GAN, in corollary. In this framework, the fact that the initial support of the generative network is the whole ambient space combined with annealing are key to balancing the minimax game. In our experiments on synthetic data, MNIST, and CelebA, {\ss}-GAN with a fixed annealing schedule was stable and did not suffer from mode collapse.
Annotation Chart Annotation charts are interactive time series line charts that support annotations.
Annotation Query Language
Annotation Query Language (AQL) is the language for developing text analytics extractors in the InfoSphere BigInsights Text Analytics system. An extractor is a program written in AQL that extracts structured information from unstructured or semistructured text. AQL is a declarative language. The syntax of AQL is similar to that of Structured Query Language (SQL), but with several important differences.
Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
Anomaly Detection In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or finding errors in text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions.
Anscombe’s Quartet Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.


Answer Set Programming – Reinforcement Learning
Non-stationary domains, where unforeseen changes happen, present a challenge for agents to find an optimal policy for a sequential decision making problem. This work investigates a solution to this problem that combines Markov Decision Processes (MDP) and Reinforcement Learning (RL) with Answer Set Programming (ASP) in a method we call ASP(RL). In this method, Answer Set Programming is used to find the possible trajectories of an MDP, from where Reinforcement Learning is applied to learn the optimal policy of the problem. Results show that ASP(RL) is capable of efficiently finding the optimal solution of an MDP representing non-stationary domains.
Ant Colony Optimization
In computer science and operations research, the ant colony optimization algorithm (ACO) is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs. This algorithm is a member of the ant colony algorithms family, in swarm intelligence methods, and it constitutes some metaheuristic optimizations. Initially proposed by Marco Dorigo in 1992 in his PhD thesis, the first algorithm was aiming to search for an optimal path in a graph, based on the behavior of ants seeking a path between their colony and a source of food. The original idea has since diversified to solve a wider class of numerical problems, and as a result, several problems have emerged, drawing on various aspects of the behavior of ants.
Anticipatory Analytics In fact, the application of predictive analytics to intelligence often focuses on what is missing in the data or how certain observations diverge from the expected norm. Some experts insist on calling this process not predictive analytics but anticipatory analytics, in order to distinguish it from similar processes in other realms.
Apache Accumulo The Apache Accumulo sorted, distributed key/value store is based on Google’s BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.
Apache Ambari Apache Ambari makes Hadoop cluster provisioning, managing, and monitoring dead simple.
Apache Beam Apache Beam provides an advanced unified programming model, allowing you to implement batch and streaming data processing jobs that can run on any execution engine. Apache Beam is:
• UNIFIED – Use a single programming model for both batch and streaming use cases.
• PORTABLE – Execute pipelines on multiple execution environments, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
• EXTENSIBLE – Write and share new SDKs, IO connectors, and transformation libraries.
Apache Bigtop Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem. The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects. In short we strive to be for Hadoop what Debian is to Linux.
Apache Cassandra / Cassandra Query Language
Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra also places a high value on performance. In 2012, University of Toronto researchers studying NoSQL systems concluded that “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments.”
Apache Chukwa Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Apache Clerezza Clerezza allows to easily develop semantic web applications by providing tools to manipulate RDF data, create RESTful Web Services and Renderlets using ScalaServerPages. Contents are stored as triples based on W3C RDF specification. These triples are stored via Clerezza’s Smart Content Binding (SCB). SCB defines a technology-agnostic layer to access and modify triple stores. It provides a java implementation of the graph data model specified by W3C RDF and functionalities to operate on that data model. SCB offers a service interface to access multiple named graphs and it can use various providers to manage RDF graphs in a technology specific manner, e.g., using Jena or Sesame. It also provides for adaptors that allow an application to use various APIs (including the Jena api) to process RDF graphs. Furthermore, SCB offers a serialization and a parsing service to convert a graph into a certain representation (format) and vice versa.
Apache CloudStack Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. CloudStack is used by a number of service providers to offer public cloud services, and by many companies to provide an on-premises (private) cloud offering, or as part of a hybrid cloud solution. CloudStack is a turnkey solution that includes the entire “stack” of features most organizations want with an IaaS cloud: compute orchestration, Network-as-a-Service, user and account management, a full and open native API, resource accounting, and a first-class User Interface (UI). CloudStack currently supports the most popular hypervisors: VMware, KVM, XenServer and Xen Cloud Platform (XCP). Users can manage their cloud with an easy to use Web interface, command line tools, and / or a full-featured RESTful API. In addition, CloudStack provides an API that’s compatible with AWS EC2 and S3 for organizations that wish to deploy hybrid clouds.
Apache Commons Mathematics Library Commons Math is a library of lightweight, self-contained mathematics and statistics components addressing the most common problems not available in the Java programming language or Commons Lang.
Apache Crunch The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
Apache Curator New users of ZooKeeper are surprised to learn that a significant amount of connection management must be done manually. For example, when the ZooKeeper client connects to the ensemble it must negotiate a new session, etc. This takes some time. If you use a ZooKeeper client API before the connection process has completed, ZooKeeper will throw an exception. These types of exceptions are referred to as “recoverable” errors. Curator automatically handles connection management, greatly simplifying client code. Instead of directly using the ZooKeeper APIs you use Curator APIs that internally check for connection completion and wrap each ZooKeeper API in a retry loop. Curator uses a retry mechanism to handle recoverable errors and automatically retry operations. The method of retry is customizable. Curator comes bundled with several implementations (ExponentialBackoffRetry, etc.) or custom implementations can be written.
Apache DataFu Apache DataFu consists of two libraries: Apache DataFu Pig is a collection of useful user-defined functions for data analysis in Apache Pig. Apache DataFu Hourglass is a library for incrementally processing data using Apache Hadoop MapReduce. This library was inspired by the prevelance of sliding window computations over daily tracking data. Computations such as these typically happen at regular intervals (e.g. daily, weekly), and therefore the sliding nature of the computations means that much of the work is unnecessarily repeated. DataFu’s Hourglass was created to make these computations more efficient, yielding sometimes 50-95% reductions in computational resources.
Apache Drill Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL. It is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google’s Dremel.
Apache Falcon Apache Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.
Apache Flink Apache Flink is an open source platform for scalable batch and stream data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink includes several APIs for creating applications that use the Flink engine:
1. DataSet API for static data embedded in Java, Scala, and Python,
2. DataStream API for unbounded streams embedded in Java and Scala, and
3. Table API with a SQL-like expression language embedded in Java and Scala.
Flink also bundles libraries for domain-specific use cases:
1. Machine Learning library, and
2. Gelly, a graph processing API and library.
You can integrate Flink easily with other well-known open source systems both for data input and output as well as deployment.
Apache Flume Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store
Apache Forrest Apache Forrest software is a publishing framework that transforms input from various sources into a unified presentation in one or more output formats. The modular and extensible plug-in architecture of Apache Forrest is based on Apache Cocoon and the relevant industry standards that separate presentation from content. Forrest can generate static documents, or be used as a dynamic server, or be deployed by its automated facility.
Apache Giraph Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections.
Apache Hadoop Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0.
The Apache Hadoop framework is composed of the following modules:
• Hadoop Common – contains libraries and utilities needed by other Hadoop modules
• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
• Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
• Hadoop MapReduce – a programming model for large scale data processing.
Hadoop is being regarded as one of the best platforms for storing and managing big data. It owes its success to its high data storage and processing scalability, low price/performance ratio, high performance, high availability, high schema flexibility, and its capability to handle all types of data.
Apache Hama The Apache Hama is an efficient and scalable general-purpose BSP computing engine which can be used to speed up a large variety of compute-intensive analytics applications.
Apache HBase Use Apache HBase software when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Apache Hive The Apache Hive (TM) data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop (TM), it provides * tools to enable easy data extract/transform/load (ETL) * a mechanism to impose structure on a variety of data formats * access to files stored either directly in Apache HDFS (TM) or in other data storage systems such as Apache HBase (TM) * query execution via MapReduce Hive defines a simple SQL-like query language, called HiveQL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. HiveQL can also be extended with custom scalar functions (UDF’s), aggregations (UDAF’s), and table functions (UDTF’s).
Apache Ignite Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies. Apache Ignite In-Memory Data Fabric is designed to deliver uncompromised performance for a wide set of in-memory computing use cases from high performance computing, to the industry most advanced data grid, highly available service grid, and streaming.
Apache Jena Apache Jena provides a complete framework for building Semantic Web and Linked Data applications in Java, and provides: parsers for RDF/XML, Turtle and N-triples; a Java programming API; a complete implementation of the SPARQL query language; a rule-based inference engine for RDFS and OWL entailments; TDB (a non-SQL persistent triple store); SDB (a persistent triples store built on a relational store) and Fuseki, an RDF server using web protocols. Jena complies with all relevant recommendations for RDF and related technologies from the W3C.
Apache Kafka Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transactions logs.
Apache Lucene The goal of Apache Lucene and Solr is to provide world class search capabilities. The Apache Lucene project develops open-source search software, including:
• Lucene Core, our flagship sub-project, provides Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.
• Solr is a high performance search server built using Lucene Core, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface.
• Open Relevance Project is a subproject with the aim of collecting and distributing free materials for relevance testing and performance.
• PyLucene is a Python port of the Core project.
Apache Lucy The Apache Lucy search engine library provides full-text search for dynamic programming languages.
Apache Mahout Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. Many of the implementations use the Apache Hadoop platform. Mahout also provides Java libraries for common math operations (focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in progress; the number of implemented algorithms has grown quickly, but various algorithms are still missing.
Apache Object Oriented Data Technology
Metadata for middleware (and vice versa):
• Transparent access to distributed resources
• Data discovery and query optimization
• Distributed processing and virtual archives
But it’s not just for science! It’s also a software architecture:
• Models for information representation
• Solutions to knowledge capture problems
• Unification of technology, data, and metadata
Apache Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
Apache OpenNLP Apache OpenNLP software supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning..
Apache Phoenix Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native noSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of noSQL stores.
Apache Pig Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation.
Apache REEF REEF, (Retainable Evaluator Execution Framework), is our approach to simplify and unify the lower layers of big data systems on modern resource managers. For managers like Apache YARN, Apache Mesos, Google Omega, and Facebook Corona, REEF provides a centralized control plane abstraction that can be used to build a decentralized data plane for supporting big data systems. Special consideration is given to graph computation and machine learning applications, both of which require data retention on allocated resources to execute multiple passes over the data. More broadly, applications that run on YARN will have the need for a variety of data-processing tasks e.g., data shuffle, group communication, aggregation, checkpointing, and many more. Rather than reimplement these for each application, REEF aims to provide them in a library form, so that they can be reused by higher-level applications and tuned for a specific domain problem e.g., Machine Learning. In that sense, our long-term vision is that REEF will mature into a Big Data Application Server, that will host a variety of tool kits and applications, on modern resource managers.
Apache S4
S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. S4 fills the gap between complex proprietary systems and batch-oriented open source computing platforms. We aim to develop a high performance computing platform that hides the complexity inherent in parallel processing system from the application programmer. The core platform is written in Java. The implementation is modular and pluggable, and S4 applications can be easily and dynamically combined for creating more sophisticated stream processing systems.
Apache SAMOA
Apache SAMOA is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms. Apache SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). Apache SAMOA users can develop distributed streaming ML algorithms once and execute the algorithms in multiple SPEs, i.e., code the algorithms once and execute them in multiple SPEs.
Apache Samza Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
Apache SINGA SINGA is a general distributed deep learning platform for training big deep learning models over large datasets. It is designed with an intuitive programming model based on the layer abstraction. A variety of popular deep learning models are supported, namely feed-forward models including convolutional neural networks (CNN), energy models like restricted Boltzmann machine (RBM), and recurrent neural networks (RNN). Many built-in layers are provided for users. SINGA architecture is sufficiently flexible to run synchronous, asynchronous and hybrid training frameworks. SINGA also supports different neural net partitioning schemes to parallelize the training of large models, namely partitioning on batch dimension, feature dimension or hybrid partitioning.
Apache Spark
Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce, for certain applications. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms.
Apache Spatial Information System Apache SIS provides data structures for geographic data and associated metadata along with methods to manipulate those data structures. The library is an implementation of GeoAPI interfaces and can be used for desktop or server applications.
Apache Sqoop Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Apache Stanbol Apache Stanbol provides a set of reusable components for semantic content management. Apache Stanbol’s intended use is to extend traditional content management systems with semantic services. Other feasible use cases include: direct usage from web applications (e.g. for tag extraction/suggestion; or text completion in search fields), ‘smart’ content workflows or email routing based on extracted entities, topics, etc Apache Stanbol – the Semantic Engine. In order to be used as a semantic engine via its services, all components offer their functionalities in terms of a RESTful web service API. Apache Stanbol’s main features are:
• Content Enhancement: Services that add semantic information to ‘non-semantic’ pieces of content.
• Reasoning: Services that are able to retrieve additional semantic information about the content based on the semantic information retrieved via content enhancement.
• Knowledge Models: Services that are used to define and manipulate the data models (e.g. ontologies) that are used to store the semantic information.
• Persistence: Services that store (or cache) semantic information, i.e. enhanced content, entities, facts, and make it searchable.
Apache Storm
Storm is a distributed computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. It uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on September 17, 2011.
A Storm application is designed as a topology of interfaces which create a “stream” of transformations. It provides similar functionality as a MapReduce job with the exception that the topology will theoretically run indefinitely until it is manually terminated.
In 2013 the Apache Software Foundation has accepted Storm into its incubator program.
Apache Tajo A big data warehouse system on Hadoop. Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources. By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities.
Apache Texen Texen is a general purpose text generating utility. It is capable of producing almost any sort of text output. Driven by Ant, essentially an Ant Task, Texen uses a control template, an optional set of worker templates, and control context to govern the generated output. Although TexenTask can be used directly, it is usually subclassed to initialize your control context before generating any output.
Apache Tez Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
Apache Tika The Apache Tika toolkit detects and extracts metadata and text content from various documents – from PPT to CSV to PDF – using existing parser libraries. Tika unifies these parsers under a single interface to allow you to easily parse over a thousand different file types. Tika is useful for search engine indexing, content analysis, translation, and much more.
Apache Twill Apache Twill is an abstraction over Apache Hadoop YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic. Apache Twill allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads.
Apache UIMA Unstructured Information Management Applications (UIMA) are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA is made of many things UIMA enables applications to be decomposed into components, for example ‘language identification’ => ‘language specific segmentation’ => ‘sentence boundary detection’ => ‘entity detection (person/place names etc.)’. Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes. Apache UIMA is an Apache-licensed open source implementation of the UIMA specification (that specification is, in turn, being developed concurrently by a technical committee within OASIS , a standards organization).
Apache Yarn
We present the next generation of Hadoop compute platform known as YARN, which departs from its familiar, monolithic architecture. By separating resource management functions from the programming model, YARN delegates many scheduling-related functions to per-job components. In this new context, MapReduce is just one of the applications running on top of YARN. This separation provides a great deal of flexibility in the choice of programming framework. Examples of alternative programming models that are becoming available on YARN are: Dryad, Giraph, Hoya, REEF, Spark, Storm and Tez.
Apache Zeppelin A web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.
Apache ZooKeeper Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.
Application Function Library
SAP HANA Library References: The SAP HANA Business Function Library (BFL) Reference describes the Business Function Library (BFL) delivered with SAP HANA. This application function library (AFL) contains pre-built financial functions implemented in C++. The SAP HANA Predictive Analysis Library (PAL) Reference describes the Predictive Analysis Library (PAL) delivered with SAP HANA. This application function library (AFL) defines functions that can be called from within SAP HANA SQLScript procedures to perform analytic algorithms.
Application Function Modeler
The Application Function Modeler 2.0 (AFM 2) is a graphical editor for complex data analysis pipelines in the HANA Studio. This tool is based on the HANA Data Scientist prototype developed at the HANA Platform Innovation Center in Potsdam, Germany. It is planned to be the next generation of the existing HANA Studio Application Function Modeler which was developed at the TIP CE&SP Algorithm Labs in Shanghai, China. The AFM 2 team consists of original and new developers from both locations.
Applied Mathematics Applied mathematics is a branch of mathematics that deals with mathematical methods that find use in science, engineering, business, computer science, and industry. Thus, ‘applied mathematics’ is a mathematical science with specialized knowledge. The term ‘applied mathematics’ also describes the professional specialty in which mathematicians work on practical problems by formulating and studying mathematical models. In the past, practical applications have motivated the development of mathematical theories, which then became the subject of study in pure mathematics where abstract concepts are studied for their own sake. The activity of applied mathematics is thus intimately connected with research in pure mathematics.
Approximate Bayesian Computation
Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics. In all model-based statistical inference, the likelihood function is of central importance, since it expresses the probability of the observed data under a particular statistical model, and thus quantifies the support data lend to particular values of parameters and to choices among different models. For simple models, an analytical formula for the likelihood function can typically be derived. However, for more complex models, an analytical formula might be elusive or the likelihood function might be computationally very costly to evaluate. ABC methods bypass the evaluation of the likelihood function. In this way, ABC methods widen the realm of models for which statistical inference can be considered. ABC methods are mathematically well-founded, but they inevitably make assumptions and approximations whose impact needs to be carefully assessed. Furthermore, the wider application domain of ABC exacerbates the challenges of parameter estimation and model selection. ABC has rapidly gained popularity over the last years and in particular for the analysis of complex problems arising in biological sciences, e.g. in population genetics, ecology, epidemiology, and systems biology.
Approximate Message Passing
The standard linear regression (SLR) problem is to recover a vector $\mathbf{x}^0$ from noisy linear observations $\mathbf{y}=\mathbf{Ax}^0+\mathbf{w}$. The approximate message passing (AMP) algorithm recently proposed by Donoho, Maleki, and Montanari is a computationally efficient iterative approach to SLR that has a remarkable property: for large i.i.d.\ sub-Gaussian matrices $\mathbf{A}$, its per-iteration behavior is rigorously characterized by a scalar state-evolution whose fixed points, when unique, are Bayes optimal. AMP, however, is fragile in that even small deviations from the i.i.d.\ sub-Gaussian model can cause the algorithm to diverge. This paper considers a ‘vector AMP’ (VAMP) algorithm and shows that VAMP has a rigorous scalar state-evolution that holds under a much broader class of large random matrices $\mathbf{A}$: those that are right-rotationally invariant. After performing an initial singular value decomposition (SVD) of $\mathbf{A}$, the per-iteration complexity of VAMP can be made similar to that of AMP. In addition, the fixed points of VAMP’s state evolution are consistent with the replica prediction of the minimum mean-squared error recently derived by Tulino, Caire, Verd\’u, and Shamai. The effectiveness and state evolution predictions of VAMP are confirmed in numerical experiments.
Approximate Nearest Neighbors
In some applications it may be acceptable to retrieve a “good guess” of the nearest neighbor. In those cases, we can use an algorithm which doesn’t guarantee to return the actual nearest neighbor in every case, in return for improved speed or memory savings. Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried. Algorithms that support the approximate nearest neighbor search include locality-sensitive hashing, best bin first and balanced box-decomposition tree based search.
Approximate Semantic Matching Event-based systems have loose coupling within space, time and synchronization, providing a scalable infrastructure for information exchange and distributed workflows. However, event-based systems are tightly coupled, via event subscriptions and patterns, to the semantics of the underlying event schema and values. The high degree of semantic heterogeneity of events in large and open deployments such as smart cities and the sensor web makes it difficult to develop and maintain event-based systems. In order to address semantic coupling within event-based systems, we propose vocabulary free subscriptions together with the use of approximate semantic matching of events. This paper examines the requirement of event semantic decoupling and discusses approximate semantic event matching and the consequences it implies for event processing systems. We introduce a semantic event matcher and evaluate the suitability of an approximate hybrid matcher based on both thesauri-based and distributional semantics-based similarity and relatedness measures. The matcher is evaluated over a structured representation of Wikipedia and Freebase events. Initial evaluations show that the approach matches events with a maximal combined precision-recall F1 score of 75.89% on average in all experiments with a subscription set of 7 subscriptions. The evaluation shows how a hybrid approach to semantic event matching outperforms a single similarity measure approach.
Towards an Inexact Semantic Complex Event Processing Framework
Apriori Algorithm Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.
Arc Diagram In graph drawing, an arc diagram is a style of graph drawing, in which the vertices of a graph are placed along a line in the Euclidean plane, with edges being drawn as semicircles in one of the two halfplanes bounded by the line, or as smooth curves formed by sequences of semicircles. In some cases, line segments of the line itself are also allowed as edges, as long as they connect only vertices that are consecutive along the line.
ArcGIS Esri’s ArcGIS is a geographic information system (GIS) for working with maps and geographic information. It is used for: creating and using maps; compiling geographic data; analyzing mapped information; sharing and discovering geographic information; using maps and geographic information in a range of applications; and managing geographic information in a database. The system provides an infrastructure for making maps and geographic information available throughout an organization, across a community, and openly on the Web.
Archetypal Analysis Archetypal analysis (Cutler and Breiman, 1994) has the aim to find a few, not necessarily observed, extremal observations (the archetypes) in a multivariate data set such that:
1. all the observations are approximated by convex combinations of the archetypes, and
2. all the archetypes are convex combinations of the observations.
Arcsine Law In probability theory, the arcsine laws are a collection of results for one-dimensional random walks and Brownian motion (the Wiener process). The best known of these is attributed to Paul Lévy (1939). All three laws relate path properties of the Wiener process to the arcsine distribution. The Lévy Arcsine Law states that the proportion of time that the one-dimensional Wiener process is positive follows an arcsine distribution.
Area under Curve
Sometimes, the ROC is used to generate a summary statistic. A common version is the area under the ROC curve, or “AUC” (“Area Under Curve”), or A’ (pronounced “a-prime”), or “c-statistic”.
Argument Component Detection
Argument component detection (ACD) is an important sub-task in argumentation mining. ACD aims at detecting and classifying different argument components in natural language texts.
Argumentation Mining The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people’s argumentation.
Artificial Continuous Prediction Market
We propose the Artificial Continuous Prediction Market (ACPM) as a means to predict a continuous real value, by integrating a range of data sources and aggregating the results of different machine learning (ML) algorithms. ACPM adapts the concept of the (physical) prediction market to address the prediction of real values instead of discrete events. Each ACPM participant has a data source, a ML algorithm and a local decision-making procedure that determines what to bid on what value. The contributions of ACPM are: (i) adaptation to changes in data quality by the use of learning in: (a) the market, which weights each market participant to adjust the influence of each on the market prediction and (b) the participants, which use a Q-learning based trading strategy to incorporate the market prediction into their subsequent predictions, (ii) resilience to a changing population of low- and high-performing participants. We demonstrate the effectiveness of ACPM by application to an influenza-like illnesses data set, showing ACPM out-performs a range of well-known regression models and is resilient to variation in data source quality.
Artificial General Intelligence
Artificial general intelligence (AGI) is the intelligence of a (hypothetical) machine that could successfully perform any intellectual task that a human being can. It is a primary goal of artificial intelligence research and an important topic for science fiction writers and futurists. Artificial general intelligence is also referred to as “strong AI”, “full AI” or as the ability to perform “general intelligent action”. AGI is associated with traits such as consciousness, sentience, sapience, and self-awareness observed in living beings.
Artificial Intelligence
Artificial intelligence (AI) is the human-like intelligence exhibited by machines or software. It is also an academic field of study. Major AI researchers and textbooks define the field as “the study and design of intelligent agents”, where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success. John McCarthy, who coined the term in 1955, defines it as “the science and engineering of making intelligent machines”.
Aspect-Oriented Software Development
In computing, Aspect-oriented software development (AOSD) is an emerging software development technology that seeks new modularizations of software systems in order to isolate secondary or supporting functions from the main program’s business logic. AOSD allows multiple concerns to be expressed separately and automatically unified into working systems. Aspect-Oriented Software Development focuses on the identification, specification and representation of cross-cutting concerns and their modularization into separate functional units as well as their automated composition into a working system.
ASReml ASReml is a statistical software package for fitting linear mixed models using restricted maximum likelihood, a technique commonly used in plant and animal breeding and quantitative genetics as well as other fields. It is notable for its ability to fit very large and complex data sets efficiently, due to its use of the average information algorithm and sparse matrix methods. It was originally developed by Arthur Gilmour. ASREML can be used in Windows, Linux, and as an add-on to S-PLUS and R.
Association Analysis
Association for Uncertainty in Artificial Intelligence
The Association for Uncertainty in Artificial Intelligence is a non-profit organization focused on organizing the annual Conference on Uncertainty in Artificial Intelligence (UAI) and, more generally, on promoting research in pursuit of advances in knowledge representation, learning and reasoning under uncertainty. Principles and applications developed within the UAI community have been at the forefront of research in Artificial Intelligence. The UAI community and annual meeting have been primary sources of advances in graphical models for representing and reasoning with uncertainty.
Association Mapping Association mapping (genetics), also known as “linkage disequilibrium mapping”, is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes (observable characteristics) to genotypes (the genetic constitution of organisms).
Association Rule Classification
Association Rule Learning Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Rakesh Agrawal et al. introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production, and bioinformatics. As opposed to sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.
Association Rule Mining
Asymmetric Convolution Bidirectional Long Short-Term Memory Networks
Recently deeplearning models have been shown to be capable of making remarkable performance in sentences and documents classification tasks. In this work, we propose a novel framework called AC-BLSTM for modeling setences and documents, which combines the asymmetric convolution neural network (ACNN) with the Bidirectional Long Short-Term Memory network (BLSTM). Experiment results demonstrate that our model achieves state-of-the-art results on all six tasks, including sentiment analysis, question type classification, and subjectivity classification.
Asynchronous Advantage Actor-Critic
Asynchronous Distributed Gibbs
Gibbs sampling is a widely used Markov Chain Monte Carlo (MCMC) method for numerically approximating integrals of interest in Bayesian statistics and other mathematical sciences. It is widely believed that MCMC methods do not extend easily to parallel implementations, as their inherently sequential nature incurs a large synchronization cost. This means that new solutions are needed to bring Bayesian analysis fully into the era of large-scale computation. In this paper, we present a novel scheme – Asynchronous Distributed Gibbs (ADG) sampling – that allows us to perform MCMC in a parallel fashion with no synchronization or locking, avoiding the typical performance bottlenecks of parallel algorithms. Our method is especially attractive in settings, such as hierarchical random-effects modeling in which each observation has its own random effect, where the problem dimension grows with the sample size. We prove convergence under some basic regularity conditions, and discuss the proof for similar parallelization schemes for other iterative algorithms. We provide three examples that illustrate some of the algorithm’s properties with respect to scaling. Because our hardware resources are bounded, we have not yet found a limit to the algorithm’s scaling, and thus its true capabilities remain unknown.
Asynchronous Parallel SAGA
We describe Asaga, an asynchronous parallel version of the incremental gradient algorithm Saga that enjoys fast linear convergence rates. We highlight a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently proposed ‘perturbed iterate’ framework that resolves it. We thereby prove that Asaga can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedup as well as the hardware overhead.
Asynchronous Parallel Stochastic Coordinate Descent
An asynchronous parallel stochastic coordinate descent algorithm for minimizing smooth unconstrained or separably constrained functions. The method achieves a linear convergence rate on functions that satisfy an essential strong convexity property and a sublinear rate (1=K) on general convex functions. Nearlinear speedup on a multicore system can be expected if the number of processors is O(n1=2) in unconstrained optimization and O(n1=4) in the separable-constrained case, where n is the number of variables.
Atomic Information Resource Data Model
Entities are simply formed by grouping Attributes together. One or more Attributes are shared between two or more Entities. According to the AtomicDB terminology, shared attributes are called bridge concepts and this is the equivalent of the relationship that is implemented with primary and foreign keys on two tables but here we are completely independent to mix and match them. For example PartColor, could be merged with another attribute from a different table in another relational database.
Atomic Triangular Matrix An atomic (upper or lower) triangular matrix is a special form of unitriangular matrix, where all of the off-diagonal entries are zero, except for the entries in a single column. Such a matrix is also called a Gauss matrix or a Gauss transformation matrix.
Atrain Distributed System
A special type of distributed system called by ‘Atrain Distributed System’ (ADS) which is very suitable for processing big data using the heterogeneous data structures r-atrain or the homogeneous data structure r-train. A simple ‘Atrain Distributed System’ is called an uni-tier ADS. The ‘Multi-tier Atrain Distributed System’ is an extension of the uni-tier ADS. The ADS is scalable upto any extent as many times as required. Two new type of network topologies are defined for ADS called by ‘multi-horse cart’ topology and ‘cycle’ topology which can support increasing volume of big data. Where r-atrain and r-train data structures are introduced for the processing of big data, the data structures ‘heterogeneous data structure MA’ and ‘homogeneous data structure MT’ are introduced for the processing of big data including temporal big data too. Both MA and MT can be well implemented in multi-tier ADS. We define cyclic train and cyclic atrain, and then doubly linked train/atrain. A method is proposed on how to implement Solid Matrices, n-dimensional arrays, n-dimensional larrays etc. in a computer memory using the data structures MT and MA.
Attention Economy Attention economics is an approach to the management of information that treats human attention as a scarce commodity, and applies economic theory to solve various information management problems.
Attentional Multi-agent Predictive Modeling
Multi-agent predictive modeling is an essential step for understanding physical, social and team-play systems. Recently, Interaction Networks (INs) were proposed for the task of modeling multi-agent physical systems, INs scale with the number of interactions in the system (typically quadratic or higher order in the number of agents). In this paper we introduce VAIN, a novel attentional architecture for multi-agent predictive modeling that scales linearly with the number of agents. We show that VAIN is effective for multi-agent predictive modeling. Our method is evaluated on tasks from challenging multi-agent prediction domains: chess and soccer, and outperforms competing multi-agent approaches.
Attribution Modeling Attribution modelling, in essence, means reporting on the impact of communication activity using metrics like:
• Turnover
• Profit
• Customer retention
• Volume of sales
Instead of metrics like:
• Share of voice
• Web visits
• Click through rate (CTR)
• Impressions
There’s a big difference between these two lists. The second list contains important metrics, but businesses could survive without ever increasing them. The business metrics in the first list, however, are essential for all companies that want to survive and thrive.
Understanding the impact of communications on business metrics is – rightly – more important to senior executives. This is the primary objective of attribution modelling; to provide holistic, accurate information about the financial return activities are delivering so you can refine them, adjust what you’re doing, and use the same budget to deliver more value to your business and your customers.
Augmented Dickey-Fuller Test
In statistics and econometrics, an augmented Dickey-Fuller test (ADF) is a test for a unit root in a time series sample. It is an augmented version of the Dickey-Fuller test for a larger and more complicated set of time series models. The augmented Dickey-Fuller (ADF) statistic, used in the test, is a negative number. The more negative it is, the stronger the rejection of the hypothesis that there is a unit root at some level of confidence.
Augmented Intelligence
Augmented Interval Markov Chains
In this paper we propose augmented interval Markov chains (AIMCs): a generalisation of the familiar interval Markov chains (IMCs) where uncertain transition probabilities are in addition allowed to depend on one another. This new model preserves the flexibility afforded by IMCs for describing stochastic systems where the parameters are unclear, for example due to measurement error, but also allows us to specify transitions with probabilities known to be identical, thereby lending further expressivity. The focus of this paper is reachability in AIMCs. We study the qualitative, exact quantitative and approximate reachability problem, as well as natural subproblems thereof, and establish several upper and lower bounds for their complexity. We prove the exact reachability problem is at least as hard as the famous square-root sum problem, but, encouragingly, the approximate version lies in $\mathbf{NP}$ if the underlying graph is known, whilst the restriction of the exact problem to a constant number of uncertain edges is in $\mathbf{P}$. Finally, we show that uncertainty in the graph structure affects complexity by proving $\mathbf{NP}$-completeness for the qualitative subproblem, in contrast with an easily-obtained upper bound of $\mathbf{P}$ for the same subproblem with known graph structure.
Augmented k-means Identifying a set of homogeneous clusters in a heterogeneous dataset is one of the most important classes of problems in statistical modeling. In the realm of unsupervised partitional clustering, k-means is a very important algorithm for this. In this technical report, we develop a new k-means variant called Augmented k-means, which is a hybrid of k-means and logistic regression. During each iteration, logistic regression is used to predict the current cluster labels, and the cluster belonging probabilities are used to control the subsequent re-estimation of cluster means. Observations which can’t be firmly identified into clusters are excluded from the re-estimation step. This can be valuable when the data exhibit many characteristics of real datasets such as heterogeneity, non-sphericity, substantial overlap, and high scatter. Augmented k-means frequently outperforms k-means by more accurately classifying observations into known clusters and / or converging in fewer iterations. We demonstrate this on both simulated and real datasets. Our algorithm is implemented in Python and will be available with this report.
Augmented Lagrangian Method
Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems; the difference is that the augmented Lagrangian method adds an additional term to the unconstrained objective. This additional term is designed to mimic a Lagrange multiplier. The augmented Lagrangian is not the same as the method of Lagrange multipliers. Viewed differently, the unconstrained objective is the Lagrangian of the constrained problem, with an additional penalty term (the augmentation).
Author Rank Author Rank is a new aspect of Google’s search algorithm that will score online content creators. Similar to SEO rankings for sites and pages, authors will now have an associated ranking based on a few contributing factors, including, but not limited to:
• Social sharing of your Google+ posts
• Quality of backlinks to your content
• Interactions with your content (comments and shares)
• Timely and topical content
• Reputation and authority on other social networks
• PageRank
In short, Google will be assessing your reputation, authority, and the general reception of your content to determine just how valuable you are as an author. This ranking methodology provides writers with a greater incentive to not only build out their Google Plus profiles (smart move, Google), but also to ensure their online presence is streamlined and connected across all social networks and blogs to which they contribute.
A user who is well-connected, well-informed, produces great content, and is seen by the larger community as valuable, will without question reap the rewards of a high author ranking.
A Guide on Google’s Author Rank
Auto Regressive TIme VArying
Biological networks are highly dynamic in response to environmental and physiological cues. This variability is in contrast to conventional analyses of biological networks, which have overwhelmingly employed static graph models which stay constant over time to describe biological systems and their underlying molecular interactions. To overcome these limitations, we propose here a new statistical modelling framework, the ARTIVA formalism (Auto Regressive TIme VArying models), and an associated inferential procedure that allows us to learn temporally varying gene-regulation networks from biological time-course expression data. ARTIVA simultaneously infers the topology of a regulatory network and how it changes over time. It allows us to recover the chronology of regulatory associations for individual genes involved in a specific biological process (development, stress response, etc.). ARTIVA does recover essential temporal dependencies in biological systems from transcriptional data, and provide a natural starting point to learn and investigate their dynamics in greater detail.
Autocorrelation Autocorrelation, also known as serial correlation or cross-autocorrelation, is the cross-correlation of a signal with itself at different points in time (that is what the cross stands for). Informally, it is the similarity between observations as a function of the time lag between them. It is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.
AutoCorrelation Function
The auto-correlation function measures the correlation of a signal x(t) with itself shifted by some time delay tau. The auto-correlation function can be used to detect repeats or periodicity in a signal. E.g. to use the auto-correlation to assess the effect of fluctuations (noise) on a periodic signal.
Autodependogram In this paper the serial dependences between the observed time series and the lagged series, taken into account one-by-one, are graphically analyzed by what we have chosen to call the ‘autodependogram’. This tool, is a sort of natural nonlinear counterpart of the well-known autocorrelogram used in the linear context. The simple idea, instead of using autocorrelations at varying time lags, exploits the c2-test statistics applied to convenient contingency tables. The usefulness of this graphical device is confirmed by simulations from certain classes of well-known models, characterized by randomness and also by different kinds of linear and nonlinear dependences. The autodependogram is also applied to both environmental and economic real data. In this way its ability to detect nonlinear features is highlighted.
Autoencoded Variational Inference For Topic Model
Topic models are one of the most popular methods for learning representations of text, but a major challenge is that any change to the topic model requires mathematically deriving a new inference algorithm. A promising approach to address this problem is autoencoding variational Bayes (AEVB), but it has proven difficult to apply to topic models in practice. We present what is to our knowledge the first effective AEVB based inference method for latent Dirichlet allocation (LDA), which we call Autoencoded Variational Inference For Topic Model (AVITM). This model tackles the problems caused for AEVB by the Dirichlet prior and by component collapsing. We find that AVITM matches traditional methods in accuracy with much better inference time. Indeed, because of the inference network, we find that it is unnecessary to pay the computational cost of running variational optimization on test data. Because AVITM is black box, it is readily applied to new topic models. As a dramatic illustration of this, we present a new topic model called ProdLDA, that replaces the mixture model in LDA with a product of experts. By changing only one line of code from LDA, we find that ProdLDA yields much more interpretable topics, even if LDA is trained via collapsed Gibbs sampling.
Autoencoder An autoencoder, autoassociator or Diabolo network is an artificial neural network used for learning efficient codings. The aim of an auto-encoder is to learn a compressed, distributed representation (encoding) for a set of data, typically for the purpose of dimensionality reduction.
Auto-Encoding Variational Bayes How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.
Automated Machine Learning
The Current State of Automated Machine Learning
Automated Model Order Selection Algorithm
One of the longstanding problems in spectral graph clustering (SGC) is the so-called model order selection problem: automated selection of the correct number of clusters. This is equivalent to the problem of finding the number of connected components or communities in an undirected graph. In this paper, we propose AMOS, an automated model order selection algorithm for SGC. Based on a recent analysis of clustering reliability for SGC under the random interconnection model, AMOS works by incrementally increasing the number of clusters, estimating the quality of identified clusters, and providing a series of clustering reliability tests. Consequently, AMOS outputs clusters of minimal model order with statistical clustering reliability guarantees. Comparing to three other automated graph clustering methods on real-world datasets, AMOS shows superior performance in terms of multiple external and internal clustering metrics.
Automated Predictive Library
Automated Predictive Library (APL) is a HANA Application Function Library (AFL) which is meant to expose the features of InfiniteInsight (aka Kxen) inside HANA.
Automatic Interactive Data Exploration
In this paper, we argue that database systems be augmented with an automated data exploration service that methodically steers users through the data in a meaningful way. Such an automated system is crucial for deriving insights from complex datasets found in many big data applications such as scientific and healthcare applications as well as for reducing the human effort of data exploration. Towards this end, we present AIDE, an Automatic Interactive Data Exploration framework that assists users in discovering new interesting data patterns and eliminate expensive ad-hoc exploratory queries. AIDE relies on a seamless integration of classification algorithms and data management optimization techniques that collectively strive to accurately learn the user interests based on his relevance feedback on strategically collected samples. We present a number of exploration techniques as well as optimizations that minimize the number of samples presented to the user while offering interactive performance. AIDE can deliver highly accurate query predictions for very common conjunctive queries with small user effort while, given a reasonable number of samples, it can predict with high accuracy complex disjunctive queries. It provides interactive performance as it limits the user wait time per iteration of exploration to less than a few seconds.
Automatic Machine Learning
The availability of large data sets and computational resources have encouraged the development of machine learning and data-driven models which pose an interesting alternative to explicit and fully structured models of behaviour. A battery of tools are now available which can automatically learn interesting mappings and simulate complex phenomena. These include techniques such as Hidden Markov Models, Neural Networks, Support Vector Machines, Mixture Models, Decision Trees and Bayesian Network Inference . In general, these tools have fallen into two classes: discriminative and generative models. The first attempts to optimize the learning for a particular task while the second models a phenomenon in its entirety. This difference between the two approaches will be addressed in this thesis in particular detail and the above probabilistic formalisms will be employed in deriving a machine learning system for our purposes.
ChaLearn Automatic Machine Learning Challenge
<a href="'Automate” target=”top”>https://…/
Automatic Multiscale-Based Peak Detection
A method for automatic detection of peaks in noisy periodic and quasi-periodic signals. This method, called automatic multiscale-based peak detection (AMPD), is based on the calculation and analysis of the local maxima scalogram, a matrix comprising the scale-dependent occurrences of local maxima.
Automatic Query Expansion
Query expansion (QE) is the process of reformulating a seed query to improve retrieval performance in information retrieval operations. In the context of search engines, query expansion involves evaluating a user’s input (what words were typed into the search query area, and sometimes other types of data) and expanding the search query to match additional documents. Query expansion involves techniques such as:
• Finding synonyms of words, and searching for the synonyms as well
• Finding all the various morphological forms of words by stemming each word in the search query
• Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results
• Re-weighting the terms in the original query
Query expansion is a methodology studied in the field of computer science, particularly within the realm of natural language processing and information retrieval.
Automatic Short Answer Grading
Automatic short answer grading (ASAG) techniques are designed to automatically assess short answers to questions in natural language, having a length of a few words to a few sentences. Supervised ASAG techniques have been demonstrated to be effective but suffer from a couple of key practical limitations. They are greatly reliant on instructor provided model answers and need labeled training data in the form of graded student answers for every assessment task. To overcome these, in this paper, we introduce an ASAG technique with two novel features. We propose an iterative technique on an ensemble of (a) a text classifier of student answers and (b) a classifier using numeric features derived from various similarity measures with respect to model answers. Second, we employ canonical correlation analysis based transfer learning on a common feature representation to build the classifier ensemble for questions having no labelled data. The proposed technique handsomely beats all winning supervised entries on the SCIENTSBANK dataset from the Student Response Analysis task of SemEval 2013. Additionally, we demonstrate generalizability and benefits of the proposed technique through evaluation on multiple ASAG datasets from different subject topics and standards.
Automatic Speech Recognition
In computer science and electrical engineering, speech recognition (SR) is the translation of spoken words into text. It is also known as ‘automatic speech recognition’ (ASR), ‘computer speech recognition’, or just ‘speech to text’ (STT). Some SR systems use ‘training’ (also called ‘enrolment’) where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person’s specific voice and uses it to fine-tune the recognition of that person’s speech, resulting in increased accuracy. Systems that do not use training are called ‘speaker independent’ systems. Systems that use training are called ‘speaker dependent’. Speech recognition applications include voice user interfaces such as voice dialling (e.g. ‘Call home’), call routing (e.g. ‘I would like to make a collect call’), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice Input). The term voice recognition or speaker identification refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person’s voice or it can be used to authenticate or verify the identity of a speaker as part of a security process. From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the world-wide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems. These speech industry players include Microsoft, Google, IBM, Baidu (China), Apple, Amazon, Nuance, IflyTek (China), many of which have publicized the core technology in their speech recognition systems being based on deep learning.
Automatic Summarization Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document. As the problem of information overload has grown, and as the quantity of data has increased, so has interest in automatic summarization. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax. An example of the use of summarization technology is search engines such as Google. Document summarization is another. Generally, there are two approaches to automatic summarization: extraction and abstraction. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate. Such a summary might contain words not explicitly present in the original. Research into abstractive methods is an increasingly important and active research area, however due to complexity constraints, research to date has focused primarily on extractive methods.
Autoregressive Conditional Duration
In financial econometrics, an autoregressive conditional duration (ACD, Engle and Russell (1998)) model considers irregularly spaced and autocorrelated intertrade durations. ACD is analogous to GARCH. Indeed, in a continuous double auction (a common trading mechanism in many financial markets) waiting times between two consecutive trades vary at random.
“Generalized Autoregressive Conditional Heteroscedasticity”
Autoregressive Conditional Heteroskedasticity
In econometrics, autoregressive conditional heteroskedasticity (ARCH) models are used to characterize and model observed time series. They are used whenever there is reason to believe that, at any point in a series, the error terms will have a characteristic size or variance. In particular ARCH models assume the variance of the current error term or innovation to be a function of the actual sizes of the previous time periods’ error terms: often the variance is related to the squares of the previous innovations. Such models are often called ARCH models (Engle, 1982), although a variety of other acronyms are applied to particular structures of model which have a similar basis. ARCH models are employed commonly in modeling financial time series that exhibit time-varying volatility clustering, i.e. periods of swings followed by periods of relative calm. ARCH-type models are sometimes considered to be part of the family of stochastic volatility models but strictly this is incorrect since at time t the volatility is completely pre-determined (deterministic) given previous values.
Autoregressive Conditional Poisson
When modelling count data that comes in the form of a time series, the static Poisson regression and standard time series models are often not appropriate. A current study therefore involves the evaluation of several observation-driven and parameter-driven time series models for count data. In the observation-driven class of models, a fairly simple model is the Autoregressive Conditional Poisson (ACP) model.
Autoregressive Distributed Lag Model
A distributed-lag model is a dynamic model in which the effect of a regressor x on y occurs over time rather than all at once. In the simple case of one explanatory variable and a linear relationship. This form is very similar to the infinite-moving-average representation of an ARMA process, except that the lag polynomial on the right-hand side is applied to the explanatory variable x rather than to a white-noise process e. The individual coefficients ßs are called lag weights and the collectively comprise the lag distribution. They define the pattern of how x affects y over time.
Autoregressive Fractionally Integrated Moving Average
In statistics, autoregressive fractionally integrated moving average models are time series models that generalize ARIMA (autoregressive integrated moving average) models by allowing non-integer values of the differencing parameter. These models are useful in modeling time series with long memory – that is, in which deviations from the long-run mean decay more slowly than an exponential decay. The acronyms “ARFIMA” or “FARIMA” are often used, although it is also conventional to simply extend the “ARIMA(p,d,q)” notation for models, by simply allowing the order of differencing, d, to take fractional values.
Autoregressive Integrated Moving Average
In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. These models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). They are applied in some cases where data show evidence of non-stationarity, where an initial differencing step (corresponding to the “integrated” part of the model) can be applied to remove the non-stationarity.
The model is generally referred to as an ARIMA(p,d,q) model where parameters p, d, and q are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively. ARIMA models form an important part of the Box-Jenkins approach to time-series modelling.
When one of the three terms is zero, it is usual to drop “AR”, “I” or “MA” from the acronym describing the model. For example, ARIMA(0,1,0) is I(1), and ARIMA(0,0,1) is MA(1).
Autoregressive Integrated Moving Average With Explanatory Variable
Matlab: ARIMAX Model Specifications
ARIMA vs. ARIMAX – which approach is better to analyze and forecast macroeconomic time series?
Auto-Sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading this paper published at the NIPS 2015 .
Auxiliary Inference Divergence Estimator
Approximate probabilistic inference algorithms are central to many fields. Examples include sequential Monte Carlo inference in robotics, variational inference in machine learning, and Markov chain Monte Carlo inference in statistics. A key problem faced by practitioners is measuring the accuracy of an approximate inference algorithm on a specific dataset. This paper introduces the auxiliary inference divergence estimator (AIDE), an algorithm for measuring the accuracy of approximate inference algorithms. AIDE is based on the observation that inference algorithms can be treated as probabilistic models and the random variables used within the inference algorithm can be viewed as auxiliary variables. This view leads to a new estimator for the symmetric KL divergence between the output distributions of two inference algorithms. The paper illustrates application of AIDE to algorithms for inference in regression, hidden Markov, and Dirichlet process mixture models. The experiments show that AIDE captures the qualitative behavior of a broad class of inference algorithms and can detect failure modes of inference algorithms that are missed by standard heuristics.
Average Hazard Ratio
“Hazard Ratio”
Estimation of the Average Hazard Ratio
Average Nearest Neighbor Degree
Average Nearest Neighbor Rank
The average nearest neighbor degree (ANND) of a node of degree $k$, as a function of $k$, is often used to characterize dependencies between degrees of a node and its neighbors in a network. We study the limiting behavior of the ANND in undirected random graphs with general i.i.d. degree sequences and arbitrary joint degree distribution of neighbor nodes, when the graph size tends to infinity. When the degree distribution has finite variance, the ANND converges to a deterministic function and we prove that for the configuration model, where nodes are connected at random, this, naturally, is a constant. For degree distributions with infinite variance, the ANND in the configuration model scales with the size of the graph and we prove a central limit theorem that characterizes this behavior. As a result, the ANND is uninformative for graphs with infinite variance degree distributions. We propose an alternative measure, the average nearest neighbor rank (ANNR) and prove its convergence to a deterministic function whenever the degree distribution has finite mean. In addition to our theoretical results we provide numerical experiments to show the convergence of both functions in the configuration model and the erased configuration model, where self-loops and multiple edges are removed. These experiments also shed new light on the well-known `structural negative correlations’, or `finite-size effects’, that arise in simple graphs, because large nodes can only have a limited number of large neighbors. In particular we show that the majority of such effects for regularly varying distributions are due to a sampling bias.
“Average Nearest Neighbor Degree”
Average Shifted Histogram
A simple device has been proposed for eliminating the bin edge problem of the frequency polygon while retaining many of the computational advantages of a density estimate based on bin counts. Scott (1983, 1985b) considered the problem of choosing among the collection of multivariate frequency polygons, each with the same smoothing parameter but di ering bin origins. Rather than choosing the \smoothest’ such curve or surface, he proposed averaging several of the shifted frequency polygons. As the average of piecewise linear curves is also piecewise linear, the resulting curve appears to be a frequency polygon as well. If the weights are nonnegative and sum to 1, the resulting \averaged shifted frequency polygon’ (ASFP) is nonnegative and integrates to 1. A nearly equivalent device is to average several shifted histograms, which is just as general but simpler to describe and analyze. The result is the \averaged shifted histogram’ (ASH). Since the average of piecewise constant functions such as the histogram is also piecewise constant, the ASH appears to be a histogram as well. In practice, the ASH is made continuous using either of the linear interpolation schemes described for the frequency polygon in Chapter 4 and will be referred to as the frequency polygon of the averaged shifted histogram (FP-ASH). The ASH is the practical choice for computationally and statistically e cient density estimation.
Average Shifted Visual Predictive Checks
“Visual Predictive Checks”
Average Top-k Loss In this work, we introduce the average top-$k$ (AT$_k$) loss as a new ensemble loss for supervised learning, which is the average over the $k$ largest individual losses over a training dataset. We show that the AT$_k$ loss is a natural generalization of the two widely used ensemble losses, namely the average loss and the maximum loss, but can combines their advantages and mitigate their drawbacks to better adapt to different data distributions. Furthermore, it remains a convex function over all individual losses, which can lead to convex optimization problems that can be solved effectively with conventional gradient-based method. We provide an intuitive interpretation of the AT$_k$ loss based on its equivalent effect on the continuous individual loss functions, suggesting that it can reduce the penalty on correctly classified data. We further give a learning theory analysis of MAT$_k$ learning on the classification calibration of the AT$_k$ loss and the error bounds of AT$_k$-SVM. We demonstrate the applicability of minimum average top-$k$ learning for binary classification and regression using synthetic and real datasets.
Average Treatment Effect
The average treatment effect (ATE) is a measure used to compare treatments (or interventions) in randomized experiments, evaluation of policy interventions, and medical trials. The ATE measures the difference in mean (average) outcomes between units assigned to the treatment and units assigned to the control. In a randomized trial (i.e., an experimental study), the average treatment effect can be estimated from a sample using a comparison in mean outcomes for treated and untreated units. However, the ATE is generally understood as a causal parameter (i.e., an estimand or property of a population) that a researcher desires to know, defined without reference to the study design or estimation procedure. Both observational and experimental study designs may enable one to estimate an ATE in a variety of ways.
Averaged One-Dependence Estimators
Averaged one-dependence estimators (AODE) is a probabilistic classification learning technique. It was developed to address the attribute-independence problem of the popular naive Bayes classifier. It frequently develops substantially more accurate classifiers than naive Bayes at the cost of a modest increase in the amount of computation.
Averaged Shifted Frequency Polygon
Awk AWK is an interpreted programming language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems. AWK was created at Bell Labs in the 1970s, and its name is derived from the family names of its authors – Alfred Aho, Peter Weinberger, and Brian Kernighan. The acronym is pronounced the same as the name of the bird, auk (which acts as an emblem of the language such as on The AWK Programming Language book cover – the book is often referred to by the abbreviation TAPL). When written in all lowercase letters, as awk, it refers to the Unix or Plan 9 program that runs scripts written in the AWK programming language. The AWK language is a data-driven scripting language consisting of a set of actions to be taken against streams of textual data – either run directly on files or used as part of a pipeline – for purposes of extracting or transforming text, such as producing formatted reports. The language extensively uses the string datatype, associative arrays (that is, arrays indexed by key strings), and regular expressions. While AWK has a limited intended application domain, and was especially designed to support one-liner programs, the language is Turing-complete, and even the early Bell Labs users of AWK often wrote well-structured large AWK programs.
Azure Machine Learning Azure Machine Learning, a fully-managed service in the cloud that enables you to easily build, deploy and share advanced analytics solutions. No software to download, no servers to manage – all you need to start is a browser and internet connectivity.