S3Pool Feature pooling layers (e.g., max pooling) in convolutional neural networks (CNNs) serve the dual purpose of providing increasingly abstract representations as well as yielding computational savings in subsequent convolutional layers. We view the pooling operation in CNNs as a two-step procedure: first, a pooling window (e.g., $2\times 2$) slides over the feature map with stride one which leaves the spatial resolution intact, and second, downsampling is performed by selecting one pixel from each non-overlapping pooling window in an often uniform and deterministic (e.g., top-left) manner. Our starting point in this work is the observation that this regularly spaced downsampling arising from non-overlapping windows, although intuitive from a signal processing perspective (which has the goal of signal reconstruction), is not necessarily optimal for \emph{learning} (where the goal is to generalize). We study this aspect and propose a novel pooling strategy with stochastic spatial sampling (S3Pool), where the regular downsampling is replaced by a more general stochastic version. We observe that this general stochasticity acts as a strong regularizer, and can also be seen as doing implicit data augmentation by introducing distortions in the feature maps. We further introduce a mechanism to control the amount of distortion to suit different datasets and architectures. To demonstrate the effectiveness of the proposed approach, we perform extensive experiments on several popular image classification benchmarks, observing excellent improvements over baseline models. Experimental code is available at https://…/s3pool.
SaberLDA Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets and a large number of topics. Though distributed CPU systems have been used, GPU-based systems have emerged as a promising alternative because of the high computational power and memory bandwidth of GPUs. However, existing GPU-based LDA systems cannot support a large number of topics because they use algorithms on dense data structures whose time and space complexity is linear to the number of topics. In this paper, we propose SaberLDA, a GPU-based LDA system that implements a sparsity-aware algorithm to achieve sublinear time complexity and scales well to learn a large number of topics. To address the challenges introduced by sparsity, we propose a novel data layout, a new warp-based sampling kernel, and an efficient sparse count matrix updating algorithm that improves locality, makes efficient utilization of GPU warps, and reduces memory consumption. xperiments show that SaberLDA can learn from billions-token-scale data with up to 10,000 topics, which is almost two orders of magnitude larger than that of the previous GPU-based systems. With a single GPU card, SaberLDA is able to earn 10,000 topics from a dataset of billions of tokens in a few hours, which is only achievable with clusters with tens of machines before.
SafePredict SafePredict is a novel meta-algorithm that works with any base prediction algorithm for online data to guarantee an arbitrarily chosen correctness rate, $1-\epsilon$, by allowing refusals. Allowing refusals means that the meta-algorithm may refuse to emit a prediction produced by the base algorithm on occasion so that the error rate on non-refused predictions does not exceed $\epsilon$. The SafePredict error bound does not rely on any assumptions on the data distribution or the base predictor. When the base predictor happens not to exceed the target error rate $\epsilon$, SafePredict refuses only a finite number of times. When the error rate of the base predictor changes through time SafePredict makes use of a weight-shifting heuristic that adapts to these changes without knowing when the changes occur yet still maintains the correctness guarantee. Empirical results show that (i) SafePredict compares favorably with state-of-the art confidence based refusal mechanisms which fail to offer robust error guarantees; and (ii) combining SafePredict with such refusal mechanisms can in many cases further reduce the number of refusals. Our software (currently in Python) is included in the supplementary material.
Sammon Mapping Sammon mapping or Sammon projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality () by trying to preserve the structure of inter-point distances in high-dimensional space in the lower-dimension projection. It is particularly suited for use in exploratory data analysis. The method was proposed by John W. Sammon in 1969. It is considered a non-linear approach as the mapping cannot be represented as a linear combination of the original variables as possible in techniques such as principal component analysis, which also makes it more difficult to use for classification applications.
Sample Size Determination
Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is determined based on the expense of data collection, and the need to have sufficient statistical power. In complicated studies there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling there would be different sample sizes for each population. In a census, data are collected on the entire population, hence the sample size is equal to the population size. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.
Sample, Explore, Modify, Model and Assess
SEMMA is an acronym that stands for Sample, Explore, Modify, Model and Assess. It is a list of sequential steps developed by SAS Institute Inc., one of the largest producers of statistics and business intelligence software. It guides the implementation of data mining applications. Although SEMMA is often considered to be a general data mining methodology, SAS claims that it is “rather a logical organisation of the functional tool set of” one of their products, SAS Enterprise Miner, “for carrying out the core tasks of data mining”.
Sampled Weighted Min-Hashing
We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.
Sampling Error In statistics, sampling error is incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics on the sample, such as means and quantiles, generally differ from parameters on the entire population. For example, if one measures the height of a thousand individuals from a country of one million, the average height of the thousand is typically not the same as the average height of all one million people in the country. Since sampling is typically done to determine the characteristics of a whole population, the difference between the sample and population values is considered a sampling error. Exact measurement of sampling error is generally not feasible since the true population values are unknown; however, sampling error can often be estimated by probabilistic modeling of the sample.
Sampling Importance Resampling
Sampling Importance Resampling allows us to sample from the posterior distribution, p(theta|data) where p(theta|data)?L(theta;data)×p(theta) by resampling from a series of draws from the prior, p(theta). Denote one of those n draws from the prior distribution, p(theta), as thetai. Then draw i from the prior sample is drawn with replacement into the posterior sample with probability qi …
Samsara Apache Mahout introduces a new math environment we call Samsara, for its theme of universal renewal. It reflects a fundamental rethinking of how scalable machine learning algorithms are built and customized. Mahout-Samsara is here to help people create their own math while providing some off-the-shelf algorithm implementations. At its core are general linear algebra and statistical operations along with the data structures to support them. You can use is as a library or customize it in Scala with Mahout-specific extensions that look something like R. Mahout-Samsara comes with an interactive shell that runs distributed operations on a Spark cluster. This make prototyping or task submission much easier and allows users to customize algorithms with a whole new degree of freedom.


Sankey Diagram Sankey diagrams are a specific type of flow diagram, in which the width of the arrows is shown proportionally to the flow quantity. They are typically used to visualize energy or material or cost transfers between processes.
SAP HANA has completely transformed the database industry by combining database, data processing, and application platform capabilities in a single in-memory platform. The platform also provides libraries for predictive, planning, text processing, spatial, and business analytics – all on the same architecture. This makes it possible for applications and analytics to be rethought without information processing latency, and sense-and-response solutions can work on massive quantities of real-time data for immediate answers without building pre-aggregates. Simply put – this makes SAP HANA the platform for building and deploying next-generation, real-time applications and analytics.
SAP River River is a programming model and a programming language where you define your application (Data Model, Queries & Business Logic) and upon deployment every run-time artifact is deployed onto a DB (such as HANA) and a run-time container (such as XS to run the JavaScript which handles the business logic side).
SAP River is an easy way to make SAP HANA Applications. Develop and test an application backend, in a matter of minutes, that runs on SAP HANA – SAP’s in-memory database and application platform.
SAP River is a new way of developing native applications on SAP HANA. River consists of a language, a programming model and a set of tools, which allow the developer to focus on the business intent of the application, and largely ignore issues of implementation and optimization. These aspects are taken care of automatically by the language tools, which choose, on compilation, the most appropriate run-time context for each part of the application.
River allows a developer to specify the data model, the application business logic as well as access control, all in a single integrated specification. River is compatible with existing SAP HANA objects, like tables, views, stored procedures and XSJS procedures. River code is in fact cross-compiled into these same native runtime objects, which are automatically exposed via an OData API.
The result is a simpler development process, increased developer productivity, and application code that is easier to understand and to maintain.
SAX Transformation “Symbolic Aggregate Approximation”
Scala Scala is an object-functional programming language for general software applications. Scala has full support for functional programming and a very strong static type system. This allows programs written in Scala to be very concise and thus smaller in size than other general-purpose programming languages. Many of Scala’s design decisions were inspired by criticism of the shortcomings of Java. Scala source code is intended to be compiled to Java bytecode, so that the resulting executable code runs on a Java virtual machine. Java libraries may be used directly in Scala code and vice versa (language interoperability). Like Java, Scala is object-oriented, and uses a curly-brace syntax reminiscent of the C programming language. Unlike Java, Scala has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching. It also has an advanced type system supporting algebraic data types, covariance and contravariance, higher-order types, and anonymous types. Other features of Scala not present in Java include operator overloading, optional parameters, named parameters, raw strings, and no checked exceptions. The name Scala is a portmanteau of ‘scalable’ and ‘language’, signifying that it is designed to grow with the demands of its users.
Scalable Advanced Massive Online Analysis
SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza. samoa is written in Java, is open source, and is available at under the Apache Software License version 2.0.
Scalable Bayesian Multi-relational Factorization with Side Information using MCMC
We propose Macau, a powerful and flexible Bayesian factorization method for heterogeneous data. Our model can factorize any set of entities and relations that can be represented by a relational model, including tensors and also multiple relations for each entity. Macau can also incorporate side information, specifically entity and relation features, which are crucial for predicting sparsely observed relations. Macau scales to millions of entity instances, hundred millions of observations, and sparse entity features with millions of dimensions. To achieve the scale up, we specially designed sampling procedure for entity and relation features that relies primarily on noise injection in linear regressions. We show performance and advanced features of Macau in a set of experiments, including challenging drug-protein activity prediction task.
Scalable Machine Learning Scalability has become one of those core concept slash buzzwords of Big Data. It’s all about scaling out, web scale, and so on. In principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast. The terms “scalable” and “large scale” have been used in machine learning circles long before there was Big Data. There had always been certain problems which lead to a large amount of data, for example in bioinformatics, or when dealing with large number of text documents. So finding learning algorithms, or more generally data analysis algorithms which can deal with a very large set of data was always a relevant question.
Scalable Online Learning
SOL is an open-source library for scalable online learning algorithms, and is particularly suitable for learning with high-dimensional data. The library provides a family of regular and sparse online learning algorithms for large-scale binary and multi-class classification tasks with high efficiency, scalability, portability, and extensibility. SOL was implemented in C++, and provided with a collection of easy-to-use command-line tools, python wrappers and library calls for users and developers, as well as comprehensive documents for both beginners and advanced users. SOL is not only a practical machine learning toolbox, but also a comprehensive experimental platform for online learning research. Experiments demonstrate that SOL is highly efficient and scalable for large-scale machine learning with high-dimensional data.
Scalding Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.
Scale Adaptive Dictionary Learning
Dictionary learning has been widely used in many image processing tasks. In most of these methods, the number of basis vectors is either set by experience or coarsely evaluated empirically. In this paper we propose a new Scale Adaptive Dictionary Learning (SADL) framework, which jointly estimates suitable scales and corresponding atoms in an adaptive fashion according to the training data, without the need of prior information. We design an atom counting function and develop a reliable numerical scheme to solve the challenging optimization problem. Extensive experiments on texture and video datasets demonstrate quantitatively and visually that our method can estimate the scale, without damaging the sparse reconstruction ability.
Scaled Cayley Orthogonal Recurrent Neural Network
Recurrent Neural Networks (RNNs) are designed to handle sequential data but suffer from vanishing or exploding gradients. Recent work on Unitary Recurrent Neural Networks (uRNNs) have been used to address this issue and in some cases, exceed the capabilities of Long Short-Term Memory networks (LSTMs). We propose a simpler and novel update scheme to maintain orthogonal recurrent weight matrices without using complex valued matrices. This is done by parametrizing with a skew-symmetric matrix using the Cayley transform. Such a parametrization is unable to represent matrices with negative one eigenvalues, but this limitation is overcome by scaling the recurrent weight matrix by a diagonal matrix consisting of ones and negative ones. The proposed training scheme involves a straightforward gradient calculation and update step. In several experiments, the proposed scaled Cayley orthogonal recurrent neural network (scoRNN) achieves superior results with fewer trainable parameters than other unitary RNNs.
Scaled Sparse Linear Regression Scaled sparse linear regression jointly estimates the regression coefficients and noise level in a linear model. It chooses an equilibrium with a sparse regression method by iteratively estimating the noise level via the mean residual square and scaling the penalty in proportion to the estimated noise level. The iterative algorithm costs little beyond the computation of a path or grid of the sparse regression estimator for penalty levels above a proper threshold. For the scaled lasso, the algorithm is a gradient descent in a convex minimization of a penalized joint loss function for the regression coefficients and noise level. Under mild regularity conditions, we prove that the scaled lasso simultaneously yields an estimator for the noise level and an estimated coefficient vector satisfying certain oracle inequalities for prediction, the estimation of the noise level and the regression coefficients. These inequalities provide sufficient conditions for the consistency and asymptotic normality of the noise level estimator, including certain cases where the number of variables is of greater order than the sample size. Parallel results are provided for the least squares estimation after model selection by the scaled lasso. Numerical results demonstrate the superior performance of the proposed methods over an earlier proposal of joint convex minimization.
Scatterplot Smoothing In statistics, several scatterplot smoothing methods are available to fit a function through the points of a scatterplot to best represent the relationship between the variables. Scatterplots may be smoothed by fitting a line to the data points in a diagram. This line attempts to display the non-random component of the association between the variables in a 2D scatter plot. Smoothing attempts to separate the non-random behaviour in the data from the random fluctuations, removing or reducing these fluctuations, and allows prediction of the response based value of the explanatory variable.
Scientific Data Mining Scientific data mining is defined as data mining applied to scientific problems, rather than database marketing, finance, or business-driven applications. Scientific data mining distinguishes itself in the sense that the nature of the datasets is often very different from traditional marketdriven data mining applications. The datasets now might involve vast amounts of precise and continuous data, and accounting for underlying system nonlinearities can be extremely challenging from a machine learning point of view.
scikit scikit-learn: Machine Learning in Python
• Simple and efficient tools for data mining and data analysis
• Accessible to everybody, and reusable in various contexts
• Built on NumPy, SciPy, and matplotlib
• Open source, commercially usable – BSD license
Score Function In statistics, the score, score function, efficient score or informant indicates how sensitively a likelihood function L(theta,X) depends on its parameter theta. Explicitly, the score for theta is the gradient of the log-likelihood with respect to theta. The score plays an important role in several aspects of inference. For example:
• in formulating a test statistic for a locally most powerful test;
• in approximating the error in a maximum likelihood estimate;
• in demonstrating the asymptotic sufficiency of a maximum likelihood estimate;
• in the formulation of confidence intervals;
• in demonstrations of the Cramér-Rao inequality.
The score function also plays an important role in computational statistics, as it can play a part in the computation of maximum likelihood estimates.
Scoring Rule In decision theory, a score function, or scoring rule, measures the accuracy of probabilistic predictions. It is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1). A score can be thought of as either a measure of the “calibration” of a set of probabilistic predictions, or as a “cost function” or “loss function”.
If a cost is levied in proportion to a proper scoring rule, the minimal expected cost corresponds to reporting the true set of probabilities. Proper scoring rules are used in meteorology, finance, and pattern classification where a forecaster or algorithm will attempt to minimize the average score to yield refined, calibrated probabilities (i.e. accurate probabilities). Various scoring rules have also been used to assess the predictive accuracy of football forecast models.
Scott-Knott Scott-Knott is an hierarchical clustering algorithm used in the application of ANOVA, when the researcher is comparing treatment means, with a very important characteristic: it does not present any overlapping in its grouping results. We wrote a code, in R, that performs this algorithm starting from vectors, matrix, data.frame, aov or aov.list objects. The results are presented with letters representing groups, as well as through graphics using different colors to differentiate distinct groups. This R package, named ScottKnott is the main topic of this article.
Scree Plot A scree plot is a graphical display of the variance of each component in the dataset which is used to determine how many components should be retained in order to explain a high percentage of the variation in the data. The plot shows the variance for the first component and then for the subsequent components, it shows the additional variance that each component is adding.
SE3-Nets We introduce SE3-Nets, which are deep networks designed to model rigid body motion from raw point cloud data. Based only on pairs of depth images along with an action vector and point wise data associations, SE3-Nets learn to segment effected object parts and predict their motion resulting from the applied force. Rather than learning point wise flow vectors, SE3-Nets predict SE3 transformations for different parts of the scene. Using simulated depth data of a table top scene and a robot manipulator, we show that the structure underlying SE3-Nets enables them to generate a far more consistent prediction of object motion than traditional flow based networks.
Seaborn Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
Search Partition Analysis
SEARNN We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the ‘learning to search’ (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an appropriate surrogate for the test error: by only maximizing the ground truth probability, it fails to exploit the wealth of information offered by structured losses. Further, it introduces discrepancies between training and predicting (such as exposure bias) that may hurt test performance. Instead, SEARNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error. We demonstrate improved performance over MLE on three different tasks: OCR, spelling correction and text chunking. Finally, we propose a subsampling strategy to enable SEARNN to scale to large vocabulary sizes.
Seasonal ARIMA
Often time series possess a seasonal component that repeats every s observations. For monthly observations s = 12 (12 in 1 year), for quarterly observations s = 4 (4 in 1 year). In order to deal with seasonality, ARIMA processes have been generalized: SARIMA models have then been formulated.
Seasonal Decomposition of Time Series by Loess
Decompose a time series into seasonal, trend and irregular components using loess.
Seasonal Hybrid ESD
The primary algorithm, Seasonal Hybrid ESD (S-H-ESD), builds upon the Generalized ESD test for detecting anomalies. S-H-ESD can be used to detect both global and local anomalies. This is achieved by employing time series decomposition and using robust statistical metrics, viz., median together with ESD. In addition, for long time series such as 6 months of minutely data, the algorithm employs piecewise approximation. This is rooted to the fact that trend extraction in the presence of anomalies is non-trivial for anomaly detection.
Second-Order Convolutional Neural Networks Convolutional Neural Networks (CNNs) have been successfully applied to many computer vision tasks, such as image classification. By performing linear combinations and element-wise nonlinear operations, these networks can be thought of as extracting solely first-order information from an input image. In the past, however, second-order statistics computed from handcrafted features, e.g., covariances, have proven highly effective in diverse recognition tasks. In this paper, we introduce a novel class of CNNs that exploit second-order statistics. To this end, we design a series of new layers that (i) extract a covariance matrix from convolutional activations, (ii) compute a parametric, second-order transformation of a matrix, and (iii) perform a parametric vectorization of a matrix. These operations can be assembled to form a Covariance Descriptor Unit (CDU), which replaces the fully-connected layers of standard CNNs. Our experiments demonstrate the benefits of our new architecture, which outperform the first-order CNNs, while relying on up to 90% fewer parameters.
Sedano We present Sedano, a system for processing and indexing a continuous stream of business-related news. Sedano defines pipelines whose stages analyze and enrich news items (e.g., newspaper articles and press releases). News data coming from several content sources are stored, processed and then indexed in order to be consumed by Atoka, our business intelligence product. Atoka users can retrieve news about specific companies, filtering according to various facets. Sedano features both an entity-linking phase, which finds mentions of companies in news, and a classification phase, which classifies news according to a set of business events. Its flexible architecture allows Sedano to be deployed on commodity machines while being scalable and fault-tolerant.
Seemingly Unrelated Regression
In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations. The model can be estimated equation-by-equation using standard ordinary least squares (OLS). Such estimates are consistent, however generally not as efficient as the SUR method, which amounts to feasible generalized least squares with a specific form of the variance-covariance matrix. Two important cases when SUR is in fact equivalent to OLS are when the error terms are in fact uncorrelated between the equations (so that they are truly unrelated) and when each equation contains exactly the same set of regressors on the right-hand-side. The SUR model can be viewed as either the simplification of the general linear model where certain coefficients in matrix B {\displaystyle \mathrm {B} } \Beta are restricted to be equal to zero, or as the generalization of the general linear model where the regressors on the right-hand-side are allowed to be different in each equation. The SUR model can be further generalized into the simultaneous equations model, where the right-hand side regressors are allowed to be the endogenous variables as well.
Segmental Recurrent Neural Network
We introduce segmental recurrent neural networks (SRNNs) which define, given an input sequence, a joint probability distribution over segmentations of the input and labelings of the segments. Representations of the input segments (i.e., contiguous subsequences of the input) are computed by encoding their constituent tokens using bidirectional recurrent neural nets, and these ‘segment embeddings’ are used to define compatibility scores with output labels. These local compatibility scores are integrated using a global semi-Markov conditional random field. Both fully supervised training – in which segment boundaries and labels are observed – as well as partially supervised training – in which segment boundaries are latent – are straightforward. Experiments on handwriting recognition and joint Chinese word segmentation/POS tagging show that, compared to models that do not explicitly represent segments such as BIO tagging schemes and connectionist temporal classification (CTC), SRNNs obtain substantially higher accuracies.
Segmented Linear Regression Segmented linear regression with two segments separated by a breakpoint can be useful to quantify an abrupt change of the response function (Yr) of a varying influential factor (x). The breakpoint can be interpreted as a critical, safe, or threshold value beyond or below which (un)desired effects occur. The breakpoint can be important in decision making.
Segmented Regression Segmented regression, also known as piecewise regression or ‘broken-stick regression’, is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval. Segmented regression analysis can also be performed on multivariate data by partitioning the various independent variables. Segmented regression is useful when the independent variables, clustered into different groups, exhibit different relationships between the variables in these regions. The boundaries between the segments are breakpoints. Segmented linear regression is segmented regression whereby the relations in the intervals are obtained by linear regression.
SegReg In statistics and data analysis the application software SegReg is a free and user-friendly tool for linear segmented regression analysis to determine the breakpoint where the relation between the dependent variable and the independent variable changes abruptly. Originally the method was developed for the analysis of the influence of soil salinity and depth of the watertable on growth of agricultural crops. However, it can be used for many other types of phenomena and relations, for example:
• the change of nutrient contents in plants with time
• the number of negative indicator responses at 30% upstream riparian harvest
• phosphorus and flow duration on the Saline River
Seldon Seldon is an open-source predictive analytics platform. Our proven machine learning algorithms and highly scalable platform serve recommendations to hundreds of millions of people across some of the world’s leading media and e-commerce brands. Seldon VM contains the entire platform pre-configured in a virtual machine for you to get started quickly, to test Seldon with your service data and a movie recommender demo.
SelectScript We introduce a new declarative language called SELECTSCRIPT. As its name suggests, it is a scripting language inspired primarily by SQL and its relational algebra. It is intended to be used for complex queries on different kinds of world models. Scripts can be dynamically generated and executed, or embedded into the code of foreign programming languages. A first interpreter was therefore developed for Python. Adapting the ideas of language-oriented programming, which enables developers to create their own domain-specific language, we developed a language stub that can be easily adapted and extended to comply with any (discrete) robotic world model or robotic simulator. We will further show how simple SELECT-statements can be used to extract any kind of valuable information in various return formats, thereby going beyond traditional SQL capabilities.
Reasoning in complex environments with the SelectScript declarative language
Self Organising Deltoids self-organising deltoids dimension squeezing algorithm. This is a simple algorithm that tries to find reasonable positions in m-dimensional space for a set of points in n dimensions (where m is smaller than n). It’s main usage is to visualise n-dimensional data in 2 dimensions, but any dimensionality can be choosen.The algorithm simply takes a set of points in N-dimensions, and then gradually squeezes out the excess dimensions using the errors in inter-node distances to arrange the nodes in the reduced dimensional space.
Self-Adaptive Systems
Self Adaptive Software evaluates its own behavior and changes behavior when the evaluation indicates that it is not accomplishing what the software is intended to do, or when better functionality or performance is possible.
Self-Exciting Model of Information Cascades
Here we focus on predicting the final size of an information cascade spreading through a network. We develop a statistical model based on the theory of self-exciting point processes. A point process indexed by time is called a counting process when it counts the number of instances (reshares, in our case) over time. In contrast to homogeneous Poisson processes which assume constant intensity over time, self-exciting processes assume that all the previous instances (i.e., reshares) influence the future evolution of the process. Self-exciting point processes are frequently used to model “rich get richer” phenomena. They are ideal for modeling information cascades in networks because every new reshare of a post not only increases its cumulative reshare count by one, but also exposes new followers who may further reshare the post. We develop SEISMIC (Self-Exciting Model of Information Cascades) for predicting the total number of reshares of a given post. In our model, each post is fully characterized by its infectiousness which measures the reshare probability. We allow the infectiousness to vary freely over time in agreement with the observation that the infectiousness can drop as the content gets stale. Moreover, our model is able to identify at each time point whether the cascade is in the supercritical or subcritical state, based on whether its infectiousness is above or below a critical threshold. A cascade in the supercritical state is going through an “explosion” period and its final size cannot be predicted accurately at the current time. On the contrary, a cascade is tractable if it is in subcritical state. In this case, we are able to predict its ultimate popularity accurately by modeling the future cascading behavior by a Galton- Watson tree. Our SEISMIC approach makes several contributions: Generative model: SEISMIC imposes no parametric assumptions and requires no expensive feature engineering. Moreover, as complete social network structure may be hard to obtain, SEISMIC assumes minimal knowledge of the network: The only required input is the time history of reshares and the degrees of the resharing nodes.
Self-Exciting Point Process Model seismic
Self-Normalizing Neural Network Deep Learning has revolutionized vision via convolutional neural networks (CNNs) and natural language processing via recurrent neural networks (RNNs). However, success stories of Deep Learning with standard feed-forward neural networks (FNNs) are rare. FNNs that perform well are typically shallow and, therefore cannot exploit many levels of abstract representations. We introduce self-normalizing neural networks (SNNs) to enable high-level abstract representations. While batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are ‘scaled exponential linear units’ (SELUs), which induce self-normalizing properties. Using the Banach fixed-point theorem, we prove that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance — even under the presence of noise and perturbations. This convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularization, and (3) to make learning highly robust. Furthermore, for activations not close to unit variance, we prove an upper and lower bound on the variance, thus, vanishing and exploding gradients are impossible. We compared SNNs on (a) 121 tasks from the UCI machine learning repository, on (b) drug discovery benchmarks, and on (c) astronomy tasks with standard FNNs and other machine learning methods such as random forests and support vector machines. SNNs significantly outperformed all competing FNN methods at 121 UCI tasks, outperformed all competing methods at the Tox21 dataset, and set a new record at an astronomy data set. The winning SNN architectures are often very deep. Implementations are available at:
Self-Organizing Map
A self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map. Self-organizing maps are different from other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space.
This makes SOMs useful for visualizing low-dimensional views of high-dimensional data, akin to multidimensional scaling. The model was first described as an artificial neural network by the Finnish professor Teuvo Kohonen, and is sometimes called a Kohonen map or network.
Self-Organizing Systems The term Self-organizing Systems refers to a class of systems that are able to change their internal structure and their function in response to external circumstances. By self-organization it is understood that elements of a system are able to manipulate or organize other elements of the same system in a way that stabilizes either structure or function of the whole against external fluctuations. Over the last decades a variety of features have been identified as typical for self-organizing systems. Not all of these features are present in all systems able to self-organize. Self-organizing systems are dynamic, non-deterministic, open, exist far from equilibrium and sometimes employ autocatalytic amplification of fluctuations. Often, they are characterized by multiple time-scales of their internal and/or external interactions, they possess a hierarchy of structural and/or functional levels and they are able to react to external input in a variety of ways. Many self-organizing systems are non-teleological, i.e. they do not have a specific purpose except their own existence. As a consequence, self-maintenance is an important function of many self-organizing systems. Most of these systems are complex and use reduncancy to achieve resilience against external pertubation tendencies. Self-organizing systems have been discovered in nature, both in the non-living (galaxies, stars) and the living world (cells, organisms, ecosystems), they have been found in man-made systems (societies, economies), and they have been identified in the world of ideas (world views, scientific believes, norm systems).
Self Organizing System
Self-Paced Learning
It is known that Boosting can be interpreted as a gradient descent technique to minimize an underlying loss function. Specifically, the underlying loss being minimized by the traditional AdaBoost is the exponential loss, which is proved to be very sensitive to random noise/outliers. Therefore, several Boosting algorithms, e.g., LogitBoost and SavageBoost, have been proposed to improve the robustness of AdaBoost by replacing the exponential loss with some designed robust loss functions. In this work, we present a new way to robustify AdaBoost, i.e., incorporating the robust learning idea of Self-paced Learning (SPL) into Boosting framework. Specifically, we design a new robust Boosting algorithm based on SPL regime, i.e., SPLBoost, which can be easily implemented by slightly modifying off-the-shelf Boosting packages. Extensive experiments and a theoretical characterization are also carried out to illustrate the merits of the proposed SPLBoost.
Self-Service Semantic Suite
The Self-Service Semantic Suite (S4) provides a set of services for low-cost, on-demand text analytics and metadata management on the cloud.
S4 provides the following services:
• Text analytics services for news, Life Science and social media that allow you to extract valuable meaning and insights used to manage your business
• On-demand, fast and reliable access to Linked Datasets, such as DBpedia, Freebase and GeoNames. These datasets provide facts you can use to enhance your semantic analysis.
• A self-managed or fully-managed scalable RDF database available as-a-service, so that you can search and update semantic facts loaded from Linked Open Data or your own documents
Semantic Analysis Semantic analysis may refer to:
• Semantic analysis (compilers)
• Semantic analysis (machine learning)
• Semantic analysis (knowledge representation)
• Semantic analysis (linguistics)
Semantic Analysis Approach for Recommendation
Recommendation system is a common demand in daily life and matrix completion is a widely adopted technique for this task. However, most matrix completion methods lack semantic interpretation and usually result in weak-semantic recommendations. To this end, this paper proposes a {\bf S}emantic {\bf A}nalysis approach for {\bf R}ecommendation systems \textbf{(SAR)}, which applies a two-level hierarchical generative process that assigns semantic properties and categories for user and item. SAR learns semantic representations of users/items merely from user ratings on items, which offers a new path to recommendation by semantic matching with the learned representations. Extensive experiments demonstrate SAR outperforms other state-of-the-art baselines substantially.
Semantic Analytics Semantic analytics is the use of ontologies to analyze content in web resources. This field of research combines text analytics and Semantic Web technologies like RDF.
Semantic Analytics Visualization
In this paper we present a new tool for semantic analytics through 3D visualization called ‘Semantic Analytics Visualization’ (SAV). It has the capability for visualizing ontologies and meta-data including annotated webdocuments, images, and digital media such as audio and video clips in a synthetic three-dimensional semi-immersive environment. More importantly, SAV supports visual semantic analytics, whereby an analyst can interactively investigate complex relationships between heterogeneous information. The tool is built using Virtual Reality technology which makes SAV a highly interactive system. The backend of SAV consists of a Semantic Analytics system that supports query processing and semantic association discovery. Using a virtual laser pointer, the user can select nodes in the scene and either play digital media, display images, or load annotated web documents. SAV can also display the ranking of web documents as well as the ranking of paths (sequences of links). SAV supports dynamic specification of sub-queries of a given graph and displays the results based on ranking information, which enables the users to find, analyze and comprehend the information presented quickly and accurately.
Semantic Correspondences Convolutional Neural Network
This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category. Previous approaches focus on either combining a spatial regularizer with hand-crafted features, or learning a correspondence model for appearance only. We propose instead a convolutional neural network architecture, called SCNet, for learning a geometrically plausible model for semantic correspondence. SCNet uses region proposals as matching primitives, and explicitly incorporates geometric consistency in its loss function. It is trained on image pairs obtained from the PASCAL VOC 2007 keypoint dataset, and a comparative evaluation on several standard benchmarks demonstrates that the proposed approach substantially outperforms both recent deep learning architectures and previous methods based on hand-crafted features.
Semantic Differential Semantic differential is a type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept.
Osgood’s semantic differential was an application of his more general attempt to measure the semantics or meaning of words, particularly adjectives, and their referent concepts. The respondent is asked to choose where his or her position lies, on a scale between two bipolar adjectives (for example: “Adequate-Inadequate”, “Good-Evil” or “Valuable-Worthless”). Semantic differentials can be used to measure opinions, attitudes and values on a psychometrically controlled scale.
Semantic Entity Retrieval Toolkit
Unsupervised learning of low-dimensional, semantic representations of words and entities has recently gained attention. In this paper we describe the Semantic Entity Retrieval Toolkit (SERT) that provides implementations of our previously published entity representation models. The toolkit provides a unified interface to different representation learning algorithms, fine-grained parsing configuration and can be used transparently with GPUs. In addition, users can easily modify existing models or implement their own models in the framework. After model training, SERT can be used to rank entities according to a textual query and extract the learned entity/word representation for use in downstream algorithms, such as clustering or recommendation.
Semantic Evaluation
SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive. This series of evaluations is providing a mechanism to characterize in more precise terms exactly what is necessary to compute in meaning. As such, the evaluations provide an emergent mechanism to identify the problems and solutions for computations with meaning. These exercises have evolved to articulate more of the dimensions that are involved in our use of language. They began with apparently simple attempts to identify word senses computationally. They have evolved to investigate the interrelationships among the elements in a sentence (e.g., semantic role labeling), relations between sentences (e.g., coreference), and the nature of what we are saying (semantic relations and sentiment analysis). The purpose of the SemEval exercises and SENSEVAL is to evaluate semantic analysis systems. ‘Semantic Analysis’ refers to a formal analysis of meaning, and ‘computational’ refer to approaches that in principle support effective implementation. The first three evaluations, Senseval-1 through Senseval-3, were focused on word sense disambiguation, each time growing in the number of languages offered in the tasks and in the number of participating teams. Beginning with the fourth workshop, SemEval-2007 (SemEval-1), the nature of the tasks evolved to include semantic analysis tasks outside of word sense disambiguation. Triggered by the conception of the *SEM conference, the SemEval community had decided to hold the evaluation workshops yearly in association with the *SEM conference. It was also the decision that not every evaluation task will be run every year, e.g. none of the WSD tasks were running in the SemEval-2012 workshop.
Semantic Integration Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists, email archives, presence information (physical, psychological, and social), documents of all sorts, contacts (including social graphs), search results, and advertising and marketing relevance derived from them. In this regard, semantics focuses on the organization of and action upon information by acting as an intermediary between heterogeneous data sources, which may conflict not only by structure but also context or value.
Semantic Layer A semantic layer is a business representation of corporate data that helps end users access data autonomously using common business terms. Developed and patented by Business Objects, it maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization. By using common business terms, rather than data language, to access, manipulate, and organize information, it simplifies the complexity of business data. These business terms are stored as objects in a universe, accessed through business views. Universes enable business users to access and analyze data stored in a relational database and OLAP cubes. This is claimed to be core business intelligence (BI) technology that frees users from IT while ensuring correct results. Business Views is a multi-tier system that is designed to enable companies to build comprehensive and specific business objects that help report designers and end users access the information they require. Business Views is intended to enable people to add the necessary business context to their data islands and link them into a single organized Business View for their organization. Semantic layer maps tables to classes and columns to objects.
Semantic Learning Machine
In iterative supervised learning algorithms it is common to reach a point in the search where no further induction seems to be possible with the available data. If the search is continued beyond this point, the risk of overfitting increases significantly. Following the recent developments in inductive semantic stochastic methods, this paper studies the feasibility of using information gathered from the semantic neighborhood to decide when to stop the search. Two semantic stopping criteria are proposed and experimentally assessed in Geometric Semantic Genetic Programming (GSGP) and in the Semantic Learning Machine (SLM) algorithm (the equivalent algorithm for neural networks). The experiments are performed on real-world high-dimensional regression datasets. The results show that the proposed semantic stopping criteria are able to detect stopping points that result in a competitive generalization for both GSGP and SLM. This approach also yields computationally efficient algorithms as it allows the evolution of neural networks in less than 3 seconds on average, and of GP trees in at most 10 seconds. The usage of the proposed semantic stopping criteria in conjunction with the computation of optimal mutation/learning steps also results in small trees and neural networks.
Semantic Lexicon A semantic lexicon is a dictionary of words labeled with semantic classes so associations can be drawn between words that have not previously been encountered: it is a dictionary with a semantic network.
Semantic Matching Semantic matching is a technique used in computer science to identify information which is semantically related. Given any two graph-like structures, e.g. classifications, taxonomies database or XML schemas and ontologies, matching is an operator which identifies those nodes in the two structures which semantically correspond to one another. For example, applied to file systems it can identify that a folder labeled “car” is semantically equivalent to another folder “automobile” because they are synonyms in English. This information can be taken from a linguistic resource like WordNet. In the recent years many of them have been offered. S-Match is an example of a semantic matching operator. It works on lightweight ontologies, namely graph structures where each node is labeled by a natural language sentence, for example in English. These sentences are translated into a formal logical formula (according to an artificial unambiguous language) codifying the meaning of the node taking into account its position in the graph. For example, in case the folder “car” is under another folder “red” we can say that the meaning of the folder “car” is “red car” in this case. This is translated into the logical formula “red AND car”. The output of S-Match is a set of semantic correspondences called mappings attached with one of the following semantic relations: disjointness (⊥), equivalence (≡), more specific (⊑) and less specific (⊒). In our example the algorithm will return a mapping between ”car” and ”automobile” attached with an equivalence relation. Information semantically matched can also be used as a measure of relevance through a mapping of near-term relationships. Such use of S-Match technology is prevalent in the career space where it is used to gauge depth of skills through relational mapping of information found in applicant resumes. Semantic matching represents a fundamental technique in many applications in areas such as resource discovery, data integration, data migration, query translation, peer to peer networks, agent communication, schema and ontology merging. It using is also being investigated in other areas such as event processing. In fact, it has been proposed as a valid solution to the semantic heterogeneity problem, namely managing the diversity in knowledge. Interoperability among people of different cultures and languages, having different viewpoints and using different terminology has always been a huge problem. Especially with the advent of the Web and the consequential information explosion, the problem seems to be emphasized. People face the concrete problem to retrieve, disambiguate and integrate information coming from a wide variety of sources.
Semantic Memory Semantic memory refers to a portion of long-term memory that processes ideas and concepts that are not drawn from personal experience. Semantic memory includes things that are common knowledge, such as the names of colors, the sounds of letters, the capitals of countries and other basic facts acquired over a lifetime. The concept of semantic memory is fairly new. It was introduced in 1972 as the result of collaboration between Endel Tulving of the University of Toronto and Wayne Donaldson of the University of New Brunswick on the impact of organization in human memory. Tulving outlined the separate systems of conceptualization of episodic and semantic memory in his book, ‘Elements of Episodic Memory.’ He noted that semantic and episodic differ in how they operate and the types of information they process.
Semantic Textual Similarity
The goal of the’ Semantic Textual Similarity (STS)’ task is to create a unified framework for the evaluation of semantic textual similarity modules and to characterize their impact on NLP applications.’ STS measures the degree of semantic equivalence. We are proposing the STS task as an attempt at creating a unified framework that allows for an extrinsic evaluation of multiple semantic components that otherwise have historically tended to be evaluated independently and without characterization of impact on NLP applications. STS is related to both Textual Entailment (TE) and – Paraphrase, but differs in a number of ways and it is more directly applicable to a number of NLP tasks. ‘ STS is ‘ different from TE inasmuch as it assumes bidirectional graded equivalence between the pair of textual snippets. In the case of TE the equivalence is directional, e.g. a car is a vehicle, but a vehicle is not necessarily a car. STS also differs from both TE and Paraphrase in that, rather than being a binary yes/no decision (e.g. a vehicle is not a car), STS is a graded similarity notion (e.g. a vehicle and a car are more similar than a wave and a car). This graded bidirectional nature of STS is useful for NLP tasks such as MT evaluation, information extraction, question answering, and summarization. Current textual similarity systems are limited in the scope of similarity they can address, mostly lexical and syntactic similarity. Some other linguistic phenomena have rarely been addressed in isolated efforts, e.g. metaphorical or idiomatic language (John spilled his guts to Mary, vs. John told Mary all about his stories/life), scoping and under-specification (Every representative of the company saw every sample), sentences where the structure is very divergent (The annihilation of Rome in 2000 BC was incurred by an insurgency of the slaves. Vs. The slaves’ revolution 2 millennia before Christ destroyed the capital of the Roman Empire.), and various modality phenomena such as committed belief, permission or negation. The STS task would like to foster joint research efforts on these, to date, fragmented areas.
Semantic Vector Spaces
Semidefinite Programming
Semidefinite programming (SDP) is a subfield of convex optimization concerned with the optimization of a linear objective function (that is, a function to be maximized or minimized) over the intersection of the cone of positive semidefinite matrices with an affine space, i.e., a spectrahedron.
Semidefinite programming is a relatively new field of optimization which is of growing interest for several reasons. Many practical problems in operations research and combinatorial optimization can be modeled or approximated as semidefinite programming problems. In automatic control theory, SDP’s are used in the context of linear matrix inequalities. SDPs are in fact a special case of cone programming and can be efficiently solved by interior point methods. All linear programs can be expressed as SDPs, and via hierarchies of SDPs the solutions of polynomial optimization problems can be approximated. Semidefinite programming has been used in the optimization of complex systems. In recent years, some quantum query complexity problems have been formulated in term of semidefinite programs.
Semi-Supervised Active Clustering
We propose a framework for Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not. We study the query and computational complexity of clustering in this framework. We consider a setting where the expert conforms to a center-based clustering with a notion of margin. We show that there is a trade off between computational complexity and query complexity; We prove that for the case of $k$-means clustering (i.e., when the expert conforms to a solution of $k$-means), having access to relatively few such queries allows efficient solutions to otherwise NP hard problems. In particular, we provide a probabilistic polynomial-time (BPP) algorithm for clustering in this setting that asks $O\big(k^2\log k + k\log n)$ same-cluster queries and runs with time complexity $O\big(kn\log n)$ (where $k$ is the number of clusters and $n$ is the number of instances). The success of the algorithm is guaranteed for data satisfying margin conditions under which, without queries, we show that the problem is NP hard. We also prove a lower bound on the number of queries needed to have a computationally efficient clustering algorithm in this setting.
Semi-Supervised GAN
We introduce a new model for building conditional generative models in a semi-supervised setting to conditionally generate data given attributes by adapting the GAN framework. The proposed semi-supervised GAN (SS-GAN) model uses a pair of stacked discriminators to learn the marginal distribution of the data, and the conditional distribution of the attributes given the data respectively. In the semi-supervised setting, the marginal distribution (which is often harder to learn) is learned from the labeled + unlabeled data, and the conditional distribution is learned purely from the labeled data. Our experimental results demonstrate that this model performs significantly better compared to existing semi-supervised conditional GAN models.
Semi-Supervised Learning Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.
Semi-Supervised Novelty Detection
A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions
SEmi-supervised VErification Network
Verification determines whether two samples belong to the same class or not, and has important applications such as face and fingerprint verification, where thousands or millions of categories are present but each category has scarce labeled examples, presenting two major challenges for existing deep learning models. We propose a deep semi-supervised model named SEmi-supervised VErification Network (SEVEN) to address these challenges. The model consists of two complementary components. The generative component addresses the lack of supervision within each category by learning general salient structures from a large amount of data across categories. The discriminative component exploits the learned general features to mitigate the lack of supervision within categories, and also directs the generative component to find more informative structures of the whole data manifold. The two components are tied together in SEVEN to allow an end-to-end training of the two components. Extensive experiments on four verification tasks demonstrate that SEVEN significantly outperforms other state-of-the-art deep semi-supervised techniques when labeled data are in short supply. Furthermore, SEVEN is competitive with fully supervised baselines trained with a larger amount of labeled data. It indicates the importance of the generative component in SEVEN.
SenGen We present a new topic model that generates documents by sampling a topic for one whole sentence at a time, and generating the words in the sentence using an RNN decoder that is conditioned on the topic of the sentence. We argue that this novel formalism will help us not only visualize and model the topical discourse structure in a document better, but also potentially lead to more interpretable topics since we can now illustrate topics by sampling representative sentences instead of bag of words or phrases. We present a variational auto-encoder approach for learning in which we use a factorized variational encoder that independently models the posterior over topical mixture vectors of documents using a feed-forward network, and the posterior over topic assignments to sentences using an RNN. Our preliminary experiments on two different datasets indicate early promise, but also expose many challenges that remain to be addressed.
Sensitivity Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). Specificity (sometimes called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition). These two measures are closely related to the concepts of type I and type II errors. A perfect predictor would be described as 100% sensitive (i.e. predicting all people from the sick group as sick) and 100% specific (i.e. not predicting anyone from the healthy group as sick); however, theoretically any predictor will possess a minimum error bound known as the Bayes error rate.
Sensitivity Analysis Sensitivity analysis is the study of how the uncertainty in the output of a mathematical model or system (numerical or otherwise) can be apportioned to different sources of uncertainty in its inputs. A related practice is uncertainty analysis, which has a greater focus on uncertainty quantification and propagation of uncertainty. Ideally, uncertainty and sensitivity analysis should be run in tandem. Sensitivity analysis can be useful for a range of purposes, including Testing the robustness of the results of a model or system in the presence of uncertainty. Increased understanding of the relationships between input and output variables in a system or model. Uncertainty reduction: identifying model inputs that cause significant uncertainty in the output and should therefore be the focus of attention if the robustness is to be increased (perhaps by further research). Searching for errors in the model (by encountering unexpected relationships between inputs and outputs). Model simplification – fixing model inputs that have no effect on the output, or identifying and removing redundant parts of the model structure. Enhancing communication from modelers to decision makers (e.g. by making recommendations more credible, understandable, compelling or persuasive). Finding regions in the space of input factors for which the model output is either maximum or minimum or meets some optimum criterion (, ). In case of calibrating models with large number of parameters, a primary sensitivity test can ease the calibration stage by focusing on the sensitive parameters. Not knowing the sensitivity of parameters can result in time being uselessly spent on non-sensitive ones. Taking an example from economics, in any budgeting process there are always variables that are uncertain. Future tax rates, interest rates, inflation rates, headcount, operating expenses and other variables may not be known with great precision. Sensitivity analysis answers the question, ‘if these variables deviate from expectations, what will the effect be (on the business, model, system, or whatever is being analyzed), and which variables are causing the largest deviations?’
Sensor Transformation Attention Networks Recent work on encoder-decoder models for sequence-to-sequence mapping has shown that integrating both temporal and spatial attention mechanisms into neural networks increases the performance of the system substantially. In this work, we report on the application of an attentional signal not on temporal and spatial regions of the input, but instead as a method of switching among inputs themselves. We evaluate the particular role of attentional switching in the presence of dynamic noise in the sensors, and demonstrate how the attentional signal responds dynamically to changing noise levels in the environment to achieve increased performance on both audio and visual tasks in three commonly-used datasets: TIDIGITS, Wall Street Journal, and GRID. Moreover, the proposed sensor transformation network architecture naturally introduces a number of advantages that merit exploration, including ease of adding new sensors to existing architectures, attentional interpretability, and increased robustness in a variety of noisy environments not seen during training. Finally, we demonstrate that the sensor selection attention mechanism of a model trained only on the small TIDIGITS dataset can be transferred directly to a pre-existing larger network trained on the Wall Street Journal dataset, maintaining functionality of switching between sensors to yield a dramatic reduction of error in the presence of noise.
Sentic Computing Sentic computing is a multi-disciplinary approach to natural language processing and understanding at the crossroads between affective computing, information extraction, and common-sense computing, which exploits both computer and social sciences to better interpret and process information on the Web. In sentic computing, whose term derives from the Latin ‘sentire’ (root of words such as sentiment and sentience) and ‘sensus’ (as in common-sense), the analysis of natural language is based on affective ontologies and common-sense reasoning tools, which enable the analysis of text not only at document-, page- or paragraph-level, but also at sentence-, clause-, and concept-level. In particular, sentic computing involves the use of AI and Semantic Web techniques, for knowledge representation and inference; mathematics, for carrying out tasks such as graph mining and multi-dimensionality reduction; linguistics, for discourse analysis and pragmatics; psychology, for cognitive and affective modeling; sociology, for understanding social network dynamics and social influence; finally ethics, for understanding related issues about the nature of mind and the creation of emotional machines. jumping NLP curves Sentic computing adopts the bag-of-concepts model in stead of simply counting word co-occurrence frequencies in text. Working at concept-level entails preserving the meaning carried by multi-word expressions such as ‘cloud computing’, which represent semantic atoms that should never be broken down into single words. In the bag-of-words model, for example, the concept ‘cloud computing’ would be split into ‘computing’ and ‘cloud’, which may wrongly activate concepts related to the weather and, hence, compromise categorization accuracy.
Sentient Enterprise The continued explosion of data and the continued evolution of analytics capabilities might usher in the next analytics revolution beyond the Intelligent Enterprise. The evolution of analytics capabilities towards an ideal state that is called ‘The Sentient Enterprise’. The Sentient Enterprise is an enterprise that can listen to data, conduct analysis and make autonomous decisions at massive scale in real-time. The Sentient Enterprise can listen to data to sense micro-trends. It can act as one organism without being impeded by information silos. It can make autonomous decisions with little or no human intervention. It is always evolving, with emergent intelligence that becomes progressively more sophisticated.
Sequence and Set Similarity Measure
In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this paper, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences.
Sequence Graph Transform
A ubiquitous presence of sequence data across fields, like, web, healthcare, bioinformatics, text mining, etc., has made sequence mining a vital research area. However, sequence mining is particularly challenging because of absence of an accurate and fast approach to find (dis)similarity between sequences. As a measure of (dis)similarity, mainstream data mining methods like k-means, kNN, regression, etc., have proved distance between data points in a euclidean space to be most effective. But a distance measure between sequences is not obvious due to their unstructuredness — arbitrary strings of arbitrary length. We, therefore, propose a new function, called as Sequence Graph Transform (SGT), that extracts sequence features and embeds it in a finite-dimensional euclidean space. It is scalable due to a low computational complexity and has a universal applicability on any sequence problem. We theoretically show that SGT can capture both short and long patterns in sequences, and provides an accurate distance-based measure of (dis)similarity between them. This is also validated experimentally. Finally, we show its real world application for clustering, classification, search and visualization on different sequence problems.
Sequence Mixed Graphs
A mixed graph can be seen as a type of digraph containing some edges (two opposite arcs). Here we introduce the concept of sequence mixed graphs, which is a generalization of both sequence graphs and iterated line digraphs. These structures are proven to be useful in the problem of constructing dense graphs or digraphs, and this is related to the degree/diameter problem. Thus, our generalized approach gives rise to graphs that have also good ratio order/diameter. Moreover, we propose a general method for obtaining a sequence mixed digraph by identifying some vertices of a certain iterated line digraph. As a consequence, some results about distance-related parameters (mainly, the diameter and the average distance) of sequence mixed graphs are presented.
Sequential Adaptive Nonlinear Modeling of Vector Time Series
We propose a method for adaptive nonlinear sequential modeling of vector-time series data. Data is modeled as a nonlinear function of past values corrupted by noise, and the underlying non-linear function is assumed to be approximately expandable in a spline basis. We cast the modeling of data as finding a good fit representation in the linear span of multi-dimensional spline basis, and use a variant of l1-penalty regularization in order to reduce the dimensionality of representation. Using adaptive filtering techniques, we design our online algorithm to automatically tune the underlying parameters based on the minimization of the regularized sequential prediction error. We demonstrate the generality and flexibility of the proposed approach on both synthetic and real-world datasets. Moreover, we analytically investigate the performance of our algorithm by obtaining both bounds of the prediction errors, and consistency results for variable selection.
Sequential Analysis In statistics, sequential analysis or sequential hypothesis testing is statistical analysis where the sample size is not fixed in advance. Instead data is evaluated as it is collected, and further sampling is stopped in accordance with a pre-defined stopping rule as soon as significant results are observed. Thus a conclusion may sometimes be reached at a much earlier stage than would be possible with more classical hypothesis testing or estimation, at consequently lower financial and/or human cost.
Sequential Backward Selection
The Sequential Backward Selection (SBS) algorithm is very similar to the Sequential Fortward Selection (SFS). The only difference is that we start with the complete feature set instead of the “null set” and remove features sequentially until we reach the number of desired features k. Note that features are never added back once they were removed, which (similar to SFS) is one of the biggest downsides of this algorithm.
Sequential Bayesian Additive Regression Trees
“Bayesian Additive Regression Trees”
Sequential Dynamical System
Sequential dynamical systems (SDSs) are a class of graph dynamical systems. They are discrete dynamical systems which generalize many aspects of for example classical cellular automata, and they provide a framework for studying asynchronous processes over graphs. The analysis of SDSs uses techniques from combinatorics, abstract algebra, graph theory, dynamical systems and probability theory.
Sequential Floating Backward Selection
Just as in the Sequential Floating Forward Selection (SFFS) algorithm, we have a conditional step: Here, we start with the whole feature subset and exclude features sequentially. Only if adding one of the previously excluded features back to a new feature subset improves the performance (assessed by the criterion function), we add it back in the Conditional Inclusion step.
Sequential Floating Forward Selection
The Sequential Floating Forward Selection (SFFS) algorithm can be considered as extension of the simpler Sequential Fortward Selection (SFS) algorithm. In constrast to SFS, the SFFS algorithm can remove features once they were included, so that a larger number of feature subset combinations can be sampled. It is important to emphasize that the removal of included features is conditional, which makes it different from the +L -R algorithm. The Conditional Exclusion in SFFS only occurs if the resulting feature subset is assessed as “better” by the criterion function after removal of a particular feature.
Sequential Forward Selection
The Sequential Fortward Selection (SFS) is one of the simplest and probably fastest Feature Selection algorithms. Let’s summarize its mechanics in words: SFS starts with an empty feature subset and sequentially adds features from the whole input feature space to this subset until the subset reaches a desired (user-specified) size. For every iteration (= inclusion of a new feature), the whole feature subset is evaluated (expect for the features that are already included in the new subset). The evaluation is done by the so-called criterion function which assesses the feature that leads to the maximum performance improvement of the feature subset if it is included. Note that included features are never removed, which is one of the biggest downsides of this algorithm.
Sequential Input Selection Algorithm
In time series prediction, making accurate predictions is often the primary goal. At the same time, interpretability of the models would be desirable. For the latter goal, we have devised a sequential input selection algorithm (SISAL) to choose a parsimonious, or sparse, set of input variables. Our proposed algorithm is a sequential backward selection type algorithm based on a cross-validation resampling procedure. Our strategy is to use a filter approach in the prediction: first we select a sparse set of inputs using linear models and then the selected inputs are used in the nonlinear prediction conducted with multilayer-perceptron networks. Furthermore, we perform a sensitivity analysis by quantifying the importance of the individual input variables in the nonlinear models using a method based on partial derivatives. Experiments are done with the Santa Fe laser data set that exhibits very nonlinear behavior and a data set in a problem of electricity load prediction. The results in the prediction problems of varying difficulty highlight the range of applicability of our proposed algorithm. In summary, our SISAL yields accurate and parsimonious prediction models giving insight to the original problem.
Sequential Match Network We study response selection for multi-turn conversation in retrieval based chatbots. Existing works either ignores relationships among utterances, or misses important information in context when matching a response with a highly abstract context vector finally. We propose a new session based matching model to address both problems. The model first matches a response with each utterance on multiple granularities, and distills important matching information from each pair as a vector with convolution and pooling operations. The vectors are then accumulated in a chronological order through a recurrent neural network (RNN) which models the relationships among the utterances. The final matching score is calculated with the hidden states of the RNN. Empirical study on two public data sets shows that our model can significantly outperform the state-of-the-art methods for response selection in multi-turn conversation.
Sequential Monte Carlo
Particle filters or Sequential Monte Carlo (SMC) methods are a set of on-line posterior density estimation algorithms that estimate the posterior density of the state-space by directly implementing the Bayesian recursion equations. The term ‘sequential Monte Carlo’ was first coined in Liu and Chen (1998). SMC methods use a sampling approach, with a set of particles to represent the posterior density. The state-space model can be non-linear and the initial state and noise distributions can take any form required. SMC methods provide a well-established methodology for generating samples from the required distribution without requiring assumptions about the state-space model or the state distributions. However, these methods do not perform well when applied to high-dimensional systems. SMC methods implement the Bayesian recursion equations directly by using an ensemble based approach. The samples from the distribution are represented by a set of particles; each particle has a weight assigned to it that represents the probability of that particle being sampled from the probability density function. Weight disparity leading to weight collapse is a common issue encountered in these filtering algorithms; however it can be mitigated by including a resampling step before the weights become too uneven. In the resampling step, the particles with negligible weights are replaced by new particles in the proximity of the particles with higher weights.
Sequential Offsetted Regression
Sequential PAttern Discovery using Equivalence classes
In this paper we present SPADE, a new algorithm for fast discovery of Sequential Patterns. The existing solutions to this problem make repeated database scans, and use complex hash structures which have poor locality. SPADE utilizes combinatorial properties to decompose the original problem into smaller sub-problems, that can be independently solved in main-memory using efficient lattice search techniques, and using simple join operations. All sequences are discovered in only three database scans. Experiments showthat SPADE outperforms the best previous algorithm by a factor of two, and by an order of magnitude with some pre-processed data. It also has linear scalability with respect to the number of input-sequences, and a number of other database parameters.
Sequential Pattern Mining Sequential Pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining. There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning.
Sequential Principal Curves Analysis
This work includes all the technical details of the Sequential Principal Curves Analysis (SPCA) in a single document. SPCA is an unsupervised nonlinear and invertible feature extraction technique. The identified curvilinear features can be interpreted as a set of nonlinear sensors: the response of each sensor is the projection onto the corresponding feature. Moreover, it can be easily tuned for different optimization criteria; e.g. infomax, error minimization, decorrelation; by choosing the right way to measure distances along each curvilinear feature. Even though proposed in and shown to work in multiple modalities in , the SPCA framework has its original roots in the nonlinear ICA algorithm in. Later on, the SPCA philosophy for nonlinear generalization of PCA originated substantially faster alternatives at the cost of introducing different constraints in the model. Namely, the Principal Polynomial Analysis (PPA) , and the Dimensionality Reduction via Regression (DRR). This report illustrates the reasons why we developed such family and is the appropriate technical companion for the missing details in.
Sequential Probability Distribution fanplot
Sequential Probability Ratio Test
The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald. Neyman and Pearson’s 1933 result inspired Wald to reformulate it as a sequential analysis problem. The Neyman-Pearson lemma, by contrast, offers a rule of thumb for when all the data is collected (and its likelihood ratio known). While originally developed for use in quality control studies in the realm of manufacturing, SPRT has been formulated for use in the computerized testing of human examinees as a termination criterion.
Sequential Subspace Optimization Boosting
We present SEBOOST, a technique for boosting the performance of existing stochastic optimization methods. SEBOOST applies a secondary optimization process in the subspace spanned by the last steps and descent directions. The method was inspired by the SESOP optimization method for large-scale problems, and has been adapted for the stochastic learning framework. It can be applied on top of any existing optimization method with no need to tweak the internal algorithm. We show that the method is able to boost the performance of different algorithms, and make them more robust to changes in their hyper-parameters. As the boosting steps of SEBOOST are applied between large sets of descent steps, the additional subspace optimization hardly increases the overall computational burden. We introduce two hyper-parameters that control the balance between the baseline method and the secondary optimization process. The method was evaluated on several deep learning tasks, demonstrating promising results.
Serial Correlation
Serial Dependence Diagrams
Service Mining Traditional service marketing and service science attempted to help companies understand what customers think and how companies dealt with problems. However, a holistic framework and viewpoint to explore services differently is needed. Service mining provides a different perspective into the services industry. Professionals and practitioners also need various mindsets to investigate and analyze the evidence from services. According to the concept of service science, certain areas are involved such as economics, management, computer science, and engineering. This book provides a novel concept to combine the areas of social science and computer science in services. Service mining is a holistic concept covering a service’s lifecycle from design, experience, recover to retain. Traditionally, the value of mining is to discover unknown and potential patterns from big data. Service mining focuses on the amount of data generated from the value co-creation process and features of services. The goal of service mining is to analyze any step in the service’s lifecycle and help enterprises reexamine each one. Companies can also utilize appropriate marketing or management methods to adjust biases and revise the errors of services.
Service mining is defined as ‘a systematical process including service discovery, service experience, service recovery and service retention to discover unique patterns and exceptional values within the existing service pool’. The goal of service mining is similar to data mining, text mining or web mining. All aim to ‘detect something new’ from the base being mined. Service mining targets the service pool. What distinguishes service mining from data or text mining is the concept service itself. Data is generally considered factual; text, though more nuanced in that words carry connotations, has a primary denotative quality which conveys meaning that text miners and the consumers of the mined text agree upon. Service, however, is trickier. It is a process of establishing a value proposition; and the value it represents is the joint creation of the provider and the customer, each of which offers a different perception in constructing the value proposition. Moreover, in the concept of service mining, the mining target is not only the traditional categories of services but also IT-based services. Under the big umbrella of service science, service mining is considered to be a branch of it.
Service With Delay Problem In this paper, we introduce the online service with delay problem. In this problem, there are $n$ points in a metric space that issue service requests over time, and a server that serves these requests. The goal is to minimize the sum of distance traveled by the server and the total delay in serving the requests. This problem models the fundamental tradeoff between batching requests to improve locality and reducing delay to improve response time, that has many applications in operations management, operating systems, logistics, supply chain management, and scheduling. Our main result is to show a poly-logarithmic competitive ratio for the online service with delay problem. This result is obtained by an algorithm that we call the preemptive service algorithm. The salient feature of this algorithm is a process called preemptive service, which uses a novel combination of (recursive) time forwarding and spatial exploration on a metric space. We hope this technique will be useful for related problems such as reordering buffer management, online TSP, vehicle routing, etc. We also generalize our results to $k > 1$ servers.
Shake-Shake Regularization The method introduced in this paper aims at helping deep learning practitioners faced with an overfit problem. The idea is to replace, in a multi-branch network, the standard summation of parallel branches with a stochastic affine combination. Applied to 3-branch residual networks, shake-shake regularization improves on the best single shot published results on CIFAR-10 and CIFAR-100 by reaching test errors of 2.86% and 15.85%. Experiments on architectures without skip connections or Batch Normalization show encouraging results and open the door to a large set of applications. Code is available at https://…/shake-shake.
Shannon-Hartley Theorem In information theory, the Shannon-Hartley theorem tells the maximum rate at which information can be transmitted over a communications channel of a specified bandwidth in the presence of noise. It is an application of the noisy-channel coding theorem to the archetypal case of a continuous-time analog communications channel subject to Gaussian noise. The theorem establishes Shannon’s channel capacity for such a communication link, a bound on the maximum amount of error-free digital data (that is, information) that can be transmitted with a specified bandwidth in the presence of the noise interference, assuming that the signal power is bounded, and that the Gaussian noise process is characterized by a known power or power spectral density. The law is named after Claude Shannon and Ralph Hartley.
ShareLaTeX An easy to use, online, collaborative LaTeX editor.
Shark SHARK is a fast, modular, feature-rich open-source C++ machine learning library. It provides methods for linear and nonlinear optimization, kernel-based learning algorithms, neural networks, and various other machine learning techniques (see the feature list below). It serves as a powerful toolbox for real world applications as well as research. Shark depends on Boost and CMake. It is compatible with Windows, Solaris, MacOS X, and Linux. Shark is licensed under the permissive GNU Lesser General Public License.
Shark Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. By running on Spark, Shark can call complex analytics functions like machine learning right from SQL. Or call Shark inside your Spark jobs to load Hive data.
Sheffield Elicitation Framework
The SHeffield ELicitation Framework (SHELF) is a package of documents, templates and software to carry out elicitation of probability distributions for uncertain quantities from a group of experts. Elicitation is increasingly important for quantifying expert knowledge in situations where hard data are sparse. This is often the context in which difficult policy decisions are made. It is generally important to elicit from a group of experts, rather than a single expert, in order to synthesise the range of knowledge and opinions of the expert community. (However, SHELF may be used for a single expert with only trivial modification.)
ShiftCNN In this paper we introduce ShiftCNN, a generalized low-precision architecture for inference of multiplierless convolutional neural networks (CNNs). ShiftCNN is based on a power-of-two weight representation and, as a result, performs only shift and addition operations. Furthermore, ShiftCNN substantially reduces computational cost of convolutional layers by precomputing convolution terms. Such an optimization can be applied to any CNN architecture with a relatively small codebook of weights and allows to decrease the number of product operations by at least two orders of magnitude. The proposed architecture targets custom inference accelerators and can be realized on FPGAs or ASICs. Extensive evaluation on ImageNet shows that the state-of-the-art CNNs can be converted without retraining into ShiftCNN with less than 1% drop in accuracy when the proposed quantization algorithm is employed. RTL simulations, targeting modern FPGAs, show that power consumption of convolutional layers is reduced by a factor of 4 compared to conventional 8-bit fixed-point architectures.
Shortest Dependency Path – Long Short Term Memory
Relation classification is an important research arena in the field of natural language processing (NLP). In this paper, we present SDP-LSTM, a novel neural network to classify the relation of two entities in a sentence. Our neural architecture leverages the shortest dependency path (SDP) between two entities; multichannel recurrent neural networks, with long short term memory (LSTM) units, pick up heterogeneous information along the SDP. Our proposed model has several distinct features: (1) The shortest dependency paths retain most relevant information (to relation classification), while eliminating irrelevant words in the sentence. (2) The multichannel LSTM networks allow effective information integration from heterogeneous sources over the dependency paths. (3) A customized dropout strategy regularizes the neural network to alleviate overfitting. We test our model on the SemEval 2010 relation classification task, and achieve an $F_1$-score of 83.7\%, higher than competing methods in the literature.
Shortest Path Faster Algorithm
The Shortest Path Faster Algorithm (SPFA) is an improvement of the Bellman-Ford algorithm which computes single-source shortest paths in a weighted directed graph. The algorithm is believed to work well on random sparse graphs and is particularly suitable for graphs that contain negative-weight edges. However, the worst-case complexity of SPFA is the same as that of Bellman-Ford, so for graphs with nonnegative edge weights Dijkstra’s algorithm is preferred. The SPFA algorithm was published in 1994 by Fanding Duan.
Shortest Probability Interval
ShotgunWSD In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at the document level. Our algorithm is inspired by a widely-used approach in the field of genetics for whole genome sequencing, known as the Shotgun sequencing technique. The proposed WSD algorithm is based on three main steps. First, a brute-force WSD algorithm is applied to short context windows (up to 10 words) selected from the document in order to generate a short list of likely sense configurations for each window. In the second step, these local sense configurations are assembled into longer composite configurations based on suffix and prefix matching. The resulted configurations are ranked by their length, and the sense of each word is chosen based on a voting scheme that considers only the top k configurations in which the word appears. We compare our algorithm with other state-of-the-art unsupervised WSD algorithms and demonstrate better performance, sometimes by a very large margin. We also show that our algorithm can yield better performance than the Most Common Sense (MCS) baseline on one data set. Moreover, our algorithm has a very small number of parameters, is robust to parameter tuning, and, unlike other bio-inspired methods, it gives a deterministic solution (it does not involve random choices).
Shrinkage In statistics, shrinkage has two meanings:
• In relation to the general observation that, in regression analysis, a fitted relationship appears to perform less well on a new data set than on the data set used for fitting. In particular the value of the coefficient of determination ‘shrinks’. This idea is complementary to overfitting and, separately, to the standard adjustment made in the coefficient of determination to compensate for the subjunctive effects of further sampling, like controlling for the potential of new explanatory terms improving the model by chance: that is, the adjustment formula itself provides ‘shrinkage.’ But the adjustment formula yields an artificial shrinkage, in contrast to the first definition.
• To describe general types of estimators, or the effects of some types of estimation, whereby a naive or raw estimate is improved by combining it with other information (). The term relates to the notion that the improved estimate is at a reduced distance from the value supplied by the ‘other information’ than is the raw estimate. In this sense, shrinkage is used to regularize ill-posed inference problems.
A common idea underlying both of these meanings is the reduction in the effects of sampling variation.
Shrinkage Estimator In statistics, a shrinkage estimator is an estimator that, either explicitly or implicitly, incorporates the effects of shrinkage. In loose terms this means that a naive or raw estimate is improved by combining it with other information. The term relates to the notion that the improved estimate is made closer to the value supplied by the ‘other information’ than the raw estimate. In this sense, shrinkage is used to regularize ill-posed inference problems.
Shrunken Centroids Regularized Discriminant Analysis
In this paper, we introduce a modified version of linear discriminant analysis, called ‘shrunken centroids regularized discriminant analysis’ (SCRDA). This method generalizes the idea of ‘nearest shrunken centroids’ (NSC) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method and can be as competitive as the SVM classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method. The open source R package for SCRDA is available and will be added to the R libraries in the near future.
Shuffled Graph Shuffled Graphs are graphs with latent vertex labels.
ShuffleNet We introduce an extremely computation efficient CNN architecture named ShuffleNet, designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two proposed operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 6.7\%) than the recent MobileNet system on ImageNet classification under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves \textasciitilde 13$\times$ actual speedup over AlexNet while maintaining comparable accuracy.
shuttleNet Despite a lot of research efforts devoted in recent years, how to efficiently learn long-term dependencies from sequences still remains a pretty challenging task. As one of the key models for sequence learning, recurrent neural network (RNN) and its variants such as long short term memory (LSTM) and gated recurrent unit (GRU) are still not powerful enough in practice. One possible reason is that they have only feedforward connections, which is different from biological neural network that is typically composed of both feedforward and feedback connections. To address the problem, this paper proposes a biologically-inspired RNN structure, called shuttleNet, by introducing loop connections in the network and utilizing parameter sharing to prevent overfitting. Unlike the traditional RNNs, the cells of shuttleNet are loop connected to mimic the brain’s feedforward and feedback connections. The structure is then stretched in the depth dimension to generate a deeper model with multiple information flow paths, while the parameters are shared so as to prevent shuttleNet from being over-fitting. The attention mechanism is then applied to select the best information path. The extensive experiments are conducted on two datasets for action recognition: UCF101 and HMDB51. We find that our model can outperform LSTMs and GRUs remarkably. Even only replacing the LSTMs with our shuttleNet in a CNN-RNN network, we can still achieve the state-of-the-art performance on both datasets.
Siamese Deep Forest
A Siamese Deep Forest (SDF) is proposed in the paper. It is based on the Deep Forest or gcForest proposed by Zhou and Feng and can be viewed as a gcForest modification. It can be also regarded as an alternative to the well-known Siamese neural networks. The SDF uses a modified training set consisting of concatenated pairs of vectors. Moreover, it defines the class distributions in the deep forest as the weighted sum of the tree class probabilities such that the weights are determined in order to reduce distances between similar pairs and to increase them between dissimilar points. We show that the weights can be obtained by solving a quadratic optimization problem. The SDF aims to prevent overfitting which takes place in neural networks when only limited training data are available. The numerical experiments illustrate the proposed distance metric method.
Sibyl A system for large scale supervised machine learning. Sibyl is an important research project underway at Google that implements machine learning primitives at scale and is widely used within Google. Large scale machine learning is playing an increasingly important role in improving the quality and monetization of Internet properties. A small number of techniques, such as regression, have proven to be widely applicable across Internet properties and applications.
sigma.js Sigma is a JavaScript library dedicated to graph drawing. It makes easy to publish networks on Web pages, and allows developers to integrate network exploration in rich Web applications.
Sigma-Delta Networks Deep neural networks can be obscenely wasteful. When processing video, a convolutional network expends a fixed amount of computation for each frame with no regard to the similarity between neighbouring frames. As a result, it ends up repeatedly doing very similar computations. To put an end to such waste, we introduce Sigma-Delta networks. With each new input, each layer in this network sends a discretized form of its change in activation to the next layer. Thus the amount of computation that the network does scales with the amount of change in the input and layer activations, rather than the size of the network. We introduce an optimization method for converting any pre-trained deep network into an optimally efficient Sigma-Delta network, and show that our algorithm, if run on the appropriate hardware, could cut at least an order of magnitude from the computational cost of processing video data.
Sigmoid Function A sigmoid function is a mathematical function having an “S” shape (sigmoid curve). Often, sigmoid function refers to the special case of the logistic function.
SignalR ASP.NET SignalR is a new library for ASP.NET developers that makes it incredibly simple to add real-time web functionality to your applications. What is “real-time web” functionality? It’s the ability to have your server-side code push content to the connected clients as it happens, in real-time. You may have heard of WebSockets, a new HTML5 API that enables bi-directional communication between the browser and server. SignalR will use WebSockets under the covers when it’s available, and gracefully fallback to other techniques and technologies when it isn’t, while your application code stays the same. SignalR also provides a very simple, high-level API for doing server to client RPC (call JavaScript functions in your clients’ browsers from server-side .NET code) in your ASP.NET application, as well as adding useful hooks for connection management, e.g. connect/disconnect events, grouping connections, authorization.
Signal-to-Noise Ratio
Signal-to-noise ratio (abbreviated SNR) is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. It is defined as the ratio of signal power to the noise power, often expressed in decibels. A ratio higher than 1:1 (greater than 0 dB) indicates more signal than noise. While SNR is commonly quoted for electrical signals, it can be applied to any form of signal (such as isotope levels in an ice core or biochemical signaling between cells). The signal-to-noise ratio, the bandwidth, and the channel capacity of a communication channel are connected by the Shannon-Hartley theorem. Signal-to-noise ratio is sometimes used informally to refer to the ratio of useful information to false or irrelevant data in a conversation or exchange. For example, in online discussion forums and other online communities, off-topic posts and spam are regarded as ‘noise’ that interferes with the ‘signal’ of appropriate discussion.
Significance-Offset Convolutional Neural Network We propose ‘Significance-Offset Convolutional Neural Network’, a deep convolutional network architecture for multivariate time series regression. The model is inspired by standard autoregressive (AR) models and gating mechanisms used in recurrent neural networks. It involves an AR-like weighting system, where the final predictor is obtained as a weighted sum of sub-predictors while the weights are data-dependent functions learnt through a convolutional network.The architecture was designed for applications on asynchronous time series with low signal-to-noise ratio and hence is evaluated on such datasets: a hedge fund proprietary dataset of over2 million quotes for a credit derivative index andan artificially generated noisy autoregressive series. The proposed architecture achieves promising results compared to convolutional and recur-rent neural networks. The code for the numerical experiments and the architecture implementation will be shared online to make the research reproducible.
Silander-Myllymaki bnstruct
Silhouette Silhouette refers to a method of interpretation and validation of clusters of data. The technique provides a succinct graphical representation of how well each object lies within its cluster. It was first described by Peter J. Rousseeuw in 1986.
SimDex We present SimDex, a new technique for serving exact top-K recommendations on matrix factorization models that measures and optimizes for the similarity between users in the model. Previous serving techniques presume a high degree of similarity (e.g., L2 or cosine distance) among users and/or items in MF models; however, as we demonstrate, the most accurate models are not guaranteed to exhibit high similarity. As a result, brute-force matrix multiply outperforms recent proposals for top-K serving on several collaborative filtering tasks. Based on this observation, we develop SimDex, a new technique for serving matrix factorization models that automatically optimizes serving based on the degree of similarity between users, and outperforms existing methods in both the high-similarity and low-similarity regimes. SimDexfirst measures the degree of similarity among users via clustering and uses a cost-based optimizer to either construct an index on the model or defer to blocked matrix multiply. It leverages highly efficient linear algebra primitives in both cases to deliver predictions either from its index or from brute-force multiply. Overall, SimDex runs an average of 2x and up to 6x faster than highly optimized baselines for the most accurate models on several popular collaborative filtering datasets.
Simhash Algorithm Most hash functions are used to separate and obscure data, so that similar data hashes to very different keys. We propose to use hash functions for the opposite purpose: to detect similarities between data. Detecting similar files and classifying documents is a well-studied problem, but typically involves complex heuristics and/or O(n 2 ) pair-wise comparisons. Using a hash function that hashed similar files to similar values, file similarity could be determined simply by comparing pre-sorted hash key values. The challenge is to find a similarity hash that minimizes false positives. We have implemented a family of similarity hash functions with this intent. We have further enhanced their performance by storing the auxiliary data used to compute our hash keys. This data is used as a second filter after a hash key comparison indicates that two files are potentially similar. We use these tests to explore the notion of “similarity.”
Similarity Ensemble Approach
SEA is based on the idea that two targets are similar if the ligand sets of a target are similar to one another. The similarity of two ligand sets is computed by the sum of ligand pair similarities that exceed a certain threshold. The ligand pair similarity is measured by Tanimoto similarity. To correct for size or chemical composition bias a correction technique is intrudiced, which is based on the similarity obtained from randomly drawn ligand sets is. This leads to z-scores for similarity between the sets. It is argued that the z-scores conform an extreme value distribution. Using this extreme value distribution the probability that a compound is active on a certain target is calculated by assuming that one of the two ligand sets consists only of the compound to predict. We implemented the SEA method efficiently for using it on a multi-core supercomputer, enabling us to compare it to the other target prediction methods.
Similarity Flooding Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the ‘accuracy’ of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.
Similarity-Based Imbalanced Classification
When the training data in a two-class classification problem is overwhelmed by one class, most classification techniques fail to correctly identify the data points belonging to the underrepresented class. We propose Similarity-based Imbalanced Classification (SBIC) that learns patterns in the training data based on an empirical similarity function. To take the imbalanced structure of the training data into account, SBIC utilizes the concept of absent data, i.e. data from the minority class which can help better find the boundary between the two classes. SBIC simultaneously optimizes the weights of the empirical similarity function and finds the locations of absent data points. As such, SBIC uses an embedded mechanism for synthetic data generation which does not modify the training dataset, but alters the algorithm to suit imbalanced datasets. Therefore, SBIC uses the ideas of both major schools of thoughts in imbalanced classification: Like cost-sensitive approaches SBIC operates on an algorithm level to handle imbalanced structures; and similar to synthetic data generation approaches, it utilizes the properties of unobserved data points from the minority class. The application of SBIC to imbalanced datasets suggests it is comparable to, and in some cases outperforms, other commonly used classification techniques for imbalanced datasets.
Similarity-First Search Seriation
Simple Competitive Learning
Simple Logging Facade for Java
The Simple Logging Facade for Java (SLF4J) serves as a simple facade or abstraction for various logging frameworks (e.g. java.util.logging, logback, log4j) allowing the end user to plug in the desired logging framework at deployment time. Before you start using SLF4J, we highly recommend that you read the two-page SLF4J user manual. Note that SLF4J-enabling your library implies the addition of only a single mandatory dependency, namely slf4j-api.jar. If no binding is found on the class path, then SLF4J will default to a no-operation implementation. In case you wish to migrate your Java source files to SLF4J, consider our migrator tool which can migrate your project to use the SLF4J API in just a few minutes. In case an externally-maintained component you depend on uses a logging API other than SLF4J, such as commons logging, log4j or java.util.logging, have a look at SLF4J’s binary-support for legacy APIs.
Simple Temporal Point Process
A simple temporal point process (SPP) is an important class of time series, where the sample realization of the process is solely composed of the times at which events occur. Particular examples of point process data are neuronal spike patterns or spike trains, and a large number of distance and similarity metrics for those data have been proposed. A marked point process (MPP) is an extension of a simple temporal point process, in which a certain vector valued mark is associated with each of the temporal points in the SPP. Analyses of MPPs are of practical importance because instances of MPPs include recordings of natural disasters such as earthquakes and tornadoes.
Simplex Algorithm In mathematical optimization, Dantzig’s simplex algorithm (or simplex method) is a popular algorithm for linear programming.
Simplex Model
Simplified Probabilistic Linear Discriminant Analysis
Simplified Shotgun Stochastic Search
In p >> n settings, full posterior sampling using existing Markov chain Monte Carlo (MCMC) algorithms is highly inefficient and often not feasible from a practical perspective. To overcome this problem, we propose a scalable stochastic search algorithm that is called the Simplified Shotgun Stochastic Search (S5) and aimed at rapidly explore interesting regions of model space and finding the maximum a posteriori(MAP) model. Also, the S5 provides an approximation of posterior probability of each model (including the marginal inclusion probabilities).
Simpson’s Paradox In probability and statistics, Simpson’s paradox, or the Yule-Simpson effect, is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is often encountered in social-science and medical-science statistics, and is particularly confounding when frequency data are unduly given causal interpretations. Simpson’s Paradox disappears when causal relations are brought into consideration. Many statisticians believe that the mainstream public should be informed of the counter-intuitive results in statistics such as Simpson’s paradox.
SimRank SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model. SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects. Effectively, SimRank is a measure that says “two objects are considered to be similar if they are referenced by similar objects.”
Simulated Annealing
Simulated annealing (SA) is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete (e.g., all tours that visit a given set of cities). For certain problems, simulated annealing may be more efficient than exhaustive enumeration – provided that the goal is merely to find an acceptably good solution in a fixed amount of time, rather than the best possible solution.
Simultaneous Validation Over an Organized set of Hypotheses
Single Index Latent Variable Models
A semi-parametric, non-linear regression model in the presence of latent variables is introduced. These latent variables can correspond to unmodeled phenomena or unmeasured agents in a complex networked system. This new formulation allows joint estimation of certain non-linearities in the system, the direct interactions between measured variables, and the effects of unmodeled elements on the observed system. The particular form of the model is justified, and learning is posed as a regularized maximum likelihood estimation. This leads to classes of structured convex optimization problems with a ‘sparse plus low-rank’ flavor. Relations between the proposed model and several common model paradigms, such as those of Robust Principal Component Analysis (PCA) and Vector Autoregression (VAR), are established. Particularly in the VAR setting, the low-rank contributions can come from broad trends exhibited in the time series. Details of the algorithm for learning the model are presented. Experiments demonstrate the performance of the model and the estimation algorithm on simulated and real data.
Single-Linkage Clustering Single-linkage clustering is one of several methods of agglomerative hierarchical clustering. In the beginning of the process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters, until all elements end up being in the same cluster. At each step, the two clusters separated by the shortest distance are combined. The definition of ‘shortest distance’ is what differentiates between the different agglomerative clustering methods. In single-linkage clustering, the link between two clusters is made by a single element pair, namely those two elements (one in each cluster) that are closest to each other. The shortest of these links that remains at any step causes the fusion of the two clusters whose elements are involved. The method is also known as nearest neighbour clustering. The result of the clustering can be visualized as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place.
Singular Spectrum Analysis
In time series analysis, singular spectrum analysis (SSA) is a nonparametric spectral estimation method. It combines elements of classical time series analysis, multivariate statistics, multivariate geometry, dynamical systems and signal processing. Its roots lie in the classical Karhunen (1946)-Loève (1945, 1978) spectral decomposition of time series and random fields and in the Mañé (1981)-Takens (1981) embedding theorem. SSA can be an aid in the decomposition of time series into a sum of components, each having a meaningful interpretation. The name “singular spectrum analysis” relates to the spectrum of eigenvalues in a singular value decomposition of a covariance matrix, and not directly to a frequency domain decomposition.
Singular Value Decomposition
In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics.
Singular Vector Canonical Correlation Analysis
With the continuing empirical successes of deep networks, it becomes increasingly important to develop better methods for understanding training of models and the representations learned within. In this paper we propose Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less.
Skellam Distribution The Skellam distribution is the discrete probability distribution of the difference n_1-n_2 of two statistically independent random variables N_1 and N_2 each having Poisson distributions with different expected values \mu_1 and \mu_2. It is useful in describing the statistics of the difference of two images with simple photon noise, as well as describing the point spread distribution in sports where all scored points are equal, such as baseball, hockey and soccer. The distribution is also applicable to a special case of the difference of dependent Poisson random variables, but just the obvious case where the two variables have a common additive random contribution which is cancelled by the differencing: see Karlis & Ntzoufras (2003) for details and an application.
Sketch, Shingle, & Hashing
Similarity search on time series is a frequent operation in large-scale data-driven applications. Sophisticated similarity measures are standard for time series matching, as they are usually misaligned. Dynamic Time Warping or DTW is the most widely used similarity measure for time series because it combines alignment and matching at the same time. However, the alignment makes DTW slow. To speed up the expensive similarity search with DTW, branch and bound based pruning strategies are adopted. However, branch and bound based pruning are only useful for very short queries (low dimensional time series), and the bounds are quite weak for longer queries. Due to the loose bounds branch and bound pruning strategy boils down to a brute-force search. To circumvent this issue, we design SSH (Sketch, Shingle, & Hashing), an efficient and approximate hashing scheme which is much faster than the state-of-the-art branch and bound searching technique: the UCR suite. SSH uses a novel combination of sketching, shingling and hashing techniques to produce (probabilistic) indexes which align (near perfectly) with DTW similarity measure. The generated indexes are then used to create hash buckets for sub-linear search. Our results show that SSH is very effective for longer time sequence and prunes around 95% candidates, leading to the massive speedup in search with DTW. Empirical results on two large-scale benchmark time series data show that our proposed method can be around 20 times faster than the state-of-the-art package (UCR suite) without any significant loss in accuracy.
Sketched Subspace Clustering
The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data. Subspace clustering (SC) is a relatively recent method that is able to successfully classify nonlinearly separable data in a multitude of settings. In spite of their high clustering accuracy, SC methods incur prohibitively high computational complexity when processing large volumes of high-dimensional data. Inspired by random sketching approaches for dimensionality reduction, the present paper introduces a randomized scheme for SC, termed Sketch-SC, tailored for large volumes of high-dimensional data. Sketch-SC accelerates the computationally heavy parts of state-of-the-art SC approaches by compressing the data matrix across both dimensions using random projections, thus enabling fast and accurate large-scale SC. Performance analysis as well as extensive numerical tests on real data corroborate the potential of Sketch-SC and its competitive performance relative to state-of-the-art scalable SC approaches.
Skew Logistic Distribution A random variable X is said to have Azzalini’s skew-logistic distribution if its pdf is f(x)=2g(x)G(lambda*x), where g(·) and G(·), respectively, denote the pdf and cdf of the logistic distribution.
Skill2vec Un-supervise learned word embeddings have seen tremendous success in numerous Natural Language Processing (NLP) tasks in recent years. The main contribution of this paper is to develop a technique called Skill2vec, which applies machine learning techniques in recruitment to enhance the search strategy to find the candidates who possess the right skills. Skill2vec is a neural network architecture which inspired by Word2vec, developed by Mikolov et al. in 2013, to transform a skill to a new vector space. This vector space has the characteristics of calculation and present their relationship. We conducted an experiment using AB testing in a recruitment company to demonstrate the effectiveness of our approach.
Skip-Gram Model A technique where by n-grams are still stored to model language, but they allow for tokens to be skipped.
Slate Markov Decision Processes
Many real-world problems come with action spaces represented as feature vectors. Although high-dimensional control is a largely unsolved problem, there has recently been progress for modest dimensionalities. Here we report on a successful attempt at addressing problems of dimensionality as high as $2000$, of a particular form. Motivated by important applications such as recommendation systems that do not fit the standard reinforcement learning frameworks, we introduce Slate Markov Decision Processes (slate-MDPs). A Slate-MDP is an MDP with a combinatorial action space consisting of slates (tuples) of primitive actions of which one is executed in an underlying MDP. The agent does not control the choice of this executed action and the action might not even be from the slate, e.g., for recommendation systems for which all recommendations can be ignored. We use deep Q-learning based on feature representations of both the state and action to learn the value of whole slates. Unlike existing methods, we optimize for both the combinatorial and sequential aspects of our tasks. The new agent’s superiority over agents that either ignore the combinatorial or sequential long-term value aspect is demonstrated on a range of environments with dynamics from a real-world recommendation system. Further, we use deep deterministic policy gradients to learn a policy that for each position of the slate, guides attention towards the part of the action space in which the value is the highest and we only evaluate actions in this area. The attention is used within a sequentially greedy procedure leveraging submodularity. Finally, we show how introducing risk-seeking can dramatically imporve the agents performance and ability to discover more far reaching strategies.
Sliced Inverse Regression
Sliced inverse regression (SIR) is a tool for dimension reduction in the field of multivariate statistics. In statistics, regression analysis is a popular way of studying the relationship between a response variable y and its explanatory variable x _ {\displaystyle {\underline {x}}} {\underline {x}}, which is a p-dimensional vector. There are several approaches which come under the term of regression. For example parametric methods include multiple linear regression; non-parametric techniques include local smoothing. With high-dimensional data (as p grows), the number of observations needed to use local smoothing methods escalates exponentially. Reducing the number of dimensions makes the operation computable. Dimension reduction aims to show only the most important directions of the data. SIR uses the inverse regression curve, E ( x _ | y ) {\displaystyle E({\underline {x}}\,|\,y)} E({\underline {x}}\,|\,y) to perform a weighted principal component analysis, with which one identifies the effective dimension reducing directions.
Sliced Inverse Regression for Dimension Reduction
Slopegraphs An overview of Edward Tufte’s “slopegraphs”; their history; good and bad examples; when to use slopegraphs; slopegraph best practices. (from Charlie Park)
Slow Feature Analysis
Slow feature analysis (SFA) is an unsupervised learning algorithm for extracting slowly varying features from a quickly varying input signal. It has been successfully applied, e.g., to the self-organization of complex-cell receptive fields, the recognition of whole objects invariant to spatial transformations, the self-organization of place-cells, extraction of driving forces, and to nonlinear blind source separation.
Theoretical Analysis of the Optimal Free Responses of Graph-Based SFA for the Design of Training Graphs
Sluice Networks Multi-task learning is partly motivated by the observation that humans bring to bear what they know about related problems when solving new ones. Similarly, deep neural networks can profit from related tasks by sharing parameters with other networks. However, humans do not consciously decide to transfer knowledge between tasks (and are typically not aware of the transfer). In machine learning, it is hard to estimate if sharing will lead to improvements; especially if tasks are only loosely related. To overcome this, we introduce Sluice Networks, a general framework for multi-task learning where trainable parameters control the amount of sharing — including which parts of the models to share. Our framework goes beyond and generalizes over previous proposals in enabling hard or soft sharing of all combinations of subspaces, layers, and skip connections. We perform experiments on three task pairs from natural language processing, and across seven different domains, using data from OntoNotes 5.0, and achieve up to 15% average error reductions over common approaches to multi-task learning. We analyze when the architecture is particularly helpful, as well as its ability to fit noise. We show that a) label entropy is predictive of gains in sluice networks, confirming findings for hard parameter sharing, and b) while sluice networks easily fit noise, they are robust across domains in practice.
Small Area Estimation
Small area estimation is any of several statistical techniques involving the estimation of parameters for small sub-populations, generally used when the sub-population of interest is included in a larger survey. The term ‘small area’ in this context generally refers to a small geographical area such as a county. It may also refer to a ‘small domain’, i.e. a particular demographic within an area. If a survey has been carried out for the population as a whole (for example, a nation or state-wide survey), the sample size within any particular small area may be too small to generate accurate estimates from the data. To deal with this problem, it may be possible to use additional data (such as census records) that exists for these small areas in order to obtain estimates.
Smart Data
Smart Mining for Deep Metric Learning To solve deep metric learning problems and producing feature embeddings, current methodologies will commonly use a triplet model to minimise the relative distance between samples from the same class and maximise the relative distance between samples from different classes. Though successful, the training convergence of this triplet model can be compromised by the fact that the vast majority of the training samples will produce gradients with magnitudes that are close to zero. This issue has motivated the development of methods that explore the global structure of the embedding and other methods that explore hard negative/positive mining. The effectiveness of such mining methods is often associated with intractable computational requirements. In this paper, we propose a novel deep metric learning method that combines the triplet model and the global structure of the embedding space. We rely on a smart mining procedure that produces effective training samples for a low computational cost. In addition, we propose an adaptive controller that automatically adjusts the smart mining hyper-parameters and speeds up the convergence of the training process. We show empirically that our proposed method allows for fast and more accurate training of triplet ConvNets than other competing mining methods. Additionally, we show that our method achieves new state-of-the-art embedding results for CUB-200-2011 and Cars196 datasets.
Smooth Imitation Learning
In Smooth Imitation Learning for online sequence prediction is the goal is to train a policy that can smoothly imitate demonstrated behavior in a dynamic and continuous environment in response to online, sequential context input.
Smoothly Clipped Absolute Deviation
Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0,inf), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions.
Snoogle Embedding small devices into everyday objects like toasters and coffee mugs creates a wireless network of objects. These embedded devices can contain a description of the underlying objects, or other user defined information. In this paper, we present Snoogle, a search engine for such a network. A user can query Snoogle to find a particular mobile object, or a list of objects that fit the description. Snoogle uses information retrieval techniques to index information and process user queries, and Bloom filters to reduce communication overhead. Security and privacy protections are also engineered into Snoogle to protect sensitive information. We have implemented a prototype of Snoogle using off-the-shelf sensor motes, and conducted extensive experiments to evaluate the system performance.
Snoogle Snoogle is a graphical, SWRL-based ontology mapper to assist in the task of OWL ontology alignment. It allows users to visualize ontologies and then draw mappings from one to another on a graphical canvas. Users draw mappings as they see them in their head, and then Snoogle turns these mappings into SWRL/RDF or SWRL/XML for use in a knowledge base.
Snowdoop Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System. But eventually word got out that Hadoop is slow, and very limited in available data operations. Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark. Spark is apparently much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations. But even Spark su ers a very practical problem, shared by the others mentioned above: All of these systems are complicated. There is a considerable amount of con guration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava. Some of this requires systems knowledge that many R users may lack. And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though technically they are coding in R. So, do we really need all that complicated machinery? Hadoop and Spark provide e cient dis- tributed sort operations, but if one’s application does not depend on sorting, we have a cost-bene t issue here. Here is an alternative, more of a general approach rather than a package, which I call ‘Snowdoop.’ (The name alludes to the fact that it uses the section of the parallel package derived from the old snow package.) The idea is to retain the notion of chunking les into distributed mini-files, but (a) do this on one’s own, and (b) the process those les using ordinary R code, not fancy new functions like Hadoop and Spark require.
Sobel Operator The Sobel operator, sometimes called Sobel Filter, is used in image processing and computer vision, particularly within edge detection algorithms, and creates an image which emphasizes edges and transitions. It is named after Irwin Sobel, who presented the idea of an ‘Isotropic 3×3 Image Gradient Operator’ at a talk at the Stanford Artificial Intelligence Project (SAIP) in 1968. Technically, it is a discrete differentiation operator, computing an approximation of the gradient of the image intensity function. At each point in the image, the result of the Sobel operator is either the corresponding gradient vector or the norm of this vector. The Sobel operator is based on convolving the image with a small, separable, and integer valued filter in horizontal and vertical direction and is therefore relatively inexpensive in terms of computations. On the other hand, the gradient approximation that it produces is relatively crude, in particular for high frequency variations in the image. The Kayyali operator for edge detection is another operator generated from Sobel operator.
Sobolev Training At the heart of deep learning we aim to use neural networks as function approximators – training them to produce outputs from inputs in emulation of a ground truth function or data creation process. In many cases we only have access to input-output pairs from the ground truth, however it is becoming more common to have access to derivatives of the target output with respect to the input – for example when the ground truth function is itself a neural network such as in network compression or distillation. Generally these target derivatives are not computed, or are ignored. This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the function’s outputs but also the function’s derivatives we encode additional information about the target function within the parameters of the neural network. Thereby we can improve the quality of our predictors, as well as the data-efficiency and generalization capabilities of our learned function approximation. We provide theoretical justifications for such an approach as well as examples of empirical evidence on three distinct domains: regression on classical optimisation datasets, distilling policies of an agent playing Atari, and on large-scale applications of synthetic gradients. In all three domains the use of Sobolev Training, employing target derivatives in addition to target values, results in models with higher accuracy and stronger generalisation.
Social Network Analysis
Social network analysis (SNA) is a strategy for investigating social structures through the use of network and graph theories. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties or edges (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, friendship and acquaintance networks, kinship, disease transmission,and sexual relationships. These networks are often visualized through sociograms in which nodes are represented as points and ties are represented as lines. Social network analysis has emerged as a key technique in modern sociology. It has also gained a significant following in anthropology, biology, communication studies, economics, geography, history, information science, organizational studies, political science, social psychology, development studies, and sociolinguistics and is now commonly available as a consumer tool.
Social Wi-Fi Many retailers offer Wi-Fi to attract and retain customers. Now some retailers hope to get more from wireless networking via Social Wi-Fi, in which customers get free connectivity by logging in to the retailer’s network using their credentials from a social network account, such as Facebook. The user gets free wireless connectivity. The retailer gets access to customer data for marketing purposes. For example, the retailer could use the data to tailor offers to the customer, such as an in-store coupon for a favorite brand.
SOCRATES A distributed semantic graph processing system that provides locality control, indexing, graph query, and parallel processing capabilities is presented.
Socratic Learning Modern machine learning techniques, such as deep learning, often use discriminative models that require large amounts of labeled data. An alternative approach is to use a generative model, which leverages heuristics from domain experts to train on unlabeled data. Domain experts often prefer to use generative models because they ‘tell a story’ about their data. Unfortunately, generative models are typically less accurate than discriminative models. Several recent approaches combine both types of model to exploit their strengths. In this setting, a misspecified generative model can hurt the performance of subsequent discriminative training. To address this issue, we propose a framework called Socratic learning that automatically uses information from the discriminative model to correct generative model misspecification. Furthermore, this process provides users with interpretable feedback about how to improve their generative model. We evaluate Socratic learning on real-world relation extraction tasks and observe an immediate improvement in classification accuracy that could otherwise require several weeks of effort by domain experts.
Soft Computing
Soft computing is a term applied to a field within computer science which is characterized by the use of inexact solutions to computationally hard tasks such as the solution of NP-complete problems, for which there is no known algorithm that can compute an exact solution in polynomial time. Soft computing differs from conventional (hard) computing in that, unlike hard computing, it is tolerant of imprecision, uncertainty, partial truth, and approximation. In effect, the role model for soft computing is the human mind.
Soft K-Means
(Fuzzy C-Means)
Soft Topographic Vector Quantization
We have developed an algorithm (STVQ) for the optimization of neighbourhood preserving maps by applying deterministic annealing to an energy function for topographic vector quantization. The combinatorial optimization problem is solved by introducing temperature dependent fuzzy assignments of data points to cluster centers and applying an EM-type algorithm at each temperature while annealing. The annealing process exhibits phase transitions in the cluster representation for which we calcul ate critical modes and temperatures expressed in terms of the neighbourhood function and the covariance matrix of the data. In particular, phase transitions corresponding to the automatic selection of feature dimensions are explored analytically and numer ically for finite temperatures. Results are related to those obtained earlier for Kohonen’s SOM-algorithm which can be derived as an approximation to STVQ. The deterministic annealing approach makes it possible to use the neighbourhood function solely to encode desired neighbourhood relations. The working of the annealing process is visualized by showing the effects of ‘heating’ on the topological structure of a two-dimensional map of the plane.
SoftTarget Regularization Deep neural networks are learning models with a very high capacity and therefore prone to over-fitting. Many regularization techniques such as Dropout, DropCon- nect, and weight decay all attempt to solve the problem of over-fitting by reducing the capacity of their respective models (Srivastava et al., 2014), (Wan et al., 2013), (Krogh & Hertz, 1992). In this paper we introduce a new form of regularization that guides the learning problem in a way that reduces over-fitting without sacrificing the capacity of the model. The mistakes that models make in early stages of training carry information about the learning problem. By adjusting the labels of the current epoch of training through a weighted average of the real labels, and an exponential average of the past soft-targets we achieved a regularization scheme as powerful as Dropout without necessarily reducing the capacity of the model, and simplified the complexity of the learning problem. SoftTarget regularization proved to be an effective tool in various neural network architectures.
Sonnet It’s now nearly a year since DeepMind made the decision to switch the entire research organisation to using TensorFlow (TF). It’s proven to be a good choice – many of our models learn significantly faster, and the built-in features for distributed training have hugely simplified our code. Along the way, we found that the flexibility and adaptiveness of TF lends itself to building higher level frameworks for specific purposes, and we’ve written one for quickly building neural network modules with TF. We are actively developing this codebase, but what we have so far fits our research needs well, and we’re excited to announce that today we are open sourcing it. We call this framework Sonnet.
Soundex Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as DB2, PostgreSQL, MySQL, Ingres, MS SQL Server and Oracle) and is often used (incorrectly) as a synonym for “phonetic algorithm”. Improvements to Soundex are the basis for many modern phonetic algorithms.
spaCy spaCy, aIndustrial-strength NLP, is a library for advanced natural language processing in Python and Cython.
spaCy is built on the very latest research, but it isn’t researchware. It was designed from day 1 to be used in real products. You can buy a commercial license, or you can use it under the AGPL. Features:
• Labelled dependency parsing (91.8% accuracy on OntoNotes 5)
• Named entity recognition (82.6% accuracy on OntoNotes 5)
• Part-of-speech tagging (97.1% accuracy on OntoNotes 5)
• Easy to use word vectors
• All strings mapped to integer IDs
• Export to numpy data arrays
• Alignment maintained to original string, ensuring easy mark up calculation
• Range of easy-to-use orthographic features.
• No pre-processing required. spaCy takes raw text as input, warts and newlines and all.
Spaghetti Plot A spaghetti plot (also known as a spaghetti chart, spaghetti diagram, or spaghetti model) is a method of viewing data to visualize possible flows through systems. Flows depicted in this manner appear like noodles, hence the coining of this term. This method of statistics was first used to track routing through factories. Visualizing flow in this manner can reduce inefficiency within the flow of a system. In regards to animal populations and weather buoys drifting through the ocean, they are drawn to study distribution and migration patterns. Within meteorology, these diagrams can help determine confidence in a specific weather forecast, as well as positions and intensities of high and low pressure systems. They are composed of deterministic forecasts from atmospheric models or their various ensemble members. Within medicine, they can illustrate the effects of drugs on patients during drug trials.
Spark Python API
The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python.
PySpark & Scikit-learn = Sparkit-learn
Sparkle Spark is an in-memory analytics platform that targets commodity server environments today. It relies on the Hadoop Distributed File System (HDFS) to persist intermediate checkpoint states and final processing results. In Spark, immutable data are used for storing data updates in each iteration, making it inefficient for long running, iterative workloads. A non-deterministic garbage collector further worsens this problem. Sparkle is a library that optimizes memory usage in Spark. It exploits large shared memory to achieve better data shuffling and intermediate storage. Sparkle replaces the current TCP/IP-based shuffle with a shared memory approach and proposes an off-heap memory store for efficient updates. We performed a series of experiments on scale-out clusters and scale-up machines. The optimized shuffle engine leveraging shared memory provides 1.3x to 6x faster performance relative to Vanilla Spark. The off-heap memory store along with the shared-memory shuffle engine provides more than 20x performance increase on a probabilistic graph processing workload that uses a large-scale real-world hyperlink graph. While Sparkle benefits at most from running on large memory machines, it also achieves 1.6x to 5x performance improvements over scale out cluster with equivalent hardware setting.
SparkNet Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster’s communication overhead, and we benchmark our system’s performance on the ImageNet dataset.
SPARQL Protocol and RDF Query Language
SPARQL (pronounced “sparkle”, a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a semantic query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 became an official W3C Recommendation, and SPARQL 1.1 in March, 2013. SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns. Implementations for multiple programming languages exist. “SPARQL will make a huge difference” making the web machine-readable according to Sir Tim Berners-Lee in a May 2006 interview. There exist tools that allow one to connect and semi-automatically construct a SPARQL query for a SPARQL endpoint, for example ViziQuer. In addition, there exist tools that translate SPARQL queries to other query languages, for example to SQL and to XQuery.
Sparse Coding The sparse code is when each item is encoded by the strong activation of a relatively small set of neurons. For each item to be encoded, this is a different subset of all available neurons. As a consequence, sparseness may be focused on temporal sparseness (‘a relatively small number of time periods are active’) or on the sparseness in an activated population of neurons. In this latter case, this may be defined in one time period as the number of activated neurons relative to the total number of neurons in the population. This seems to be a hallmark of neural computations since compared to traditional computers, information is massively distributed across neurons. A major result in neural coding from Olshausen et al. is that sparse coding of natural images produces wavelet-like oriented filters that resemble the receptive fields of simple cells in the visual cortex. The capacity of sparse codes may be increased by simultaneous use of temporal coding, as found in the locust olfactory system. Given a potentially large set of input patterns, sparse coding algorithms (e.g. Sparse Autoencoder) attempt to automatically find a small number of representative patterns which, when combined in the right proportions, reproduce the original input patterns. The sparse coding for the input then consists of those representative patterns. For example, the very large set of English sentences can be encoded by a small number of symbols (i.e. letters, numbers, punctuation, and spaces) combined in a particular order for a particular sentence, and so a sparse coding for English would be those symbols.
“Dictionary Learning”
More Algorithms for Provable Dictionary Learning
Sparse Distributed Representations
Sparse Distributed Representations are binary representations of data comprised of many bits with a small percentage of the bits active (1’s). The bits in these representations have semantic meaning and that meaning is distributed across the bits.
Sparse Generalized Linear Models glmgraph
Sparse Linear Method
This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse LInear Method (SLIM) is proposed, which generates topN recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an l1-norm and l2-norm regularized optimization problem. W is demonstrated to produce highquality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-ofthe-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.
Sparse Matrix / Sparsity In numerical analysis, a sparse matrix is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense. The fraction of zero elements (non-zero elements) in a matrix is called the sparsity (density).
Sparsity is in the general sense: variable selection, total variation regularization, polynomial trend filtering, and others.
Sparse Shrink Nowadays, it is still difficult to adapt Convolutional Neural Network (CNN) based models for deployment on embedded devices. The heavy computation and large memory footprint of CNN models become the main burden in real application. In this paper, we propose a ‘Sparse Shrink’ algorithm to prune an existing CNN model. By analyzing the importance of each channel via sparse reconstruction, the algorithm is able to prune redundant feature maps accordingly. The resulting pruned model thus directly saves computational resource. We have evaluated our algorithm on CIFAR-100. As shown in our experiments, we can reduce 56.77% parameters and 73.84% multiplication in total with only minor decrease in accuracy. These results have demonstrated the effectiveness of our ‘Sparse Shrink’ algorithm.
Sparse Spatial Generalized Linear Mixed Model
(reparameterizations of traditional models)
SparseStep Regression The SparseStep algorithm is presented for the estimation of a sparse parameter vector in the linear regression problem. The algorithm works by adding an approximation of the exact counting norm as a constraint on the model parameters and iteratively strengthening this approximation to arrive at a sparse solution. Theoretical analysis of the penalty function shows that the estimator yields unbiased estimates of the parameter vector. An iterative majorization algorithm is derived which has a straightforward implementation reminiscent of ridge regression. In addition, the SparseStep algorithm is compared with similar methods through a rigorous simulation study which shows it often outperforms existing methods in both model fit and prediction accuracy.
Sparsity Oriented Importance Learning
Sparsity Oriented Importance Learning (SOIL) provides an objective and informative profile of variable importances for high dimensional regression and classification models.
Spatial Position Model SpatialPosition
Spatial Random Sampling
Random column sampling is not guaranteed to yield data sketches that preserve the underlying structures of the data and may not sample sufficiently from less-populated data clusters. Also, adaptive sampling can often provide accurate low rank approximations, yet may fall short of producing descriptive data sketches, especially when the cluster centers are linearly dependent. Motivated by that, this paper introduces a novel randomized column sampling tool dubbed Spatial Random Sampling (SRS), in which data points are sampled based on their proximity to randomly sampled points on the unit sphere. The most compelling feature of SRS is that the corresponding probability of sampling from a given data cluster is proportional to the surface area the cluster occupies on the unit sphere, independently from the size of the cluster population. Although it is fully randomized, SRS is shown to provide descriptive and balanced data representations. The proposed idea addresses a pressing need in data science and holds potential to inspire many novel approaches for analysis of big data.
Spatial Sign Correlation A new robust correlation estimator based on the spatial sign covariance matrix (SSCM) is proposed. We derive its asymptotic distribution and influence function at elliptical distributions. Finite sample and robustness properties are studied and compared to other robust correlation estimators by means of numerical simulations.
Spatial Sign Covariance Matrix
The robust estimation of multivariate location and shape is one of the most challenging problems in statistics and crucial in many application areas. The objective is to find highly efficient, robust, computable and affine equivariant location and covariance matrix estimates. In this paper three different concepts of multivariate sign and rank are considered and their ability to carry information about the geometry of the underlying distribution (or data cloud) are discussed. New techniques for robust covariance matrix estimation based on different sign and rank concepts are proposed and algorithms for computing them outlined. In addition, new tools for evaluating the qualitative and quantitative robustness of a covariance estimator are proposed. The use of these tools is demonstrated on two rank based covariance matrix estimates. Finally, to illustrate the practical importance of the problem, a signal processing example where robust covariance matrix estimates are needed is given.
The Spatial Sign Covariance Matrix With Unknown Location
“Spatial Sign Correlation”
Spatial Simulated Annealing
Spatial simulated annealing uses slight perturbations of previous sampling designs and a random search technique to solve spatial optimization problems. Candidate measurement locations are iteratively moved around and optimized by minimizing the mean universal kriging variance. The approach relies on a known, pre-specified model for underlying spatial variation.
“Simulated Annealing”
Spatial Statistics Spatial analysis or spatial statistics includes any of the formal techniques which study entities using their topological, geometric, or geographic properties. The phrase properly refers to a variety of techniques, many still in their early development, using different analytic approaches and applied in fields as diverse as astronomy, with its studies of the placement of galaxies in the cosmos, to chip fabrication engineering, with its use of ‘place and route’ algorithms to build complex wiring structures. The phrase is often used in a more restricted sense to describe techniques applied to structures at the human scale, most notably in the analysis of geographic data. The phrase is even sometimes used to refer to a specific technique in a single area of research, for example, to describe geostatistics.
Spatial Stochastic Frontier Analysis
Spatial Stochastic Frontier Analysis (SSFA) is an original method for controlling the spatial heterogeneity in Stochastic Frontier Analysis (SFA) models by splitting the inefficiency term into three terms: the first one related to spatial peculiarities of the territory in which each single unit operates, the second one related to the specific production features and the third one representing the error term.


Spatially Compact Semantic Scan
Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spatially compact events in text streams. SCSS employs alternating optimization between using semantic scan to estimate contrastive foreground topics in documents, and discovering spatial neighborhoods with high occurrence of documents containing the foreground topics. We evaluate our method on Emergency Department chief complaints dataset (ED dataset) to verify the effectiveness of our method in detecting real-world disease outbreaks from free-text ED chief complaint data.
spatstat spatstat is an R package for spatial statistics with a strong focus on analysing spatial point patterns in 2D (with some support for 3D and very basic support for space-time).
SPCALDA A new reduced-rank LDA method which works for high dimensional multi-class data.
Specificity Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). Specificity (sometimes called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition). These two measures are closely related to the concepts of type I and type II errors. A perfect predictor would be described as 100% sensitive (i.e. predicting all people from the sick group as sick) and 100% specific (i.e. not predicting anyone from the healthy group as sick); however, theoretically any predictor will possess a minimum error bound known as the Bayes error rate.
Spectral Clustering In multivariate statistics and the clustering of data, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset. In application to image segmentation, spectral clustering is known as segmentation-based object categorization.
Spectral Convolution Networks Previous research has shown that computation of convolution in the frequency domain provides a significant speedup versus traditional convolution network implementations. However, this performance increase comes at the expense of repeatedly computing the transform and its inverse in order to apply other network operations such as activation, pooling, and dropout. We show, mathematically, how convolution and activation can both be implemented in the frequency domain using either the Fourier or Laplace transformation. The main contributions are a description of spectral activation under the Fourier transform and a further description of an efficient algorithm for computing both convolution and activation under the Laplace transform. By computing both the convolution and activation functions in the frequency domain, we can reduce the number of transforms required, as well as reducing overall complexity. Our description of a spectral activation function, together with existing spectral analogs of other network functions may then be used to compose a fully spectral implementation of a convolution network.
Spectral Graph Clustering
“Spectral Clustering”
Speech Analytics Speech analytics is the process of analyzing recorded calls to gather information, brings structure to customer interactions and exposes information buried in customer contact center interactions with an enterprise. Although it often includes elements of automatic speech recognition, where the identities of spoken words or phrases are determined, it may also include analysis of one or more of the following: the topic(s) being discussed the emotional character of the speech the amount and locations of speech versus non-speech (e.g. call hold time or periods of silence) One use of speech analytics applications is to spot spoken keywords or phrases, either as real-time alerts on live audio or as a post-processing step on recorded speech. This technique is also known as audio mining. Other uses include categorization of speech, for example in the contact center environment, to identify calls from unsatisfied customers. Speech analytics in contact centers can be used to extract critical business intelligence that would otherwise be lost. By analyzing and categorizing recorded phone conversations between companies and their customers, useful information can be discovered relating to strategy, product, process, operational issues and contact center agent performance. This information gives decision-makers insight into what customers really think about their company so that they can quickly react. In addition, speech analytics can automatically identify areas in which contact center agents may need additional training or coaching, and can automatically monitor the customer service provided on calls.
Spherical Paragraph Model
Representing texts as fixed-length vectors is central to many language processing tasks. Most traditional methods build text representations based on the simple Bag-of-Words (BoW) representation, which loses the rich semantic relations between words. Recent advances in natural language processing have shown that semantically meaningful representations of words can be efficiently acquired by distributed models, making it possible to build text representations based on a better foundation called the Bag-of-Word-Embedding (BoWE) representation. However, existing text representation methods using BoWE often lack sound probabilistic foundations or cannot well capture the semantic relatedness encoded in word vectors. To address these problems, we introduce the Spherical Paragraph Model (SPM), a probabilistic generative model based on BoWE, for text representation. SPM has good probabilistic interpretability and can fully leverage the rich semantics of words, the word co-occurrence information as well as the corpus-wide information to help the representation learning of texts. Experimental results on topical classification and sentiment analysis demonstrate that SPM can achieve new state-of-the-art performances on several benchmark datasets.
Spiking Neural Nets
Spiking Neural Nets (SNNs) (also sometimes called Oscillatory NNs) are being developed from an examination of the fact that neurons do not constantly communicate with one another but rather in spikes of signals. We all have heard of alpha waves in the brain and these oscillations are only one manifestation of the irregular cyclic and spiking nature of communication among neurons.
So if individual neurons are activated only under specific circumstances in which the electrical potential exceeds a specific threshold, a spike, what might be the implication for designing neural nets? For one, there is the fundamental question of whether information is being encoded in the rate, amplitude, or even latency of the spikes. It appears this is so.
The SNNs that have been demonstrated thus far show the following characteristics:
• They can be developed with far fewer layers. If nodes only fire in response to a spike (actually a train of spikes) then one spiking neuron could replace many hundreds of hidden units on a sigmoidal NN.
• There are implications for energy efficiency. SNNs should require much lower power than CNNs.
• You could in theory route spikes like data packets further reducing layers. It’s tempting to say this reduces complexity and it’s true that layers go away, but are replaced by the complexity of interpreting and directing basically noisy spike trains.
• Training SNNs does not rely on gradient descent functions as do CNNs. Gradient descent which looks at the performance of the overall network can be led astray by unusual conditions at a layer like a non-differentiable activation function. The current and typical way to train SNNs is some variation on ‘Spike Timing Dependent Plasticity’ and is based on the timing, amplitude, or latency of the spike train.
Spinnaker Spinnaker is an open source, multi-cloud continuous delivery platform for releasing software changes with high velocity and confidence.
Spirtes Glymour Scheines Algorithm
A.) Form the complete undirected graph H on the vertex set V.
B.) For each pair of vertices A and B, if there exists a subset S of V such that A and B are d-separated given S, remove the edge between A and B from H.
C.) Let K be the undirected graph resulting from step B). For each triple of vertices A B, and C such that the pair A and B and the pair B and C are each adjacent in K (written as A – B – C) but the pair A and C are not adjacent in K, orient A – B – C as A -> B <- C if and only if there is no subset S of {B} È V that d-separates A and C.
D.) repeat
• If A -> B, B and C are adjacent, A and C are not adjacent, and there is no arrowhead at B, then orient B – C as B -> C.
• If there is a directed path from A to B, and an edge between A and B, then orient A – B as A -> B.
until no more edges can be oriented.
Split-Apply-Combine Strategy In a split-apply-combine strategy you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.
Splitted Isotonic Regression
A limitation of many clustering algorithms is the requirement to tune adjustable parameters for each application or even for each dataset. Some algorithms require an \emph{a priori} estimate of the number of clusters while density-based techniques usually require a scale parameter. Other parametric methods, such as mixture modeling, make assumptions about the underlying cluster distributions. Here we introduce a non-parametric clustering method that does not involve tunable parameters and only assumes that clusters are unimodal, in the sense that they have a single point of maximal density when projected onto any line, and that clusters are separated from one another by a separating hyperplane of relatively lower density. The technique uses a non-parametric algorithm—isotonic regression—as the kernel operation repeated at every iteration. We carry out a rigorous hypothesis test for whether pairs of clusters should be merged based upon Monte Carlo sampling of a statistic. We compare the method against k-means++, DBSCAN, and Gaussian mixture algorithms and show in simulations that it performs better than these standard methods in many situations. The algorithm’s utility is also demonstrated in the context of ‘spike sorting’ of neural electrical recordings. The source code for the algorithm is freely available.
Spoken Dialogue System
A spoken dialog system is a computer system able to converse with a human with voice. It has two essential components that do not exist in a text dialog system: a speech recognizer and a text-to-speech module. In can be further distinguished from command and control speech systems that can respond to requests but do not attempt to maintain continuity over time.
Spontaneous Clustering We propose a new method for clustering based on the local minimization of the \gamma-divergence, which we call the spontaneous clustering. The greatest advantage of the proposed method is that it automatically detects the number of clusters that adequately reflect the data structure. In contrast, exiting methods such as K-means, fuzzy c-means, and model based clustering need to prescribe the number of clusters. We detect all the local minimum points of the \gamma-divergence, which are defined as the centers of clusters. A necessary and sufficient condition for the \gamma-divergence to have the local minimum points is also derived in a simple setting. A simulation study and a real data analysis are performed to compare our proposal with existing methods.
Spotlight Analysis New name for an old way of interpreting an interaction between a continuous and a categorical grouping variable in a regression model. The basic idea of spotlight analysis is to compare the mean satisfaction score of the two groups at specific values of the continuous covariate.
spray spray is an open-source toolkit for building REST/HTTP-based integration layers on top of Scala and Akka. Being asynchronous, actor-based, fast, lightweight, modular and testable it’s a great way to connect your Scala applications to the world.
Spreading Activation Spreading activation is a method for searching associative networks, neural networks, or semantic networks. The search process is initiated by labeling a set of source nodes (e.g. concepts in a semantic network) with weights or “activation” and then iteratively propagating or “spreading” that activation out to other nodes linked to the source nodes. Most often these “weights” are real values that decay as activation propagates through the network. When the weights are discrete this process is often referred to as marker passing. Activation may originate from alternate paths, identified by distinct markers, and terminate when two alternate paths reach the same node.
Spreadmart A spreadmart (spreadsheet data mart) is a situation in which a company’s employees has inconsistent views of corporate data because each department relies on the data from their own spreadsheets.
Spring for Apache Hadoop Spring for Apache Hadoop simplifies developing Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive. It also provides integration with other Spring ecosystem project such as Spring Integration and Spring Batch enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration.
Spyre Spyre is a Web Application Framework for providing a simple user interface for Python data projects. Spyre runs on the minimalist python web framework, cherrypy, with jinja2 templating. At it’s heart, spyre is about data and data visualization, so you’ll also need pandas and matplotlib.
SQLite SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is a relational database management system contained in a C programming library. In contrast to other database management systems, SQLite is not a separate process that is accessed from the client application, but an integral part of it. SQLite is ACID-compliant and implements most of the SQL standard, using a dynamically and weakly typed SQL syntax that does not guarantee the domain integrity. SQLite is a popular choice as embedded database for local/client storage in application software such as web browsers. It is arguably the most widely deployed database engine, as it is used today by several widespread browsers, operating systems, and embedded systems, among others. SQLite has bindings to many programming languages. The source code for SQLite is in the public domain.
SQLScript The motivation for SQLScript is to embed data-intensive application logic into the database. As of today, applications only offload very limited functionality into the database using SQL, most of the application logic is normally executed in an application server. This has the effect that data to be operated upon needs to be copied from the database into the application server and vice versa. When executing data intensive logic, this copying of data is very expensive in terms of processor and data transfer time. Moreover, when using an imperative language like ABAP or JAVA for processing data, developers tend to write algorithms which follow a one tuple at a time semantics (for example looping over rows in a table). However, these algorithms are hard to optimize and parallelize compared to declarative set-oriented languages such as SQL. The SAP HANA database is optimized for modern technology trends and takes advantage of modern hardware, for example, by having data residing in main-memory and allowing massive-parallelization on multi-core CPUs. The goal of the SAP HANA database is to optimally support application requirements by leveraging such hardware. To this end, the SAP HANA database exposes a very sophisticated interface to the application consisting of many languages. The expressiveness of these languages far exceeds that attainable with OpenSQL. The set of SQL extensions for the SAP HANA database that allow developers to push data intensive logic into the database is called SQLScript. Conceptually SQLScript is related to stored procedures as defined in the SQL standard, but SQLScript is designed to provide superior optimization possibilities. SQLScript should be used in cases where other modeling constructs of SAP HANA, for example analytic views or attribute views are not sufficient. For more information on how to best exploit the different view types, see ‘Exploit Underlying Engine’. The set of SQL extensions are the key to avoiding massive data copies to the application server and for leveraging sophisticated parallel execution strategies of the database. SQLScript addresses the following problems:
● Decomposing an SQL query can only be done using views. However when decomposing complex queries using views, all intermediate results are visible and must be explicitly typed. Moreover SQL views cannot be parameterized which limits their reuse. In particular they can only be used like tables and embedded into other SQL statements.
● SQL queries do not have features to express business logic (for example a complex currency conversion). As a consequence such a business logic cannot be pushed down into the database (even if it is mainly based on standard aggregations like SUM(Sales), etc.).
● An SQL query can only return one result at a time. As a consequence the computation of related result sets must be split into separate, usually unrelated, queries.
● As SQLScript encourages developers to implement algorithms using a set-oriented paradigm and not using a one tuple at a time paradigm, imperative logic is required, for example by iterative approximation algorithms. Thus it is possible to mix imperative constructs known from stored procedures with declarative ones.
Stability A learning system is said to be stable if no pattern in the training data changes its category after a finite number of learning iterations.
Stable Marriage Problem
In mathematics, economics, and computer science, the stable marriage problem (also stable matching problem or SMP) is the problem of finding a stable matching between two equally sized sets of elements given an ordering of preferences for each element. A matching is a mapping from the elements of one set to the elements of the other set. A matching is stable whenever it is not the case that both:
1. some given element A of the first matched set prefers some given element B of the second matched set over the element to which A is already matched, and
2. B also prefers A over the element to which B is already matched
In other words, a matching is stable when there does not exist any match (A, B) by which both A and B are individually better off than they would be with the element to which they are currently matched. The stable marriage problem is commonly stated in terms of heterosexual marriages and binary genders:
‘Given n men and n women, where each person has ranked all members of the opposite sex in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. When there are no such pairs of people, the set of marriages is deemed stable.’
Algorithms for finding solutions to the stable marriage problem have applications in a variety of real-world situations, perhaps the best known of these being in the assignment of graduating medical students to their first hospital appointments. In 2012, the Nobel Prize in Economics was awarded to Lloyd S. Shapley and Alvin E. Roth ‘for the theory of stable allocations and the practice of market design.’
Stacked Autoencoders A stacked autoencoder is a neural network consisting of multiple layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer. The greedy layerwise approach for pretraining a deep network works by training each layer in turn. In this page, you will find out how autoencoders can be “stacked” in a greedy layerwise fashion for pretraining (initializing) the weights of a deep network.
Stacked Deconvolutional Network
Recent progress in semantic segmentation has been driven by improving the spatial resolution under Fully Convolutional Networks (FCNs). To address this problem, we propose a Stacked Deconvolutional Network (SDN) for semantic segmentation. In SDN, multiple shallow deconvolutional networks, which are called as SDN units, are stacked one by one to integrate contextual information and guarantee the fine recovery of localization information. Meanwhile, inter-unit and intra-unit connections are designed to assist network training and enhance feature fusion since the connections improve the flow of information and gradient propagation throughout the network. Besides, hierarchical supervision is applied during the upsampling process of each SDN unit, which guarantees the discrimination of feature representations and benefits the network optimization. We carry out comprehensive experiments and achieve the new state-of-the-art results on three datasets, including PASCAL VOC 2012, CamVid, GATECH. In particular, our best model without CRF post-processing achieves an intersection-over-union score of 86.6% in the test set.
Stacked Denoising Autoencoder
A stacked denoising autoencoder is to a denoising autoencoder what a deep-belief network is to a restricted Boltzmann machine. A key function of SDAs, and deep learning more generally, is unsupervised pre-training, layer by layer, as input is fed through. Once each layer is pre-trained to conduct feature selection and extraction on the input from the preceding layer, a second stage of supervised fine-tuning can follow. A word on stochastic corruption in SDAs: Denoising autoencoders shuffle data around and learn about that data by attempting to reconstruct it. The act of shuffling is the noise, and the job of the network is to recognize the features within the noise that will allow it to classify the input. When a network is being trained, it generates a model, and measures the distance between that model and the benchmark through a loss function. Its attempts to minimize the loss function involve resampling the shuffled inputs and re-reconstructing the data, until it finds those inputs which bring its model closest to what it has been told is true. The serial resamplings are based on a generative model to randomly provide data to be processed. This is known as a Markov Chain, and more specifically, a Markov Chain Monte Carlo algorithm that steps through the data set seeking a representative sampling of indicators that can be used to construct more and more complex features.
Stacked Generalization
Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although in practice, a single-layer logistic regression model is often used as the combiner. Stacking typically yields performance better than any single one of the trained models. It has been successfully used on both supervised learning tasks (regression) and unsupervised learning (density estimation). It has also been used to estimate bagging’s error rate. It has been reported to out-perform Bayesian model-averaging. The two top-performers in the Netflix competition utilized blending, which may be considered to be a form of stacking.
Stacked Generative Adversarial Networks
In this paper we aim to leverage the powerful bottom-up discriminative representations to guide a top-down generative model. We propose a novel generative model named Stacked Generative Adversarial Networks (SGAN), which is trained to invert the hierarchical representations of a discriminative bottom-up deep network. Our model consists of a top-down stack of GANs, each trained to generate ‘plausible’ lower-level representations, conditioned on higher-level representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottom-up discriminative network, providing intermediate supervision. In addition, we introduce a conditional loss that encourages the use of conditional information from the layer above, and a novel entropy loss that maximizes a variational lower bound on the conditional entropy of generator outputs. To the best of our knowledge, the entropy loss is the first attempt to tackle the conditional model collapse problem that is common in conditional GANs. We first train each GAN of the stack independently, and then we train the stack end-to-end. Unlike the original GAN that uses a single noise vector to represent all the variations, our SGAN decomposes variations into multiple levels and gradually resolves uncertainties in the top-down generative process. Experiments demonstrate that SGAN is able to generate diverse and high-quality images, as well as being more interpretable than a vanilla GAN.
Stan Stan is a probabilistic programming language implementing full Bayesian statistical inference wit MCMC sampling (NUTS, HMC) and penalized maximum likelihood estimation wit Optimization (L-BFGS. Stan is coded in C++ and runs on all major platforms (Linux, Mac, Windows). Stan is freedom-respecting, open-source software (new BSD core, GPLv3 interfaces).
Standard Methodology for Analytical Models
In this document, the Standard Methodology for Analytical Models (SMAM) is described. The most frequent used methodology is the Cross Industrial Standard Processes for Data Mining (CRISP-DM), which has several shortcomings that translate into frequent friction points with the business when practitioners start building analytical models.
Stanford DAWN Project Despite incredible recent advances in machine learning, building machine learning applications remains prohibitively time-consuming and expensive for all but the best-trained, best-funded engineering organizations. This expense comes not from a need for new and improved statistical models but instead from a lack of systems and tools for supporting end-to-end machine learning application development, from data preparation and labeling to productionization and monitoring. In this document, we outline opportunities for infrastructure supporting usable, end-to-end machine learning applications in the context of the nascent DAWN (Data Analytics for What’s Next) project at Stanford.
StarCraft II Learning Environment
This paper introduces SC2LE (StarCraft II Learning Environment), a reinforcement learning environment based on the StarCraft II game. This domain poses a new grand challenge for reinforcement learning, representing a more difficult class of problems than considered in most prior work. It is a multi-agent problem with multiple players interacting; there is imperfect information due to a partially observed map; it has a large action space involving the selection and control of hundreds of units; it has a large state space that must be observed solely from raw input feature planes; and it has delayed credit assignment requiring long-term strategies over thousands of steps. We describe the observation, action, and reward specification for the StarCraft II domain and provide an open source Python-based interface for communicating with the game engine. In addition to the main game maps, we provide a suite of mini-games focusing on different elements of StarCraft II gameplay. For the main game maps, we also provide an accompanying dataset of game replay data from human expert players. We give initial baseline results for neural networks trained from this data to predict game outcomes and player actions. Finally, we present initial baseline results for canonical deep reinforcement learning agents applied to the StarCraft II domain. On the mini-games, these agents learn to achieve a level of play that is comparable to a novice player. However, when trained on the main game, these agents are unable to make significant progress. Thus, SC2LE offers a new and challenging environment for exploring deep reinforcement learning algorithms and architectures.
STARTS Although researchers in clinical psychology routinely gather data in which many individuals respond at multiple times, there is not a standard way to analyze such data. A new approach for the analysis of such data is described. It is proposed that a person’s current standing on a variable is caused by 3 sources of variance: a term that does not change (trait), a term that changes (state), and a random term (error). It is shown how structural equation modeling can be used to estimate such a model. An extended example is presented in which the correlations between variables are quite different at the trait, state, and error levels. (PsycINFO Database Record (c) 2016 APA, all rights reserved)
Stata Stata is a complete, integrated statistical software package that provides everything you need for data analysis, data management, and graphics. With both a point-and-click interface and a powerful, intuitive command syntax, Stata is fast, accurate, and easy to use. All analyses can be reproduced and documented for publication and review. Version control ensures statistical programs will continue to produce the same results no matter when you wrote them.
State Space Model
State space model (SSM) refers to a class of probabilistic graphical model (Koller and Friedman, 2009) that describes the probabilistic dependence between the latent state variable and the observed measurement. The state or the measurement can be either continuous or discrete. The term “state space” originated in 1960s in the area of control engineering (Kalman, 1960). SSM provides a general framework for analyzing deterministic and stochastic dynamical systems that are measured or observed through a stochastic process. The SSM framework has been successfully applied in engineering, statistics, computer science and economics to solve a broad range of dynamical systems problems. Other terms used to describe SSMs are hidden Markov models (HMMs) (Rabiner, 1989) and latent process models. The most well studied SSM is the Kalman filter, which defines an optimal algorithm for inferring linear Gaussian systems.
SARSA (State-Action-Reward-State-Action) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was introduced in a technical note where the alternative name SARSA was only mentioned as a footnote.
This name simply reflects the fact that the main function for updating the Q-value depends on the current state of the agent “S1”, the action the agent chooses “A1”, the reward “R” the agent gets for choosing this action, the state “S2” that the agent will now be in after taking that action, and finally the next action “A2” the agent will choose in its new state. Taking every letter in the quintuple (st, at, rt, st+1, at+1) yields the word SARSA.
Stated Preference Method The term “Stated Preference Methods” refers to a family of techniques which use individual respondents´ statements about their preferences in a set of options to estimate utility functions. The options are typically descriptions of situations or contexts constructed by the researcher. By their nature, stated preference methods require purpose-designed surveys for their collection of data. “Contingent Valuation” is often referred to as a stated preference model.
Stationary Process In mathematics and statistics, a stationary process (or strict(ly) stationary process or strong(ly) stationary process) is a stochastic process whose joint probability distribution does not change when shifted in time. Consequently, parameters such as the mean and variance, if they are present, also do not change over time and do not follow any trends.
Stationarity is used as a tool in time series analysis, where the raw data is often transformed to become stationary; for example, economic data are often seasonal and/or dependent on a non-stationary price level. An important type of non-stationary process that does not include a trend-like behavior is the cyclostationary process.
Note that a “stationary process” is not the same thing as a “process with a stationary distribution”. Indeed there are further possibilities for confusion with the use of “stationary” in the context of stochastic processes; for example a “time-homogeneous” Markov chain is sometimes said to have “stationary transition probabilities”. Besides, all stationary Markov random processes are time-homogeneous.
STATISTICA STATISTICA is a statistics and analytics software package developed by StatSoft. STATISTICA provides data analysis, data management, statistics, data mining, and data visualization procedures. STATISTICA product categories include Enterprise (for use across a site or organization), Web-Based (for use with a server and web browser), Concurrent Network Desktop, and Single-User Desktop.
Statistical Analysis System
SAS (Statistical Analysis System; not to be confused with SAP) is a software suite developed by SAS Institute for advanced analytics, business intelligence, data management, and predictive analytics. SAS was developed at North Carolina State University from 1966 until 1976, when SAS Institute was incorporated. SAS was further developed in the 1980s and 1990s with the addition of new statistical procedures, additional components and the introduction of JMP. A point-and-click interface was added in version 9 in 2004. A social media analytics product was added in 2010.
Statistical Archetypal Analysis
Statistical Archetypal Analysis (SAA) is introduced for the dimensional reduction of a collection of probability distributions known via samples. Applications include medical diagnosis from clinical data in the form of distributions (such as distributions of blood pressure or heart rates from different patients), the analysis of climate data such as temperature or wind speed at different locations, and the study of bifurcations in stochastic dynamical systems. Distributions can be embedded into a Hilbert space with a suitable metric, and then analyzed similarly to feature vectors in Euclidean space. However, most dimensional reduction techniques –such as Principal Component Analysis– are not interpretable for distributions, as neither the components nor the reconstruction of input data by components are themselves distributions. To obtain an interpretable result, Archetypal Analysis (AA) is extended to distributions, requiring the components to be mixtures of the input distributions and approximating the input distributions by mixtures of components.
Statistical Classification In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into “spam” or “non-spam” classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.
Statistical Data and Metadata eXchange
SDMX is an initiative to foster standards for the exchange of statistical information. It started in 2001 and aims at fostering standards for Statistical Data and Metadata eXchange (SDMX). The SDMX message formats have two basic expressions, SDMX-ML (using XML syntax) and SDMX-EDI (using EDIFACT syntax and based on the GESMES/TS statistical message). The standards also include additional specifications (e.g. registry specification, web services). Version 1.0 of the SDMX standard has been recognised as an ISO standard in 2005. The latest version of the standard – SDMX 2.1 – has been released in April 2011. In 2013 SDMX was approved by ISO as an International Standard (ISO 17369:2013).
Statistical Decision Theory
Statistical Disclosure Control
The purpose of statistical disclosure control is to make as small as possible the risk of releasing confidential information whilst maximising the access to useful, high quality data. Statistical disclosure control (SDC) covers a range of ways of changing data which are used to control the risk of an intruder finding out confidential information about a person or unit (such as a household or business). Laws protect the confidentiality of data about living people and there is also a range of legislation for specific types of data, for example the census. Many surveys also carry a confidentiality assurance, which is an agreement between the respondent and the data collector about how the collected data will be used. In the last ten years there has been a large increase in the electronic storage of data and wider access to information on the internet, including data for small geographical areas. At the same time, computing expertise and access to computers with a large amount of processing power have also increased. This means that data publishers need to take increased steps so that released micro-data (data held as individual records) and tabulations do not reveal any identifiable or disclosive information about a person , household or business.
Statistical Disclosure Control: Protecting sensitive information
Statistical disclosure control
Introduction to Statistical Disclosure Control
Statistical Disclosure Limitation
The Statistical Disclosure Limitation (SDL) problem involves modifying a data set in such a manner that statistical analysis on the modified data is reasonably close to that performed on the original data, while preserving the privacy of individuals in the data set. For instance, we might have a medical data set on which we want to allow researchers to do their statistical analyses but not violate the privacy of the patients in the study.
Statistical Distance In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be between an individual sample point and a population or a wider sample of points. A distance between populations can be interpreted as measuring the distance between two probability distributions and hence they are essentially measures of distances between probability measures. Where statistical distance measures relate to the differences between random variables, these may have statistical dependence, and hence these distances are not directly related to measures of distances between probability measures. Again, a measure of distance between random variables may relate to the extent of dependence between them, rather than to their individual values. Statistical distance measures are mostly not metrics and they need not be symmetric. Some types of distance measures are referred to as (statistical) divergences.
Statistical Engineering Several authors, including the American Statistician (ASA), have noted the challenges facing statisticians when attacking large, complex, unstructured problems, as opposed to well-defined textbook problems. Clearly, the standard paradigm of selecting the one ‘correct’ statistical method for such problems is not sufficient; a new paradigm is needed. Statistical engineering has been proposed as a discipline that can provide a viable paradigm to attack such problems, used in conjunction with sound statistical science. Of course, in order to develop as a true discipline, statistical engineering needs a well-developed theory, not just a formal definition and successful case studies. This article documents and disseminates the current state of the underlying theory of statistical engineering. Our purpose is to provide a vehicle for applied statisticians to further enhance the practice of statistics, and for academics so interested to continue development of the underlying theory of statistical engineering.
Statistical Inference In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation. Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data.
Statistical Learning Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, bioinformatics and baseball. It is the theoretical framework underlying support vector machines.
Statistical Model A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more other variables. The model is statistical as the variables are not deterministically but stochastically related. In mathematical terms, a statistical model is frequently thought of as a pair where is the set of possible observations and the set of possible probability distributions on. It is assumed that there is a distinct element of which generates the observed data. Statistical inference enables us to make statements about which element(s) of this set are likely to be the true one.
Most statistical tests can be described in the form of a statistical model. For example, the Student’s t-test for comparing the means of two groups can be formulated as seeing if an estimated parameter in the model is different from 0. Another similarity between tests and models is that there are assumptions involved. Error is assumed to be normally distributed in most models.
Peter Norvig
Statistical Power The power of a statistical test is the probability that it correctly rejects the null hypothesis when the null hypothesis is false (i.e. the probability of not committing a Type II error).
It can be equivalently thought of as the probability of correctly accepting the alternative hypothesis when the alternative hypothesis is true – that is, the ability of a test to detect an effect, if the effect actually exists.
Statistical Process Control
Statistical process control (SPC) is a method of quality control which uses statistical methods. SPC is applied in order to monitor and control a process. Monitoring and controlling the process ensures that it operates at its full potential. At its full potential, the process can make as much conforming product as possible with a minimum (if not an elimination) of waste (rework or scrap). SPC can be applied to any process where the “conforming product” (product meeting specifications) output can be measured. Key tools used in SPC include control charts; a focus on continuous improvement; and the design of experiments. An example of a process where SPC is applied is manufacturing lines.
Statistical Protocol IDentification
Identifying which application layer protocol is being used within a network communication session is important when assigning Quality of Service priorities as well as when conducting network security monitoring. Currently most protocol identification is performed through signature matching algorithms that rely on strings or regular expressions as signatures. This report presents a protocol identification scheme called the Statistical Protocol Identification (SPID) algorithm, which reliably identifies the application layer protocol by using statistical measurements of flow data as well as application layer data. The SPID algorithm utilises Kullback-Leibler divergence measurements to compare probability vectors created from observed network traffic to probability vectors of known protocols.
Statistical Ranking Color Scheme
The problem of comparing a new solution method against existing ones to find statistically significant differences arises very often in sciences and engineering. When the problem instance being solved is defined by several parameters, assessing a number of methods with respect to many problem configurations simultaneously becomes a hard task. Some visualization technique is required for presenting a large number of statistical significance results in an easily interpretable way. Here we review an existing color-based approach called Statistical Ranking Color Scheme (SRCS) for displaying the results of multiple pairwise statistical comparisons between several methods assessed separately on a number of problem configurations. We introduce an R package implementing SRCS, which performs all the pairwise statistical tests from user data and generates customizable plots. We demonstrate its applicability on two examples from the areas of dynamic optimization and machine learning, in which several algorithms are compared on many problem instances, each defined by a combination of parameters.
Statistical Recurrent Unit
Sophisticated gated recurrent neural network architectures like LSTMs and GRUs have been shown to be highly effective in a myriad of applications. We develop an un-gated unit, the statistical recurrent unit (SRU), that is able to learn long term dependencies in data by only keeping moving averages of statistics. The SRU’s architecture is simple, un-gated, and contains a comparable number of parameters to LSTMs; yet, SRUs perform favorably to more sophisticated LSTM and GRU alternatives, often outperforming one or both in various tasks. We show the efficacy of SRUs as compared to LSTMs and GRUs in an unbiased manner by optimizing respective architectures’ hyperparameters in a Bayesian optimization scheme for both synthetic and real-world tasks.
Statistical Relational Learning
Statistical relational learning (SRL) is a subdiscipline of artificial intelligence and machine learning that is concerned with models of domains that exhibit both uncertainty (which can be dealt with using statistical methods) and complex, relational structure. Typically, the knowledge representation formalisms developed in SRL use (a subset of) first-order logic to describe relational properties of a domain in a general manner (universal quantification) and draw upon probabilistic graphical models (such as Bayesian networks or Markov networks) to model the uncertainty; some also build upon the methods of inductive logic programming. Significant contributions to the field have been made since the late 1990s. As is evident from the characterization above, the field is not strictly limited to learning aspects; it is equally concerned with reasoning (specifically probabilistic inference) and knowledge representation. Therefore, alternative terms that reflect the main foci of the field include statistical relational learning and reasoning (emphasizing the importance of reasoning) and first-order probabilistic languages (emphasizing the key properties of the languages with which models are represented).
Statistical Theory The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics. The theory covers approaches to statistical-decision problems and to statistical inference, and the actions and deductions that satisfy the basic principles stated for these different approaches. Within a given approach, statistical theory gives ways of comparing statistical procedures; it can find a best possible procedure within a given context for given statistical problems, or can provide guidance on the choice between alternative procedures. Apart from philosophical considerations about how to make statistical inferences and decisions, much of statistical theory consists of mathematical statistics, and is closely linked to probability theory, to utility theory, and to optimization.
Statistics Statistics is the study of the collection, organization, analysis, interpretation and presentation of data. It deals with all aspects of data including the planning of data collection in terms of the design of surveys and experiments. When analyzing data, it is possible to use one or both of statistics methodologies: descriptive and inferential statistics in the analysis data.
statnet statnet is a suite of software packages for network analysis that implement recent advances in the statistical modeling of networks. The analytic framework is based on Exponential family Random Graph Models (ergm). statnet provides a comprehensive framework for ergm-based network modeling, including tools for model estimation, model evaluation, model-based network simulation, and network visualization. This broad functionality is powered by a central Markov chain Monte Carlo (MCMC) algorithm.
Stein Variational Autoencoder A new method for learning variational autoencoders is developed, based on an application of Stein’s operator. The framework represents the encoder as a deep nonlinear function through which samples from a simple distribution are fed. One need not make parametric assumptions about the form of the encoder distribution, and performance is further enhanced by integrating the proposed encoder with importance sampling. Example results are demonstrated across multiple unsupervised and semi-supervised problems, including semi-supervised analysis of the ImageNet data, demonstrating the scalability of the model to large datasets.
Stein´s Paradox Stein’s example (or phenomenon or paradox), in decision theory and estimation theory, is the phenomenon that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on average (that is, having lower expected mean-squared error) than any method that handles the parameters separately.
An intuitive explanation is that optimizing for the mean-squared error of a combined estimator is not the same as optimizing for the errors of separate estimators of the individual parameters. In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent; this occurs in channel estimation in telecommunications, for instance (different factors affect overall channel performance). On the other hand, if one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.
Stemming Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation. Stemming programs are commonly referred to as stemming algorithms or stemmers.
Stepwise Regression In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of F-tests or t-tests, but other techniques are possible, such as adjusted R-square, Akaike information criterion, Bayesian information criterion, Mallows’s Cp, PRESS, or false discovery rate. The frequent practice of fitting the final selected model followed by reporting estimates and confidence intervals without adjusting them to take the model building process into account has led to calls to stop using stepwise model building altogether or to at least make sure model uncertainty is correctly reflected.
STN-OCR Detecting and recognizing text in natural scene images is a challenging, yet not completely solved task. In recent years several new systems that try to solve at least one of the two sub-tasks (text detection and text recognition) have been proposed. In this paper we present STN-OCR, a step towards semi-supervised neural networks for scene text recognition, that can be optimized end-to-end. In contrast to most existing works that consist of multiple deep neural networks and several pre-processing steps we propose to use a single deep neural network that learns to detect and recognize text from natural images in a semi-supervised way. STN-OCR is a network that integrates and jointly learns a spatial transformer network, that can learn to detect text regions in an image, and a text recognition network that takes the identified text regions and recognizes their textual content. We investigate how our model behaves on a range of different tasks (detection and recognition of characters, and lines of text). Experimental results on public benchmark datasets show the ability of our model to handle a variety of different tasks, without substantial changes in its overall network structure.
Stochastic In probability theory, a purely stochastic system is one whose state is non-deterministic (i.e., “random”) so that the subsequent state of the system is determined probabilistically. Any system or process that must be analyzed using probability theory is stochastic at least in part. Stochastic systems and processes play a fundamental role in mathematical models of phenomena in many fields of science, engineering, and economics. Stochastic comes from a Greek word, which means “aim”. It also denotes a target stick; the pattern of arrows around a target stick stuck in a hillside is representative of what is stochastic.
Stochastic Approximation of Expectation Maximization
The SAEM algorithm: – computes the maximum likelihood estimator of the population parameters, without any approximation of the model (linearisation, quadrature approximation,…), using the Stochastic Approximation Expectation Maximization (SAEM) algorithm, – provides standard errors for the maximum likelihood estimator – estimates the conditional modes, the conditional means and the conditional standard deviations of the individual parameters, using the Hastings-Metropolis algorithm. Several applications of SAEM in agronomy, animal breeding and PKPD analysis have been published by members of the Monolix group ( ).
Stochastic Average Gradient Algorithm
In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method.
Stochastic Block Model
The stochastic block model is a generative model for random graphs. This model tends to produce graphs containing communities, subsets characterized by being connected with one another with particular edge densities. For example, edges may be more common within communities than between communities. The stochastic block model is important in statistics, machine learning, and network science, where it serves as a useful benchmark for the task of recovering community structure in graph data.
Stochastic Computing based Deep Convolutional Neural Networks
With recent advancing of Internet of Things (IoTs), it becomes very attractive to implement the deep convolutional neural networks (DCNNs) onto embedded/portable systems. Presently, executing the software-based DCNNs requires high-performance server clusters in practice, restricting their widespread deployment on the mobile devices. To overcome this issue, considerable research efforts have been conducted in the context of developing highly-parallel and specific DCNN hardware, utilizing GPGPUs, FPGAs, and ASICs. Stochastic Computing (SC), which uses bit-stream to represent a number within [-1, 1] by counting the number of ones in the bit-stream, has a high potential for implementing DCNNs with high scalability and ultra-low hardware footprint. Since multiplications and additions can be calculated using AND gates and multiplexers in SC, significant reductions in power/energy and hardware footprint can be achieved compared to the conventional binary arithmetic implementations. The tremendous savings in power (energy) and hardware resources bring about immense design space for enhancing scalability and robustness for hardware DCNNs. This paper presents the first comprehensive design and optimization framework of SC-based DCNNs (SC-DCNNs). We first present the optimal designs of function blocks that perform the basic operations, i.e., inner product, pooling, and activation function. Then we propose the optimal design of four types of combinations of basic function blocks, named feature extraction blocks, which are in charge of extracting features from input feature maps. Besides, weight storage methods are investigated to reduce the area and power/energy consumption for storing weights. Finally, the whole SC-DCNN implementation is optimized, with feature extraction blocks carefully selected, to minimize area and power/energy consumption while maintaining a high network accuracy level.
Stochastic Configuration Networks
This paper contributes to a development of randomized methods for neural networks. The proposed learner model is generated incrementally by stochastic configuration (SC) algorithms, termed as Stochastic Configuration Networks (SCNs). In contrast to the existing randomised learning algorithms for single layer feed-forward neural networks (SLFNNs), we randomly assign the input weights and biases of the hidden nodes in the light of a supervisory mechanism, and the output weights are analytically evaluated in either constructive or selective manner. As fundamentals of SCN-based data modelling techniques, we establish some theoretical results on the universal approximation property. Three versions of SC algorithms are presented for regression problems (applicable for classification problems as well) in this work. Simulation results concerning both function approximation and real world data regression indicate some remarkable merits of our proposed SCNs in terms of less human intervention on the network size setting, the scope adaptation of random parameters, fast learning and sound generalization.
Stochastic Decorrelation Loss
Multi-view learning aims to learn an embedding space where multiple views are either maximally correlated for cross-view recognition, or decorrelated for latent factor disentanglement. A key challenge for deep multi-view representation learning is scalability. To correlate or decorrelate multi-view signals, the covariance of the whole training set should be computed which does not fit well with the mini-batch based training strategy, and moreover (de)correlation should be done in a way that is free of SVD-based computation in order to scale to contemporary layer sizes. In this work, a unified approach is proposed for efficient and scalable deep multi-view learning. Specifically, a mini-batch based Stochastic Decorrelation Loss (SDL) is proposed which can be applied to any network layer to provide soft decorrelation of the layer’s activations. This reveals the connection between deep multi-view learning models such as Deep Canonical Correlation Analysis (DCCA) and Factorisation Autoencoder (FAE), and allows them to be easily implemented. We further show that SDL is superior to other decorrelation losses in terms of efficacy and scalability.
Stochastic Differential Equation
A stochastic differential equation (SDE) is a differential equation in which one or more of the terms is a stochastic process, resulting in a solution which is itself a stochastic process. SDEs are used to model diverse phenomena such as fluctuating stock prices or physical systems subject to thermal fluctuations. Typically, SDEs incorporate random white noise which can be thought of as the derivative of Brownian motion (or the Wiener process); however, it should be mentioned that other types of random fluctuations are possible, such as jump processes.
Stochastic Dual Coordinate Ascent
Stochastic Frontier Analysis
Stochastic frontier analysis (SFA) is a method of economic modeling. It has its starting point in the stochastic production frontier models.
Stochastic Frontier Models Stochastic frontier models allow to analyse technical inefficiency in the framework of production functions. Production units (firms, regions, countries, etc.) are assumed to produce according to a common technology, and reach the frontier when they produce the maximum possible output for a given set of inputs. Inefficiencies can be due to structural problems or market imperfections and other factors which cause countries to produce below their maximum attainable output. Over time, production units can become less inefficient and catch up to the frontier. It is also possible that the frontier shifts, indicating technical progress. In addition, production units can move along the frontier by changing input quantities. Finally, there can be some combinations of these three effects. The stochastic frontier method allows to decompose growth into changes in input use, changes in technology and changes in efficiency, thus extending the widely used growth accounting method.
Stochastic Gradient Descent
Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions.
Stochastic Multidimensional Scaling Multidimensional scaling (MDS) is a popular dimensionality reduction techniques that has been widely used for network visualization and cooperative localization. However, the traditional stress minimization formulation of MDS necessitates the use of batch optimization algorithms that are not scalable to large-sized problems. This paper considers an alternative stochastic stress minimization framework that is amenable to incremental and distributed solutions. A novel linear-complexity stochastic optimization algorithm is proposed that is provably convergent and simple to implement. The applicability of the proposed algorithm to localization and visualization tasks is also expounded. Extensive tests on synthetic and real datasets demonstrate the efficacy of the proposed algorithm.
Stochastic Neural Network Stochastic neural networks are a type of artificial neural networks, which is a tool of artificial intelligence. They are built by introducing random variations into the network, either by giving the network’s neurons stochastic transfer functions, or by giving them stochastic weights. This makes them useful tools for optimization problems, since the random fluctuations help it escape from local minima. Stochastic neural networks that are built by using stochastic transfer functions are often called Boltzmann machines.
Stochastic Optimization
Stochastic optimization (SO) methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involve random objective functions or random constraints, for example. Stochastic optimization methods also include methods with random iterates. Some stochastic optimization methods use random iterates to solve stochastic problems, combining both meanings of stochastic optimization. Stochastic optimization methods generalize deterministic methods for deterministic problems.
Stochastic Ordering In probability theory and statistics, a stochastic order quantifies the concept of one random variable being ‘bigger’ than another. These are usually partial orders, so that one random variable A may be neither stochastically greater than, less than nor equal to another random variable B. Many different orders exist, which have different applications.
An Introduction to Stochastic Orders
Stochastic Partial Differential Equation
Stochastic partial differential equations (SPDEs) are similar to ordinary stochastic differential equations. They are essentially partial differential equations that have random forcing terms and coefficients. They can be exceedingly difficult to solve. However, they have strong connections with quantum field theory and statistical mechanics.
Stochastic Process In probability theory, a stochastic process, or sometimes random process (widely used) is a collection of random variables, representing the evolution of some system of random values over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. In the simple case of discrete time, as opposed to continuous time, a stochastic process involves a sequence of random variables and the time series associated with these random variables (for example, see Markov chain, also known as discrete-time Markov chain). One approach to stochastic processes treats them as functions of one or several deterministic arguments (inputs; in most cases this will be the time parameter) whose values (outputs) are random variables: non-deterministic (single) quantities which have certain probability distributions. Random variables corresponding to various times (or points, in the case of random fields) may be completely different. The main requirement is that these different random quantities all take values in the same space (the codomain of the function). Although the random values of a stochastic process at different times may be independent random variables, in most commonly considered situations they exhibit complicated statistical correlations. Familiar examples of processes modeled as stochastic time series include stock market and exchange rate fluctuations, signals such as speech, audio and video, medical data such as a patient’s EKG, EEG, blood pressure or temperature, and random movement such as Brownian motion or random walks. Examples of random fields include static images, random terrain (landscapes), wind waves or composition variations of a heterogeneous material. A generalization, the random field, is defined by letting the variables’ parameters be members of a topological space instead of limited to real values representing time.
Stochastic Programming In the field of mathematical optimization, stochastic programming is a framework for modeling optimization problems that involve uncertainty. Whereas deterministic optimization problems are formulated with known parameters, real world problems almost invariably include some unknown parameters. When the parameters are known only within certain bounds, one approach to tackling such problems is called robust optimization. Here the goal is to find a solution which is feasible for all such data and optimal in some sense. Stochastic programming models are similar in style but take advantage of the fact that probability distributions governing the data are known or can be estimated. The goal here is to find some policy that is feasible for all (or almost all) the possible data instances and maximizes the expectation of some function of the decisions and the random variables. More generally, such models are formulated, solved analytically or numerically, and analyzed in order to provide useful information to a decision-maker.
StochAstic Recursive grAdient algoritHm
In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. Numerical experiments demonstrate the efficiency of our algorithm.
Stochastic Self-Organising Map
Stochastic Simulation Algorithm
Stochastic simulation is a simulation that operates with variables that can change with certain probability. Stochastic means that particular factors (values) are variable or random. With a stochastic model we create a projection which is based on a set of random values. Outputs are recorded and the projection is repeated with a new set of random (variable) values. Previous steps are repeated until a reasonable amount of data is gathered (thousandfold, millionfold, ..). In the end, the distribution of the outputs shows the most probable estimates as well as a frame of expectations (outlier values dividing those we still can expect from the ones we should not).
StochasticNet Deep neural networks is a branch in machine learning that has seen a meteoric rise in popularity due to its powerful abilities to represent and model high-level abstractions in highly complex data. One area in deep neural networks that is ripe for exploration is neural connectivity formation. A pivotal study on the brain tissue of rats found that synaptic formation for specific functional connectivity in neocortical neural microcircuits can be surprisingly well modeled and predicted as a random formation. Motivated by this intriguing finding, we introduce the concept of StochasticNet, where deep neural networks are formed via stochastic connectivity between neurons. Such stochastic synaptic formations in a deep neural network architecture can potentially allow for efficient utilization of neurons for performing specific tasks. To evaluate the feasibility of such a deep neural network architecture, we train a StochasticNet using three image datasets. Experimental results show that a StochasticNet can be formed that provides comparable accuracy and reduced overfitting when compared to conventional deep neural networks with more than two times the number of neural connections.
Stone’s Paradox In technical jargon, he shows that ‘a finitely additive measure on the free group with two generators is nonconglomerable.’ In English: even for a simple problem with a discrete parameters space, flat priors can lead to surprises.
Stop Words In computing, stop words are words which are filtered out before or after processing of natural language data (text). There is not one definite list of stop words which all tools use and such a filter is not always used. Some tools specifically avoid removing them to support phrase search. Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as ‘The Who’, ‘The The’, or ‘Take That’. Other search engines remove some of the most common words-including lexical words, such as “want”-from a query in order to improve performance.
Stopping Time In probability theory, in particular in the study of stochastic processes, a stopping time (also Markov time) is a specific type of “random time”: a random variable whose value is interpreted as the time at which a given stochastic process exhibits a certain behavior of interest. A stopping time is often defined by a stopping rule, a mechanism for deciding whether to continue or stop a process on the basis of the present position and past events, and which will almost always lead to a decision to stop at some finite time.
Strang’s Diagram A diagram that shows actions of A, an m×n matrix, as linear transformations from the space R^m to R^n. The diagram helps to understand the fundamental concepts of Linear Algebra in terms of the four subspaces by visually illustrating the actions of A on all these subspaces.
Stream Processing Stream processing is a computer programming paradigm, related to SIMD (single instruction, multiple data), that allows some applications to more easily exploit a limited form of parallel processing. Such applications can use multiple computational units, such as the FPUs on a GPU or field programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units. The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a set of data (a stream), a series of operations (kernel functions) is applied to each element in the stream. Uniform streaming, where one kernel function is applied to all elements in the stream, is typical. Kernel functions are usually pipelined, and local on-chip memory is reused to minimize external memory bandwidth. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks. Stream processing hardware can use scoreboarding, for example, to launch DMAs at runtime, when dependencies become known. The elimination of manual DMA management reduces software complexity, and the elimination of hardware caches reduces the amount of the area not dedicated to computational units such as ALUs. During the 1980s stream processing was explored within dataflow programming. An example is the language SISAL (Streams and Iteration in a Single Assignment Language).
StreamFlow StreamFlow is a stream processing tool designed to rapidly build and monitor processing workflows. The ultimate goal of StreamFlow is to make working with stream processing frameworks such as Apache Storm easier, faster, and with “enterprise” like management functionality. StreamFlow provides a graphical user interface for non-developers such as data scientists, analysts, or operational users to rapidly build scalable data flows and analytics. The following image is a screenshot of this topology builder.
Streamgraph A streamgraph, or stream graph, is a type of stacked area graph which is displaced around a central axis, resulting in a flowing, organic shape. Streamgraphs were developed by Lee Byron and popularized by their use in a February 2008 New York Times article on movie box office revenues.
Streaming Ensemble Algorithm
Ensemble methods have recently garnered a great deal of attention in the machine learning community. Techniques such as Boosting and Bagging have proven to be highly effective but require repeated resampling of the training data, making them inappropriate in a data mining context. The methods presented in this paper take advantage of plentiful data, building separate classifiers on sequential chunks of training points. These classifiers are combined into a fixedsize ensemble using a heuristic replacement strategy. The result is a fast algorithm for large-scale or streaming data that classifies as well as a single decision tree built on all the data, requires approximately constant memory, and adjusts quickly to concept drift.
Streaming Platform LAnguage Shell
Stream Processing LAnguage SHell (SPLASH) is a scripting language that brings extensibility to CCL (Continuous Computation Language), allowing you to create custom operators and functions that go beyond standard SQL. CCL is the primary event processing language of the Event Stream Processor. ESP projects are defined in CCL.
Streaming Processing Engines
Apache Storm, Apache S4
Streaming Variational Bayes
We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data – a case where SVI may be applied – and in the streaming setting, where SVI does not apply.
Streamulus Streamulus is a C++ library that makes it very easy to process event streams. You need to write code that handles a single event and the library turns this code into a data structure that handles infinite streams of such events. The stream operators you write can have side effects and they can maintain an internal state.
String Distance Algorithms String Distance Algorithms is about calculation of the distance between two strings. E.g. “WhatIs” compared to “wahtis” to identify how similar to strings are to get a fuzzy interpretation and so e.g. try to get rid of typos etc.
Strongly Connected Component In the mathematical theory of directed graphs, a graph is said to be strongly connected if every vertex is reachable from every other vertex. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are themselves strongly connected. It is possible to test the strong connectivity of a graph, or to find its strongly connected components, in linear time.
Structural Causal Model
Structural Data Structural data holds information about the relationship between events. Some key concerns include the following:
1) how these relationships should be conveyed from the user to the computer and from the computer back to the user;
2) how the data should be configured to allow for systemic evaluation;
3) how to effectively store large amounts while allowing for rapid access;
4) how to reliably transfer the details between computers regardless of operating platform;
5) how to process the data to achieve different strategic outcomes;
6) how to arrange the data in order to express the relevance between events; and
7) how to ensure current design does not limit future application.
Structural Equation Modeling
Structural equation modelling (SEM) is a statistical technique for testing and estimating causal relations using a combination of statistical data and qualitative causal assumptions. This definition of SEM was articulated by the geneticist Sewall Wright (1921), the economist Trygve Haavelmo (1943) and the cognitive scientist Herbert A. Simon (1953), and formally defined by Judea Pearl (2000) using a calculus of counterfactuals. Structural equation models (SEM) allow both confirmatory and exploratory modeling, meaning they are suited to both theory testing and theory development. Confirmatory modeling usually starts out with a hypothesis that gets represented in a causal model. The concepts used in the model must then be operationalized to allow testing of the relationships between the concepts in the model. The model is tested against the obtained measurement data to determine how well the model fits the data. The causal assumptions embedded in the model often have falsifiable implications which can be tested against the data. With an initial theory, SEM can be used inductively by specifying a corresponding model and using data to estimate the values of free parameters. Often the initial hypothesis requires adjustment in light of model evidence. SEM can be used purely for exploration; this would usually be a technique similar to exploratory factor analysis, a technique commonly used in psychometrics. Among the strengths of SEM is the ability to construct latent variables: variables that are not measured directly, but are estimated in the model from several measured variables, each of which is predicted to ‘tap into’ the latent variables. This allows the modeler to explicitly capture the unreliability of measurement in the model, which in theory allows the structural relations between latent variables to be accurately estimated. Factor analysis, path analysis and regression all represent special cases of SEM. In SEM, the qualitative causal assumptions are represented by the missing variables in each equation, as well as vanishing covariances among some error terms. These assumptions are testable in experimental studies and must be confirmed judgmentally in observational studies.
Structural Expectation-Maximization Algorithm In recent years there has been a flurry of works on learning probabilistic belief networks. Current state of the art methods have been shown to be successful for two learning scenarios: learning both network structure and parameters from complete data, and learning parameters for a fixed network from incomplete data – that is, in the presence of missing values or hidden variables. However, no method has yet been demonstrated to effectively learn network structure fromincomplete data. In this paper, we propose a new method for learning network structure from incomplete data. This method is based on an extension of the Expectation-Maximization (EM) algorithm for model selection problems that performs search for the best structure inside the EM procedure. We prove the convergence of this algorithm, and adapt it for learning belief networks. We then describe how to learn networks in two scenarios: when the data contains missing values, and in the presence of hidden variables. We provide experimental results that show the effectiveness of our procedure in both scenarios.
The Bayesian Structural EM Algorithm
Structural Hamming Distance
In simple terms, this is the number of edge instertion, deletions or flips in order to transform one graph to another graph.
“Hamming Distance”
“Structural Intervention Distance”
Structural Intervention Distance
Causal inference relies on the structure of a graph, often a directed acyclic graph (DAG). Different graphs may result in different causal inference statements and different intervention distributions. To quantify such differences, we propose a (pre-) distance between DAGs, the structural intervention distance (SID). The SID is based on a graphical criterion only and quantifies the closeness between two DAGs in terms of their corresponding causal inference statements. It is therefore well-suited for evaluating graphs that are used for computing interventions. Instead of DAGs it is also possible to compare CPDAGs, completed partially directed acyclic graphs that represent Markov equivalence classes. Since it differs significantly from the popular Structural Hamming Distance (SHD), the SID constitutes a valuable additional measure.
Structural Learning and Integrative DEcomposition
The increased availability of the multi-view data (data on the same samples from multiple sources) has led to strong interest in models based on low-rank matrix factorizations. These models represent each data view via shared and individual components, and have been successfully applied for exploratory dimension reduction, association analysis between the views, and further learning tasks such as consensus clustering. Despite these advances, there remain significant challenges in modeling partially-shared components, and identifying the number of components of each type (shared/partially-shared/individual). In this work, we formulate a novel linked component model that directly incorporates partially-shared structures. We call this model SLIDE for Structural Learning and Integrative DEcomposition of multi-view data. We prove the existence of SLIDE decomposition and explicitly characterize the identifiability conditions. The proposed model fitting and selection techniques allow for joint identification of the number of components of each type, in contrast to existing sequential approaches. In our empirical studies, SLIDE demonstrates excellent performance in both signal estimation and component selection. We further illustrate the methodology on the breast cancer data from The Cancer Genome Atlas repository.
Structural Maxent Model We present a new class of density estimation models, Structural Maxent models, with feature functions selected from a union of possibly very complex sub-families and yet benefiting from strong learning guarantees. The design of our models is based on a new principle supported by uniform convergence bounds and taking into consideration the complexity of the different sub-families composing the full set of features. We prove new data-dependent learning bounds for our models, expressed in terms of the Rademacher complexities of these sub-families. We also prove a duality theorem, which we use to derive our Structural Maxent algorithm. We give a full description of our algorithm, including the details of its derivation, and report the results of several experiments demonstrating that its performance improves on that of existing L1-norm regularized Maxent algorithms. We further similarly define conditional Structural Maxent models for multi-class classification problems. These are conditional probability models also making use of a union of possibly complex feature subfamilies. We prove a duality theorem for these models as well, which reveals their connection with existing binary and multi-class deep boosting algorithms.
Structural Topic Model
The Structural Topic Model (STM) allows researchers to estimate a topic model which includes document-level meta-data. Statistical models of text have become increasingly popular in statistics and com- puter science as a method of exploring large document collections. Social scientists often want to move beyond exploration, to measurement and experimentation, and make inference about social and political processes that drive discourse and content. In this paper, we develop a model of text data that supports this type of substantive re- search. Our approach is to posit a hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates. In this model, topical prevalence and topical content are speci ed as a sim- ple generalized linear model on an arbitrary number of document-level covariates, such as news source and time of release, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a gen- erally applicable framework. We demonstrate the proposed methodology by analyzing a collection of news reports about China, where we allow the prevalence of topics to evolve over time and vary across newswire services. Our methods help quantify the e ect of news wire source on both the frequency and nature of topic coverage. All the methods we describe are available as part of the open source R package stm.
Structured Attention Networks Attention networks have proven to be an effective approach for embedding categorical inference within a deep neural network. However, for many tasks we may want to model richer structural dependencies without abandoning end-to-end training. In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees. We experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. We further find that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention.
Structured Factored Inference
Reasoning on large and complex real-world models is a computationally difficult task, yet one that is required for effective use of many AI applications. A plethora of inference algorithms have been developed that work well on specific models or only on parts of general models. Consequently, a system that can intelligently apply these inference algorithms to different parts of a model for fast reasoning is highly desirable. We introduce a new framework called structured factored inference (SFI) that provides the foundation for such a system. Using models encoded in a probabilistic programming language, SFI provides a sound means to decompose a model into sub-models, apply an inference algorithm to each sub-model, and combine the resulting information to answer a query. Our results show that SFI is nearly as accurate as exact inference yet retains the benefits of approximate inference methods.
Structured Learning Structured prediction is a generalization of the standard paradigms of supervised learning, classification and regression. All of these can be thought of finding a function that minimizes some loss over a training set. The differences are in the kind of functions that are used and the losses. In classification, the target domain are discrete class labels, and the loss is usually the 0-1 loss, i.e. counting the misclassifications. In regression, the target domain is the real numbers, and the loss is usually mean squared error. In structured prediction, both the target domain and the loss are more or less arbitrary. This means the goal is not to predict a label or a number, but a possibly much more complicated object like a sequence or a graph. In structured prediction, we often deal with finite, but large output spaces Y. This situation could be dealt with using classification with a very large number of classes. The idea behind structured prediction is that we can do better than this, by making use of the structure of the output space.
Structured Sufficient Dimension Reduction
Structured Support Vector Machine The structured support vector machine is a machine learning algorithm that generalizes the Support Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels. As an example, a sample instance might be a natural language sentence, and the output label is an annotated parse tree. Training a classifier consists of showing pairs of correct sample and output label pairs. After training, the structured SVM model allows one to predict for new sample instances the corresponding output label; that is, given a natural language sentence, the classifier can produce the most likely parse tree.
“Support Vector Machine”
Subadditivity In mathematics, subadditivity is a property of a function that states, roughly, that evaluating the function for the sum of two elements of the domain always returns something less than or equal to the sum of the function’s values at each element. There are numerous examples of subadditive functions in various areas of mathematics, particularly norms and square roots. Additive maps are special cases of subadditive functions.
subgraph2vec In this paper, we present subgraph2vec, a novel approach for learning latent representations of rooted subgraphs from large graphs inspired by recent advancements in Deep Learning and Graph Kernels. These latent representations encode semantic substructure dependencies in a continuous vector space, which is easily exploited by statistical models for tasks such as graph classification, clustering, link prediction and community detection. subgraph2vec leverages on local information obtained from neighbourhoods of nodes to learn their latent representations in an unsupervised fashion. We demonstrate that subgraph vectors learnt by our approach could be used in conjunction with classifiers such as CNNs, SVMs and relational data clustering algorithms to achieve significantly superior accuracies. Also, we show that the subgraph vectors could be used for building a deep learning variant of Weisfeiler-Lehman graph kernel. Our experiments on several benchmark and large-scale real-world datasets reveal that subgraph2vec achieves significant improvements in accuracies over existing graph kernels on both supervised and unsupervised learning tasks. Specifically, on two realworld program analysis tasks, namely, code clone and malware detection, subgraph2vec outperforms state-of-the-art kernels by more than 17% and 4%, respectively.
Sublinear Algorithm
Submanifold Sparse Convolutional Network Convolutional network are the de-facto standard for analysing spatio-temporal data such as images, videos, 3D shapes, etc. Whilst some of this data is naturally dense (for instance, photos), many other data sources are inherently sparse. Examples include pen-strokes forming on a piece of paper, or (colored) 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard ‘dense’ implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce a sparse convolutional operation tailored to processing sparse data that differs from prior work on sparse convolutional networks in that it operates strictly on submanifolds, rather than ‘dilating’ the observation with every layer in the network. Our empirical analysis of the resulting submanifold sparse convolutional networks shows that they perform on par with state-of-the-art methods whilst requiring substantially less computation.
Subsampled Double Bootstrap
Bayesian Bootstraps for Massive Data
SUBSCALE Rapid growth of high dimensional datasets in recent years has created an emergent need to extract the knowledge underlying them. Clustering is the process of automatically finding groups of similar data points in the space of the dimensions or attributes of a dataset. Finding clusters in the high dimensional datasets is an important and challenging data mining problem. Data group together differently under different subsets of dimensions, called subspaces. Quite often a dataset can be better understood by clustering it in its subspaces, a process called subspace clustering. But the exponential growth in the number of these subspaces with the dimensionality of data makes the whole process of subspace clustering computationally very expensive. There is a growing demand for efficient and scalable subspace clustering solutions in many Big data application domains like biology, computer vision, astronomy and social networking. Apriori based hierarchical clustering is a promising approach to find all possible higher dimensional subspace clusters from the lower dimensional clusters using a bottom-up process. However, the performance of the existing algorithms based on this approach deteriorates drastically with the increase in the number of dimensions. Most of these algorithms require multiple database scans and generate a large number of redundant subspace clusters, either implicitly or explicitly, during the clustering process. In this paper, we present SUBSCALE, a novel clustering algorithm to find non-trivial subspace clusters with minimal cost and it requires only k database scans for a k-dimensional data set. Our algorithm scales very well with the dimensionality of the dataset and is highly parallelizable. We present the details of the SUBSCALE algorithm and its evaluation in this paper.
Subspace Clustering
Subspace Outlier Degree
(see Definition 1)
Subspace Outlier Detection
Substochastic Monte Carlo
In this paper we introduce and formalize Substochastic Monte Carlo (SSMC) algorithms. These algorithms, originally intended to be a better classical foil to quantum annealing than simulated annealing, prove to be worthy optimization algorithms in their own right. In SSMC, a population of walkers is initialized according to a known distribution on an arbitrary search space and varied into the solution of some optimization problem of interest. The first argument of this paper shows how an existing classical algorithm, ‘Go-With-The-Winners’ (GWW), is a limiting case of SSMC when restricted to binary search and particular driving dynamics. Although limiting to GWW, SSMC is more general. We show that (1) GWW can be efficiently simulated within the SSMC framework, (2) SSMC can be exponentially faster than GWW, (3) by naturally incorporating structural information, SSMC can exponentially outperform the quantum algorithm that first inspired it, and (4) SSMC exhibits desirable search features in general spaces. Our approach combines ideas from genetic algorithms (GWW), theoretical probability (Fleming-Viot processes), and quantum computing. Not only do we demonstrate that SSMC is often more efficient than competing algorithms, but we also hope that our results connecting these disciplines will impact each independently. An implemented version of SSMC has previously enjoyed some success as a competitive optimization algorithm for Max-$k$-SAT.
Sufficient Dimension Reduction
In statistics, sufficient dimension reduction (SDR) is a paradigm for analyzing data that combines the ideas of dimension reduction with the concept of sufficiency. Dimension reduction has long been a primary goal of regression analysis. Given a response variable y and a p-dimensional predictor vector \textbf{x}, regression analysis aims to study the distribution of y|\textbf{x}, the conditional distribution of y given \textbf{x}. A dimension reduction is a function R(\textbf{x}) that maps \textbf{x} to a subset of \mathbb{R}^k, k < p, thereby reducing the dimension of \textbf{x}. For example, R(\textbf{x}) may be one or more linear combinations of \textbf{x}. A dimension reduction R(\textbf{x}) is said to be sufficient if the distribution of y|R(\textbf{x}) is the same as that of y|\textbf{x}. In other words, no information about the regression is lost in reducing the dimension of \textbf{x} if the reduction is sufficient.
Sufficient Factor Broadcasting
Matrix-parametrized models, including multiclass logistic regression and sparse coding, are used in machine learning (ML) applications ranging from computer vision to computational biology. When these models are applied to large-scale ML problems starting at millions of samples and tens of thousands of classes, their parameter matrix can grow at an unexpected rate, resulting in high parameter synchronization costs that greatly slow down distributed learning. To address this issue, we propose a Sufficient Factor Broadcasting (SFB) computation model for efficient distributed learning of a large family of matrix-parameterized models, which share the following property: the parameter update computed on each data sample is a rank-1 matrix, i.e., the outer product of two ‘sufficient factors’ (SFs). By broadcasting the SFs among worker machines and reconstructing the update matrices locally at each worker, SFB improves communication efficiency — communication costs are linear in the parameter matrix’s dimensions, rather than quadratic — without affecting computational correctness. We present a theoretical convergence analysis of SFB, and empirically corroborate its efficiency on four different matrix-parametrized ML models.
Sufficient Statistic In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if “no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter”. In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than does the statistic, as to which of those probability distributions is that of the population from which the sample was taken.
Suffix Tree
(PAT Tree)
In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed ‘trie’ (digital tree) containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations. The construction of such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provide one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string’s suffix tree typically requires significantly more space than storing the string itself.
Sugeno Integral In mathematics, the Sugeno integral, named after M. Sugeno, is a type of integral with respect to a fuzzy measure.
Suite of Fast Incremental Algorithms for Machine Learning
The suite of fast incremental algorithms for machine learning (sofia-ml) can be used for training models for classification, regression, ranking, or combined regression and ranking. Several different techniques are available. This release is intended to aid researchers and practitioners who require fast methods for classification and ranking on large, sparse data sets. Supported classification, regression, and ranking learners include:
• Pegasos SVM
• Stochastic Gradient Descent (SGD) SVM
• Passive-Aggressive Perceptron
• Perceptron with Margins
• Logistic Regression (with Pegasos Projection)
This package provides a commandline utility for training models and using them to predict on new data, and also exposes an API for model training and prediction that can be used in new applications. The underlying libraries for data sets, weight vectors, and example vectors are also provided for researchers wishing to use these classes to implement other algorithms.
Sum of Powered Score
Sum Product Networks
Sum-Product Networks (SPNs) are recently introduced deep tractable probabilistic models by which several kinds of inference queries can be answered exactly and in a tractable time. Up to now, they have been largely used as black box density estimators, assessed only by comparing their likelihood scores only. In this paper we explore and exploit the inner representations learned by SPNs. We do this with a threefold aim: first we want to get a better understanding of the inner workings of SPNs; secondly, we seek additional ways to evaluate one SPN model and compare it against other probabilistic models, providing diagnostic tools to practitioners; lastly, we want to empirically evaluate how good and meaningful the extracted representations are, as in a classic Representation Learning framework. In order to do so we revise their interpretation as deep neural networks and we propose to exploit several visualization techniques on their node activations and network outputs under different types of inference queries. To investigate these models as feature extractors, we plug some SPNs, learned in a greedy unsupervised fashion on image datasets, in supervised classification learning tasks. We extract several embedding types from node activations by filtering nodes by their type, by their associated feature abstraction level and by their scope. In a thorough empirical comparison we prove them to be competitive against those generated from popular feature extractors as Restricted Boltzmann Machines. Finally, we investigate embeddings generated from random probabilistic marginal queries as means to compare other tractable probabilistic models on a common ground, extending our experiments to Mixtures of Trees.
Sum-Product Graphical Model
This paper introduces a new probabilistic architecture called Sum-Product Graphical Model (SPGM). SPGMs combine traits from Sum-Product Networks (SPNs) and Graphical Models (GMs): Like SPNs, SPGMs always enable tractable inference using a class of models that incorporate context specific independence. Like GMs, SPGMs provide a high-level model interpretation in terms of conditional independence assumptions and corresponding factorizations. Thus, the new architecture represents a class of probability distributions that combines, for the first time, the semantics of graphical models with the evaluation efficiency of SPNs. We also propose a novel algorithm for learning both the structure and the parameters of SPGMs. A comparative empirical evaluation demonstrates competitive performances of our approach in density estimation.
Summary Receiver Operating Characteristic
Sunburst Chart A ring chart, also known as a sunburst chart or a multilevel pie chart, is used to visualize hierarchical data, depicted by concentric circles. The circle in the centre represents the root node, with the hierarchy moving outward from the center. A segment of the inner circle bears a hierarchical relationship to those segments of the outer circle which lie within the angular sweep of the parent segment.
Superadditivity In mathematics, a sequence { an }, n ≥ 1, is called superadditive if it satisfies the inequality a_{n+m} &gt; a_n+a_m, for all m and n. The major reason for the use of superadditive sequences is the following lemma due to Michael Fekete.
SuperPivot We present SuperPivot, an analysis method for low-resource languages that occur in a superparallel corpus, i.e., in a corpus that contains an order of magnitude more languages than parallel corpora currently in use. We show that SuperPivot performs well for the crosslingual analysis of the linguistic phenomenon of tense. We produce analysis results for more than 1000 languages, conducting – to the best of our knowledge – the largest crosslingual computational study performed to date. We extend existing methodology for leveraging parallel corpora for typological analysis by overcoming a limiting assumption of earlier work: We only require that a linguistic feature is overtly marked in a few of thousands of languages as opposed to requiring that it be marked in all languages under investigation.
Superpixels Superpixels group perceptually similar pixels to create visually meaningful entities while heavily reducing the number of primitives. As of these properties, superpixel algorithms have received much attention since their naming in 2003. By today, publicly available and well-understood superpixel algorithms have turned into standard tools in low-level vision. As such, and due to their quick adoption in a wide range of applications, appropriate benchmarks are crucial for algorithm selection and comparison. Until now, the rapidly growing number of algorithms as well as varying experimental setups hindered the development of a unifying benchmark. We present a comprehensive evaluation of 28 state-of-the-art superpixel algorithms utilizing a benchmark focussing on fair comparison and designed to provide new and relevant insights. To this end, we explicitly discuss parameter optimization and the importance of strictly enforcing connectivity. Furthermore, by extending well-known metrics, we are able to summarize algorithm performance independent of the number of generated superpixels, thereby overcoming a major limitation of available benchmarks. Furthermore, we discuss runtime, robustness against noise, blur and affine transformations, implementation details as well as aspects of visual quality. Finally, we present an overall ranking of superpixel algorithms which redefines the state-of-the-art and enables researchers to easily select appropriate algorithms and the corresponding implementations which themselves are made publicly available as part of our benchmark at
SuperSpike A vast majority of computation in the brain is performed by spiking neural networks. Despite the ubiquity of such spiking, we currently lack an understanding of how biological spiking neural circuits learn and compute in-vivo, as well as how we can instantiate such capabilities in artificial spiking circuits in-silico. Here we revisit the problem of supervised learning in temporally coding multi-layer spiking neural networks. First, by using a surrogate gradient approach, we derive SuperSpike, a nonlinear voltage-based three factor learning rule capable of training multi-layer networks of deterministic integrate-and-fire neurons to perform nonlinear computations on spatiotemporal spike patterns. Second, inspired by recent results on feedback alignment, we compare the performance of our learning rule under different credit assignment strategies for propagating output errors to hidden units. Specifically, we test uniform, symmetric and random feedback, finding that simpler tasks can be solved with any type of feedback, while more complex tasks require symmetric feedback. In summary, our results open the door to obtaining a better scientific understanding of learning and computation in spiking neural networks by advancing our ability to train them to solve nonlinear problems involving transformations between different spatiotemporal spike-time patterns.
Supervised Quantile Normalisation
Quantile normalisation is a popular normalisation method for data subject to unwanted variations such as images, speech, or genomic data. It applies a monotonic transformation to the feature values of each sample to ensure that after normalisation, they follow the same target distribution for each sample. Choosing a ‘good’ target distribution remains however largely empirical and heuristic, and is usually done independently of the subsequent analysis of normalised data. We propose instead to couple the quantile normalisation step with the subsequent analysis, and to optimise the target distribution jointly with the other parameters in the analysis. We illustrate this principle on the problem of estimating a linear model over normalised data, and show that it leads to a particular low-rank matrix regression problem that can be solved efficiently. We illustrate the potential of our method, which we term SUQUAN, on simulated data, images and genomic data, where it outperforms standard quantile normalisation.
Supervised Tensor Learning
Support Support is defined on itemsets and gives the proportion of transactions which contain X. It is used as a measure of significance (importance) of an itemset. Since it basically uses the count of transactions it is often called a frequency constraint. An itemset with a support greater then a set minimum support threshold, supp(X)>σ, is called a frequent or large itemset. Supports main feature is that it possesses the down-ward closure property (anti-monotonicity) which means that all sub sets of a frequent set are also frequent. This property (actually, the fact that no super set of a infrequent set can be frequent) is used to prune the search space (usually thought of as a lattice or tree of item sets with increasing size) in level-wise algorithms (e.g., the Apriori algorithm). The disadvantage of support is the rare item problem. Items that occur very infrequently in the data set are pruned although they would still produce interesting and potentially valuable rules. The rare item problem is important for transaction data which usually have a very uneven distribution of support for the individual items (typical is a power-law distribution where few items are used all the time and most item are rarely used).
Support Tensor Machine
Support Vector Data Description
Data domain description concerns the characterization of a data set. A good description covers all target data but includes no superfluous space. The boundary of a dataset can be used to detect novel data or outliers. We will present the Support Vector Data Description (SVDD) which is inspired by the Support Vector Classifier. It obtains a spherically shaped boundary around a dataset and analogous to the Support Vector Classifier it can be made flexible by using other kernel functions. The method is made robust against outliers in the training set and is capable of tightening the description by using negative examples. We show characteristics of the Support Vector Data Descriptions using artificial and real data.
Sampling Method for Fast Training of Support Vector Data Description
Support Vector Machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
Surface Network
We study data-driven representations for three-dimensional triangle meshes, which are one of the prevalent objects used to represent 3D geometry. Recent works have developed models that exploit the intrinsic geometry of manifolds and graphs, namely the Graph Neural Networks (GNNs) and its spectral variants, which learn from the local metric tensor via the Laplacian operator. Despite offering excellent sample complexity and built-in invariances, intrinsic geometry alone is invariant to isometric deformations, making it unsuitable for many applications. To overcome this limitation, we propose several upgrades to GNNs to leverage extrinsic differential geometry properties of three-dimensional surfaces, increasing its modeling power. In particular, we propose to exploit the Dirac operator, whose spectrum detects principal curvature directions — this is in stark contrast with the classical Laplace operator, which directly measures mean curvature. We coin the resulting model the \emph{Surface Network (SN)}. We demonstrate the efficiency and versatility of SNs on two challenging tasks: temporal prediction of mesh deformations under non-linear dynamics and generative models using a variational autoencoder framework with encoders/decoders given by SNs.
Surrogate Variable Analysis
Modern high-throughput molecular biology experiments measure data for thousands of related features and seek to rank those features for association with some variables of experimental or clinical importance. The process of ranking features for association with primary variables is complicated by genetic, environmental, and technical factors that influence hundreds or thousands of features at a time. In highdimensional experiments these factors are often unknown, unmeasured, or incapable of being tractably modeled. Consistent patterns of variation across features due to unmeasured or unmodeled factors can confound the relationship between the primary variables and the measured features. In this thesis we provide a statistical framework for modeling large-scale noise dependence caused by unmeasured or unmodeled factors in high-throughput data. We argue that estimating the sources of noise dependence is more appropriate than estimating the pairwise covariance between all features when the number of features is large. A direct connection is made with the well-studied problem of multiple testing dependence, which typically focuses on the distribution of P-values from multiple testing procedures. We introduce the concept of surrogate variables, estimable linear combinations of the true unmeasured or unmodeled factors causing noise dependence, that can be included when modeling the relationship between the primary variables and the feature level data. We also propose algorithms for estimating surrogate variables based on principal component analysis of relevant subsets of features. Under certain conditions accounting for the estimated surrogate variables asymptotically corrects the ranking and error rate estimation in high-throughput data analysis. We also discuss pathological situations when surrogate variables can not be estimated. To illustrate the power of this approach, we apply our estimates of the surrogate variables to improve reproducibility in a large clinical gene expression study of trauma related outcomes.
Survival Analysis Survival analysis is a branch of statistics which deals with analysis of time duration to until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, and duration analysis or duration modeling in economics or event history analysis in sociology. Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?
Book: Survival Analysis
Book: Handbook of Survival Analysis
https://…/Survival Analysis with Plotly
Svensson’s Method Svensson’s Method is a rank-invariant nonparametric method for the analysis of ordered scales which measures the level of change both from systematic and individual aspects. For the details, please refer to Svensson E. Analysis of systematic and random differences between paired ordinal categorical data [dissertation]. Stockholm: Almqvist & Wiksell International; 1993.
svg.js A lightweight library for manipulating and animating SVG.
svg-pan-zoom.js JavaScript library that enables panning and zooming of an SVG in an HTML document, with mouse events or custom JavaScript hooks.
Swapout We describe Swapout, a new stochastic training method, that outperforms ResNets of identical network structure yielding impressive results on CIFAR-10 and CIFAR-100. Swapout samples from a rich set of architectures including dropout, stochastic depth and residual architectures as special cases. When viewed as a regularization method swapout not only inhibits co-adaptation of units in a layer, similar to dropout, but also across network layers. We conjecture that swapout achieves strong regularization by implicitly tying the parameters across layers. When viewed as an ensemble training method, it samples a much richer set of architectures than existing methods such as dropout or stochastic depth. We propose a parameterization that reveals connections to exiting architectures and suggests a much richer set of architectures to be explored. We show that our formulation suggests an efficient training method and validate our conclusions on CIFAR-10 and CIFAR-100 matching state of the art accuracy. Remarkably, our 32 layer wider model performs similar to a 1001 layer ResNet model.
Swift A probabilistic program defines a probability measure over its semantic structures. One common goal of probabilistic programming languages (PPLs) is to compute posterior probabilities for arbitrary models and queries, given observed evidence, using a generic inference engine. Most PPL inference engines – even the compiled ones – incur significant runtime interpretation overhead, especially for contingent and open-universe models. This paper describes Swift, a compiler for the BLOG PPL. Swift-generated code incorporates optimizations that eliminate interpretation overhead, maintain dynamic dependencies efficiently, and handle memory management for possible worlds of varying sizes. Experiments comparing Swift with other PPL engines on a variety of inference problems demonstrate speedups ranging from 12x to 326x.
SWP Operator The sweep operator as defined in (Dempster, 1969), commonly referred to as the SWP operator, is a useful tool for a computational statistician working with covariance matrices. In particular, the SWP operator allows a statistician to quickly regress all variables against one specified variable, obtaining OLS estimates for regression coefficients and variances in a single application. Subsequent applications of the SWP operator allows for regressing against more variables.
Sybase IQ SAP Sybase IQ is a highly optimized analytics server designed specifically to deliver superior performance for mission-critical business intelligence, analytics and data warehousing solutions on any standard hardware and operating system.
Symbiosis The 20th century paradigm of paper forms and typewriters lives on in most of today’s User Interfaces. This kind of UI is adequate for repeatable tasks, but not for highly dynamic, situation-driven activities. The ubiquity of new devices with amazing capabilities has opened the door for a completely new way of working with computers: Combining the respective strengths of human and computer by means of frictionless interaction.
Symbol-Concept Association Network
The natural world is infinitely diverse, yet this diversity arises from a relatively small set of coherent properties and rules, such as the laws of physics or chemistry. We conjecture that biological intelligent systems are able to survive within their diverse environments by discovering the regularities that arise from these rules primarily through unsupervised experiences, and representing this knowledge as abstract concepts. Such representations possess useful properties of compositionality and hierarchical organisation, which allow intelligent agents to recombine afinite set of conceptual building blocks into an exponentially large set of useful new concepts. This paper describes SCAN (Symbol-Concept Association Network), a new framework for learning such concepts in the visual domain. We first use the previously published beta-VAE (Higgins et al., 2017a) architecture to learn a disentangled representation of the latent structure of the visual world, before training SCAN to extract abstract concepts grounded in such disentangled visual primitives through fast symbol association. Our approach requires very few pairings between symbols and images and makes no assumptions about the choice of symbol representations.Once trained, SCAN is capable of multimodal bi-directional inference, generating a diverse set of image samples from symbolic descriptions and vice versa. It also allows for traversal and manipulation of the implicit hierarchy of compositional visual concepts through symbolic instructions and learnt logical recombination operations. Such manipulations enable SCAN to invent and learn novel visual concepts through recombination of the few learnt concepts.
Symbolic Aggregate Approximation
While there are literally hundreds of papers on discretizing (symbolizing, tokenizing, quantizing) time series, none of the techniques allows a distance measure that lower bounds a distance measure defined on the original time series. For this reason, the generic time series data mining approach illustrated in Table 1 is of little utility, since the approximate solution to problem created in main memory may be arbitrarily dissimilar to the true solution that would have been obtained on the original data. If, however, one had a symbolic approach that allowed lower bounding of the true distance, one could take advantage of the generic time series data mining model, and of a host of other algorithms, definitions and data structures which are only defined for discrete data, including hashing, Markov models, and suffix trees. This is exactly the contribution of this paper. We call our symbolic representation of time series SAX (Symbolic Aggregate approXimation), and define it in the next section….
Symbolic Computation In mathematics and computer science, computer algebra, also called symbolic computation or algebraic computation is a scientific area that refers to the study and development of algorithms and software for manipulating mathematical expressions and other mathematical objects. Although, properly speaking, computer algebra should be a subfield of scientific computing, they are generally considered as distinct fields because scientific computing is usually based on numerical computation with approximate floating point numbers, while symbolic computation emphasizes exact computation with expressions containing variables that have not any given value and are thus manipulated as symbols (therefore the name of symbolic computation).
Symbolic Data Any data taking care on the variation inside classes of standard observation: The data descriptions of the units are called ‘symbolic’ when they are more complex than standard ones due to the fact that they contain internal variation and are structured.
Symbolic Data Analysis
Symbolic data analysis (SDA) is an extension of standard data analysis where symbolic data tables are used as input and symbolic objects are outputted as a result. The data units are called symbolic since they are more complex than standard ones, as they not only contain values or categories, but also include internal variation and structure. SDA is based on four spaces: the space of individuals, the space of concepts, the space of descriptions, and the space of symbolic objects. The space of descriptions models individuals, while the space of symbolic objects models concepts.
Symbolic Multidimensional Scaling Symbolic multidimensional scaling aims to present relations between objects treated as hypercubes in multidimensional space. To allow interpretation and graphical representation of the results usually two-dimensional space is used. Most of symbolic multidimensional scaling methods require interval dissimilarity matrix as input. This matrix can be obtained from n judges, opinions or from dissimilaritymeasure for interval-valued variables that produces interval-valued dissimilarities.
Symbolic Multidimensional Scaling of Interval Dissimilarities
Multidimensional scaling aims at reconstructing dissimilarities between pairs of objects by distances in a low dimensional space. However, in some cases the dissimilarity itself is unknown, but the range of the dissimilarity is given. Such fuzzy data fall in the wider class of symbolic data (Bock & Diday, 2000). Denoeux and Masson (2002) have proposed to model an interval dissimilarity by a range of the distance defined as the minimum and maximum distance between two rectangles representing the objects. In this paper, we provide a new algorithm called SymScal that is based on iterative majorization. The advantage is that each iteration is guaranteed to improve the solution until no improvement is possible. In a simulation study, we investigate the quality of this algorithm. We discuss the use of SymScal on empirical dissimilarity intervals of sounds.
Synthesizing What I Mean
Modern programming frameworks come with large libraries, with diverse applications such as for matching regular expressions, parsing XML files and sending email. Programmers often use search engines such as Google and Bing to learn about existing APIs. In this paper, we describe SWIM, a tool which suggests code snippets given API-related natural language queries such as ‘generate md5 hash code’. We translate user queries into the APIs of interest using clickthrough data from the Bing search engine. Then, based on patterns learned from open-source code repositories, we synthesize idiomatic code describing the use of these APIs. We introduce \emph{structured call sequences} to capture API-usage patterns. Structured call sequences are a generalized form of method call sequences, with if-branches and while-loops to represent conditional and repeated API usage patterns, and are simple to extract and amenable to synthesis. We evaluated SWIM with 30 common C# API-related queries received by Bing. For 70% of the queries, the first suggested snippet was a relevant solution, and a relevant solution was present in the top 10 results for all benchmarked queries. The online portion of the workflow is also very responsive, at an average of 1.5 seconds per snippet.
Syslog Syslog has been around for a number of decades and provides a protocol used for transporting event messages between computer systems and software applications. The protocol utilizes a layered architecture, which allows the use of any number of transport protocols for transmission of syslog messages. It also provides a message format that allows vendor-specific extensions to be provided in a structured way. Syslog is now standardized by the IETF in RFC 5424 (since 2009), but has been around since the 80’s and for many years served as the de facto standard for logging without any authoritative published specification. Best practices often promote storing log messages on a centralized server that can provide a correlated view on all the log data generated by different system components. Otherwise, analyzing each log file separately and then manually linking each related log message is extremely time-consuming. As a result, forwarding local log messages to a remote log analytics server/service via Syslog has been commonly adopted as a standard industrial logging solution.
Systems Of Insight Systems of insight are the business discipline and technology to harness insights and turn data into action. Systems of insight deliver what big data cannot – effective action through insights driven software; after all that’s the only thing firms really care about.