BabelNet  BabelNet is a multilingual lexicalized semantic network and ontology developed at the Linguistic Computing Laboratory in the Department of Computer Science of the Sapienza University of Rome. BabelNet was automatically created by linking the largest multilingual Web encyclopedia, Wikipedia, to the most popular computational lexicon of the English language, WordNet. The integration is performed by means of an automatic mapping and by filling in lexical gaps in resourcepoor languages with the aid of statistical machine translation. The result is an ‘encyclopedic dictionary’ that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to freelicense wordnets, OmegaWiki, the English Wiktionary and Wikidata. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions (called glosses) in many languages harvested from both WordNet and Wikipedia. BabelNet 
Backpropagation  Backpropagation, an abbreviation for “backward propagation of errors”, is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent. The method calculates the gradient of a loss function with respects to all the weights in the network. The gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function. Backpropagation requires a known, desired output for each input value in order to calculate the loss function gradient. It is therefore usually considered to be a supervised learning method, although it is also used in some unsupervised networks such as autoencoders. It is a generalization of the delta rule to multilayered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. Backpropagation requires that the activation function used by the artificial neurons (or “nodes”) be differentiable. 
Backpropogation Through Time (BPTT) 
➘ “Predictive State Recurrent Neural Networks” 
Backtesting  Backtesting is jargon used in financial industries to refer to testing a trading strategy or predictive model using existing historic data. Backtesting is a special type of crossvalidation applied to time series data. 
Backwards Analysis  The idea of backwards analysis (or backward analysis) is a technique to analyze randomized algorithms by imagining as if it was running backwards in time, from output to input. Most of the more interesting applications of backward analysis are in Computational Geometry, but nevertheless, there are some other applications that are interesting and we survey some of them here. 
BadNet  Deep learningbased techniques have achieved stateoftheart performance on a wide variety of recognition and classification tasks. However, these networks are typically computationally expensive to train, requiring weeks of computation on many GPUs; as a result, many users outsource the training procedure to the cloud or rely on pretrained models that are then finetuned for a specific task. In this paper we show that outsourced training introduces new security risks: an adversary can create a maliciously trained network (a backdoored neural network, or a \emph{BadNet}) that has stateoftheart performance on the user’s training and validation samples, but behaves badly on specific attackerchosen inputs. We first explore the properties of BadNets in a toy example, by creating a backdoored handwritten digit classifier. Next, we demonstrate backdoors in a more realistic scenario by creating a U.S. street sign classifier that identifies stop signs as speed limits when a special sticker is added to the stop sign; we then show in addition that the backdoor in our US street sign detector can persist even if the network is later retrained for another task and cause a drop in accuracy of {25}\% on average when the backdoor trigger is present. These results demonstrate that backdoors in neural networks are both powerful and—because the behavior of neural networks is difficult to explicate—stealthy. This work provides motivation for further research into techniques for verifying and inspecting neural networks, just as we have developed tools for verifying and debugging software. 
Bag Of Centroids Model  https://…/Word2Vec_BagOfCentroids.py 
Bag of Little Bootstraps (BLB) 
Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate BLB’s favorable statistical performance via a theoretical analysis elucidating the procedure’s properties, as well as a simulation study comparing BLB to the bootstrap, the out of bootstrap, and subsampling. Introduction to Bag of Little Bootstrap 
Bag Of Words Model  The bagofwords model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words (matrix), disregarding grammar and even word order but keeping multiplicity. The bagofwords model is commonly used in methods of document classification, where the (frequency of) occurrence of each word is used as a feature for training a classifier. 
Bagging Hierarchical Clustering  Bagging (bootstrap aggregating) is usually used with supervised methods to improve their stability and accuracy. The idea is to bootstrap the sample, build a predictive model on each bootstrapped sample and then combine the results to produce for classification a vote on the predicted class and for the continuous case an average prediction. If we bootstrap sample our data and build a separate hierarchical clustering solution on each sample can we then combine the results to produce a more stable clustering solution. 
Balanced kMeans  Mesh partitioning is an indispensable tool for efficient parallel numerical simulations. Its goal is to minimize communication between the processes of a simulation while achieving load balance. Established graphbased partitioning tools yield a high solution quality; however, their scalability is limited. Geometric approaches usually scale better, but their solution quality may be unsatisfactory for `nontrivial’ mesh topologies. In this paper, we present a scalable version of $k$means that is adapted to yield balanced clusters. Balanced $k$means constitutes the core of our new partitioning algorithm Geographer. Bootstrapping of initial centers is performed with spacefilling curves, leading to fast convergence of the subsequent balanced kmeans algorithm. Our experiments with up to 16384 MPI processes on numerous benchmark meshes show the following: (i) Geographer produces partitions with a lower communication volume than stateoftheart geometric partitioners from the Zoltan package; (ii) Geographer scales well on large inputs; (iii) a Delaunay mesh with a few billion vertices and edges can be partitioned in a few seconds. 
Balancing GAN (BAGAN) 
Image classification datasets are often imbalanced, characteristic that negatively affects the accuracy of deeplearning classifiers. In this work we propose balancing GANs (BAGANs) as an augmentation tool to restore balance in imbalanced datasets. This is challenging because the few minorityclass images may not be enough to train a GAN. We overcome this issue by including during training all available images of majority and minority classes. The generative model learns useful features from majority classes and uses these to generate images for minority classes. We apply classconditioning in the latent space to drive the generation process towards a target class. Additionally, we couple GANs with autoencoding techniques to reduce the risk of collapsing toward the generation of few foolish examples. We compare the proposed methodology with stateoftheart GANs and demonstrate that BAGAN generates images of superior quality when trained with an imbalanced dataset. 
Banzhaf Random Forests (BRF) 
Random forests are a type of ensemble method which makes predictions by combining the results of several independent trees. However, the theory of random forests has long been outpaced by their application. In this paper, we propose a novel random forests algorithm based on cooperative game theory. Banzhaf power index is employed to evaluate the power of each feature by traversing possible feature coalitions. Unlike the previously used information gain rate of information theory, which simply chooses the most informative feature, the Banzhaf power index can be considered as a metric of the importance of each feature on the dependency among a group of features. More importantly, we have proved the consistency of the proposed algorithm, named Banzhaf random forests (BRF). This theoretical analysis takes a step towards narrowing the gap between the theory and practice of random forests for classification problems. Experiments on several UCI benchmark data sets show that BRF is competitive with stateoftheart classifiers and dramatically outperforms previous consistent random forests. Particularly, it is much more efficient than previous consistent random forests. 
Barista  In recent years, the importance of deep learning has significantly increased in pattern recognition, computer vision, and artificial intelligence research, as well as in industry. However, despite the existence of multiple deep learning frameworks, there is a lack of comprehensible and easytouse highlevel tools for the design, training, and testing of deep neural networks (DNNs). In this paper, we introduce Barista, an opensource graphical highlevel interface for the Caffe deep learning framework. While Caffe is one of the most popular frameworks for training DNNs, editing prototext files in order to specify the net architecture and hyper parameters can become a cumbersome and errorprone task. Instead, Barista offers a fully graphical user interface with a graphbased net topology editor and provides an endtoend training facility for DNNs, which allows researchers to focus on solving their problems without having to write code, edit text files, or manually parse logged data. 
Barnard’s Test  In statistics, Barnard’s test is an exact test used in the analysis of contingency tables. The test was first published by George Alfred Barnard (1945, 1947) who claimed this test is a more powerful alternative than Fisher’s exact test for 2×2 contingency tables. A previous barrier to the widespread use of Barnard’s test was likely the computational difficulty of calculating the pvalue; nowadays, computers can implement Barnard’s test. 
Basic Linear Algebra Subprograms (BLAS) 
The Basic Linear Algebra Subprograms (BLAS) are a specified set of lowlevel kernel subroutines that perform common linear algebra operations such as copying, vector scaling, vector dot products, linear combinations, and matrix multiplication. They were first published as a Fortran library in 1979 and are still used as a building block in higherlevel math programming languages and libraries, including LINPACK, LAPACK, MATLAB, Mathematica, NumPy and R. BLAS subroutines are a de facto standard API for linear algebra libraries and routines. Several BLAS library implementations have been tuned for specific computer architectures. Highly optimized implementations have been developed by hardware vendors such as Intel and AMD, as well as by other authors, e.g. GotoBLAS and ATLAS (a portable selfoptimizing BLAS). The LINPACK and HPL benchmarks rely heavily on DGEMM, a BLAS subroutine, for its performance measurements. 
Basic Recurrent Neural Network Model (bRNN) 
We present a model of a basic recurrent neural network (or bRNN) that includes a separate linear term with a slightly ‘stable’ fixed matrix to guarantee bounded solutions and fast dynamic response. We formulate a state space viewpoint and adapt the constrained optimization Lagrange Multiplier (CLM) technique and the vector Calculus of Variations (CoV) to derive the (stochastic) gradient descent. In this process, one avoids the commonly used reapplication of the circular chainrule and identifies the error backpropagation with the costate backward dynamic equations. We assert that this bRNN can successfully perform regression tracking of timeseries. Moreover, the ‘vanishing and exploding’ gradients are explicitly quantified and explained through the costate dynamics and the update laws. The adapted CoV framework, in addition, can correctly and principally integrate new loss functions in the network on any variable and for varied goals, e.g., for supervised learning on the outputs and unsupervised learning on the internal (hidden) states. 
Bass Diffusion Model  The Bass Model or Bass Diffusion Model was developed by Frank Bass and it consists of a simple differential equation that describes the process of how new products get adopted in a population. The model presents a rationale of how current adopters and potential adopters of a new product interact. The basic premise of the model is that adopters can be classified as innovators or as imitators and the speed and timing of adoption depends on their degree of innovativeness and the degree of imitation among adopters. The Bass model has been widely used in forecasting, especially new products’ sales forecasting and technology forecasting. Mathematically, the basic Bass diffusion is a Riccati equation with constant coefficients. Frank Bass published his paper “A new product growth for model consumer durables” in 1969 whose title indeed contained a typographical error. Prior to this, Everett Rogers published Diffusion of Innovations, a highly influential work that described the different stages of product adoption. Bass contributed some mathematical ideas to the concept. 
Batch Normalization  Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training minibatch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a stateoftheart image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batchnormalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top5 validation error (and 4.8% test error), exceeding the accuracy of human raters. 
BatchExpansion Training (BET) 
We propose BatchExpansion Training (BET), a framework for running a batch optimizer on a gradually expanding dataset. As opposed to stochastic approaches, batches do not need to be resampled i.i.d. at every iteration, thus making BET more resource efficient in a distributed setting, and when diskaccess is constrained. Moreover, BET can be easily paired with most batch optimizers, does not require any parametertuning, and compares favorably to existing stochastic and batch methods. We show that when the batch size grows exponentially with the number of outer iterations, BET achieves optimal $\tilde{O}(1/\epsilon)$ dataaccess convergence rate for strongly convex objectives. 
BatchPPO  An efficient implementation of the proximal policy optimization algorithm. ➘ “TensorFlow Agents” 
Battery Reduction  Battery reduction is used to select a subset of m variables from an original set of n variables (m < n) that reproduce a large proportion of the variance in the original set of n variables. There are a number of procedures for performing battery reduction analysis. A popular method involves performing a principal components analysis first to select m components, which account for the salient variance in the original data. GramSchmidt orthogonal rotations are then performed to determine the m variables that account for the largest proportion of variance. 
BaumWelch Algorithm  In electrical engineering, computer science, statistical computing and bioinformatics, the BaumWelch algorithm is used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forwardbackward algorithm and is named for Leonard E. Baum and Lloyd R. Welch. 
Bayes Factor  In statistics, the use of Bayes factors is a Bayesian alternative to classical hypothesis testing. Bayesian model comparison is a method of model selection based on Bayes factors. Bayes factors provide a numerical value that quantifies how well a hypothesis predicts the empirical data relative to a competing hypothesis. For example, if the BF is 4, this indicates: ‘This empirical data is 4 times more probable if H1 were true than if H0 were true.’. Hence, evidence points towards H1. A BF of 1 means that data are equally likely to be occured under both hypotheses. In this case, it would be impossible to decide between both. http://…/ashorttaxonomyofbayesfactors 
Bayes Point Machine  Kernelclassifiers comprise a powerful class of nonlinear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal realvalued output on the training sample, within the set of consistent hypotheses, the socalled version space. We suggest the Bayes point machine as a wellfounded improvement which approximates the Bayesoptimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for softboundaries in order to admit training errors. Experimentally, we find that — for the zero training error case — Bayes point machines consistently outperform support vector machines on both surrogate data and realworld benchmark data sets. In the softboundary/softmargin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the realvalued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion. http://…/bayes%20point%20machine%20tutorial.aspx 
Bayes via Goodness of fit  The two key issues of modern Bayesian statistics are: (i) establishing principled approach for distilling statistical prior that is consistent with the given data from an initial believable scientific prior; and (ii) development of a Bayesfrequentist consolidated data analysis workflow that is more effective than either of the two separately. In this paper, we propose the idea of ‘Bayes via goodness of fit’ as a framework for exploring these fundamental questions, in a way that is general enough to embrace almost all of the familiar probability models. Several illustrative examples show the benefit of this new point of view as a practical data analysis tool. Relationship with other Bayesian cultures is also discussed. 
BayesDB  BayesDB is a probabilistic programming platform that enables users to query the probable implications of their data as directly as SQL databases enable them to query the data itself. The default modeling assumptions that BayesDB makes are suitable for a broad class of problems, but statisticians can customize these assumptions when necessary. BayesDB also enables domain experts that lack statistical expertise to perform qualitative model checking and encode simple forms of qualitative prior knowledge. BayesDB: A probabilistic programming system for querying the probable implications of data BayesDB 
Bayesian Additive Regression Tree (BART) 
Bayesian additive regression trees (BART) is a regression technique developed by Chipman et al. (2008). Its usefulness in standard regression settings has been clearly demonstrated, but it has not been applied to time series analysis as yet. We discuss the difficulties in applying this technique to time series analysis and demonstrate its superior predictive capabilities in the case of a well know time series: the Southern Oscillation Index. BayesTree,BART 
Bayesian Adjustment for Confounding (BAC) 

Bayesian Belief Network (BBN) 
A Bayesian belief network is a graphical representation of a probabilistic dependency model. It consists of a set of interconnected nodes, where each node represents a variable in the dependency model and the connecting arcs represent the causal relationships between these variables. Each node or variable may take one of a number of possible states or values. The belief in, or certainty of, each of these states is determined from the belief in each possible state of every node directly connected to it and its relationship with each of these nodes. The belief in each state of a node is updated whenever the belief in each state of any directly connected node changes. 
Bayesian Bootstrap  Bootstrapping can be interpreted in a Bayesian framework using a scheme that creates new datasets through reweighting the initial data. http://…parametricbootstrapasabayesianmodel http://…/1176345338 
Bayesian Bridge  Bridge regression is a regularized regression in which the regression coefficient’s prior is an exponential power distribution. BayesBridge 
Bayesian Causal Effect Estimation (BCEE) 
Estimating causal exposure effects in observational studies ideally requires the analyst to have a vast knowledge of the domain of application. Investigators often bypass difficulties related to the identification and selection of confounders through the use of fully adjusted outcome regression models. However, since such models likely contain more covariates than required, the variance of the regression coefficient for exposure may be unnecessarily large. Instead of using a fully adjusted model, model selection can be attempted. Most classical statistical model selection approaches, such as Bayesian model averaging, do not readily address causal effect estimation. We present a novel model averaged approach to causal inference, the Bayesian Causal Effect Estimation (BCEE) algorithm, which is motivated by the graphical framework for causal inference. BCEE aims to unbiasedly estimate the causal effect of a continuous exposure on a continuous outcome while being more efficient than a fully adjusted model. 
Bayesian Conditional Generative Adverserial Networks (BCGAN) 
Traditional GANs use a deterministic generator function (typically a neural network) to transform a random noise input $z$ to a sample $\mathbf{x}$ that the discriminator seeks to distinguish. We propose a new GAN called Bayesian Conditional Generative Adversarial Networks (BCGANs) that use a random generator function to transform a deterministic input $y’$ to a sample $\mathbf{x}$. Our BCGANs extend traditional GANs to a Bayesian framework, and naturally handle unsupervised learning, supervised learning, and semisupervised learning problems. Experiments show that the proposed BCGANs outperforms the stateofthearts. 
Bayesian Constrained Generalised Linear Model  See Meyer et al. (2011) <doi/10.1080/10485252.2011.597852> for more details. bcgam 
Bayesian Decision Theory  Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. It is considered the ideal case in which the probability structure underlying the categories is known perfectly. While this sort of stiuation rarely occurs in practice, it permits us to determine the optimal (Bayes) classifier against which we can compare all other classifiers. Moreover, in some problems it enables us to predict the error we will get when we generalize to novel patterns. This approach is based on quantifying the tradeoffs between various classification decisions using probability and the costs that accompany such decisions. It makes the assumption that the decision problem is posed in probabilistic terms, and that all of the relevant probability values are known. http://…/Bayesian_decision_theory http://…/bayesian.pdf 
Bayesian Estimation  
Bayesian Exponential Random Graph Models  Bergm 
Bayesian Gradient Descent  We suggest a novel approach for the estimation of the posterior distribution of the weights of a neural network, using an online version of the variational Bayes method. Having a confidence measure of the weights allows to combat several shortcomings of neural networks, such as their parameter redundancy, and their notorious vulnerability to the change of input distribution (‘catastrophic forgetting’). Specifically, We show that this approach helps alleviate the catastrophic forgetting phenomenon – even without the knowledge of when the tasks are been switched. Furthermore, it improves the robustness of the network to weight pruning – even without retraining. 
Bayesian Hierarchical Matrix Factorization (BHMF) 
BHPMF 
Bayesian Highest Posterior Density (HPD) 

Bayesian Hypernetworks  We propose Bayesian hypernetworks: a framework for approximate Bayesian inference in neural networks. A Bayesian hypernetwork, $h$, is a neural network which learns to transform a simple noise distribution, $p(\epsilon) = \mathcal{N}(0,I)$, to a distribution $q(\theta) \doteq q(h(\epsilon))$ over the parameters $\theta$ of another neural network (the ‘primary network’). We train $q$ with variational inference, using an invertible $h$ to enable efficient estimation of the variational lower bound on the posterior $p(\theta  \mathcal{D})$ via sampling. In contrast to most methods for Bayesian deep learning, Bayesian hypernets can represent a complex multimodal approximate posterior with correlations between parameters, while enabling cheap i.i.d. sampling of $q(\theta)$. We demonstrate these qualitative advantages of Bayesian hypernets, which also achieve competitive performance on a suite of tasks that demonstrate the advantage of estimating model uncertainty, including active learning and anomaly detection. 
Bayesian Information Criterion (BIC) 
In statistics, the Bayesian information criterion (BIC) or Schwarz criterion (also SBC, SBIC) is a criterion for model selection among a finite set of models. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC). When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC. The BIC was developed by Gideon E. Schwarz, who gave a Bayesian argument for adopting it. 
Bayesian Joint Matrix Decomposition (BJMD) 
Matrix decomposition is a popular and fundamental approach in machine learning and data mining. It has been successfully applied into various fields. Most matrix decomposition methods focus on decomposing a data matrix from one single source. However, it is common that data are from different sources with heterogeneous noise. A few of matrix decomposition methods have been extended for such multiview data integration and pattern discovery. While only few methods were designed to consider the heterogeneity of noise in such multiview data for data integration explicitly. To this end, we propose a joint matrix decomposition framework (BJMD), which models the heterogeneity of noise by Gaussian distribution in a Bayesian framework. We develop two algorithms to solve this model: one is a variational Bayesian inference algorithm, which makes full use of the posterior distribution; and another is a maximum a posterior algorithm, which is more scalable and can be easily paralleled. Extensive experiments on synthetic and realworld datasets demonstrate that BJMD considering the heterogeneity of noise is superior or competitive to the stateoftheart methods. 
Bayesian Kernel Projection Classifier (BKPC) 
Bayesian kernel projection classifier is a nonlinear multicategory classifier which performs the classification of the projections of the data to the principal axes of the feature space. A Gibbs sampler is implemented to find the posterior distributions of the parameters. BKPC 
Bayesian Latent Class Analysis  
Bayesian Latent Space Model (BLSM) 
BLSM 
Bayesian Linear Regression  In statistics, Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference. When the regression model has errors that have a normal distribution, and if a particular form of prior distribution is assumed, explicit results are available for the posterior probability distributions of the model’s parameters. 
Bayesian Masking  A common strategy for sparse linear regression is to introduce regularization, which eliminates irrelevant features by letting the corresponding weights be zeros. Regularization, however, often shrinks the estimator for relevant features, which leads incorrect feature selection. Motivated by the above issue, we propose Bayesian masking (BM), a sparse estimation method which imposes no regularization on the weights. The key concept of BM is to introduce binary latent variables that randomly mask features. Estimating the masking rates determines the relevances of the features automatically. We derive a variational Bayesian inference algorithm that maximizes a lower bound of the factorized information criterion (FIC), which is a recentlydeveloped asymptotic evaluation of the marginal loglikelihood. We also propose reparametrization that accelerates the convergence. We demonstrate that BM outperforms Lasso and automatic relevance determination (ARD) in terms of the sparsityshrinkage tradeoff. 
Bayesian Model Averaging  Bayesian Model Averaging is a technique designed to help account for the uncertainty inherent in the model selection process, something which traditional statistical analysis often neglects. By averaging over many different competing models, BMA incorporates model uncertainty into conclusions about parameters and prediction. BMA has been applied successfully to many statistical model classes including linear regression, generalized linear models, Cox regression models, and discrete graphical models, in all cases improving predictive performance. 
Bayesian Network (BN) 
A Bayesian network, Bayes network, belief network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. 
Bayesian Neural Network (BNN) 
This paper describes and discusses Bayesian Neural Network (BNN). The paper showcases a few different applications of them for classification and regression problems. BNNs are comprised of a Probabilistic Model and a Neural Network. The intent of such a design is to combine the strengths of Neural Networks and Stochastic modeling. Neural Networks exhibit continuous function approximator capabilities. Stochastic models allow direct specification of a model with known interaction between parameters to generate data. During the prediction phase, stochastic models generate a complete posterior distribution and produce probabilistic guarantees on the predictions. Thus BNNs are a unique combination of neural network and stochastic models with the stochastic model forming the core of this integration. BNNs can then produce probabilistic guarantees on it’s predictions and also generate the distribution of parameters that it has learnt from the observations. That means, in the parameter space, one can deduce the nature and shape of the neural network’s learnt parameters. These two characteristics makes them highly attractive to theoreticians as well as practitioners. Recently there has been a lot of activity in this area, with the advent of numerous probabilistic programming libraries such as: PyMC3, Edward, Stan etc. Further this area is rapidly gaining ground as a standard machine learning approach for numerous problems 
Bayesian Nonparametric Model  A Bayesian nonparametric model is a Bayesian model on an infinitedimensional parameter space. The parameter space is typically chosen as the set of all possible solutions for a given learning problem. For example, in a regression problem the parameter space can be the set of continuous functions, and in a density estimation problem the space can consist of all densities. A Bayesian nonparametric model uses only a finite subset of the available parameter dimensions to explain a finite sample of observations, with the set of dimensions chosen depending on the sample, such that the effective complexity of the model (as measured by the number of dimensions used) adapts to the data. Classical adaptive problems, such as nonparametric estimation and model selection, can thus be formulated as Bayesian inference problems. Popular examples of Bayesian nonparametric models include Gaussian process regression, in which the correlation structure is refined with growing sample size, and Dirichlet process mixture models for clustering, which adapt the number of clusters to the complexity of the data. Bayesian nonparametric models have recently been applied to a variety of machine learning problems, including regression, classification, clustering, latent variable modeling, sequential modeling, image segmentation, source separation and grammar induction. 
Bayesian Nonparametric Principal Component Analysis (BNPPCA) 
Principal component analysis (PCA) is very popular to perform dimension reduction. The selection of the number of significant components is essential but often based on some practical heuristics depending on the application. Only few works have proposed a probabilistic approach able to infer the number of significant components. To this purpose, this paper introduces a Bayesian nonparametric principal component analysis (BNPPCA). The proposed model projects observations onto a random orthogonal basis which is assigned a prior distribution defined on the Stiefel manifold. The prior on factor scores involves an Indian buffet process to model the uncertainty related to the number of components. The parameters of interest as well as the nuisance parameters are finally inferred within a fully Bayesian framework via Monte Carlo sampling. A study of the (in)consistence of the marginal maximum a posteriori estimator of the latent dimension is carried out. A new estimator of the subspace dimension is proposed. Moreover, for sake of statistical significance, a KolmogorovSmirnov test based on the posterior distribution of the principal components is used to refine this estimate. The behaviour of the algorithm is first studied on various synthetic examples. Finally, the proposed BNP dimension reduction approach is shown to be easily yet efficiently coupled with clustering or latent factor models within a unique framework. 
Bayesian PassiveAggressive Learning (BayesPA) 
Online PassiveAggressive (PA) learning is an effective framework for performing maxmargin online learning. But the deterministic formulation and estimated single largemargin model could limit its capability in discovering descriptive structures underlying complex data. This paper presents online Bayesian PassiveAggressive (BayesPA) learning, which subsumes the online PA and extends naturally to incorporate latent variables and perform nonparametric Bayesian inference, thus providing great flexibility for explorative analysis. We apply BayesPA to topic modeling and derive efficient online learning algorithms for maxmargin topic models. We further develop nonparametric methods to resolve the number of topics. Experimental results on real datasets show that our approaches significantly improve time efficiency while maintaining comparable results with the batch counterparts. 
Bayesian Robust Principal Component Regression  Principal component regression is a linear regression model with principal components as regressors. This type of modelling is particularly useful for prediction in settings with highdimensional covariates. Surprisingly, the existing literature treating of Bayesian approaches is relatively sparse. In this paper, we aim at filling some gaps through the following practical contribution: we introduce a Bayesian approach with detailed guidelines for a straightforward implementation. The approach features two characteristics that we believe are important. First, it effectively involves the relevant principal components in the prediction process. This is achieved in two steps. The first one is model selection; the second one is to average out the predictions obtained from the selected models according to model averaging mechanisms, allowing to account for model uncertainty. The model posterior probabilities are required for model selection and model averaging. For this purpose, we include a procedure leading to an efficient reversible jump algorithm. The second characteristic of our approach is whole robustness, meaning that the impact of outliers on inference gradually vanishes as they approach plus or minus infinity. The conclusions obtained are consequently consistent with the majority of observations (the bulk of the data). 
Bayesian Tensor Factorization Linked to External Data (BaTFLED) 
The vast majority of current machine learning algorithms are designed to predict single responses or a vector of responses, yet many types of response are more naturally organized as matrices or higherorder tensor objects where characteristics are shared across modes. We present a new machine learning algorithm BaTFLED (Bayesian Tensor Factorization Linked to External Data) that predicts values in a threedimensional response tensor using input features for each of the dimensions. BaTFLED uses a probabilistic Bayesian framework to learn projection matrices mapping input features for each mode into latent representations that multiply to form the response tensor. By utilizing a Tucker decomposition, the model can capture weights for interactions between latent factors for each mode in a small core tensor. Priors that encourage sparsity in the projection matrices and core tensor allow for feature selection and model regularization. This method is shown to far outperform elastic net and neural net models on ‘cold start’ tasks from data simulated in a threemode structure. Additionally, we apply the model to predict doseresponse curves in a panel of breast cancer cell lines treated with drug compounds that was used as a Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge. 
Bayesian YingYang Learning Algorithm (BYY) 
YingYang learning considers a learning system featured with two pathways between the external observation domain X and its inner representation domain R. The domain R and the pathway R→X is modeled by one subsystem system, while the domain X and the pathway X→R is modeled by another subsystem. From the view of the ancient YingYang philosophy, the former is called Ying and the latter is called Yang, and the two coordinately form a YingYang system, with the structure of Ying subject to a principle of compactness (least complexity) and the structure of Yang subject to a principle of proper vitality (matched dynamic range) with respect to the Ying. Moreover, all the rest unknowns in the YingYang system are learned under the guidance of a YingYang best harmony principle. http://…/indexpubbyy.html http://…/chaptersXu03a.pdf 
Behavior Tree  A Behavior Tree (BT) is a way to structure the switching between different tasks in an autonomous agent, such as a robot or a virtual entity in a computer game. BTs are a very efficient way of creating complex systems that are both modular and reactive. These properties are crucial in many applications, which has led to the spread of BT from computer game programming to many branches of AI and Robotics. 
Behavioral Analytics  Behavioral Analytics is a subset of business analytics that focuses on how and why users of eCommerce platforms, online games, & web applications behave. While business analytics has a more broad focus on the who, what, where and when of business intelligence, behavioral analytics narrows that scope, allowing one to take seemingly unrelated data points in order to extrapolate, predict and determine errors and future trends. It takes a more holistic and human view of data, connecting individual data points to tell us not only what is happening, but also how and why it is happening. Behavioral analytics utilizes user data captured while the web application, game, or website is in use by analytic platforms like Google Analytics. Platform traffic data like navigation paths, clicks, social media interactions, purchasing decisions and marketing responsiveness is all recorded. Also, other more specific advertising metrics like clicktoconversion time, and comparisons between other metrics like the monetary value of an order and the amount of time spent on the site. These data points are then compiled and analyzed, whether by looking at the timeline progression from when a user first entered the platform until a sale was made, or what other products a user bought or looked at before this purchase. Behavioral analysis allows future actions and trends to be predicted based on all the data collected. 
Behavioral Change Point Analysis (BCPA) 
The Behavioral Change Point Analysis (BCPA) is a method of identifying hidden shifts in the underlying parameters of a time series, developed specifically to be applied to animal movement data which is irregularly sampled. 
BELIEF  With the advent of Big Data era, data reduction methods are highly demanded given its ability to simplify huge data, and ease complex learning processes. Concretely, algorithms that are able to filter relevant dimensions from a set of millions are of huge importance. Although effective, these techniques suffer from the ‘scalability’ curse as well. In this work, we propose a distributed feature weighting algorithm, which is able to rank millions of features in parallel using large samples. This method, inspired by the wellknown RELIEF algorithm, introduces a novel redundancy elimination measure that provides similar schemes to those based on entropy at a much lower cost. It also allows smooth scale up when more instances are demanded in feature estimations. Empirical tests performed on our method show its estimation ability in manifold huge sets –both in number of features and instances–, as well as its simplified runtime cost (specially, at the redundancy detection step). 
Belief Functions  The theory of belief functions provides a nonBayesian way of using mathematical probability to quantify subjective judgements. Whereas a Bayesian assesses probabilities directly for the answer to a question of interest, a belieffunction user assesses probabilities for related questions and then considers the implications of these probabilities for the question of interest. 
Belief Network  see “Bayesian Network” or “Bayes Net” 
Belief Propagation  Belief propagation, also known as sumproduct message passing is a message passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory and has demonstrated empirical success in numerous applications including lowdensity paritycheck codes, turbo codes, free energy approximation, and satisfiability. The algorithm was first proposed by Judea Pearl in 1982, who formulated this algorithm on trees, and was later extended to polytrees. It has since been shown to be a useful approximate algorithm on general graphs. 
Bellman Equation  A Bellman equation, named after its discoverer, Richard Bellman, also known as a dynamic programming equation, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It writes the value of a decision problem at a certain point in time in terms of the payoff from some initial choices and the value of the remaining decision problem that results from those initial choices. This breaks a dynamic optimization problem into simpler subproblems, as Bellman’s Principle of Optimality prescribes. The Bellman equation was first applied to engineering control theory and to other topics in applied mathematics, and subsequently became an important tool in economic theory. Almost any problem which can be solved using optimal control theory can also be solved by analyzing the appropriate Bellman equation. However, the term ‘Bellman equation’ usually refers to the dynamic programming equation associated with discretetime optimization problems. In continuoustime optimization problems, the analogous equation is a partial differential equation which is usually called the HamiltonJacobiBellman equation. 
Benford’s Law  Benford’s Law, also called the FirstDigit Law, refers to the frequency distribution of digits in many (but not all) reallife sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford’s Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution. 
BentCable Regression  We use the socalled bentcable model to describe natural phenomena which exhibit a potentially sharp change in slope. The model comprises two linear segments, joined smoothly by a quadratic bend. The class of bent cables includes, as a limiting case, the popular piecewiselinear model (with a sharp kink), otherwise known as the broken stick. Associated with bentcable regression is the estimation of the bendwidth parameter, through which the abruptness of the underlying transition may be assessed. 
Berkeley Data Analytics Stack (BDAS) 
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the Berkeley AMPLab to make sense of Big Data. 
Bessels Correction  In statistics, Bessel’s correction, named after Friedrich Bessel, is the use of n – 1 instead of n in the formula for the sample variance and sample standard deviation, where n is the number of observations in a sample. This corrects the bias in the estimation of the population variance, and some (but not all) of the bias in the estimation of the population standard deviation, but often increases the mean squared error in these estimations. 
Best Friends For Ever (BFF) 
Graphs form a natural model for relationships and interactions between entities, for example, between people in social and cooperation networks, servers in computer networks,or tags and words in documents and tweets. But, which of these relationships or interactions are the most lasting ones? In this paper, given a set of graph snapshots, which may correspond to the state of a dynamic graph at different time instances, we look at the problem of identifying the set of nodes that are the most densely connected at all snapshots. We call this problem the Best Friends For Ever (Bff) problem. We provide definitions for density over multiple graph snapshots, that capture different semantics of connectedness over time, and we study the corresponding variants of the Bff problem. We then look at the OnOff Bff (O2Bff) problem that relaxes the requirement of nodes being connected in all snapshots, and asks for the densest set of nodes in at least k of a given set of graph snapshots. We show that this problem is NPcomplete for all definitions of density, and we propose a set of efficient algorithms. Finally, we present experiments with synthetic and real datasets that show both the efficiency of our algorithms and the usefulness of the Bff and the O2Bff problems. 
Best Linear Adaptive Enhancement (BLADE) 
The Rapid and Accurate Image Super Resolution (RAISR) method of Romano, Isidoro, and Milanfar is a computationally efficient image upscaling method using a trained set of filters. We describe a generalization of RAISR, which we name Best Linear Adaptive Enhancement (BLADE). This approach is a trainable edgeadaptive filtering framework that is general, simple, computationally efficient, and useful for a wide range of image processing problems. We show applications to denoising, compression artifact removal, demosaicing, and approximation of anisotropic diffusion equations. 
Best Linear Unbiased Estimator (BLUE) 
In statistics, the GaussMarkov theorem, named after Carl Friedrich Gauss and Andrey Markov, states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator. Here ‘best’ means giving the lowest variance of the estimate, as compared to other unbiased, linear estimators. The errors don’t need to be normal, nor do they need to be independent and identically distributed (only uncorrelated and homoscedastic). The hypothesis that the estimator be unbiased cannot be dropped, since otherwise estimators better than OLS exist. See for examples the JamesStein estimator (which also drops linearity) or ridge regression. 
Best Linear Unbiased Prediction (BLUP) 
In statistics, best linear unbiased prediction (BLUP) is used in linear mixed models for the estimation of random effects. BLUP was derived by Charles Roy Henderson in 1950 but the term ‘best linear unbiased predictor’ (or ‘prediction’) seems not to have been used until 1962. ‘Best linear unbiased predictions’ (BLUPs) of random effects are similar to best linear unbiased estimates (BLUEs) () of fixed effects. The distinction arises because it is conventional to talk about estimating fixed effects but predicting random effects, but the two terms are otherwise equivalent. (This is a bit strange since the random effects have already been ‘realized’ − they already exist. The use of the term ‘prediction’ may be because in the field of animal breeding in which Henderson worked, the random effects were usually genetic merit, which could be used to predict the quality of offspring. However, the equations for the ‘fixed’ effects and for the random effects are different. In practice, it is often the case that the parameters associated with the random effect(s) term(s) are unknown; these parameters are the variances of the random effects and residuals. Typically the parameters are estimated and plugged into the predictor, leading to the Empirical Best Linear Unbiased Predictor (EBLUP). Notice that by simply plugging in the estimated parameter into the predictor, additional variability is unaccounted for, leading to overly optimistic prediction variances for the EBLUP. Best linear unbiased predictions are similar to empirical Bayes estimates of random effects in linear mixed models, except that in the latter case, where weights depend on unknown values of components of variance, these unknown variances are replaced by samplebased estimates. 
Best Subsets  Best Subsets Regression is a method used to help determine which predictor (independent) variables should be included in a multiple regression model. This method involves examining all of the models created from all possible combination of predictor variables. Best Subsets Regression uses R2 to check for the best model. It would not be fun or fast to compute this method without the use of a statistical software program. First, all models that have only one predictor variable included are checked and the two models with the highest R2 are selected. Then all models that have only two predictor variables included are checked and the two models with the highest R2 are chosen, again. This process continues until all combinations of all predictors variables have been taken into account. Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms 
BestWorst Scaling  BestWorst Scaling (BWS) can be a method of data collection, and/or a theory of how respondents provide top and bottom ranked items from a list. BWS is increasingly used to obtain more choice data from individuals and/or to understand choice processes. The three casesof BWS are described, together with the intuition behind the models that are applied in each case. A summary of the main theoretical results is provided, including an exposition of the possible theoretical relationships between estimates from the di¤erent cases, and of the theoretical properties of best minus worst scores.BWS data can be analysed using relatively simple extensions to maximumlikelihood based methods used in discrete choice experiments. These are summarised, before the bene ts of simple functions of the best and 
Beta Regression  How should we one perform a regression analysis in which the dependent variable is restricted to the standard unit interval such as rates and proportions? Ferrari and CribariNeto, 2004 proposed a regression model for continuous variates that assume values in the standard unit interval, e.g., rates, proportions, or concentrations indices. The model is based on the assumption that the response is betadistributed, they called their model the beta regression model. The regression parameters are interpretable in terms of the mean of y (the variable of interest) and the model is naturally heteroskedastic and easily accommodates asymmetries. A variant of the beta regression model that allows for nonlinearities and variable dispersion was proposed by Simas et al., 2010. zoib: An R package for Bayesian Inference for Beta Regression and Zero/One Inflated Beta Regression A Short Course in Beta Regression Models 
Bhattacharyya Distance  In statistics, the Bhattacharyya distance measures the similarity of two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations. Both measures are named after Anil Kumar Bhattacharya, a statistician who worked in the 1930s at the Indian Statistical Institute. The coefficient can be used to determine the relative closeness of the two samples being considered. It is used to measure the separability of classes in classification and it is considered to be more reliable than the Mahalanobis distance, as the Mahalanobis distance is a particular case of the Bhattacharyya distance when the standard deviations of the two classes are the same. Therefore, when two classes have similar means but different standard deviations, the Mahalanobis distance would tend to zero, however, the Bhattacharyya distance would grow depending on the difference between the standard deviations. 
Bias of an Estimator / Unbiasedness  In statistics, the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased. In statistics, “bias” is an objective statement about a function, and while not a desired property, it is not pejorative, unlike the ordinary English use of the term “bias”. 
Bias/Variance Tradeoff  In machine learning, the biasvariance dilemma or biasvariance tradeoff is the problem of simultaneously minimizing the bias (how accurate a model is across different training sets) and variance of the model error (how sensitive the model is to small changes in training set). This tradeoff applies to all forms of supervised learning: classification, function fitting, and structured output learning. It has also been invoked to explain the effectiveness of heuristics in human learning. 
BiasCompensated Normalized Maximum Correntropy Criterion (BCNMCC) 
This paper proposed a biascompensated normalized maximum correntropy criterion (BCNMCC) algorithm charactered by its low steadystate misalignment for system identification with noisy input in an impulsive output noise environment. The normalized maximum correntropy criterion (NMCC) is derived from a correntropy based cost function, which is rather robust with respect to impulsive noises. To deal with the noisy input, we introduce a biascompensated vector (BCV) to the NMCC algorithm, and then an unbiasedness criterion and some reasonable assumptions are used to compute the BCV. Taking advantage of the BCV, the bias caused by the input noise can be effectively suppressed. System identification simulation results demonstrate that the proposed BCNMCC algorithm can outperform other related algorithms with noisy input especially in an impulsive output noise environment. 
BiBit  Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divideandconquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bitpattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. ➘ “Biclustering” BiBitR 
Biclustering  Biclustering, coclustering, or twomode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. 
Bidirectional Conditional Generative Adversarial Network  Conditional variants of Generative Adversarial Networks (GANs), known as cGANs, are generative models that can produce data samples ($x$) conditioned on both latent variables ($z$) and known auxiliary information ($c$). Another GAN variant, Bidirectional GAN (BiGAN) is a recently developed framework for learning the inverse mapping from $x$ to $z$ through an encoder trained simultaneously with the generator and the discriminator of an unconditional GAN. We propose the Bidirectional Conditional GAN (BCGAN), which combines cGANs and BiGANs into a single framework with an encoder that learns inverse mappings from $x$ to both $z$ and $c$, trained simultaneously with the conditional generator and discriminator in an endtoend setting. We present crucial techniques for training BCGANs, which incorporate an extrinsic factor loss along with an associated dynamicallytuned importance weight. As compared to other encoderbased GANs, BCGANs not only encode $c$ more accurately but also utilize $z$ and $c$ more effectively and in a more disentangled way to generate data samples. 
Bidirectional Deep Echo State Network  In this work we propose a deep architecture for the classification of multivariate time series. By means of a recurrent and untrained reservoir we generate a vectorial representation that embeds the temporal relationships in the data. To overcome the limitations of the reservoir vanishing memory, we introduce a bidirectional reservoir, whose last state captures also the past dependencies in the input. We apply dimensionality reduction to the final reservoir states to obtain compressed fixed size representations of the time series. These are subsequently fed into a deep feedforward network, which is trained to perform the final classification. We test our architecture on benchmark datasets and on a realworld usecase of blood samples classification. Results show that our method performs better than a standard echo state network, and it can be trained much faster than a fullytrained recurrent network. 
Bidirectional Learning (BL) 
Bidirectional Learning for Robust Neural Networks 
BiDirectional Long Short Term Memory Network (BLSTM) 
Most existing methods for biomedical entity recognition task rely on explicit feature engineering where many features either are specific to a particular task or depends on output of other existing NLP tools. Neural architectures have been shown across various domains that efforts for explicit feature design can be reduced. In this work we propose an unified framework using bidirectional long short term memory network (BLSTM) for named entity recognition (NER) tasks in biomedical and clinical domains. Three important characteristics of the framework are as follows – (1) model learns contextual as well as morphological features using two different BLSTM in hierarchy, (2) model uses first order linear conditional random field (CRF) in its output layer in cascade of BLSTM to infer label or tag sequence, (3) model does not use any domain specific features or dictionary, i.e., in another words, same set of features are used in the three NER tasks, namely, disease name recognition (Disease NER), drug name recognition (Drug NER) and clinical entity recognition (Clinical NER). We compare performance of the proposed model with existing stateoftheart models on the standard benchmark datasets of the three tasks. We show empirically that the proposed framework outperforms all existing models. Further our analysis of CRF layer and wordembedding obtained using character based embedding show their importance. 
Bidirectional LSTM  Recurrent neural networks like long shortterm memory (LSTM) are important architectures for sequential prediction tasks. LSTMs (and RNNs in general) model sequences along the forward time direction. Bidirectional LSTMs (BiLSTMs) on the other hand model sequences along both forward and backward directions and are generally known to perform better at such tasks because they capture a richer representation of the data. In the training of BiLSTMs, the forward and backward paths are learned independently. 
Bidirectional Recurrent Imputation for Time Series (BRITS) 
Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation.BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a datadriven imputation procedure and applies to general settings with missing data.We evaluate our model on three realworld datasets, including an air quality dataset, a healthcare data, and a localization data for human activity. Experiments show that our model outperforms the stateoftheart methods in both imputation and classification/regression accuracies. 
Big Data  “Big Data” is the term for a collection of data sets so large and complex that it becomes difficult to process using onhand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. 
Big Data Analytics (BDA) 
Big data analytics refers to the process of collecting, organizing and analyzing large sets of data (“big data”) to discover patterns and other useful information. Not only will big data analytics help you to understand the information contained within the data, but it will also help identify the data that is most important to the business and future business decisions. Big data analysts basically want the knowledge that comes from analyzing the data. 
Big Data Discovery  Big Data Discovery = {Big Data, Data Discovery, Data Science} 
Big Data Integration Ontology  Big Data architectures allow to flexibly store and process heterogeneous data, from multiple sources, in their original format. The structure of those data, commonly supplied by means of REST APIs, is continuously evolving. Thus data analysts need to adapt their analytical processes after each API release. This gets more challenging when performing an integrated or historical analysis. To cope with such complexity, in this paper, we present the Big Data Integration ontology, the core construct to govern the data integration process under schema evolution by systematically annotating it with information regarding the schema of the sources. We present a query rewriting algorithm that, using the annotated ontology, converts queries posed over the ontology to queries over the sources. To cope with syntactic evolution in the sources, we present an algorithm that semiautomatically adapts the ontology upon new releases. This guarantees ontologymediated queries to correctly retrieve data from the most recent schema version as well as correctness in historical queries. A functional and performance evaluation on realworld APIs is performed to validate our approach. 
Big Data Management (BDM) 
Big Data Management (BDM), an amalgam of old and new best practices, skills, teams, data types, and homegrown or vendorbuilt functionality. All of these are expanding and realigning so that businesses can fully leverage big data, not merely manage it. At the same time, big data must eventually find a permanent place in enterprise data management. BDM is well worth doing because managing big data leads to a number of benefits. According to this report’s survey, the business and technology tasks that improve most are analytic insights, the completeness of analytic data sets, business value drawn from big data, and all sales and marketing activities. BDM also has challenges, and common barriers include low organizational maturity relative to big data, weak business support, and the need to learn new technology approaches. 
Big O Notation  In mathematics, big O notation describes the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, BachmannLandau notation (after Edmund Landau and Paul Bachmann), or asymptotic notation. In computer science, big O notation is used to classify algorithms by how they respond (e.g., in their processing time or working space requirements) to changes in input size. In analytic number theory, it is used to estimate the ‘error committed’ while replacing the asymptotic size, or asymptotic mean size, of an arithmetical function, by the value, or mean value, it takes at a large finite argument. A famous example is the problem of estimating the remainder term in the prime number theorem. Big O notation characterizes functions according to their growth rates: different functions with the same growth rate may be represented using the same O notation. The letter O is used because the growth rate of a function is also referred to as order of the function. A description of a function in terms of big O notation usually only provides an upper bound on the growth rate of the function. Associated with big O notation are several related notations, using the symbols o, Ω, ω, and Θ, to describe other kinds of bounds on asymptotic growth rates. Big O notation is also used in many other fields to provide similar estimates. 
Big Workflow  Big Workflow is an industry term coined by Adaptive Computing that accelerates insights by more efficiently processing intense simulations and big data analysis. Adaptive Computing’s Big Workflow solution derives it name from its ability to solve big data challenges by streamlining the workflow to deliver valuable insights from massive quantities of data across multiple platforms, environments and locations. 
BigDataBench  As architecture, system, data management, and machine learning communities pay greater attention to innovative big data and datadriven artificial intelligence (in short, AI) algorithms, architecture, and systems, the pressure of benchmarking rises. However, complexity, diversity, frequently changed workloads, and rapid evolution of big data, especially AI systems raise great challenges in benchmarking. First, for the sake of conciseness, benchmarking scalability, portability cost, reproducibility, and better interpretation of performance data, we need understand what are the abstractions of frequentlyappearing units of computation, which we call dwarfs, among big data and AI workloads. Second, for the sake of fairness, the benchmarks must include diversity of data and workloads. Third, for codesign of software and hardware, the benchmarks should be consistent across different communities. Other than creating a new benchmark or proxy for every possible workload, we propose using dwarfbased benchmarks–the combination of eight dwarfs–to represent diversity of big data and AI workloads. The current version–BigDataBench 4.0 provides 13 representative realworld data sets and 47 big data and AI benchmarks, including seven workload types: online service, offline analytics, graph analytics, AI, data warehouse, NoSQL, and streaming. BigDataBench 4.0 is publicly available from http://…/BigDataBench. Also, for the first time, we comprehensively characterize the benchmarks of seven workload types in BigDataBench 4.0 in addition to traditional benchmarks like SPECCPU, PARSEC and HPCC in a hierarchical manner and drill down on five levels, using the TopDown analysis from an architecture perspective. 
BigDL  In this paper, we present BigDL, a distributed deep learning framework for Big Data platforms and workflows. It is implemented on top of Apache Spark, and allows users to write their deep learning applications as standard Spark programs (running directly on largescale big data clusters in a distributed fashion). It provides an expressive, ‘dataanalytics integrated’ deep learning programming model, so that users can easily build the endtoend analytics + AI pipelines under a unified programming paradigm; by implementing an AllReduce like operation using existing primitives in Spark (e.g., shuffle, broadcast, and inmemory data persistence), it also provides a highly efficient ‘parameter server’ style architecture, so as to achieve highly scalable, dataparallel distributed training. Since its initial open source release, BigDL users have built many analytics and deep learning applications (e.g., object detection, sequencetosequence generation, neural recommendations, fraud detection, etc.) on Spark. 
BigQuery  Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling superfast, SQLlike queries against appendonly tables, using the processing power of Google’s infrastructure. Simply move your data into BigQuery and let us handle the hard work. You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data. You can access BigQuery by using a web UI or a commandline tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, PHP or Python. There are also a variety of thirdparty tools that you can use to interact with BigQuery, such as visualizing the data or loading the data. Get started now with creating an app, running a web query or using the commandline tool, or read on for more information about BigQuery fundamentals and how you can work with the product. BigQuery Big Data Visualization With D3.js 
Bilinear Attention Network (BAN) 
Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, coattention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given visionlanguage information seamlessly. BAN considers bilinear interactions among two groups of input channels, while lowrank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eightattention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new stateofthearts on both datasets. 
Binarized Back Propagation  
Binarized Deep Neural Network (BDNN) 
In this work we introduce a binarized deep neural network (BDNN) model. BDNNs are trained using a novel binarized back propagation algorithm (BBP), which uses binary weights and binary neurons during the forward and backward propagation, while retaining precision of the stored weights in which gradients are accumulated. At test phase, BDNNs are fully binarized and can be implemented in hardware with low circuit complexity. The proposed binarized networks can be implemented using binary convolutions and proxy matrix multiplications with only standard binary XNOR and population count (popcount) operations. BBP is expected to reduce energy consumption by at least two orders of magnitude when compared to the hardware implementation of existing training algorithms. We obtained near stateoftheart results with BDNNs on the permutationinvariant MNIST, CIFAR10 and SVHN datasets. 
Binary Weight and Hadamardtransformed Image Network (BWHIN) 
Deep learning has made significant improvements at many image processing tasks in recent years, such as image classification, object recognition and object detection. Convolutional neural networks (CNN), which is a popular deep learning architecture designed to process data in multiple array form, show great success to almost all detection \& recognition problems and computer vision tasks. However, the number of parameters in a CNN is too high such that the computers require more energy and larger memory size. In order to solve this problem, we propose a novel energy efficient model Binary Weight and Hadamardtransformed Image Network (BWHIN), which is a combination of Binary Weight Network (BWN) and Hadamardtransformed Image Network (HIN). It is observed that energy efficiency is achieved with a slight sacrifice at classification accuracy. Among all energy efficient networks, our novel ensemble model outperforms other energy efficient models. 
BinaryNet  We introduce BinaryNet, a method which trains DNNs with binary weights and activations when computing parameters’ gradient. We show that it is possible to train a Multi Layer Perceptron (MLP) on MNIST and ConvNets on CIFAR10 and SVHN with BinaryNet and achieve nearly stateoftheart results. At runtime, BinaryNet drastically reduces memory usage and replaces most multiplications by 1bit exclusivenotor (XNOR) operations, which might have a big impact on both generalpurpose and dedicated Deep Learning hardware. We wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST MLP 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for BinaryNet is available. 
Binning  Data binning is a data preprocessing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization. Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard. http://…optimalbinningforscoringmodeling.html 
Binomial Options Pricing Model (BOPM) 
In finance, the binomial options pricing model (BOPM) provides a generalizable numerical method for the valuation of options. The binomial model was first proposed by Cox, Ross and Rubinstein in 1979. Essentially, the model uses a ‘discretetime’ (lattice based) model of the varying price over time of the underlying financial instrument. In general, binomial options pricing models do not have closedform solutions. (Binomial Tree Option Model) 
Bio7  The application Bio7 is an integrated development environment for ecological modelling and contains powerful tools for model creation, scientific image analysis and statistical analysis. The application itself is based on an RCPEclipseEnvironment (RichClientPlatform) which offers a huge flexibility in configuration and extensibility because of its plugin structure and the possibility of customization. Features: • Creation and analysis of simulation models. • Statistical analysis. • Advanced R Graphical User Interface with editor, spreadsheet, ImageJ plot device and debugging interface. • Spatial statistics (possibility to send values from a specialized panel to R). • Image Analysis (embedded ImageJ). • Fast transfer of image data from ImageJ to R and vice versa. • Fast communication between R and Java (with RServe) and the possibilty to use R methods inside Java. • Interpretation of Java and script creation (BeanShell, Groovy, Jython). • Dynamic compilation of Java. • Creation of methods for Java, BeanShell, Groovy, Jython and R (integrated editors for Java, R, BeanShell, Groovy, Jython). • Sensitivity analysis with an embedded flowchart editor in which scripts, macros and compiled code can be dragged and executed. • Creation of 3d OpenGL (Jogl) models. • Visualizations and simulations on an embedded 3d globe (World Wind Java SDK). • Creation of Graphical User Interfaces with the embedded JavaFX SceneBuilder. 
BioWorkbench  Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of largescale bioinformatics experiments. Because these experiments are computation and dataintensive, they require highperformance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a prebuilt feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves highperformance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process. 
Bipartite Graph  In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint sets U and V (that is, U and V are each independent sets) such that every edge connects a vertex in U to one in V. Equivalently, a bipartite graph is a graph that does not contain any oddlength cycles. 
Biplot  Biplots are a type of exploratory graph used in statistics, a generalization of the simple twovariable scatterplot. A biplot allows information on both samples and variables of a data matrix to be displayed graphically. Samples are displayed as points while variables are displayed either as vectors, linear axes or nonlinear trajectories. In the case of categorical variables, category level points may be used to represent the levels of a categorical variable. A generalised biplot displays information on both continuous and categorical variables. 
BiSeg  We present a simple and effective framework for simultaneous semantic segmentation and instance segmentation with Fully Convolutional Networks (FCNs). The method, called BiSeg, predicts instance segmentation as a posterior in Bayesian inference, where semantic segmentation is used as a prior. We extend the idea of positionsensitive score maps used in recent methods to a fusion of multiple score maps at different scales and partition modes, and adopt it as a robust likelihood for instance segmentation inference. As both Bayesian inference and map fusion are performed per pixel, BiSeg is a fully convolutional endtoend solution that inherits all the advantages of FCNs. We demonstrate stateoftheart instance segmentation accuracy on PASCAL VOC. 
Bit Stream Computing  In this study, we propose a novel computing paradigm ‘Bit Stream Computing’ that is constructed on the logic used in stochastic computing, but does not necessarily employ randomly or Binomially distributed bit streams as stochastic computing does. Any type of streams can be used either stochastic or deterministic. The proposed paradigm benefits from the area advantage of stochastic logic and the accuracy advantage of conventional binary logic. We implement accurate arithmetic multiplier and adder circuits, classified as asynchronous or synchronous; we also consider their suitability of processing successive streams. The proposed circuits are simulated both in gate level and in transistor level with AMS 0.35um CMOS technology to show the circuits’ potential for practical use. We thoroughly compare the proposed adders and multipliers with their predecessors in the literature, individually and in a neural network application. Comparisons made in terms of area and accuracy clearly favor the proposed designs. We believe that this study opens up new horizons for computing that enables us to implement much smaller yet accurate arithmetic circuits compared to the conventional binary and stochastic ones. 
Bitcoin  Bitcoin is a payment system invented by Satoshi Nakamoto in 2008 and introduced as opensource software in 2009. The system is peertopeer; all nodes verify transactions in a public distributed ledger called the block chain. The ledger uses its own unit of account, also called bitcoin. The system works without a central repository or single administrator, which has led the US Treasury to categorize it as a decentralized virtual currency. While bitcoin is not the first virtual currency, it is the first decentralized digital currency and cryptocurrency. It is the largest of its kind in terms of total market value. 
BitRegularized Deep Neural Network (BitNet) 
We present a novel regularization scheme for training deep neural networks. The parameters of neural networks are usually unconstrained and have a dynamic range dispersed over the real line. Our key idea is to control the expressive power of the network by dynamically quantizing the range and set of values that the parameters can take. We formulate this idea using a novel endtoend approach that regularizes the traditional classification loss function. Our regularizer is inspired by the Minimum Description Length principle. For each layer of the network, our approach optimizes a translation and scaling factor along with integervalued parameters. We empirically compare BitNet to an equivalent unregularized model on the MNIST and CIFAR10 datasets. We show that BitNet converges faster to a superior quality solution. Additionally, the resulting model is significantly smaller in size due to the use of integer parameters instead of floats. 
Bivariate Pareto Model  Bivariate.Pareto 
blabr  Scientific computing for the web. Create your own interactive computation directly in the browser. Share on the web. 
Black Hole Metric  In network science, there is often the need to sort the graph nodes. While the sorting strategy may be different, in general sorting is performed by exploiting the network structure. In particular, the metric PageRank has been used in the past decade in different ways to produce a ranking based on how many neighbors point to a specific node. PageRank is simple, easy to compute and effective in many applications, however it comes with a price: as PageRank is an application of the random walker, the arc weights need to be normalized. This normalization, while necessary, introduces a series of unwanted sideeffects. In this paper, we propose a generalization of PageRank named Black Hole Metric which mitigates the problem. We devise a scenario in which the sideeffects are particularily impactful on the ranking, test the new metric in both real and synthetic networks, and show the results. 
BlackBox Optimization Benchmarking (BBOB) 

BlackLitterman Model  In finance, the BlackLitterman model is a mathematical model for portfolio allocation developed in 1990 at Goldman Sachs by Fischer Black and Robert Litterman, and published in 1992. It seeks to overcome problems that institutional investors have encountered in applying modern portfolio theory in practice. The model starts with the equilibrium assumption that the asset allocation of a representative agent should be proportional to the market values of the available assets, and then modifies that to take into account the ‘views’ (i.e., the specific opinions about asset returns) of the investor in question to arrive at a bespoke asset allocation. 
BlackOut  We propose BlackOut, an approximation algorithm to efficiently train massive recurrent neural network language models (RNNLMs) with million word vocabularies. BlackOut is motivated by using a discriminative loss, and we describe a new sampling strategy which significantly reduces computation while improving stability, sample efficiency, and rate of convergence. One way to understand BlackOut is to view it as an extension of the DropOut strategy to the output layer, wherein we use a discriminative training loss and a weighted sampling scheme. We also establish close connections between BlackOut, importance sampling, and noise contrastive estimation (NCE). Our experiments, on the recently released one billion word language modeling benchmark, demonstrate scalability and accuracy of BlackOut; we outperform the stateofthe art, and achieve the lowest perplexity scores on this dataset. Moreover, unlike other established methods which typically require GPUs or CPU clusters, we show that a carefully implemented version of BlackOut requires only 110 days on a single machine to train a RNNLM with a million word vocabulary and billions of parameters on one billion of words. 
BlandAltman Plot  A BlandAltman plot (Difference plot) in analytical chemistry and biostatistics is a method of data plotting used in analyzing the agreement between two different assays. It is identical to a Tukey meandifference plot, the name by which it is known in other fields, but was popularised in medical statistics by J. Martin Bland and Douglas G. Altman. BlandAltmanLeh,agRee 
Bleach  In this paper we address the problem of rulebased stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the unbounded nature of data streams. We design a system, called Bleach, which achieves realtime violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data, using an incremental version of the equivalence class algorithm. Additionally, it supports rule dynamics and uses a ‘cumulative’ sliding window operation to improve cleaning accuracy. We evaluate a prototype of Bleach using a TPCDS derived dirty data stream and observe its high throughput, low latency and high cleaning accuracy, even with rule dynamics. Experimental results indicate superior performance of Bleach compared to a baseline system built on the microbatch streaming paradigm. 
BlinderOaxaca Decomposition  The BlinderOaxaca decomposition technique, or simply the Oaxaca decomposition, decomposes wage differentials into two components: a portion that arises because two comparison groups, on average, have different qualifications or credentials (e.g., years of schooling and experience in the labor market) when both groups receive the same treatment (explained component), and a portion that arises because one group is more favorably treated than the other given the same individual characteristics (unexplained component). The two portions are also called characteristics and coefficients effect using the terminology of regression analysis, which provides the basis of this decomposition technique. The coefficients effect is frequently interpreted as a measure of labor market discrimination. For a comprehensive review of issues related to labor market discrimination, see Joseph Altonji and Rebecca Blank (1999). oaxaca 
Blip  Edge environments offer a number of advantages for software developers including the ability to create services which can offer lower latency, better privacy, and reduced operational costs than traditional cloud hosted services. However large technical challenges exist, which prevent developers from utilising the Edge; complexities related to the heterogeneous nature of the Edge environment, issues with orchestration and application management and lastly, the inherent issues in creating decentralised distributed applications which operate at a large geographic scale. In this conceptual and architectural paper we envision a solution, Blip, which offers an easy to use programming and operational environment which addresses the these issues. It aims to remove the technical barriers which will inhibit the wider adoption Edge application development. This paper validates the Blip concept by demonstrating how it will deliver on the advantages of the Edge for a familiar scenario. 
Block Chain  A block chain is a transaction database shared by all nodes participating in a system based on the Bitcoin protocol. A full copy of a currency’s block chain contains every transaction ever executed in the currency. With this information, one can find out how much value belonged to each address at any point in history. Every block contains a hash of the previous block. This has the effect of creating a chain of blocks from the genesis block to the current block. Each block is guaranteed to come after the previous block chronologically because the previous block’s hash would otherwise not be known. Each block is also computationally impractical to modify once it has been in the chain for a while because every block after it would also have to be regenerated. These properties are what make doublespending of bitcoins very difficult. The block chain is the main innovation of Bitcoin. The block chain is a public ledger that records bitcoin transactions. A novel solution accomplishes this without any trusted central authority: maintenance of the block chain is performed by a network of communicating nodes running bitcoin software. Transactions of the form payer X sends Y bitcoins to payee Z are broadcast to this network using readily available software applications. Network nodes can validate transactions, add them to their copy of the ledger, and then broadcast these ledger additions to other nodes. The block chain is a distributed database; in order to independently verify the chain of ownership of any and every bitcoin (amount), each network node stores its own copy of the block chain. Approximately six times per hour, a new group of accepted transactions, a block, is created, added to the block chain, and quickly published to all nodes. This allows bitcoin software to determine when a particular bitcoin amount has been spent, which is necessary in order to prevent doublespending in an environment without central oversight. Whereas a conventional ledger records the transfers of actual bills or promissory notes that exist apart from it, the block chain is the only place that bitcoins can be said to exist in the form of unspent outputs of transactions. Blockchain Technology Explained 
Block Markov Chain (BMC) 
These Markov chains are characterized by a block structure in their transition matrix. More precisely, the $n$ possible states are divided into a finite number of $K$ groups or clusters, such that states in the same cluster exhibit the same transition rates to other states. One observes a trajectory of the Markov chain, and the objective is to recover, from this observation only, the (initially unknown) clusters. 
Block Point Process Model (BPPM) 
Many application settings involve the analysis of timestamped relations or events between a set of entities, e.g. messages between users of an online social network. Static and discretetime network models are typically used as analysis tools in these settings; however, they discard a significant amount of information by aggregating events over time to form network snapshots. In this paper, we introduce a block point process model (BPPM) for dynamic networks evolving in continuous time in the form of events at irregular time intervals. The BPPM is inspired by the wellknown stochastic block model (SBM) for static networks and is a simpler version of the recentlyproposed Hawkes infinite relational model (IRM). We show that networks generated by the BPPM follow an SBM in the limit of a growing number of nodes and leverage this property to develop an efficient inference procedure for the BPPM. We fit the BPPM to several real network data sets, including a Facebook network with over 3, 500 nodes and 130, 000 events, several orders of magnitude larger than the Hawkes IRM and other existing point process network models. 
Block Power Methods  This paper is concerned with the extension of the power method, used for finding the largest eigenvalue and associated eigenvector of a matrix, to its block from for computing the largest block eigenvalue and associated block eigenvector of a nonsymmetric matrix. Based on the developed block power method, several algorithms are developed for solving the complete set of solvents and spectral factors of a matrix polynomial, without prior knowledge of the latent roots of the matrix polynomial. Moreover, when any right/left solvent of a matrix polynomial is given, the proposed method can be used to determine the corresponding left/right solvent such that both right and left solvents have the same eigenspectra. The matrix polynomial of interest must have distinct block solvents and a corresponding nonsingular polynomial matrix. The established algorithms can be applied in the analysis and/or design of systems described by highdegree vector differential equations and/or matrix fraction descriptions. 
Block Term Network (BTnet) 
Recently, deep neural networks (DNNs) have been regarded as the stateoftheart classification methods in a wide range of applications, especially in image classification. Despite the success, the huge number of parameters blocks its deployment to situations with light computing resources. Researchers resort to the redundancy in the weights of DNNs and attempt to find how fewer parameters can be chosen while preserving the accuracy at the same time. Although several promising results have been shown along this research line, most existing methods either fail to significantly compress a welltrained deep network or require a heavy finetuning process for the compressed network to regain the original performance. In this paper, we propose the \textit{Block Term} networks (BTnets) in which the commonly used fullyconnected layers (FClayers) are replaced with block term layers (BTlayers). In BTlayers, the inputs and the outputs are reshaped into two lowdimensional highorder tensors, then blockterm decomposition is applied as tensor operators to connect them. We conduct extensive experiments on benchmark datasets to demonstrate that BTlayers can achieve a very large compression ratio on the number of parameters while preserving the representation power of the original FClayers as much as possible. Specifically, we can get a higher performance while requiring fewer parameters compared with the tensor train method. 
Block Tree (BT) 
The Block Tree (BT) is a novel compact data structure designed to compress sequence collections. It obtains compression ratios close to LempelZiv and supports efficient direct access to any substring. The BT divides the text recursively into fixedsize blocks and those appearing earlier are represented with pointers. On repetitive collections, a few blocks can represent all the others, and thus the BT reduces the size by orders of magnitude. 
BlockCNN  We present a general technique that performs both artifact removal and image compression. For artifact removal, we input a JPEG image and try to remove its compression artifacts. For compression, we input an image and process its 8 by 8 blocks in a sequence. For each block, we first try to predict its intensities based on previous blocks; then, we store a residual with respect to the input image. Our technique reuses JPEG’s legacy compression and decompression routines. Both our artifact removal and our image compression techniques use the same deep network, but with different training weights. Our technique is simple and fast and it significantly improves the performance of artifact removal and image compression. 
BlockSci  Analysis of blockchain data is useful for both scientific research and commercial applications. We present BlockSci, an opensource software platform for blockchain analysis. BlockSci is versatile in its support for different blockchains and analysis tasks. It incorporates an inmemory, analytical (rather than transactional) database, making it several hundred times faster than existing tools. We describe BlockSci’s design and present four analyses that illustrate its capabilities. This is a working paper that accompanies the first public release of BlockSci, available at https://…/BlockSci. We seek input from the community to further develop the software and explore other potential applications. 
Blockspring  Blockspring lets you dramatically scale analytics with limited technical resources. It’s a platform that makes distribution and consumption of technology simple within your organization. Here’s how it works: • Developers and data scientists post common company functions – queries, algorithms, visualizations, API calls, etc – to Blockspring in their favorite programming language. • Business users search Blockspring for the function they need, and easily use it in their spreadsheet. This model helps IT teams produced more functionality. Simultaneously, it lets business users find and use the tools they need, when they need them. 
BlockwiseMajorizationDescent  gglasso 
Bloom Filter  A Bloom filter is a spaceefficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the set, but not removed (though this can be addressed with a “counting” filter). The more elements that are added to the set, the larger the probability of false positives. Bloom proposed the technique for applications where the amount of source data would require an impracticably large hash area in memory if “conventional” errorfree hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient core memory, an errorfree hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom’s technique uses a smaller hash area but still eliminates most unnecessary accesses. For example, a hash area only 15% of the size needed by an ideal errorfree hash still eliminates 85% of the disk accesses (Bloom (1970)). 
BLOSSOM  We develop the first Bayesian Optimization algorithm, BLOSSOM, which selects between multiple alternative acquisition functions and traditional local optimization at each step. This is combined with a novel stopping condition based on expected regret. This pairing allows us to obtain the best characteristics of both local and Bayesian optimization, making efficient use of function evaluations while yielding superior convergence to the global minimum on a selection of optimization problems, and also halting optimization once a principled and intuitive stopping condition has been fulfilled. 
Blossom Belief Propagation (BlossomBP) 
Maxproduct Belief Propagation (BP) is a popular messagepassing algorithm for computing a MaximumAPosteriori (MAP) assignment over a distribution represented by a Graphical Model (GM). It has been shown that BP can solve a number of combinatorial optimization problems including minimum weight matching, shortest path, network flow and vertex cover under the following common assumption: the respective Linear Programming (LP) relaxation is tight, i.e., no integrality gap is present. However, when LP shows an integrality gap, no model has been known which can be solved systematically via sequential applications of BP. In this paper, we develop the first such algorithm, coined BlossomBP, for solving the minimum weight matching problem over arbitrary graphs. Each step of the sequential algorithm requires applying BP over a modified graph constructed by contractions and expansions of blossoms, i.e., odd sets of vertices. Our scheme guarantees termination in O(n^2) of BP runs, where n is the number of vertices in the original graph. In essence, the BlossomBP offers a distributed version of the celebrated Edmonds’ Blossom algorithm by jumping at once over many substeps with a single BP. Moreover, our result provides an interpretation of the Edmonds’ algorithm as a sequence of LPs. 
BlueSky Statistics  Fully featured Statistics application and development framework built on the open source R project. Provides familiar powerful user interface available in mainstream statistical applications like SPSS, SAS etc. Unlocks the power of R for the analyst community by providing a rich GUI and output for several popular statistics, data mining, data manipulation and graphics commands, all out of the box… Provide a rich development framework for developing and deploying new statistical modules, applications or functions with rich graphical user interfaces and output, all through intuitive drag and drop user interfaces (No programming required). A quick look at BlueSky Statistics 
BlurRing  A code package, BlurRing, is developed as a method to allow for multidimensional likelihood visualisation. From the BlurRing visualisation additional information about the likelihood can be extracted. The spread in any direction of the overlaid likelihood curves gives information about the uncertainty on the confidence intervals presented in the twodimensional likelihood plots. 
Bokeh  Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, but also deliver this capability with highperformance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications. 
Bollinger Bands  The Bollinger Band was introduce by John Bollinger in 1980s. These Bands depict the volatility of stock as it increases or decreases. The bands are placed above and below the moving average line of the stocks. The wider the gap between the bands, higher is the degree of volatility. On the other hand, as the width within the band decreases, lower is the degree of volatility of the stock. At times, the width within the band is constant over a period of time, which shows the constant behavior of a certain stock over that period of time. There are three lines in the Bollinger Band, • The middle line with Nperiod moving average (MA); 20day SMA • An upper band at K times an Nperiod standard deviation above the moving average; 20day SMA + (20day standard deviation of price x 2) • A lower band at K times an Nperiod standard deviation below the moving average; 20day SMA – (20day standard deviation of price x 2) 
bolt  Bringing multidimensional arrays to distributed settings through a unified Python interface. Bolt is an open source library providing a Python interface to ndarrays backed by local or distributed implementations (currently targeting Spark). We want to make working with big array data in Python as easy and seamless as in local settings, while exploiting the speed of proven distributed engines. 
Boolean Satisfiability Problem (SAT) 
In computer science, the Boolean satisfiability problem (sometimes called Propositional Satisfiability Problem and abbreviated as SATISFIABILITY or SAT) is the problem of determining if there exists an interpretation that satisfies a given Boolean formula. In other words, it asks whether the variables of a given Boolean formula can be consistently replaced by the values TRUE or FALSE in such a way that the formula evaluates to TRUE. If this is the case, the formula is called satisfiable. On the other hand, if no such assignment exists, the function expressed by the formula is FALSE for all possible variable assignments and the formula is unsatisfiable. For example, the formula ‘a AND NOT b’ is satisfiable because one can find the values a = TRUE and b = FALSE, which make (a AND NOT b) = TRUE. In contrast, ‘a AND NOT a’ is unsatisfiable. rpicosat 
Boosting  Boosting is a machine learning metaalgorithm for reducing bias in supervised learning. Boosting is based on the question posed by Kearns: Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily wellcorrelated with the true classification. Schapire’s affirmative answer to Kearns’ question has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting. 
Boosting Independent Embeddings Robustly (BIER) 
Learning similarity functions between image pairs with deep neural networks yields highly correlated activations of embeddings. In this work, we show how to improve the robustness of such embeddings by exploiting the independence within ensembles. To this end, we divide the last embedding layer of a deep network into an embedding ensemble and formulate training this ensemble as an online gradient boosting problem. Each learner receives a reweighted training sample from the previous learners. Further, we propose two loss functions which increase the diversity in our ensemble. These loss functions can be applied either for weight initialization or during training. Together, our contributions leverage large embedding sizes more effectively by significantly reducing correlation of the embedding and consequently increase retrieval accuracy of the embedding. Our method works with any differentiable loss function and does not introduce any additional parameters during test time. We evaluate our metric learning method on image retrieval tasks and show that it improves over stateoftheart methods on the CUB 2002011, Cars196, Stanford Online Products, InShop Clothes Retrieval and VehicleID datasets. 
Boosting Variational Inference  Variational Inference is a popular technique to approximate a possibly intractable Bayesian posterior with a more tractable one. Recently, Boosting Variational Inference has been proposed as a new paradigm to approximate the posterior by a mixture of densities by greedily adding components to the mixture. In the present work, we study the convergence properties of this approach from a modern optimization viewpoint by establishing connections to the classic FrankWolfe algorithm. Our analyses yields novel theoretical insights on the Boosting of Variational Inference regarding the sufficient conditions for convergence, explicit sublinear/linear rates, and algorithmic simplifications. 
BoostJet  Recommenders have become widely popular in recent years because of their broader applicability in many ecommerce applications. These applications rely on recommenders for generating advertisements for various offers or providing content recommendations. However, the quality of the generated recommendations depends on user features (like demography, temporality), offer features (like popularity, price), and useroffer features (like implicit or explicit feedback). Current stateoftheart recommenders do not explore such diverse features concurrently while generating the recommendations. In this paper, we first introduce the notion of Trackers which enables us to capture the abovementioned features and thus incorporate users’ online behaviour through statistical aggregates of different features (demography, temporality, popularity, price). We also show how to capture offertooffer relations, based on their consumption sequence, leveraging neural embeddings for offers in our Offer2Vec algorithm. We then introduce BoostJet, a novel recommender which integrates the Trackers along with the neural embeddings using MatrixNet, an efficient distributed implementation of gradient boosted decision tree, to improve the recommendation quality significantly. We provide an indepth evaluation of BoostJet on Yandex’s dataset, collecting online behaviour from tens of millions of online users, to demonstrate the practicality of BoostJet in terms of recommendation quality as well as scalability. 
Bootstrap Aggregating (Bagging) 
Bootstrap aggregating, also called bagging, is a machine learning ensemble metaalgorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach. 
Bootstrap CUSUM Test  Cumulative sum (CUSUM) statistics are widely used in the change point inference and identification. This paper studies the two problems for highdimensional mean vectors based on the supremum norm of the CUSUM statistics. For the problem of testing for the existence of a change point in a sequence of independent observations generated from the meanshift model, we introduce a Gaussian multiplier bootstrap to approximate critical values of the CUSUM test statistics in high dimensions. The proposed bootstrap CUSUM test is fully datadependent and it has strong theoretical guarantees under arbitrary dependence structures and mild moment conditions. Specifically, we show that with a boundary removal parameter the bootstrap CUSUM test enjoys the uniform validity in size under the null and it achieves the minimax separation rate under the sparse alternatives when the dimension $p$ can be larger than the sample size $n$. Once a change point is detected, we estimate the change point location by maximizing the supremum norm of the generalized CUSUM statistics at two different weighting scales. The first estimator is based on the covariance stationary CUSUM statistics at each data point, which is consistent in estimating the location at the nearly parametric rate $n^{1/2}$ for subexponential observations. The second estimator is a nonstationary CUSUM statistics, assigning less weights on the boundary data points. In the latter case, we show that it achieves the nearly best possible rate of convergence on the order $n^{1}$. In both cases, the dimension impacts the rate of convergence only through the logarithm factors, and therefore consistency of the CUSUM location estimators is possible when $p$ is much larger than $n$. 
Bootstrap Lasso + Partial Ridge (LPR) 
For highdimensional sparse linear models, how to construct confidence intervals for coefficients remains a difficult question. The main reason is the complicated limiting distributions of common estimators such as the Lasso. Several confidence interval construction methods have been developed, and Bootstrap Lasso+OLS is notable for its simple technicality, good interpretability, and comparable performance with other more complicated methods. However, Bootstrap Lasso+OLS depends on the betamin assumption, a theoretic criterion that is often violated in practice. In this paper, we introduce a new method called Bootstrap Lasso+Partial Ridge (LPR) to relax this assumption. LPR is a twostage estimator: first using Lasso to select features and subsequently using Partial Ridge to refit the coefficients. Simulation results show that Bootstrap LPR outperforms Bootstrap Lasso+OLS when there exist small but nonzero coefficients, a common situation violating the betamin assumption. For such coefficients, compared to Bootstrap Lasso+OLS, confidence intervals constructed by Bootstrap LPR have on average 50% larger coverage probabilities. Bootstrap LPR also has on average 35% shorter confidence interval lengths than the desparsified Lasso methods, regardless of whether linear models are misspecified. Additionally, we provide theoretical guarantees of Bootstrap LPR under appropriate conditions and implement it in the R package ‘HDCI.’ 
Bootstrap Percolation  The name percolation probably relates for most people to brewing coffee, where the water fumes go throw the coffee powder. This example is only one of many systems in which perculation phenomenon exists.To understand it a bit more, one can think of a mixture of glass and metal balls in a jar. up to a certain precentage of metal balls the mixture would behave as an insulator, that is there would be no group of metal balls touching each other that would reach from one side of the jar to the other. From a certain precentage called the percolation threshhold Pc such a group would exist and the mixture would behave as a conductor.The percolation threshold is defined as the probability below which no infinite cluster is found in the infinite system. A group of touching metal balls is named a cluster and the group that reaches from one end to the other is called the spanning cluster. 
BootstrapEnhanced Least Absolute Shrinkage Operator (Bolasso) 
Using the bootstrap and intersecting the supports, we actually get a consistent model estimate, without the consistency condition required by the regular Lasso. We refer to this new procedure as the Bolasso (bootstrapenhanced least absolute shrinkage operator). Finally, our Bolasso framework could be seen as a voting scheme applied to the supports of the bootstrap Lasso estimates; however, our procedure may rather be considered as a consensus combination scheme, as we keep the (largest) subset of variables on which all regressors agree in terms of variable selection, which is in our case provably consistent and also allows to get rid of a potential additional hyperparameter. 
Bootstrapping  In statistics, bootstrapping is a method for assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods. Generally, it falls in the broader class of resampling methods. 
BorderPeeling Clustering  In this paper, we present a novel nonparametric clustering technique, which is based on an iterative algorithm that peels off layers of points around the clusters. Our technique is based on the notion that each latent cluster is comprised of layers that surround its core, where the external layers, or border points, implicitly separate the clusters. Analyzing the Knearest neighbors of the points makes it possible to identify the border points and associate them with points of inner layers. Our clustering algorithm iteratively identifies border points, peels them, and separates the latent clusters. We show that the peeling process adapts to the local density and successfully separates adjacent clusters. A notable quality of the BorderPeeling algorithm is that it does not require any parameter tuning in order to outperform stateoftheart finelytuned nonparametric clustering methods, including MeanShift and DBSCAN. We further assess our technique on highdimensional datasets that vary in size and characteristics. In particular, we analyze the space of deep features that were trained by a convolutional neural network. 
Borealis  A generalized global update algorithm for Boolean optimization problems. Optimization problems with Boolean variables that fall into the nondeterministic polynomial (NP) class are of fundamental importance in computer science, mathematics, physics and industrial applications. Most notably, solving constraintsatisfaction problems, which are related to spinglasslike Hamiltonians in physics, remains a difficult numerical task. As such, there has been great interest in designing efficient heuristics to solve these computationally difficult problems. Inspired by parallel tempering Monte Carlo in conjunction with the rejectionfree isoenergetic cluster algorithm developed for Ising spin glasses, we present a generalized global update optimization heuristic that can be applied to different NPcomplete problems with Boolean variables. The global cluster updates allow for a widespread sampling of phase space, thus considerably speeding up optimization. By carefully tuning the pseudotemperature (needed to randomize the configurations) of the problem, we show that the method can efficiently tackle optimization problems with overconstraints or on topologies with a large sitepercolation threshold. We illustrate the efficiency of the heuristic on paradigmatic optimization problems, such as the maximum satisfiability problem and the vertex cover problem. 
BornAgain Network (BAN) 
Knowledge distillation (KD) consists of transferring knowledge from one machine learning model (the teacher}) to another (the student). Commonly, the teacher is a highcapacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student’s compactness. %we desire a compact model with performance close to the teacher’s. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these {BornAgain Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate stateoftheart performance on the CIFAR10 (3.5%) and CIFAR100 (15.5%) datasets, by validation error. Additional experiments explore two distillation objectives: (i) ConfidenceWeighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating a role of the teacher outputs on both predicted and nonpredicted classes. We present experiments with students of various capacities, focusing on the underexplored case where students overpower teachers. Our experiments show significant advantages from transferring knowledge between DenseNets and ResNets in either direction. 
Boruta  Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences. Boruta 
Boundary Optimizing Network (BON) 
Despite all the success that deep neural networks have seen in classifying certain datasets, the challenge of finding optimal solutions that generalize well still remains. In this paper, we propose the Boundary Optimizing Network (BON), a new approach to generalization for deep neural networks when used for supervised learning. Given a classification network, we propose to use a collaborative generative network that produces new synthetic data points in the form of perturbations of original data points. In this way, we create a data support around each original data point which prevents decision boundaries to pass too close to the original data points, i.e. prevents overfitting. To prevent catastrophic forgetting during training, we propose to use a variation of Memory Aware Synapses to optimize the generative networks. On the Iris dataset, we show that the BON algorithm creates better decision boundaries when compared to a network regularized by the popular dropout scheme. 
Bowtie  Bowtie is a library for writing dashboards in Python. No need to know web frameworks or JavaScript, focus on building functionality in Python. Interactively explore your data in new ways! Deploy and share with others! 
Box–Muller Transform  The BoxMuller transform (by George Edward Pelham Box and Mervin Edgar Muller 1958) is a pseudorandom number sampling method for generating pairs of independent, standard, normally distributed (zero expectation, unit variance) random numbers, given a source of uniformly distributed random numbers. It is commonly expressed in two forms. The basic form as given by Box and Muller takes two samples from the uniform distribution on the interval (0, 1] and maps them to two standard, normally distributed samples. The polar form takes two samples from a different interval, [1, +1], and maps them to two normally distributed samples without the use of sine or cosine functions. The BoxMuller transform was developed as a more computationally efficient alternative to the inverse transform sampling method. The Ziggurat algorithm gives an even more efficient method. 
Boxplot  In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms boxandwhisker plot and boxandwhisker diagram. Outliers may be plotted as individual points. 
Brain2Text  Nowadays, the Internet represents a vast informational space, growing exponentially and the problem of search for relevant data becomes essential as never before. The algorithm proposed in the article allows to perform natural language queries on content of the document and get comprehensive meaningful answers. The problem is partially solved for English as SQuAD contains enough data to learn on, but there is no such dataset in Russian, so the methods used by scientists now are not applicable to Russian. Brain2 framework allows to cope with the problem – it stands out for its ability to be applied on small datasets and does not require impressive computing power. The algorithm is illustrated on Sberbank of Russia Strategy’s text and assumes the use of a neuromodel consisting of 65 mln synapses. The trained model is able to construct wordbyword answers to questions based on a given text. The existing limitations are its current inability to identify synonyms, pronoun relations and allegories. Nevertheless, the results of conducted experiments showed high capacity and generalisation ability of the suggested approach. 
Branch Convolutional Neural Network (BCNN) 
Convolutional Neural Network (CNN) image classifiers are traditionally designed to have sequential convolutional layers with a single output layer. This is based on the assumption that all target classes should be treated equally and exclusively. However, some classes can be more difficult to distinguish than others, and classes may be organized in a hierarchy of categories. At the same time, a CNN is designed to learn internal representations that abstract from the input data based on its hierarchical layered structure. So it is natural to ask if an inverse of this idea can be applied to learn a model that can predict over a classification hierarchy using multiple output layers in decreasing order of class abstraction. In this paper, we introduce a variant of the traditional CNN model named the Branch Convolutional Neural Network (BCNN). A BCNN model outputs multiple predictions ordered from coarse to fine along the concatenated convolutional layers corresponding to the hierarchical structure of the target classes, which can be regarded as a form of prior knowledge on the output. To learn with BCNNs a novel training strategy, named the Branch Training strategy (BTstrategy), is introduced which balances the strictness of the prior with the freedom to adjust parameters on the output layers to minimize the loss. In this way we show that CNN based models can be forced to learn successively coarse to fine concepts in the internal layers at the output stage, and that hierarchical prior knowledge can be adopted to boost CNN models’ classification performance. Our models are evaluated to show that the BCNN extensions improve over the corresponding baseline CNN on the benchmark datasets MNIST, CIFAR10 and CIFAR100. 
Breadthfirst Search (BFS) 
In graph theory, breadthfirst search (BFS) is a strategy for searching in a graph when search is limited to essentially two operations: (a) visit and inspect a node of a graph; (b) gain access to visit the nodes that neighbor the currently visited node. The BFS begins at a root node and inspects all the neighboring nodes. Then for each of those neighbor nodes in turn, it inspects their neighbor nodes which were unvisited, and so on. Compare BFS with the equivalent, but more memoryefficient Iterative deepening depthfirst search and contrast with depthfirst search. 
Break Down Plot  Break Down Plots are inspired by waterfall plots created by ‘xgboostExplainer’ package (see <https://…/xgboostExplainer> ). The idea behind Break Down Plots it to decompose model prediction for a single observation. Break Down Plots show the contribution of every variable present in the model. Such plots will work for binary classifiers and general regression models. breakDown 
Breakout  A breakout is typically characterized by two steady states and an intermediate transition period. Broadly speaking, breakouts have two flavors: 1. Mean shift: A sudden jump in the time series corresponds to a mean shift. A sudden jump in CPU utilization from 40% to 60% would exemplify a mean shift. 2. Ramp up: A gradual increase in the value of the metric from one steady state to another constitutes a ramp up. A gradual increase in CPU utilization from 40% to 60% would exemplify a ramp up. 
Bridge Sampling  (Bennett, 1976; Meng & Wong, 1996), a reliable and relatively straightforward sampling method that allows researchers to obtain the marginal likelihood for models of varying complexity. 
Broadcasting Convolutional Network  While convolutional neural networks (CNNs) are widely used for handling spatiotemporal scenes, there exist limitations in reasoning relations among spatial features caused by their inherent structures, which have been issued consistently in many studies. In this paper, we propose Broadcasting Convolutional Networks (BCN) that allow global receptive fields to share spatial information. BCNs are simple network modules that collect effective spatial features, embed location informations and broadcast them to the entire feature maps without much additional computational cost. This method gains great improvements in feature localization problems through efficiently extending the receptive fields, and can easily be implemented within any structure of CNNs. We further utilize BCN to propose MultiRelational Networks (multiRN) that greatly improve existing Relation Networks (RNs). In pixelbased relation reasoning problems, multiRN with BCNs implanted extends the concept of `pairwise relations’ from conventional RNs to `multiple relations’ by relating each object with multiple objects at once and not in pairs. This yields in O(n) complexity for n number of objects, which is a vast computational gain from RNs that take O(n^2). Through experiments, BCNs are proven for their usability on relation reasoning problems, which is due from their efficient handlings of spatial information. 
Broyden–Fletcher–Goldfarb–Shanno Algorithm (BFGS) 
In numerical optimization, the BroydenFletcherGoldfarbShanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. The BFGS method approximates Newton’s method, a class of hillclimbing optimization techniques that seeks a stationary point of a (preferably twice continuously differentiable) function. For such problems, a necessary condition for optimality is that the gradient be zero. Newton’s method and the BFGS methods are not guaranteed to converge unless the function has a quadratic Taylor expansion near an optimum. These methods use both the first and second derivatives of the function. However, BFGS has proven to have good performance even for nonsmooth optimizations. 
BUbiNG  BUbiNG is an opensource Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousands pages per second respecting strict politeness constraints, both host and IPbased. Unlike existing opensource distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern highspeed protocols so to achieve very high throughput. 
Bucketization  
Budding Perceptron  Traditionally, deep learning algorithms update the network weights whereas the network architecture is chosen manually, using a process of trial and error. In this work, we propose two novel approaches that automatically update the network structure while also learning its weights. The novelty of our approach lies in our parameterization where the depth, or additional complexity, is encapsulated continuously in the parameter space through control parameters that add additional complexity. We propose two methods: In tunnel networks, this selection is done at the level of a hidden unit, and in budding perceptrons, this is done at the level of a network layer; updating this control parameter introduces either another hidden unit or another hidden layer. We show the effectiveness of our methods on the synthetic twospirals data and on two real data sets of MNIST and MIRFLICKR, where we see that our proposed methods, with the same set of hyperparameters, can correctly adjust the network complexity to the task complexity. 
Bumping  Bumping is a simple algorithm that can help your classifier escape from a local minimum. The idea behind bumping is that we can break the symmetry of the problem (or escape the local minimum) by training a decision tree on random subsample. This is similar to bagging. The hope is that in the subsample there will be a preferred split so the tree can pick it. We fit several trees on different bootstrap) samples (sampling with replacement) and choose the one with the best performance on the full training set as the winner. The more rounds of bumping we do, the more likely we are to escape. It costs more CPU time as well though. 
Bumps Chart  Bump charts got their name from ‘bumps race’, a term used to refer to a boat race where each boat tries to ‘bump’ the one in front and move up the chart. Bump charts have become quite common of late and are typically used to represent changes in the position of a given number of competing entities over a fixed time duration. 
Burning Number  We introduce a new graph parameter called the burning number, inspired by contact processes on graphs such as graph bootstrap percolation, and graph searching paradigms such as Firefighter. The burning number measures the speed of the spread of contagion in a graph; the lower the burning number, the faster the contagion spreads. We provide a number of properties of the burning number, including characterizations and bounds. The burning number is computed for several graph classes, and is derived for the graphs generated by the Iterated Local Transitivity model for social networks. 
Business Analysis Body of Knowledge (BABOK) 
A Guide to the Business Analysis Body of Knowledge (BABOK) is the written guide to the collection of business analysis knowledge reflecting current best practice, providing a framework that describes the areas of knowledge, with associated activities and tasks and techniques required. According to Capability Maturity Model Integration, organisations interested in process improvement need to adopt industry standards from the Business Analysis Body of Knowledge (and other associated references) to lift their project delivery from the ad hoc to the managed level. 
Business Analytics  Business analytics (BA) refers to the skills, technologies, applications and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods. In contrast, business intelligence traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning, which is also based on data and statistical methods. 
Business Function Library (BFL) 
The Business Function Library (BFL) is one of the SAP AFL (Application Function) Libraries. It contains prebuilt parameterdriven functions in the financial area. The functions are implemented by C++. This library helps you develop compound business algorithms that are fully compliant with the SAP HANA calculation engine. It offers you the flexibility and efficiency to develop HANAbased applications with incredible performance. 
Business Intelligence (BI) 
Business intelligence (BI) is a set of theories, methodologies, architectures, and technologies that transform raw data into meaningful and useful information for business purposes. BI can handle enormous amounts of unstructured data to help identify, develop and otherwise create new opportunities. BI, in simple words, makes interpreting voluminous data friendly. Making use of new opportunities and implementing an effective strategy can provide a competitive market advantage and longterm stability. 
Business Intelligence Competency Centers (BICC) 
A Business Intelligence Competency Center (BICC) is a crossfunctional organizational team that has defined tasks, roles, responsibilities and processes for supporting and promoting the effective use of Business Intelligence (BI) across an organization. As early as 2001, Gartner, an information technology research and advisory company, started advocating that companies need a BICC to develop and focus resources to be successful using business intelligence. Since then, the BICC concept has been further refined through practical implementations in organizations that have implemented BI and analytical software. In practice, the term ‘BICC’ is not well integrated into the nomenclature of business or public sector organizations and there are a large degree of variances in the organizational design for BICCs. Nevertheless, the popularity of the BICC concept has caused the creation of units that focus on ensuring the use of the information for decisionmaking from BI software and increasing the return on investment (ROI) of BI. A BICC coordinates the activities and resources to ensure that a factbased approach to decision making is systematically implemented throughout an organization. It has responsibility for the governance structure for BI and analytical programs, projects, practices, software, and architecture. It is responsible for building the plans, priorities, infrastructure, and competencies that the organization needs to take forwardlooking strategic decisions by using the BI and analytical software capabilities. A BICC’s influence transcends that of a typical business unit, playing a crucial central role in the organizational change and strategic process. Accordingly, the BICC’s purpose is to empower the entire organization to coordinate BI from all units. Through centralization, it ‘…ensures that information and best practices are communicated and shared through the entire organization so that everyone can benefit from successes and lessons learned.’ The BICC also plays an important organizational role facilitating interaction among the various cultures and units within the organization. Knowledge transfer, enhancement of analytic skills, coaching and training are central to the mandate of the BICC. A BICC should be pivotal in ensuring a high degree of information consumption and a ROI for BI. 
ButterflyNet  Deep networks, especially Convolutional Neural Networks (CNNs), have been successfully applied in various areas of machine learning as well as to challenging problems in other scientific and engineering fields. This paper introduces ButterflyNet, a lowcomplexity CNN with structured hardcoded weights and sparse acrosschannel connections, which aims at an optimal hierarchical function representation of the input signal. Theoretical analysis of the approximation power of ButterflyNet to the Fourier representation of input data shows that the error decays exponentially as the depth increases. Due to the ability of ButterflyNet to approximate Fourier and local Fourier transforms, the result can be used for approximation upper bound for CNNs in a large class of problems. The analysis results are validated in numerical experiments on the approximation of a 1D Fourier kernel and of solving a 2D Poisson’s equation. 
Byzantine Gradient Descent  We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and timevarying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning — a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failurefree setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function. 
Byzantine Stochastic Gradient Descent (BSGD) 
This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional minibatch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is informationtheoretically optimal both in terms of sampling complexity and time complexity. 
Advertisements