# Magister Dixit

“Predictive is the ‘killer app’ for Big Data.” Waqar Hasan

# If you did not already know

Speech2Vec
In this paper, we propose a novel deep neural network architecture, Speech2Vec, for learning fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to the underlying spoken words, and are close to other vectors in the embedding space if their corresponding underlying spoken words are semantically similar. The proposed model can be viewed as a speech version of Word2Vec. Its design is based on a RNN Encoder-Decoder framework, and borrows the methodology of skipgrams or continuous bag-of-words for training. Learning word embeddings directly from speech enables Speech2Vec to make use of the semantic information carried by speech that does not exist in plain text. The learned word embeddings are evaluated and analyzed on 13 widely used word similarity benchmarks, and outperform word embeddings learned by Word2Vec from the transcriptions. …

An implementation of the additive polynomial (AP) design matrix. It constructs and appends an AP design matrix to a data frame for use with longitudinal data subject to seasonality. …

Echo State Network (ESN)
The echo state network (ESN), is a recurrent neural network with a sparsely connected hidden layer (with typically 1% connectivity). The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can (re)produce specific temporal patterns. The main interest of this network is that although its behaviour is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system. Alternatively, one may consider a nonparametric Bayesian formulation of the output layer, under which: (i) a prior distribution is imposed over the output weights; and (ii) the output weights are marginalized out in the context of prediction generation, given the training data. This idea has been demonstrated in by using Gaussian priors, whereby a Gaussian process model with ESN-driven kernel function is obtained. Such a solution was shown to outperform ESNs with trainable (finite) sets of weights in several benchmarks.
Deep Echo State Networks for Diagnosis of Parkinson’s Disease

# R Packages worth a look

Working Examples for Reproducible Research Documents (stationery)
Templates, guides, and scripts for writing documents in ‘LaTeX’ and ‘R markdown’ to produce guides, slides, and reports. Special care is taken to illus …

The Entire Solution Paths for ROC-SVM (rocsvm.path)
We develop the entire solution paths for ROC-SVM presented by Rakotomamonjy. The ROC-SVM solution path algorithm greatly facilitates the tuning procedu …

Supervised NMF (SpNMF)
Non-negative Matrix Factorization(NMF) is a powerful tool for identifying the key features of microbial communities and a dimension-reduction method. W …

# Book Memo: “Machine Learning Risk Assessments in Criminal Justice Settings”

 This book puts in one place and in accessible form Richard Berk´s most recent work on forecasts of re-offending by individuals already in criminal justice custody. Using machine learning statistical procedures trained on very large datasets, an explicit introduction of the relative costs of forecasting errors as the forecasts are constructed, and an emphasis on maximizing forecasting accuracy, the author shows how his decades of research on the topic improves forecasts of risk. Criminal justice risk forecasts anticipate the future behavior of specified individuals, rather than ‘predictive policing’ for locations in time and space, which is a very different enterprise that uses different data different data analysis tools.

# Document worth reading: “Machine Learning and Applied Linguistics”

This entry introduces the topic of machine learning and provides an overview of its relevance for applied linguistics and language learning. The discussion will focus on giving an introduction to the methods and applications of machine learning in applied linguistics, and will provide references for further study. Machine Learning and Applied Linguistics

# Distilled News

This is the first blog post in a two-part series. The series expands on the Frontiers of Natural Language Processing session organized by Herman Kamper and me at the Deep Learning Indaba 2018. Slides of the entire session can be found here. This post will discuss major recent advances in NLP focusing on neural network-based methods. The second post will discuss open problems in NLP.
There’s a lot of conversation lately about all the possibilities of machines learning to do things humans currently do in our factories, warehouses, offices and homes. While the technology is evolving – quickly – along with fears and excitement, terms such as artificial intelligence, machine learning and deep learning may leave you perplexed. I hope that this simple guide will help sort out the confusion around deep learning and that the 8 practical examples will help to clarify the actual use of deep learning technology today.
1. Virtual assistants
2. Translations
3. Vision for driverless delivery trucks, drones and autonomous cars
4. Chatbots and service bots
5. Image colorization
6. Facial recognition
7. Medicine and pharmaceuticals
8. Personalized shopping and entertainment
Today’s machine learning systems are more advanced than ever, capable of automating increasingly complex tasks and serving as a critical tool for human operators. Despite recent advances, however, a critical component of Artificial Intelligence (AI) remains just out of reach – machine common sense. Defined as ‘the basic ability to perceive, understand, and judge things that are shared by nearly all people and can be reasonably expected of nearly all people without need for debate,’ common sense forms a critical foundation for how humans interact with the world around them. Possessing this essential background knowledge could significantly advance the symbiotic partnership between humans and machines. But articulating and encoding this obscure-but-pervasive capability is no easy feat. ‘The absence of common sense prevents an intelligent system from understanding its world, communicating naturally with people, behaving reasonably in unforeseen situations, and learning from new experiences,’ said Dave Gunning, a program manager in DARPA’s Information Innovation Office (I2O). ‘This absence is perhaps the most significant barrier between the narrowly focused AI applications we have today and the more general AI applications we would like to create in the future.’
E-commerce has revolutionized the way we shop. That phone you’ve been saving up to buy for months? It’s just a search and a few clicks away. Items are delivered within a matter of days (sometimes even the next day!). For online retailers, there are no constraints related to inventory management or space management They can sell as many different products as they want. Brick and mortar stores can keep only a limited number of products due to the finite space they have available. I remember when I used to place orders for books at my local bookstore, and it used to take over a week for the book to arrive. It seems like a story from the ancient times now!
In this post, we explore some broad guidelines for selecting machine learning models
The overall steps for Machine Learning/Deep Learning are:
• Collect data
• Check for anomalies, missing data and clean the data
• Perform statistical analysis and initial visualization
• Build models
• Check the accuracy
• Present the results
Machine learning tasks can be classified into
• Supervised learning
• Unsupervised learning
• Semi-supervised learning
• Reinforcement learning
(In this document – we do not focus on the last two
)
Below are some approaches on choosing a model for Machine Learning/Deep Learning
In this tutorial, learn how to implement decorators in Python. A decorator is a design pattern in Python that allows a user to add new functionality to an existing object without modifying its structure. Decorators are usually called before the definition of a function you want to decorate. In this tutorial, we’ll show the reader how they can use decorators in their Python functions.
This tutorial takes course material from DataCamp’s Machine Learning Toolbox course and allows you to practice confusion matrices in R.
Decision trees are a highly useful visual aid in analyzing a series of predicted outcomes for a particular model. As such, it is often used as a supplement (or even alternative to) regression analysis in determining how a series of explanatory variables will impact the dependent variable.
As I allude here, my long-held impression is that no true anomaly-based network IDS (NIDS) has ever been successful commercially and/or operationally. There were some bits of success, to be sure (‘OMG WE CAN DETECT PORTSCANS!!!’), but in total, they (IMHO) don’t quite measure up to SUCCESS of the approach. In light of this opinion, here is a fun question: do you think the current generation of machine learning (ML) – and ‘AI’-based (why is AI in quotes?) systems will work better? Note that I am aiming at a really, really low bar: will they work better than – per the above statement – not at all? But my definition of ‘work’ includes ‘work in today’s messy and evolving real life networks.’ This is actually a harder question than it seems. Of course, ML and ‘AI’ aficionados (who, as I am hearing, are generally saner compared to the blockchain types … these are more akin to clowns, really) would claim that of course ‘now with ML, things are totally different’, ‘because cyber AI’ and ‘next next next generation deep learning just works.’ On the other hand, some of the rumors we are hearing mention that in noisy, flat, poorly managed networks anomaly detection devolves to … no, really! … to signatures and fixed activity thresholds where humans write rules about what is bad and/or not good.
Pipe notation is popular with a large league of R users, with magrittr being the dominant realization. However, this should not be enough to consider piping in R as a completely settled topic that is not subject to further discussion, experiments, or the possibility of improvement. To promote innovation opportunities we describe wrapr ‘dot-pipe’, a well behaved sequencing operator with S3 extensibility. In this article we include a number of examples of using this pipe to interact with and extend other R packages.

# R Packages worth a look

The Folding Test of Unimodality (Rfolding)
The basic algorithm to perform the folding test of unimodality. Given a dataset X (d dimensional, n samples), the test checks whether the distribution …

Multivariate Normal Variance Mixtures (Including Student’s t Distribution for Non-Integer Degrees of Freedom) (nvmix)
Functions for working with multivariate normal variance mixture distributions including evaluating their distribution functions, densities and random n …

Optimal Binning of Continuous and Categorical Variables (varbin)
Tool for easy and efficient discretization of continuous and categorical data. The package calculates the most optimal binning of a given explanatory v …

# Whats new on arXiv

Activity recognition from sensor data deals with various challenges, such as overlapping activities, activity labeling, and activity detection. Although each challenge in the field of recognition has great importance, the most important one refers to online activity recognition. The present study tries to use online hierarchical hidden Markov model to detect an activity on the stream of sensor data which can predict the activity in the environment with any sensor event. The activity recognition samples were labeled by the statistical features such as the duration of activity. The results of our proposed method test on two different datasets of smart homes in the real world showed that one dataset has improved 4% and reached (59%) while the results reached 64.6% for the other data by using the best methods.
The banking industry is very important for an economic cycle of each country and provides some quality of services for us. With the advancement in technology and rapidly increasing of the complexity of the business environment, it has become more competitive than the past so that efficiency analysis in the banking industry attracts much attention in recent years. From many aspects, such analyses at the branch level are more desirable. Evaluating the branch performance with the purpose of eliminating deficiency can be a crucial issue for branch managers to measure branch efficiency. This work not only can lead to a better understanding of bank branch performance but also give further information to enhance managerial decisions to recognize problematic areas. To achieve this purpose, this study presents an integrated approach based on Data Envelopment Analysis (DEA), Clustering algorithms and Polynomial Pattern Classifier for constructing a classifier to identify a class of bank branches. First, the efficiency estimates of individual branches are evaluated by using the DEA approach. Next, when the range and number of classes were identified by experts, the number of clusters is identified by an agglomerative hierarchical clustering algorithm based on some statistical methods. Next, we divide our raw data into k clusters By means of self-organizing map (SOM) neural networks. Finally, all clusters are fed into the reduced multivariate polynomial model to predict the classes of data.
Conceptual Knowledge Markup Language (CKML) is an application of XML. Earlier versions of CKML followed rather exclusively the philosophy of Conceptual Knowledge Processing (CKP), a principled approach to knowledge representation and data analysis that ‘advocates methods and instruments of conceptual knowledge processing which support people in their rational thinking, judgment and acting and promote critical discussion.’ The new version of CKML continues to follow this approach, but also incorporates various principles, insights and techniques from Information Flow (IF), the logical design of distributed systems. Among other things, this allows diverse communities of discourse to compare their own information structures, as coded in logical theories, with that of other communities that share a common generic ontology. CKML incorporates the CKP ideas of concept lattice and formal context, along with the IF ideas of classification (= formal context), infomorphism, theory, interpretation and local logic. Ontology Markup Language (OML), a subset of CKML that is a self-sufficient markup language in its own right, follows the principles and ideas of Conceptual Graphs (CG). OML is used for structuring the specifications and axiomatics of metadata into ontologies. OML incorporates the CG ideas of concept, conceptual relation, conceptual graph, conceptual context, participants and ontology. The link from OML to CKML is the process of conceptual scaling, which is the interpretive transformation of ontologically structured knowledge to conceptual structured knowledge.
This paper presents a novel physics-informed regularization method for training of deep neural networks (DNNs). In particular, we focus on the DNN representation for the response of a physical or biological system, for which a set of governing laws are known. These laws often appear in the form of differential equations, derived from first principles, empirically-validated laws, and/or domain expertise. We propose a DNN training approach that utilizes these known differential equations in addition to the measurement data, by introducing a penalty term to the training loss function to penalize divergence form the governing laws. Through three numerical examples, we will show that the proposed regularization produces surrogates that are physically interpretable with smaller generalization errors, when compared to other common regularization methods.
The heavy-tailed distributions of corrupted outliers and singular values of all channels in low-level vision have proven effective priors for many applications such as background modeling, photometric stereo and image alignment. And they can be well modeled by a hyper-Laplacian. However, the use of such distributions generally leads to challenging non-convex, non-smooth and non-Lipschitz problems, and makes existing algorithms very slow for large-scale applications. Together with the analytic solutions to lp-norm minimization with two specific values of p, i.e., p=1/2 and p=2/3, we propose two novel bilinear factor matrix norm minimization models for robust principal component analysis. We first define the double nuclear norm and Frobenius/nuclear hybrid norm penalties, and then prove that they are in essence the Schatten-1/2 and 2/3 quasi-norms, respectively, which lead to much more tractable and scalable Lipschitz optimization problems. Our experimental analysis shows that both our methods yield more accurate solutions than original Schatten quasi-norm minimization, even when the number of observations is very limited. Finally, we apply our penalties to various low-level vision problems, e.g., text removal, moving object detection, image alignment and inpainting, and show that our methods usually outperform the state-of-the-art methods.
Annotation guidelines used to guide the annotation of training and evaluation datasets can have a considerable impact on the quality of machine learning models. In this study, we explore the effects of annotation guidelines on the quality of app feature extraction models. As a main result, we propose several changes to the existing annotation guidelines with a goal of making the extracted app features more useful and informative to the app developers. We test the proposed changes via simulating the application of the new annotation guidelines and then evaluating the performance of the supervised machine learning models trained on datasets annotated with initial and simulated guidelines. While the overall performance of automatic app feature extraction remains the same as compared to the model trained on the dataset with initial annotations, the features extracted by the model trained on the dataset with simulated new annotations are less noisy and more informative to the app developers. Secondly, we are interested in what kind of annotated training data is necessary for training an automatic app feature extraction model. In particular, we explore whether the training set should contain annotated app reviews from those apps/app categories on which the model is subsequently planned to be applied, or is it sufficient to have annotated app reviews from any app available for training, even when these apps are from very different categories compared to the test app. Our experiments show that having annotated training reviews from the test app is not necessary although including them into training set helps to improve recall. Furthermore, we test whether augmenting the training set with annotated product reviews helps to improve the performance of app feature extraction. We find that the models trained on augmented training set lead to improved recall but at the cost of the drop in precision.
Coreference resolution is an important task for natural language understanding, and the resolution of ambiguous pronouns a longstanding challenge. Nonetheless, existing corpora do not capture ambiguous pronouns in sufficient volume or diversity to accurately indicate the practical utility of models. Furthermore, we find gender bias in existing corpora and systems favoring masculine entities. To address this, we present and release GAP, a gender-balanced labeled corpus of 8,908 ambiguous pronoun-name pairs sampled to provide diverse coverage of challenges posed by real-world text. We explore a range of baselines which demonstrate the complexity of the challenge, the best achieving just 66.9% F1. We show that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge.
Anomaly detection is often considered a challenging field of machine learning due to the difficulty of obtaining anomalous samples for training and the need to obtain a sufficient amount of training data. In recent years, autoencoders have been shown to be effective anomaly detectors that train only on ‘normal’ data. Generative adversarial networks (GANs) have been used to generate additional training samples for classifiers, thus making them more accurate and robust. However, in anomaly detection GANs are only used to reconstruct existing samples rather than to generate additional ones. This stems both from the small amount and lack of diversity of anomalous data in most domains. In this study we propose MDGAN, a novel GAN architecture for improving anomaly detection through the generation of additional samples. Our approach uses two discriminators: a dense network for determining whether the generated samples are of sufficient quality (i.e., valid) and an autoencoder that serves as an anomaly detector. MDGAN enables us to reconcile two conflicting goals: 1) generate high-quality samples that can fool the first discriminator, and 2) generate samples that can eventually be effectively reconstructed by the second discriminator, thus improving its performance. Empirical evaluation on a diverse set of datasets demonstrates the merits of our approach.
Data augmentation is commonly used to encode invariances in learning methods. However, this process is often performed in an inefficient manner, as artificial examples are created by applying a number of transformations to all points in the training set. The resulting explosion of the dataset size can be an issue in terms of storage and training costs, as well as in selecting and tuning the optimal set of transformations to apply. In this work, we demonstrate that it is possible to significantly reduce the number of data points included in data augmentation while realizing the same accuracy and invariance benefits of augmenting the entire dataset. We propose a novel set of subsampling policies, based on model influence and loss, that can achieve a 90% reduction in augmentation set size while maintaining the accuracy gains of standard data augmentation.
Multi-objective optimization is a crucial matter in computer systems design space exploration because real-world applications often rely on a trade-off between several objectives. Derivatives are usually not available or impractical to compute and the feasibility of an experiment can not always be determined in advance. These problems are particularly difficult when the feasible region is relatively small, and it may be prohibitive to even find a feasible experiment, let alone an optimal one. We introduce a new methodology and corresponding software framework, HyperMapper 2.0, which handles multi-objective optimization, unknown feasibility constraints, and categorical/ordinal variables. This new methodology also supports injection of user prior knowledge in the search when available. All of these features are common requirements in computer systems but rarely exposed in existing design space exploration systems. The proposed methodology follows a white-box model which is simple to understand and interpret (unlike, for example, neural networks) and can be used by the user to better understand the results of the automatic search. We apply and evaluate the new methodology to automatic static tuning of hardware accelerators within the recently introduced Spatial programming language, with minimization of design runtime and compute logic under the constraint of the design fitting in a target field programmable gate array chip. Our results show that HyperMapper 2.0 provides better Pareto fronts compared to state-of-the-art baselines, with better or competitive hypervolume indicator and with 8x improvement in sampling budget for most of the benchmarks explored.
Most existing studies in text-to-SQL tasks do not require generating complex SQL queries with multiple clauses or sub-queries, and generalizing to new, unseen databases. In this paper we propose SyntaxSQLNet, a syntax tree network to address the complex and cross-domain text-to-SQL generation task. SyntaxSQLNet employs a SQL specific syntax tree-based decoder with SQL generation path history and table-aware column attention encoders. We evaluate SyntaxSQLNet on the Spider text-to-SQL task, which contains databases with multiple tables and complex SQL queries with multiple SQL clauses and nested queries. We use a database split setting where databases in the test set are unseen during training. Experimental results show that SyntaxSQLNet can handle a significantly greater number of complex SQL examples than prior work, outperforming the previous state-of-the-art model by 8.3% in exact matching accuracy. We also show that SyntaxSQLNet can further improve the performance by an additional 8.1% using a cross-domain augmentation method, resulting in a 16.4% improvement in total. To our knowledge, we are the first to study this complex and cross-domain text-to-SQL task.
This paper considers stochastic optimization problems whose objective functions involve powers of random variables. For example, consider the classic Stochastic lp Load Balancing Problem (SLBp): There are $m$ machines and $n$ jobs, and known independent random variables $Y_{ij}$ decribe the load incurred on machine $i$ if we assign job $j$ to it. The goal is to assign each jobs to machines in order to minimize the expected $l_p$-norm of the total load on the machines. While convex relaxations represent one of the most powerful algorithmic tools, in problems such as SLBp the main difficulty is to capture the objective function in a way that only depends on each random variable separately. We show how to capture $p$-power-type objectives in such separable way by using the $L$-function method, introduced by Lata{\l}a to relate the moment of sums of random variables to the individual marginals. We show how this quickly leads to a constant-factor approximation for very general subset selection problem with $p$-moment objective. Moreover, we give a constant-factor approximation for SLBp, improving on the recent $O(p/\ln p)$-approximation of [Gupta et al., SODA 18]. Here the application of the method is much more involved. In particular, we need to sharply connect the expected $l_p$-norm of a random vector with the $p$-moments of its marginals (machine loads), taking into account simultaneously the different scales of the loads that are incurred by an unknown assignment.
We present a sample path dependent measure of causal influence between time series. The proposed causal measure is a random sequence, a realization of which enables identification of specific patterns that give rise to high levels of causal influence. We show that these patterns cannot be identified by existing measures such as directed information (DI). We demonstrate how sequential prediction theory may be leveraged to estimate the proposed causal measure and introduce a notion of regret for assessing the performance of such estimators. We prove a finite sample bound on this regret that is determined by the worst case regret of the sequential predictors used in the estimator. Justification for the proposed measure is provided through a series of examples, simulations, and application to stock market data. Within the context of estimating DI, we show that, because joint Markovicity of a pair of processes does not imply the marginal Markovicity of individual processes, commonly used plug-in estimators of DI will be biased for a large subset of jointly Markov processes. We introduce a notion of DI with ‘stale history’, which can be combined with a plug-in estimator to upper and lower bound the DI when marginal Markovicity does not hold.
Ranking functions return ranked lists of items, and users often interact with these items. How to evaluate ranking functions using historical interaction logs, also known as off-policy evaluation, is an important but challenging problem. The commonly used Inverse Propensity Scores (IPS) approaches work better for the single item case, but suffer from extremely low data efficiency for the ranked list case. In this paper, we study how to improve the data efficiency of IPS approaches in the offline comparison setting. We propose two approaches Trunc-match and Rand-interleaving for offline comparison using uniformly randomized data. We show that these methods can improve the data efficiency and also the comparison sensitivity based on one of the largest email search engines.
In this paper we propose Aleph, a leaderless, fully asynchronous, Byzantine fault tolerant consensus protocol for ordering messages exchanged among processes. It is based on a distributed construction of a partially ordered set and the algorithm for reaching a consensus on its extension to a total order. To achieve the consensus, the processes perform computations based only on a local copy of the data structure, however, they are bound to end with the same results. Our algorithm uses a dual-threshold coin-tossing scheme as a randomization strategy and establishes the agreement in an expected constant number of rounds. In addition, we introduce a fast way of validating messages that can occur prior to determining the total ordering.
Network pruning is widely used for reducing the heavy computational cost of deep models. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all the six state-of-the-art pruning algorithms we examined, fine-tuning a pruned model only gives comparable or even worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for a wide variety of pruning algorithms with multiple network architectures, datasets, and tasks. Our results have several implications: 1) training a large, over-parameterized model is not necessary to obtain an efficient final model, 2) learned ‘important’ weights of the large model are not necessarily useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited ‘important’ weights, is what leads to the efficiency benefit in the final model, which suggests that some pruning algorithms could be seen as performing network architecture search.
IOHprofiler is a new tool for analyzing and comparing iterative optimization heuristics. Given as input algorithms and problems written in C or Python, it provides as output a statistical evaluation of the algorithms’ performance by means of the distribution on the fixed-target running time and the fixed-budget function values. In addition, IOHprofiler also allows to track the evolution of algorithm parameters, making our tool particularly useful for the analysis, comparison, and design of (self-)adaptive algorithms. IOHprofiler is a ready-to-use software. It consists of two parts: an experimental part, which generates the running time data, and a post-processing part, which produces the summarizing comparisons and statistical evaluations. The experimental part is build on the COCO software, which has been adjusted to cope with optimization problems that are formulated as functions $f:\mathcal{S}^n \to \R$ with $\mathcal{S}$ being a discrete alphabet of integers. The post-processing part is our own work. It can be used as a stand-alone tool for the evaluation of running time data of arbitrary benchmark problems. It accepts as input files not only the output files of IOHprofiler, but also original COCO data files. The post-processing tool is designed for an interactive evaluation, allowing the user to chose the ranges and the precision of the displayed data according to his/her needs. IOHprofiler is available on GitHub at \url{https://…/IOHprofiler}.
We present online boosting algorithms for multiclass classification with bandit feedback, where the learner only receives feedback about the correctness of its prediction. We propose an unbiased estimate of the loss using a randomized prediction, allowing the model to update its weak learners with limited information. Using the unbiased estimate, we extend two full information boosting algorithms (Jung et al., 2017) to the bandit setting. We prove that the asymptotic error bounds of the bandit algorithms exactly match their full information counterparts. The cost of restricted feedback is reflected in the larger sample complexity. Experimental results also support our theoretical findings, and performance of the proposed models is comparable to the that of an existing bandit boosting algorithm, which is limited to use binary weak learners.
The knowledge graph(KG) composed of entities with their descriptions and attributes, and relationship between entities, is finding more and more application scenarios in various natural language processing tasks. In a typical knowledge graph like Wikidata, entities usually have a large number of attributes, but it is difficult to know which ones are important. The importance of attributes can be a valuable piece of information in various applications spanning from information retrieval to natural language generation. In this paper, we propose a general method of using external user generated text data to evaluate the relative importance of an entity’s attributes. To be more specific, we use the word/sub-word embedding techniques to match the external textual data back to entities’ attribute name and values and rank the attributes by their matching cohesiveness. To our best knowledge, this is the first work of applying vector based semantic matching to important attribute identification, and our method outperforms the previous traditional methods. We also apply the outcome of the detected important attributes to a language generation task; compared with previous generated text, the new method generates much more customized and informative messages.
Making deep convolutional neural networks more accurate typically comes at the cost of increased computational and memory resources. In this paper, we exploit the fact that the importance of features computed by convolutional layers is highly input-dependent, and propose feature boosting and suppression (FBS), a new method to predictively amplify salient convolutional channels and skip unimportant ones at run-time. FBS introduces small auxiliary connections to existing convolutional layers. In contrast to channel pruning methods which permanently remove channels, it preserves the full network structures and accelerates convolution by dynamically skipping unimportant input and output channels. FBS-augmented networks are trained with conventional stochastic gradient descent, making it readily available for many state-of-the-art CNNs. We compare FBS to a range of existing channel pruning and dynamic execution schemes and demonstrate large improvements on ImageNet classification. Experiments show that FBS can accelerate VGG-16 by $5\times$ and improve the speed of ResNet-18 by $2\times$, both with less than $0.6\%$ top-5 accuracy loss.
Non-concave maximization has been the subject of much recent study in the optimization and machine learning communities, specifically in deep learning. Recent papers ((Ge \etal 2015, Lee \etal 2017) and references therein) indicate that first order methods work well and avoid saddles points. Results as in (Lee \etal 2017), however, are limited to the \textit{unconstrained} case or for cases where the critical points are in the interior of the feasibility set, which fail to capture some of the most interesting applications. In this paper we focus on \textit{constrained} non-concave maximization. We analyze a variant of a well-established algorithm in machine learning called Multiplicative Weights Update (MWU) for the maximization problem $\max_{\mathbf{x} \in D} P(\mathbf{x})$, where $P$ is non-concave, twice continuously differentiable and $D$ is a product of simplices. We show that MWU converges almost always for small enough stepsizes to critical points that satisfy the second order KKT conditions. We combine techniques from dynamical systems as well as taking advantage of a recent connection between Baum Eagon inequality and MWU (Palaiopanos \etal 2017).
Blockchain technology has attracted tremendous attention in both academia and capital market. However, overwhelming speculations on thousands of available cryptocurrencies and numerous initial coin offering (ICO) scams have also brought notorious debates on this emerging technology. This paper traces the development of blockchain systems to reveal the importance of decentralized applications (dApps) and the future value of blockchain. We survey the state-of-the-art dApps and discuss the direction of blockchain development to fulfill the desirable characteristics of dApps. The readers will gain an overview of dApp research and get familiar with recent developments in the blockchain.
Collaborative Filtering (CF) is one of the most used methods for Recommender System. Because of the Bayesian nature and non-linearity, deep generative models, e.g. Variational Autoencoder (VAE), have been applied into CF task, and have achieved great performance. However, most VAE-based methods suffer from matrix sparsity and consider the prior of users’ latent factors to be the same, which leads to poor latent representations of users and items. Additionally, most existing methods model latent factors of users only and but not items, which makes them not be able to recommend items to a new user. To tackle these problems, we propose a Neural Variational Hybrid Collaborative Filtering, \VDMF{}. Specifically, we consider both the generative processes of users and items, and the prior of latent factors of users and items to be \emph{side ~information-specific}, which enables our model to alleviate matrix sparsity and learn better latent representations of users and items. For inference purpose, we derived a Stochastic Gradient Variational Bayes (SGVB) algorithm to analytically approximate the intractable distributions of latent factors of users and items. Experiments conducted on two large datasets have showed our methods significantly outperform the state-of-the-art CF methods, including the VAE-based methods.
The goal of a technology-assisted review is to achieve high recall with low human effort. Continuous active learning algorithms have demonstrated good performance in locating the majority of relevant documents in a collection, however their performance is reaching a plateau when 80\%-90\% of them has been found. Finding the last few relevant documents typically requires exhaustively reviewing the collection. In this paper, we propose a novel method to identify these last few, but significant, documents efficiently. Our method makes the hypothesis that entities carry vital information in documents, and that reviewers can answer questions about the presence or absence of an entity in the missing relevance documents. Based on this we devise a sequential Bayesian search method that selects the optimal sequence of questions to ask. The experimental results show that our proposed method can greatly improve performance requiring less reviewing effort.
Transfer learning is a widely used strategy in medical image analysis. Instead of only training a network with a limited amount of data from the target task of interest, we can first train the network with other, potentially larger source datasets, creating a more robust model. The source datasets do not have to be related to the target task. For a classification task in lung CT images, we could use both head CT images, or images of cats, as the source. While head CT images appear more similar to lung CT images, the number and diversity of cat images might lead to a better model overall. In this survey we review a number of papers that have performed similar comparisons. Although the answer to which strategy is best seems to be ‘it depends’, we discuss a number of research directions we need to take as a community, to gain more understanding of this topic.
Normalization methods are a central building block in the deep learning toolbox. They accelerate and stabilize training, while decreasing the dependence on manually tuned learning rate schedules. When learning from multi-modal distributions, the effectiveness of batch normalization (BN), arguably the most prominent normalization method, is reduced. As a remedy, we propose a more flexible approach: by extending the normalization to more than a single mean and variance, we detect modes of data on-the-fly, jointly normalizing samples that share common features. We demonstrate that our method outperforms BN and other widely used normalization techniques in several experiments, including single and multi-task datasets.
High-risk domains require reliable confidence estimates from predictive models. Deep latent variable models provide these, but suffer from the rigid variational distributions used for tractable inference, which err on the side of overconfidence. We propose Stochastic Quantized Activation Distributions (SQUAD), which imposes a flexible yet tractable distribution over discretized latent variables. The proposed method is scalable, self-normalizing and sample efficient. We demonstrate that the model fully utilizes the flexible distribution, learns interesting non-linearities, and provides predictive uncertainty of competitive quality.
Understanding the uncertainty of a neural network’s (NN) predictions is essential for many applications. The Bayesian framework provides a principled approach to this, however applying it to NNs is challenging due to the large number of parameters and data. Ensembling NNs provides a practical and scalable method for uncertainty quantification. Its drawback is that its justification is heuristic rather than Bayesian. In this work we propose one modification to the usual ensembling process, that does result in Bayesian behaviour: regularising parameters about values drawn from a prior distribution. Hence, we present an easily implementable, scalable technique for performing approximate Bayesian inference in NNs.
Complex industrial systems are continuously monitored by a large number of heterogenous sensors. The diversity of their operating conditions and the possible fault types make it impossible to collect enough data for learning all the possible fault patterns. The paper proposes an integrated automatic unsupervised feature learning approach for fault detection that uses healthy conditions data only for its training. The approach is based on stacked Extreme Learning Machines (namely Hierarchical, or HELM) and comprises stacked autoencoders performing unsupervised feature learning, and a one-class classifier monitoring the variations in the features to assess the health of the system. This study provides a comprehensive evaluation of HELM fault detection capability compared to other machine learning approaches, including Deep Belief Networks. The performance is first evaluated on a synthetic dataset with typical characteristics of condition monitoring data. Subsequently, the approach is evaluated on a real case study of a power plant fault. HELM demonstrates a better performance specifically in cases where several non-informative signals are included.
Many probabilistic models of interest in scientific computing and machine learning have expensive, black-box likelihoods that prevent the application of standard techniques for Bayesian inference, such as MCMC, which would require access to the gradient or a large number of likelihood evaluations. We introduce here a novel sample-efficient inference framework, Variational Bayesian Monte Carlo (VBMC). VBMC combines variational inference with Gaussian-process based, active-sampling Bayesian quadrature, using the latter to efficiently approximate the intractable integral in the variational objective. Our method produces both a nonparametric approximation of the posterior distribution and an approximate lower bound of the model evidence, useful for model selection. We demonstrate VBMC both on several synthetic likelihoods and on a neuronal model with data from real neurons. Across all tested problems and dimensions (up to $D = 10$), VBMC performs consistently well in reconstructing the posterior and the model evidence with a limited budget of likelihood evaluations, unlike other methods that work only in very low dimensions. Our framework shows great promise as a novel tool for posterior and model inference with expensive, black-box likelihoods.
Most of agents that learn policy for tasks with reinforcement learning (RL) lack the ability to communicate with people, which makes human-agent collaboration challenging. We believe that, in order for RL agents to comprehend utterances from human colleagues, RL agents must infer the mental states that people attribute to them because people sometimes infer an interlocutor’s mental states and communicate on the basis of this mental inference. This paper proposes PublicSelf model, which is a model of a person who infers how the person’s own behavior appears to their colleagues. We implemented the PublicSelf model for an RL agent in a simulated environment and examined the inference of the model by comparing it with people’s judgment. The results showed that the agent’s intention that people attributed to the agent’s movement was correctly inferred by the model in scenes where people could find certain intentionality from the agent’s behavior.
The recent turn towards quantitative text-as-data approaches in IR brought new ways to study the discursive landscape of world politics. Here seen as complementary to qualitative approaches, quantitative assessments have the advantage of being able to order and make comprehensible vast amounts of text. However, the validity of unsupervised methods applied to the types of text available in large quantities needs to be established before they can speak to other studies relying on text and discourse as data. In this paper, we introduce a new text corpus of United Nations Security Council (UNSC) speeches on Afghanistan between 2001 and 2017; we study this corpus through unsupervised topic modeling (LDA) with the central aim to validate the topic categories that the LDA identifies; and we discuss the added value, and complementarity, of quantitative text-as-data approaches. We set-up two tests using mixed- method approaches. Firstly, we evaluate the identified topics by assessing whether they conform with previous qualitative work on the development of the situation in Afghanistan. Secondly, we use network analysis to study the underlying social structures of what we will call ‘speaker-topic relations’ to see whether they correspondent to know divisions and coalitions in the UNSC. In both cases we find that the unsupervised LDA indeed provides valid and valuable outputs. In addition, the mixed-method approaches themselves reveal interesting patterns deserving future qualitative research. Amongst these are the coalition and dynamics around the ‘women and human rights’ topic as part of the UNSC debates on Afghanistan.
Deep reinforcement learning (DRL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. In this context, first, this article provides a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Second, it provides guidelines to complement this emerging area by (i) showcasing examples on how methods and algorithms from DRL and multiagent learning (MAL) have helped solve problems in MDRL and (ii) providing general lessons learned from these works. We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists in both areas (DRL and MAL) in a joint effort to promote fruitful research in the multiagent community.
This paper presents a technology for simple and computationally efficient improvements of a generic Artificial Intelligence (AI) system, including Multilayer and Deep Learning neural networks. The improvements are, in essence, small network ensembles constructed on top of the existing AI architectures. Theoretical foundations of the technology are based on Stochastic Separation Theorems and the ideas of the concentration of measure. We show that, subject to mild technical assumptions on statistical properties of internal signals in the original AI system, the technology enables instantaneous and computationally efficient removal of spurious and systematic errors with probability close to one on the datasets which are exponentially large in dimension. The method is illustrated with numerical examples and a case study of ten digits recognition from American Sign Language.
Addressing fairness in machine learning models has recently attracted a lot of attention, as it will ensure continued confidence of the general public in the deployment of machine learning systems. Here, we focus on mitigating harm of a biased system that offers much better quality outputs for certain groups than for others. We show that bias in the output can naturally be handled in Gaussian process classification (GPC) models by introducing a latent target output that will modulate the likelihood function. This simple formulation has several advantages: first, it is a unified framework for several notions of fairness (demographic parity, equalized odds, and equal opportunity); second, it allows encoding our knowledge of what the bias in outputs should be; and third, it can be solved by using off-the-shelf GPC packages.
We develop model-based methods for solving stochastic convex optimization problems, introducing the approximate-proximal point, or \aProx, family, which includes stochastic subgradient, proximal point, and bundle methods. When the modeling approaches we propose are appropriately accurate, the methods enjoy stronger convergence and robustness guarantees than classical approaches, even though the model-based methods typically add little to no computational overhead over stochastic subgradient methods. For example, we show that improved models converge with probability 1 and enjoy optimal asymptotic normality results under weak assumptions; these methods are also adaptive to a natural class of what we term easy optimization problems, achieving linear convergence under appropriate strong growth conditions on the objective. Our substantial experimental investigation shows the advantages of more accurate modeling over standard subgradient methods across many smooth and non-smooth optimization problems.

# Book Memo: “Market Segmentation Analysis”

 Understanding It, Doing It, and Making It Useful This open access book offers something for everyone working with market segmentation: practical guidance for users of market segmentation solutions; organisational guidance on implementation issues; guidance for market researchers in charge of collecting suitable data; and guidance for data analysts with respect to the technical and statistical aspects of market segmentation analysis. Even market segmentation experts will find something new, including an approach to exploring data structure and choosing a suitable number of market segments, and a vast array of useful visualisation techniques that make interpretation of market segments and selection of target segments easier. The book talks the reader through every single step, every single potential pitfall, and every single decision that needs to be made to ensure market segmentation analysis is conducted as well as possible. All calculations are accompanied not only with a detailed explanation, but also with R code that allows readers to replicate any aspect of what is being covered in the book using R, the open-source environment for statistical computing and graphics.