Distilled News

JustML provides automatic machine learning model selection, training and deployment in the cloud.
The Internet of Things (IoT) refers to the network of numerous physical devices, also known as the Internet of objects, refers to the networked interconnection of everyday objects (20 billion by 2020, according to Gartner). Such devices will be an integral part of next-generation computing, additionally, these devices will produce astronomical data volume, catapulting us into the world of zettabytes and yottabytes. Data is a new Oil, which is a byproduct of doing operations and for others, same data can be a catalyst to capture newer insights, build AI models and drive innovation.
The build system rolled up R-3.5.0.tar.gz (codename ‘Joy in Playing’) this morning. The list below details the changes in this release. You can get the source code from http://…/R-3.5.0.tar.gz or wait for it to be mirrored at a CRAN site nearer to you. Binaries for various platforms will appear in due course.
I was surprised by greta. I had assumed that the tensorflow and reticulate packages would eventually enable R developers to look beyond deep learning applications and exploit the TensorFlow platform to create all manner of production-grade statistical applications. But I wasn’t thinking Bayesian. After all, Stan is probably everything a Bayesian modeler could want. Stan is a powerful, production-level probability distribution modeling engine with a slick R interface, deep documentation, and a dedicated development team. But greta lets users write TensorFlow-based Bayesian models directly in R! What could be more charming? greta removes the barrier of learning an intermediate modeling language while still promising to deliver high-performance MCMC models that run anywhere TensorFlow can go.
In this tutorial, you’ll learn about absolute and weighted word frequency in text mining and how to calculate it with defaultdict and pandas DataFrames.
Have you ever been embarrassed by the first iteration of one of your machine learning projects, where you didn’t include obvious and important features? In the practical hustle and bustle of trying to build models, we can often forget about the observation step in the scientific method and jump straight to hypothesis testing.
Easily apply any function to a pandas dataframe in the fastest available manner. Time is precious. There is absolutely no reason to be wasting it waiting for your function to be applied to your pandas series (1 column) or dataframe (>1 columns). Don’t get me wrong, pandas is an amazing tool for python users, and a majority of the time pandas operations are very quick. Here, I wish to take the pandas apply function under close inspection. This function is incredibly useful, because it lets you easily apply any function that you’ve specified to your pandas series or dataframe. But there is a cost — the apply function essentially acts as a for loop, and a slow one at that. This means that the apply function is a linear operation, processing your function at O(n) complexity.

R Packages worth a look

Graphical Tools of Histogram PCA (GraphPCA)
Histogram principal components analysis is the generalization of the PCA. Histogram data are adapted to design complex and big data which histograms used as variables (big data adapter). Functions implemented provides numerical and graphical tools of an extension of PCA. Sun Makosso Kallyth (2016) <doi:10.1002/sam.11270>. Sun Makosso Kallyth and Edwin Diday (2012) <doi:10.1007/s11634-012-0108-0>.

Convert Plot to ‘grob’ or ‘ggplot’ Object (ggplotify)
Convert plot function call (using expression or formula) to ‘grob’ or ‘ggplot’ object that compatible to the ‘grid’ and ‘ggplot2’ ecosystem. With this package, we are able to e.g. using ‘cowplot’ to align plots produced by ‘base’ graphics, ‘grid’, ‘lattice’, ‘vcd’ etc. by converting them to ‘ggplot’ objects.

Linear Model with Tree-Based Lasso Regularization for Rare Features (rare)
Implementation of an alternating direction method of multipliers algorithm for fitting a linear model with tree-based lasso regularization, which is proposed in Algorithm 1 of Yan and Bien (2018) <arXiv:1803.06675>. The package allows efficient model fitting on the entire 2-dimensional regularization path for large datasets. The complete set of functions also makes the entire process of tuning regularization parameters and visualizing results hassle-free.

Tools, Measures and Statistical Tests for Cultural Evolution (cultevo)
Provides tools for measuring the compositionality of signalling systems (in particular the information-theoretic measure due to Spike (2016) <http://…/25930> and the Mantel test for distance matrix correlation (after Dietz 1983) <doi:10.1093/sysbio/32.1.21>), functions for computing string and meaning distance matrices as well as an implementation of the Page test for monotonicity of ranks (Page 1963) <doi:10.1080/01621459.1963.10500843> with exact p-values up to k = 22.

Power Analysis for a SMART Design (smartsizer)
A set of tools for determining the necessary sample size in order to identify the optimal dynamic treatment regime in a sequential, multiple assignment, randomized trial (SMART). Utilizes multiple comparisons with the best methodology to adjust for multiple comparisons. Designed for an arbitrary SMART design. Please see Artman (2018) <arXiv:1804.04587> for more details.

If you did not already know

Takeuchi’s Information Criteria (TIC)
Takeuchi’s Information Criteria (TIC) is a linearization of maximum likelihood estimator bias which shrinks the model parameters towards the maximum entropy distribution, even when the model is mis-specified. In statistical machine learning, $L_2$ regularization (a.k.a. ridge regression) also introduces a parameterized bias term with the goal of minimizing out-of-sample entropy, but generally requires a numerical solver to find the regularization parameter. …

Latent Sequence Decompositions (LSD)
We present the Latent Sequence Decompositions (LSD) framework. LSD decomposes sequences with variable lengthed output units as a function of both the input sequence and the output sequence. We present a training algorithm which samples valid extensions and an approximate decoding algorithm. We experiment with the Wall Street Journal speech recognition task. Our LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER. When combined with a convolutional network on the encoder, we achieve 9.2% WER. …

Domain Adaptation is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution. For instance, one of the tasks of the common spam filtering problem consists in adapting a model from one user (the source distribution) to a new one who receives significantly different emails (the target distribution). Note that, when more than one source distribution is available the problem is referred to as multi-source domain adaptation.
Domain Adaptation with Randomized Expectation Maximization

Whats new on arXiv

Learning the structure of Bayesian networks from data is known to be a computationally challenging, NP-hard problem. The literature has long investigated how to perform structure learning from data containing large numbers of variables, following a general interest in high-dimensional applications (‘small n, large p’) in systems biology and genetics. More recently, data sets with large numbers of observations (the so-called ‘big data’) have become increasingly common; and many of these data sets are not high-dimensional, having only a few tens of variables. We revisit the computational complexity of Bayesian network structure learning in this setting, showing that the common choice of measuring it with the number of estimated local distributions leads to unrealistic time complexity estimates for the most common class of score-based algorithms, greedy search. We then derive more accurate expressions under common distributional assumptions. These expressions suggest that the speed of Bayesian network learning can be improved by taking advantage of the availability of closed form estimators for local distributions with few parents. Furthermore, we find that using predictive instead of in-sample goodness-of-fit scores improves both speed and accuracy at the same time. We demonstrate these results on large real-world environmental data and on reference data sets available from public repositories.
Estimating causal interactions in complex networks is an important problem encountered in many fields of current science. While a theoretical solution for detecting the graph of causal interactions has been previously formulated in the framework of prediction improvement, it generally requires the computation of high-dimensional information functionals — a situation invoking the curse of dimensionality with increasing network size. Recently, several methods have been proposed to alleviate this problem, based on iterative procedures for assessment of conditional (in)dependences. In the current work, we bring a comparison of several such prominent approaches. This is done both by theoretical comparison of the algorithms using a formulation in a common framework, and by numerical simulations including realistic complex coupling patterns. The theoretical analysis shows that the algorithms are strongly related; including one algorithm being in particular situations equivalent to the first phase of another. Moreover, numerical simulations suggest that the accuracy of most of the algorithms is under suitable parameter choice almost indistinguishable. However, particularly for large networks there are substantial differences in their computational demands, suggesting some of the algorithms are relatively more efficient in the case of sparse networks, while other perform better in the case of dense networks. The most recent variant of the algorithm by Runge et al. then provides a promising speedup particularly for large sparse networks, albeit appears to lead to a substantial decrease in accuracy in some scenarios. Based on the analysis of the reviewed algorithms, we propose a hybrid approach that provides competitive results both concerning computational efficiency and accuracy.
Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used so far in the LI literature. For describing the features and methods we introduce a unified notation. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.
End-to-end dialog systems have become very popular because they hold the promise of learning directly from human to human dialog interaction. Retrieval and Generative methods have been explored in this area with mixed results. A key element that is missing so far, is the incorporation of a-priori knowledge about the task at hand. This knowledge may exist in the form of structured or unstructured information. As a first step towards this direction, we present a novel approach, Knowledge based end-to-end memory networks (KB-memN2N), which allows special handling of named entities for goal-oriented dialog tasks. We present results on two datasets, DSTC6 challenge dataset and dialog bAbI tasks.
Call centers’ managers are interested in obtaining accurate point and distributional forecasts of call arrivals in order to achieve an optimal balance between service quality and operating costs. We present a strategy for selecting forecast models of call arrivals which is based on three pillars: (i) flexibility of the loss function; (ii) statistical evaluation of forecast accuracy; (iii) economic evaluation of forecast performance using money metrics. We implement fourteen time series models and seven forecast combination schemes on three series of daily call arrivals. Although we focus mainly on point forecasts, we also analyze density forecast evaluation. We show that second moments modeling is important both for point and density forecasting and that the simple Seasonal Random Walk model is always outperformed by more general specifications. Our results suggest that call center managers should invest in the use of forecast models which describe both first and second moments of call arrivals.
In natural language understanding, many challenges require learning relationships between two sequences for various tasks such as similarity, relatedness, paraphrasing and question matching. Some of these challenges are inherently closer in nature, hence the knowledge acquired from one task to another is easier acquired and adapted. However, transferring all knowledge might be undesired and can lead to sub-optimal results due to \textit{negative} transfer. Hence, this paper focuses on the transferability of both instances and parameters across natural language understanding tasks using an ensemble-based transfer learning method to circumvent such issues. The primary contribution of this paper is the combination of both \textit{Dropout} and \textit{Bagging} for improved transferability in neural networks, referred to as \textit{Dropping} herein. Secondly, we present a straightforward yet novel approach to incorporating source \textit{Dropping} Networks to a target task for few-shot learning that mitigates \textit{negative} transfer. This is achieved by using a decaying parameter chosen according to the slope changes of a smoothed spline error curve at sub-intervals during training. We compare the approach over the hard parameter sharing, soft parameter sharing and single-task learning to compare its effectiveness. The aforementioned adjustment leads to improved transfer learning performance and comparable results to the current state of the art only using few instances from the target task.
This paper presents an R package to handle and represent measurements with errors in a very simple way. We briefly introduce the main concepts of metrology and propagation of uncertainty, and discuss related R packages. Building upon this, we introduce the ‘errors’ package, which provides a class for associating uncertainty metadata, automated propagation and reporting. Working with ‘errors’ enables transparent, lightweight, less error-prone handling and convenient representation of measurements with errors. Finally, we discuss the advantages, limitations and future work of computing with errors.
This work introduces a new abstraction technique for reducing the state space of large, discrete-time labelled Markov chains. The abstraction leverages the semantics of interval Markov decision processes and the existing notion of approximate probabilistic bisimulation. Whilst standard abstractions make use of abstract points that are taken from the state space of the concrete model and which serve as representatives for sets of concrete states, in this work the abstract structure is constructed considering abstract points that are not necessarily selected from the states of the concrete model, rather they are a function of these states. The resulting model presents a smaller one-step bisimulation error, when compared to a like-sized, standard Markov chain abstraction. We outline a method to perform probabilistic model checking, and show that the computational complexity of the new method is comparable to that of standard abstractions based on approximate probabilistic bisimulations.
Deep Reinforcement Learning (deep RL) has made several breakthroughs in recent years in applications ranging from complex control tasks in unmanned vehicles to game playing. Despite their success, deep RL still lacks several important capacities of human intelligence, such as transfer learning, abstraction and interpretability. Deep Symbolic Reinforcement Learning (DSRL) seeks to incorporate such capacities to deep Q-networks (DQN) by learning a relevant symbolic representation prior to using Q-learning. In this paper, we propose a novel extension of DSRL, which we call Symbolic Reinforcement Learning with Common Sense (SRL+CS), offering a better balance between generalization and specialization, inspired by principles of common sense when assigning rewards and aggregating Q-values. Experiments reported in this paper show that SRL+CS learns consistently faster than Q-learning and DSRL, achieving also a higher accuracy. In the hardest case, where agents were trained in a deterministic environment and tested in a random environment, SRL+CS achieves nearly 100% average accuracy compared to DSRL’s 70% and DQN’s 50% accuracy. To the best of our knowledge, this is the first case of near perfect zero-shot transfer learning using Reinforcement Learning.
Current neural network-based classifiers are susceptible to adversarial examples even in the black-box setting, where the attacker only has query access to the model. In practice, the threat model for real-world systems is often more restrictive than the typical black-box model of full query access. We define three realistic threat models that more accurately characterize many real-world classifiers: the query-limited setting, the partial-information setting, and the label-only setting. We develop new attacks that fool classifiers under these more restrictive threat models, where previous methods would be impractical or ineffective. We demonstrate that our methods are effective against an ImageNet classifier under our proposed threat models. We also demonstrate a targeted black-box attack against a commercial classifier, overcoming the challenges of limited query access, partial information, and other practical issues to attack the Google Cloud Vision API.

Book Memo: “Introduction to Nature-Inspired Optimization”

 Introduction to Nature-Inspired Optimization brings together many of the innovative mathematical methods for non-linear optimization that have their origins in the way various species behave in order to optimize their chances of survival. The book describes each method, examines their strengths and weaknesses, and where appropriate, provides the MATLAB code to give practical insight into the detailed structure of these methods and how they work. Nature-inspired algorithms emulate processes that are found in the natural world, spurring interest for optimization. Lindfield/Penny provide concise coverage to all the major algorithms, including genetic algorithms, artificial bee colony algorithms, ant colony optimization and the cuckoo search algorithm, among others. This book provides a quick reference to practicing engineers, researchers and graduate students who work in the field of optimization. • Applies concepts in nature and biology to develop new algorithms for nonlinear optimization • Offers working MATLAB® programs for the major algorithms described, applying them to a range of problems • Provides useful comparative studies of the algorithms, highlighting their strengths and weaknesses • Discusses the current state-of-the-field and indicates possible areas of future development

Book Memo: “Natural Language Processing with PyTorch”

 Build Intelligent Language Applications Using Deep Learning Natural Language Processing (NLP) offers unbounded opportunities for solving interesting problems in artificial intelligence, making it the latest frontier for developing intelligent, deep learning-based applications. If you’re a developer or researcher ready to dive deeper into this rapidly growing area of artificial intelligence, this practical book shows you how to use the PyTorch deep learning framework to implement recently discovered NLP techniques. To get started, all you need is a machine learning background and experience programming with Python. Authors Delip Rao and Goku Mohandas provide you with a solid grounding in PyTorch, and deep learning algorithms, for building applications involving semantic representation of text. Each chapter includes several code examples and illustrations.

Document worth reading: “One Big Net For Everything”

I apply recent work on ‘learning to think’ (2015) and on PowerPlay (2011) to the incremental training of an increasingly general problem solver, continually learning to solve new tasks without forgetting previous skills. The problem solver is a single recurrent neural network (or similar general purpose computer) called ONE. ONE is unusual in the sense that it is trained in various ways, e.g., by black box optimization / reinforcement learning / artificial evolution as well as supervised / unsupervised learning. For example, ONE may learn through neuroevolution to control a robot through environment-changing actions, and learn through unsupervised gradient descent to predict future inputs and vector-valued reward signals as suggested in 1990. User-given tasks can be defined through extra goal-defining input patterns, also proposed in 1990. Suppose ONE has already learned many skills. Now a copy of ONE can be re-trained to learn a new skill, e.g., through neuroevolution without a teacher. Here it may profit from re-using previously learned subroutines, but it may also forget previous skills. Then ONE is retrained in PowerPlay style (2011) on stored input/output traces of (a) ONE’s copy executing the new skill and (b) previous instances of ONE whose skills are still considered worth memorizing. Simultaneously, ONE is retrained on old traces (even those of unsuccessful trials) to become a better predictor, without additional expensive interaction with the enviroment. More and more control and prediction skills are thus collapsed into ONE, like in the chunker-automatizer system of the neural history compressor (1991). This forces ONE to relate partially analogous skills (with shared algorithmic information) to each other, creating common subroutines in form of shared subnetworks of ONE, to greatly speed up subsequent learning of additional, novel but algorithmically related skills. One Big Net For Everything

Whats new on arXiv

In recent years, it has been found that neural networks can be easily fooled by adversarial examples, which is a potential safety hazard in some safety-critical applications. Many researchers have proposed various method to make neural networks more robust to white-box adversarial attacks, but an effective method have not been found so far. In this short paper, we focus on the robustness of the features learned by neural networks. We show that the features learned by neural networks are not robust, and find that the robustness of the learned features is closely related to the resistance against adversarial examples of neural networks. We also find that adversarial training against fast gradients sign method (FGSM) does not make the leaned features very robust, even if it can make the trained networks very resistant to FGSM attack. Then we propose a method, which can be seen as an extension of adversarial training, to train neural networks to learn more robust features. We perform experiments on MNIST and CIFAR-10 to evaluate our method, and the experiment results show that this method greatly improves the robustness of the learned features and the resistance to adversarial attacks.
Reinforcement learning and symbolic planning have both been used to build intelligent autonomous agents. Reinforcement learning relies on learning from interactions with real world, which often requires an unfeasibly large amount of experience. Symbolic planning relies on manually crafted symbolic knowledge, which may not be robust to domain uncertainties and changes. In this paper we present a unified framework {\em PEORL} that integrates symbolic planning with hierarchical reinforcement learning (HRL) to cope with decision-making in a dynamic environment with uncertainties. Symbolic plans are used to guide the agent’s task execution and learning, and the learned experience is fed back to symbolic knowledge to improve planning. This method leads to rapid policy search and robust symbolic plans in complex domains. The framework is tested on benchmark domains of HRL.
Exposing the weaknesses of neural models is crucial for improving their performance and robustness in real-world applications. One common approach is to examine how input perturbations affect the output. Our analysis takes this to an extreme on natural language processing tasks by removing as many words as possible from the input without changing the model prediction. For question answering and natural language inference, this of- ten reduces the inputs to just one or two words, while model confidence remains largely unchanged. This is an undesireable behavior: the model gets the Right Answer for the Wrong Reason (RAWR). We introduce a simple training technique that mitigates this problem while maintaining performance on regular examples.
We propose a novel value-aware quantization which applies aggressively reduced precision to the majority of data while separately handling a small amount of large data in high precision, which reduces total quantization errors under very low precision. We present new techniques to apply the proposed quantization to training and inference. The experiments show that our method with 3-bit activations (with 2% of large ones) can give the same training accuracy as full-precision one while offering significant (41.6% and 53.7%) reductions in the memory cost of activations in ResNet-152 and Inception-v3 compared with the state-of-the-art method. Our experiments also show that deep networks such as Inception-v3, ResNet-101 and DenseNet-121 can be quantized for inference with 4-bit weights and activations (with 1% 16-bit data) within 1% top-1 accuracy drop.
We describe a set of techniques to generate queries automatically based on one or more ingested, input corpuses. These queries require no a priori domain knowledge, and hence no human domain experts. Thus, these auto-generated queries help address the epistemological question of how we know what we know, or more precisely in this case, how an AI system with ingested data knows what it knows. These auto-generated queries can also be used to identify and remedy problem areas in ingested material — areas for which the knowledge of the AI system is incomplete or even erroneous. Similarly, the proposed techniques facilitate tests of AI capability — both in terms of coverage and accuracy. By removing humans from the main learning loop, our approach also allows more effective scaling of AI and cognitive capabilities to provide (1) broader coverage in a single domain such as health or geology; and (2) more rapid deployment to new domains. The proposed techniques also allow ingested knowledge to be extended naturally. Our investigations are early, and this paper provides a description of the techniques. Assessment of their efficacy is our next step for future work.
We study the problem of adapting neural sentence embedding models to the domain of human activities to capture their relations in different dimensions. We introduce a novel approach, Sequential Network Transfer, and show that it largely improves the performance on all dimensions. We also extend this approach to other semantic similarity datasets, and show that the resulting embeddings outperform traditional transfer learning approaches in many cases, achieving state-of-the-art results on the Semantic Textual Similarity (STS) Benchmark. To account for the improvements, we provide some interpretation of what the networks have learned. Our results suggest that Sequential Network Transfer is highly effective for various sentence embedding models and tasks.
Deep neural networks trained over large datasets learn features that are both generic to the whole dataset, and specific to individual classes in the dataset. Learned features tend towards generic in the lower layers and specific in the higher layers of a network. Methods like fine-tuning are made possible because of the ability for one filter to apply to multiple target classes. Much like the human brain this behavior, can also be used to cluster and separate classes. However, to the best of our knowledge there is no metric for how applicable learned features are to specific classes. In this paper we propose a definition and metric for measuring the applicability of learned features to individual classes, and use this applicability metric to estimate input applicability and produce a new method of unsupervised learning we call the CactusNet.
A number of differences have emerged between modern and classic approaches to constituency parsing in recent years, with structural components like grammars and feature-rich lexicons becoming less central while recurrent neural network representations rise in popularity. The goal of this work is to analyze the extent to which information provided directly by the model structure in classical systems is still being captured by neural methods. To this end, we propose a high-performance neural model (92.08 F1 on PTB) that is representative of recent work and perform a series of investigative experiments. We find that our model implicitly learns to encode much of the same information that was explicitly provided by grammars and lexicons in the past, indicating that this scaffolding can largely be subsumed by powerful general-purpose neural machinery.
Propensity scores are often used for stratification of treatment and control groups of subjects in observational data to remove confounding bias when estimating of causal effect of the treatment on an outcome in so-called potential outcome causal modeling framework. In this article, we try to get some insights into basic behavior of the propensity scores in a probabilistic sense. We do a simple analysis of their usage confining to the case of discrete confounding covariates and outcomes. While making clear about behavior of the propensity score our analysis shows how the so-called prognostic score can be derived simultaneously. However the prognostic score is derived in a limited sense in the current literature whereas our derivation is more general and shows all possibilities of having the score. And we call it outcome score. We argue that application of both the propensity score and the outcome score is the most efficient way for reduction of dimension in the confounding covariates as opposed to current belief that the propensity score alone is the most efficient way.
Learning in adversarial settings is becoming an important task for application domains where attackers may inject malicious data into the training set to subvert normal operation of data-driven technologies. Feature selection has been widely used in machine learning for security applications to improve generalization and computational efficiency, although it is not clear whether its use may be beneficial or even counterproductive when training data are poisoned by intelligent attackers. In this work, we shed light on this issue by providing a framework to investigate the robustness of popular feature selection methods, including LASSO, ridge regression and the elastic net. Our results on malware detection show that feature selection methods can be significantly compromised under attack (we can reduce LASSO to almost random choices of feature sets by careful insertion of less than 5% poisoned training samples), highlighting the need for specific countermeasures.
Measuring strength or degree of statistical dependence between two random variables is a common problem in many domains. Pearson’s correlation coefficient $\rho$ is an accurate measure of linear dependence. We show that $\rho$ is a normalized, Euclidean type distance between joint probability distribution of the two random variables and that when their independence is assumed while keeping their marginal distributions. And the normalizing constant is the geometric mean of two maximal distances, each between the joint probability distribution when the full linear dependence is assumed while preserving respective marginal distribution and that when the independence is assumed. Usage of it is restricted to linear dependence because it is based on Euclidean type distances that are generally not metrics and considered full dependence is linear. Therefore, we argue that if a suitable distance metric is used while considering all possible maximal dependences then it can measure any non-linear dependence. But then, one must define all the full dependences. Hellinger distance that is a metric can be used as the distance measure between probability distributions and obtain a generalization of $\rho$ for the discrete case.
Well known Simpson’s paradox is puzzling and surprising for many, especially for the empirical researchers and users of statistics. However there is no surprise as far as mathematical details are concerned. A lot more is written about the paradox but most of them are beyond the grasp of such users. This short article is about explaining the phenomenon in an easy way to grasp using simple algebra and geometry. The mathematical conditions under which the paradox can occur are made explicit and a simple geometrical illustrations is used to describe it. We consider the reversal of the association between two binary variables, say, $X$ and $Y$ by a third binary variable, say, $Z$. We show that it is always possible to define $Z$ algebraically for non-extreme dependence between $X$ and $Y$, therefore occurrence of the paradox depends on identifying it with a practical meaning for it in a given context of interest, that is up to the subject domain expert. And finally we discuss the paradox in predictive contexts since in literature it is argued that the paradox is resolved using causal reasoning.
We study the problem of stock related question answering (StockQA): automatically generating answers to stock related questions, just like professional stock analysts providing action recommendations to stocks upon user’s requests. StockQA is quite different from previous QA tasks since (1) the answers in StockQA are natural language sentences (rather than entities or values) and due to the dynamic nature of StockQA, it is scarcely possible to get reasonable answers in an extractive way from the training data; and (2) StockQA requires properly analyzing the relationship between keywords in QA pair and the numerical features of a stock. We propose to address the problem with a memory-augmented encoder-decoder architecture, and integrate different mechanisms of number understanding and generation, which is a critical component of StockQA. We build a large-scale Chinese dataset containing over 180K StockQA instances, based on which various technique combinations are extensively studied and compared. Experimental results show that a hybrid word-character model with separate character components for number processing, achieves the best performance.\footnote{The data is publicly available at \url{http://…/}.}
The rapid development recently of Community Question Answering (CQA) satisfies users quest for professional and personal knowledge about anything. In CQA, one central issue is to find users with expertise and willingness to answer the given questions. Expert finding in CQA often exhibits very different challenges compared to traditional methods. Sparse data and new features violate fundamental assumptions of traditional recommendation systems. This paper focuses on reviewing and categorizing the current progress on expert finding in CQA. We classify all the existing solutions into four different categories: matrix factorization based models (MF-based models), gradient boosting tree based models (GBT-based models), deep learning based models (DL-based models) and ranking based models (R-based models). We find that MF-based models outperform other categories of models in the field of expert finding in CQA. Moreover, we use innovative diagrams to clarify several important concepts of ensemble learning, and find that ensemble models with several specific single models can further boosting the performance. Further, we compare the performance of different models on different types of matching tasks, including text vs. text, graph vs. text, audio vs. text and video vs. text. The results can help the model selection of expert finding in practice. Finally, we explore some potential future issues in expert finding research in CQA.
We introduce empirical equilibrium, the prediction in a game that selects the Nash equilibria that can be approximated by a sequence of payoff-monotone distributions, a well-documented proxy for empirically plausible behavior. Then, we reevaluate implementation theory based on this equilibrium concept. We show that in a partnership dissolution environment with complete information, two popular auctions that are essentially equivalent for the Nash equilibrium prediction, can be expected to differ in fundamental ways when they are operated. Besides the direct policy implications, two general consequences follow. First, a mechanism designer may not be constrained by typical invariance properties. Second, a mechanism designer who does not account for the empirical plausibility of equilibria may inadvertently design implicitly biased mechanisms.
Deep neural networks (DNNs) are vulnerable to adversarial examples, perturbations to correctly classified examples which can cause the network to misclassify. In the image domain, these perturbations can often be made virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a population-based optimization algorithm to generate semantically and syntactically similar adversarial examples. We demonstrate via a human study that 94.3% of the generated examples are classified to the original label by human evaluators, and that the examples are perceptibly quite similar. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain.
Many optimization problems in science and engineering are challenging to solve, and the current trend is to use swarm intelligence (SI) and SI-based algorithms to tackle such challenging problems. Some significant developments have been made in recent years, though there are still many open problems in this area. This paper provides a short but timely analysis about SI-based algorithms and their links with self-organization. Different characteristics and properties are analyzed here from both mathematical and qualitative perspectives. Future research directions are outlined and open questions are also highlighted.
Fine-grained entity typing is the task of assigning fine-grained semantic types to entity mentions. We propose a neural architecture which learns a distributional semantic representation that leverages a greater amount of semantic context — both document and sentence level information — than prior work. We find that additional context improves performance, with further improvements gained by utilizing adaptive classification thresholds. Experiments show that our approach without reliance on hand-crafted features achieves the state-of-the-art results on three benchmark datasets.
We design new differentially private algorithms for the Euclidean k-means problem, both in the centralized model and in the local model of differential privacy. In both models, our algorithms achieve significantly improved error rates over the previous state-of-the-art. In addition, in the local model, our algorithm significantly reduces the number of needed interactions. Although the problem has been widely studied in the context of differential privacy, all of the existing constructions achieve only super constant approximation factors. We present, for the first time, efficient private algorithms for the problem with constant multiplicative error.
Multi-modal data is becoming more common than before because of big data issues. Finding the semantically equal or similar objects from different data sources(called entity resolution) is one of the heart problem of multi-modal task. Current models for solving this problem usually needs much paired data to find the latent correlation between multi-modal data, which is of high cost. A new kind latent correlation is proposed in this article. With the correlation, multi-modal objects can be uniformly represented in a commonly shard space. A classifying based model is designed for multi-modal entity resolution task. With the proposed method, the demand of training data can be decreased much.
This paper describes a new algorithm for exact Bayesian inference that is based on a recently proposed compositional semantics of Bayesian networks in terms of channels. The paper concentrates on the ideas behind this algorithm, involving a linearisation (`stretching’) of the Bayesian network, followed by a combination of forward state transformation and backward predicate transformation, while evidence is accumulated along the way. The performance of a prototype implementation of the algorithm in Python is briefly compared to a standard implementation (pgmpy): first results show competitive performance.
Expert diagnostic support systems have been extensively studied. The practical application of these systems in real-world scenarios have been somewhat limited due to well-understood shortcomings such as extensibility. More recently, machine learned models for medical diagnosis have gained momentum since they can learn and generalize patterns found in very large datasets like electronic health records. These models also have shortcomings. In particular, there is no easy way to incorporate prior knowledge from existing literature or experts. In this paper, we present a method to merge both approaches by using expert systems as generative models that create simulated data on which models can be learned. We demonstrate that such a learned model not only preserve the original properties of the expert systems but also addresses some of their limitations. Furthermore, we show how this approach can also be used as the starting point to combine expert knowledge with knowledge extracted from other data sources such as electronic health records.
A major challenge in training deep neural networks is overfitting, i.e. inferior performance on unseen test examples compared to performance on training examples. To reduce overfitting, stochastic regularization methods have shown superior performance compared to deterministic weight penalties on a number of image recognition tasks. Stochastic methods such as Dropout and Shakeout, in expectation, are equivalent to imposing a ridge and elastic-net penalty on the model parameters, respectively. However, the choice of the norm of weight penalty is problem dependent and is not restricted to $\{L_1,L_2\}$. Therefore, in this paper we propose the Bridgeout stochastic regularization technique and prove that it is equivalent to an $L_q$ penalty on the weights, where the norm $q$ can be learned as a hyperparameter from data. Experimental results show that Bridgeout results in sparse model weights, improved gradients and superior classification performance compared to Dropout and Shakeout on synthetic and real datasets.
A competitive baseline in sentence-level extractive summarization of news articles is the Lead-3 heuristic, where only the first 3 sentences are extracted. The success of this method is due to the tendency for writers to implement progressive elaboration in their work by writing the most important content at the beginning. In this paper, we introduce the Lead-like Recognizer (LeadR) to show how the Lead heuristic can be extended to summarize multi-section documents where it would not usually work well. This is done by introducing a neural model which produces a probability distribution over positions for sentences, so that we can locate sentences with introduction-like qualities. To evaluate the performance of our model, we use the task of summarizing multi-section documents. LeadR outperforms several baselines on this task, including a simple extension of the Lead heuristic designed for the task. Our work suggests that predicted position is a strong feature to use when extracting summaries.
The encoder-decoder dialog model is one of the most prominent methods used to build dialog systems in complex domains. Yet it is limited because it cannot output interpretable actions as in traditional systems, which hinders humans from understanding its generation process. We present an unsupervised discrete sentence representation learning method that can integrate with any existing encoder-decoder dialog models for interpretable response generation. Building upon variational autoencoders (VAEs), we present two novel models, DI-VAE and DI-VST that improve VAEs and can discover interpretable semantics via either auto encoding or context predicting. Our methods have been validated on real-world dialog datasets to discover semantic representations and enhance encoder-decoder models with interpretable generation.
Inner product-based convolution has been a central component of convolutional neural networks (CNNs) and the key to learning visual representations. Inspired by the observation that CNN-learned features are naturally decoupled with the norm of features corresponding to the intra-class variation and the angle corresponding to the semantic difference, we propose a generic decoupled learning framework which models the intra-class variation and semantic difference independently. Specifically, we first reparametrize the inner product to a decoupled form and then generalize it to the decoupled convolution operator which serves as the building block of our decoupled networks. We present several effective instances of the decoupled convolution operator. Each decoupled operator is well motivated and has an intuitive geometric interpretation. Based on these decoupled operators, we further propose to directly learn the operator from data. Extensive experiments show that such decoupled reparameterization renders significant performance gain with easier convergence and stronger robustness.

Distilled News

One of the widely used natural language processing task in different business problems is “Text Classification”. The goal of text classification is to automatically classify the text documents into one or more defined categories. Some examples of text classification are:
• Understanding audience sentiment from social media,
• Detection of spam and non-spam emails,
• Auto tagging of customer queries, and
• Categorization of news articles into defined topics.
For people working in Artificial Intelligence, the term “Human-in-the-Loop” is familiar i.e. a human in the process to validate and improve the AI. There are many situations where it applies, as many as there are AI applications. However. there are still some distinct different ways it can be deployed even within the same application.
… But speech recognition has been around for decades, so why is it just now hitting the mainstream? The reason is that deep learning finally made speech recognition accurate enough to be useful outside of carefully controlled environments.
The goal is to improve the category classification performance for a set of text posts. The evaluation metric is the macro F1 score.
Deep learning brings multiple benefits in learning multiple levels of representation of natural language. Here we will cover the motivation of using deep learning and distributed representation for NLP, word embeddings and several methods to perform word embeddings, and applications.
I recently completed a course on NLP through Deep Learning (CS224N) at Stanford and loved the experience. Learnt a whole bunch of new things. For my final project I worked on a question answering model built on Stanford Question Answering Dataset (SQuAD). In this blog, I want to cover the main building blocks of a question answering model.
This post is long overdue. The information contained herein has been built up over years of deploying and hosting Shiny apps, particularly in production environments, and mainly where those Shiny apps are very large and contain a lot of code.
Weird but (sometimes) useful charts