In his seminal book The Inmates are Running the Asylum: Why High-Tech Products Drive Us Crazy And How To Restore The Sanity’ [2004, Sams Indianapolis, IN, USA], Alan Cooper argues that a major reason why software is often poorly designed (from a user perspective) is that programmers are in charge of design decisions, rather than interaction designers. As a result, programmers design software for themselves, rather than for their target audience; a phenomenon he refers to as the inmates running the asylum’. This paper argues that explainable AI risks a similar fate. While the re-emergence of explainable AI is positive, this paper argues most of us as AI researchers are building explanatory agents for ourselves, rather than for the intended users. But explainable AI is more likely to succeed if researchers and practitioners understand, adopt, implement, and improve models from the vast and valuable bodies of research in philosophy, psychology, and cognitive science; and if evaluation of these models is focused more on people than on technology. From a light scan of literature, we demonstrate that there is considerable scope to infuse more results from the social and behavioural sciences into explainable AI, and present some key results from these fields that are relevant to explainable AI.
We propose a new approach for estimating causal effects when the exposure is measured with error and confounding adjustment is performed via a generalized propensity score (GPS). Using validation data, we propose a regression calibration (RC)-based adjustment for a continuous error-prone exposure combined with GPS to adjust for confounding (RC-GPS). The outcome analysis is conducted after transforming the corrected continuous exposure into a categorical exposure. We consider confounding adjustment in the context of GPS subclassification, inverse probability treatment weighting (IPTW) and matching. In simulations with varying degrees of exposure error and confounding bias, RC-GPS eliminates bias from exposure error and confounding compared to standard approaches that rely on the error-prone exposure. We applied RC-GPS to a rich data platform to estimate the causal effect of long-term exposure to fine particles ($PM_{2.5}$) on mortality in New England for the period from 2000 to 2012. The main study consists of $2,202$ zip codes covered by $217,660$ 1km $\times$ 1km grid cells with yearly mortality rates, yearly $PM_{2.5}$ averages estimated from a spatio-temporal model (error-prone exposure) and several potential confounders. The internal validation study includes a subset of 83 1km $\times$ 1km grid cells within 75 zip codes from the main study with error-free yearly $PM_{2.5}$ exposures obtained from monitor stations. Under assumptions of non-interference and weak unconfoundedness, using matching we found that exposure to moderate levels of $PM_{2.5}$ ($8 <$ $PM_{2.5}$ $\leq 10\ {\rm \mu g/m^3}$) causes a $2.8\%$ ($95\%$ CI: $0.6\%, 3.6\%$) increase in all-cause mortality compared to low exposure ($PM_{2.5}$ $\leq 8\ {\rm \mu g/m^3}$).
Independent causal mechanisms are a central concept in the study of causality with implications for machine learning tasks. In this work we develop an algorithm to recover a set of (inverse) independent mechanisms relating a distribution transformed by the mechanisms to a reference distribution. The approach is fully unsupervised and based on a set of experts that compete for data to specialize and extract the mechanisms. We test and analyze the proposed method on a series of experiments based on image transformations. Each expert successfully maps a subset of the transformed data to the original domain, and the learned mechanisms generalize to other domains. We discuss implications for domain transfer and links to recent trends in generative modeling.
A correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints has already been established. In this work, answer-set programs that specify repairs of databases are used as a basis for solving computational and reasoning problems about causes. Here, causes are also introduced at the attribute level by appealing to a both null-based and attribute-based repair semantics. The corresponding repair programs are presented, and they are used as a basis for computation and reasoning about attribute-level causes.
An intriguing property of deep neural networks is their inherent vulnerability to adversarial inputs, which significantly hinders their application in security-critical domains. Most existing detection methods attempt to use carefully engineered patterns to distinguish adversarial inputs from their genuine counterparts, which however can often be circumvented by adaptive adversaries. In this work, we take a completely different route by leveraging the definition of adversarial inputs: while deceiving for deep neural networks, they are barely discernible for human visions. Building upon recent advances in interpretable models, we construct a new detection framework that contrasts an input’s interpretation against its classification. We validate the efficacy of this framework through extensive experiments using benchmark datasets and attacks. We believe that this work opens a new direction for designing adversarial input detection methods.
We propose a method for learning CNN structures that is more efficient than previous approaches: instead of using reinforcement learning (RL) or genetic algorithms (GA), we use a sequential model-based optimization (SMBO) strategy, in which we search for architectures in order of increasing complexity, while simultaneously learning a surrogate function to guide the search, similar to A* search. On the CIFAR-10 dataset, our method finds a CNN structure with the same classification accuracy (3.41% error rate) as the RL method of Zoph et al. (2017), but 2 times faster (in terms of number of models evaluated). It also outperforms the GA method of Liu et al. (2017), which finds a model with worse performance (3.63% error rate), and takes 5 times longer. Finally we show that the model we learned on CIFAR also works well at the task of ImageNet classification. In particular, we match the state-of-the-art performance of 82.9% top-1 and 96.1% top-5 accuracy.
We introduce MAgent, a platform to support research and development of many-agent reinforcement learning. Unlike previous research platforms on single or multi-agent reinforcement learning, MAgent focuses on supporting the tasks and the applications that require hundreds to millions of agents. Within the interactions among a population of agents, it enables not only the study of learning algorithms for agents’ optimal polices, but more importantly, the observation and understanding of individual agent’s behaviors and social phenomena emerging from the AI society, including communication languages, leaderships, altruism. MAgent is highly scalable and can host up to one million agents on a single GPU server. MAgent also provides flexible configurations for AI researchers to design their customized environments and agents. In this demo, we present three environments designed on MAgent and show emerged collective intelligence by learning from scratch.
Recent studies have revealed the vulnerability of deep neural networks – A small adversarial perturbation that is imperceptible to human can easily make a well-trained deep neural network mis-classify. This makes it unsafe to apply neural networks in security-critical applications. In this paper, we propose a new defensive algorithm called Random Self-Ensemble (RSE) by combining two important concepts: ${\bf randomness}$ and ${\bf ensemble}$. To protect a targeted model, RSE adds random noise layers to the neural network to prevent from state-of-the-art gradient-based attacks, and ensembles the prediction over random noises to stabilize the performance. We show that our algorithm is equivalent to ensemble an infinite number of noisy models $f_\epsilon$ without any additional memory overhead, and the proposed training procedure based on noisy stochastic gradient descent can ensure the ensemble model has good predictive capability. Our algorithm significantly outperforms previous defense techniques on real datasets. For instance, on CIFAR-10 with VGG network (which has $92\%$ accuracy without any attack), under the state-of-the-art C&W attack within a certain distortion tolerance, the accuracy of unprotected model drops to less than $10\%$, the best previous defense technique has $48\%$ accuracy, while our method still has $86\%$ prediction accuracy under the same level of attack. Finally, our method is simple and easy to integrate into any neural network.
Generative Adversarial Networks (GAN) have become one of the most successful frameworks for unsupervised generative modeling. As GANs are difficult to train much research has focused on this. However, very little of this research has directly exploited game-theoretic techniques. We introduce Generative Adversarial Network Games (GANGs), which explicitly model a finite zero-sum game between a generator ($G$) and classifier ($C$) that use mixed strategies. The size of these games precludes exact solution methods, therefore we define resource-bounded best responses (RBBRs), and a resource-bounded Nash Equilibrium (RB-NE) as a pair of mixed strategies such that neither $G$ or $C$ can find a better RBBR. The RB-NE solution concept is richer than the notion of `local Nash equilibria’ in that it captures not only failures of escaping local optima of gradient descent, but applies to any approximate best response computations, including methods with random restarts. To validate our approach, we solve GANGs with the Parallel Nash Memory algorithm, which provably monotonically converges to an RB-NE. We compare our results to standard GAN setups, and demonstrate that our method deals well with typical GAN problems such as mode collapse, partial mode coverage and forgetting.
Deep generative models learned through adversarial training have become increasingly popular for their ability to generate naturalistic image textures. However, apart from the visual texture, the visual appearance of objects is significantly affected by their shape geometry, information which is not taken into account by existing generative models. This paper introduces the Geometry-Aware Generative Adversarial Network (GAGAN) for incorporating geometric information into the image generation process. Specifically, in GAGAN the generator samples latent variables from the probability space of a statistical shape model. By mapping the output of the generator to a canonical coordinate frame through a differentiable geometric transformation, we enforce the geometry of the objects and add an implicit connection from the prior to the generated object. Experimental results on face generation indicate that the GAGAN can generate realistic images of faces with arbitrary facial attributes such as facial expression, pose, and morphology, that are of better quality compared to current GAN-based methods. Finally, our method can be easily incorporated into and improve the quality of the images generated by any existing GAN architecture.
In object detection, an intersection over union (IoU) threshold is required to define positives and negatives. An object detector, trained with low IoU threshold, e.g. 0.5, usually produces noisy detections. However, detection performance tends to degrade with increasing the IoU thresholds. Two main factors are responsible for this: 1) overfitting during training, due to exponentially vanishing positive samples, and 2) inference-time mismatch between the IoUs for which the detector is optimal and those of the input hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, is proposed to address these problems. It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector. The resampling of progressively improved hypotheses guarantees that all detectors have a positive set of examples of equivalent size, reducing the overfitting problem. The same cascade procedure is applied at inference, enabling a closer match between the hypotheses and the detector quality of each stage. A simple implementation of the Cascade R-CNN is shown to surpass all single-model object detectors on the challenging COCO dataset. Experiments also show that the Cascade R-CNN is widely applicable across detector architectures, achieving consistent gains independently of the baseline detector strength. The code will be made available at https://…/cascade-rcnn.
In online social networks people often express attitudes towards others, which forms massive sentiment links among users. Predicting the sign of sentiment links is a fundamental task in many areas such as personal advertising and public opinion analysis. Previous works mainly focus on textual sentiment classification, however, text information can only disclose the ‘tip of the iceberg’ about users’ true opinions, of which the most are unobserved but implied by other sources of information such as social relation and users’ profile. To address this problem, in this paper we investigate how to predict possibly existing sentiment links in the presence of heterogeneous information. First, due to the lack of explicit sentiment links in mainstream social networks, we establish a labeled heterogeneous sentiment dataset which consists of users’ sentiment relation, social relation and profile knowledge by entity-level sentiment extraction method. Then we propose a novel and flexible end-to-end Signed Heterogeneous Information Network Embedding (SHINE) framework to extract users’ latent representations from heterogeneous networks and predict the sign of unobserved sentiment links. SHINE utilizes multiple deep autoencoders to map each user into a low-dimension feature space while preserving the network structure. We demonstrate the superiority of SHINE over state-of-the-art baselines on link prediction and node recommendation in two real-world datasets. The experimental results also prove the efficacy of SHINE in cold start scenario.
We wish to contribute to the discussion of ‘Comparing Consensus Monte Carlo Strategies for Distributed Bayesian Computation’ by offering our views on the current best methods for Bayesian computation, both at big-data scale and with smaller data sets, as summarized in Table 1. This table is certainly an over-simplification of a highly complicated area of research in constant (present and likely future) flux, but we believe that constructing summaries of this type is worthwhile despite their drawbacks, if only to facilitate further discussion.
To realize human-like robot intelligence, a large-scale cognitive architecture is required for robots to understand the environment through a variety of sensors with which they are equipped. In this paper, we propose a novel framework named Serket that enables the construction of a large-scale generative model and its inference easily by connecting sub-modules to allow the robots to acquire various capabilities through interaction with their environments and others. We consider that large-scale cognitive models can be constructed by connecting smaller fundamental models hierarchically while maintaining their programmatic independence. Moreover, connected modules are dependent on each other, and parameters are required to be optimized as a whole. Conventionally, the equations for parameter estimation have to be derived and implemented depending on the models. However, it becomes harder to derive and implement those of a larger scale model. To solve these problems, in this paper, we propose a method for parameter estimation by communicating the minimal parameters between various modules while maintaining their programmatic independence. Therefore, Serket makes it easy to construct large-scale models and estimate their parameters via the connection of modules. Experimental results demonstrated that the model can be constructed by connecting modules, the parameters can be optimized as a whole, and they are comparable with the original models that we have proposed.
SSD (Single Shot Multibox Detetor) is one of the best object detection algorithms with both high accuracy and fast speed. However, SSD’s feature pyramid detection method makes it hard to fuse the features from different scales. In this paper, we proposed FSSD (Feature Fusion Single Shot Multibox Detector), an enhanced SSD with a novel and lightweight feature fusion module which can improve the performance significantly over SSD with just a little speed drop. In the feature fusion module, features from different layers with different scales are concatenated together, followed by some down-sampling blocks to generate new feature pyramid, which will be fed to multibox detectors to predict the final detection results. On the Pascal VOC 2007 test, our network can achieve 82.7 mAP (mean average precision) at the speed of 65.8 FPS (frame per second) with the input size 300$\times$300 using a single Nvidia 1080Ti GPU. In addition, our result on COCO is also better than the conventional SSD with a large margin. Our FSSD outperforms a lot of state-of-the-art object detection algorithms in both aspects of accuracy and speed. Code will be made publicly available.
Drift analysis is one of the major tools for analysing evolutionary algorithms and nature-inspired search heuristics. In this chapter we give an introduction to drift analysis and give some examples of how to use it for the analysis of evolutionary algorithms.
Financial time-series forecasting has long been a challenging problem because of the inherently noisy and stochastic nature of the market. In the High-Frequency Trading (HFT), forecasting for trading purposes is even a more challenging task since an automated inference system is required to be both accurate and fast. In this paper, we propose a neural network layer architecture that incorporates the idea of bilinear projection as well as an attention mechanism that enables the layer to detect and focus on crucial temporal information. The resulting network is highly interpretable, given its ability to highlight the importance and contribution of each temporal instance, thus allowing further analysis on the time instances of interest. Our experiments in a large-scale Limit Order Book (LOB) dataset show that a two-hidden-layer network utilizing our proposed layer outperforms by a large margin all existing state-of-the-art results coming from much deeper architectures while requiring far fewer computations.
Suffering from the extreme training data imbalance between seen and unseen classes, most of existing state-of-the-art approaches fail to achieve satisfactory results for the challenging generalized zero-shot learning task. To circumvent the need for labeled examples of unseen classes, we propose a novel generative adversarial network (GAN) that synthesizes CNN features conditioned on class-level semantic information, offering a shortcut directly from a semantic descriptor of a class to a class-conditional feature distribution. Our proposed approach, pairing a Wasserstein GAN with a classification loss, is able to generate sufficiently discriminative CNN features to train softmax classifiers or any multimodal embedding method. Our experimental results demonstrate a significant boost in accuracy over the state of the art on five challenging datasets — CUB, FLO, SUN, AWA and ImageNet — in both the zero-shot learning and generalized zero-shot learning settings.
In recent years Deep Neural Networks (DNNs) have been rapidly developed in various applications, together with increasingly complex architectures. The performance gain of these DNNs generally comes with high computational costs and large memory consumption, which may not be affordable for mobile platforms. Deep model quantization can be used for reducing the computation and memory costs of DNNs, and deploying complex DNNs on mobile equipment. In this work, we propose an optimization framework for deep model quantization. First, we propose a measurement to estimate the effect of parameter quantization errors in individual layers on the overall model prediction accuracy. Then, we propose an optimization process based on this measurement for finding optimal quantization bit-width for each layer. This is the first work that theoretically analyse the relationship between parameter quantization errors of individual layers and model accuracy. Our new quantization algorithm outperforms previous quantization optimization methods, and achieves 20-40% higher compression rate compared to equal bit-width quantization at the same model prediction accuracy.
One way to avoid overfitting in machine learning is to use model parameters distributed according to a Bayesian posterior given the data, rather than the maximum likelihood estimator. Stochastic gradient Langevin dynamics (SGLD) is one algorithm to approximate such Bayesian posteriors for large models and datasets. SGLD is a standard stochastic gradient descent to which is added a controlled amount of noise, specifically scaled so that the parameter converges in law to the posterior distribution [WT11, TTV16]. The posterior predictive distribution can be approximated by an ensemble of samples from the trajectory. Choice of the variance of the noise is known to impact the practical behavior of SGLD: for instance, noise should be smaller for sensitive parameter directions. Theoretically, it has been suggested to use the inverse Fisher information matrix of the model as the variance of the noise, since it is also the variance of the Bayesian posterior [PT13, AKW12, GC11]. But the Fisher matrix is costly to compute for large- dimensional models. Here we use the easily computed Fisher matrix approximations for deep neural networks from [MO16, Oll15]. The resulting natural Langevin dynamics combines the advantages of Amari’s natural gradient descent and Fisher-preconditioned Langevin dynamics for large neural networks. Small-scale experiments on MNIST show that Fisher matrix preconditioning brings SGLD close to dropout as a regularizing technique.
We consider the problem of statistical inference for ranking data, specifically rank aggregation, under the assumption that samples are incomplete in the sense of not comprising all choice alternatives. In contrast to most existing methods, we explicitly model the process of turning a full ranking into an incomplete one, which we call the coarsening process. To this end, we propose the concept of rank-dependent coarsening, which assumes that incomplete rankings are produced by projecting a full ranking to a random subset of ranks. For a concrete instantiation of our model, in which full rankings are drawn from a Plackett-Luce distribution and observations take the form of pairwise preferences, we study the performance of various rank aggregation methods. In addition to predictive accuracy in the finite sample setting, we address the theoretical question of consistency, by which we mean the ability to recover a target ranking when the sample size goes to infinity, despite a potential bias in the observations caused by the (unknown) coarsening.
The Euclidean k-means problem is arguably the most widely-studied clustering problem in machine learning. While the k-means objective is NP-hard in the worst-case, practitioners have enjoyed remarkable success in applying heuristics like Lloyd’s algorithm for this problem. To address this disconnect, we study the following question: what properties of real-world instances will enable us to design efficient algorithms and prove guarantees for finding the optimal clustering? We consider a natural notion called additive perturbation stability that we believe captures many practical instances. Stable instances have unique optimal k-means solutions that do not change even when each point is perturbed a little (in Euclidean distance). This captures the property that the k-means optimal solution should be tolerant to measurement errors and uncertainty in the points. We design efficient algorithms that provably recover the optimal clustering for instances that are additive perturbation stable. When the instance has some additional separation, we show an efficient algorithm with provable guarantees that is also robust to outliers. We complement these results by studying the amount of stability in real datasets and demonstrating that our algorithm performs well on these benchmark datasets.