Causal influence measures for machine learnt classifiers shed light on the reasons behind classification, and aid in identifying influential input features and revealing their biases. However, such analyses involve evaluating the classifier using datapoints that may be atypical of its training distribution. Standard methods for training classifiers that minimize empirical risk do not constrain the behavior of the classifier on such datapoints. As a result, training to minimize empirical risk does not distinguish among classifiers that agree on predictions in the training distribution but have wildly different causal influences. We term this problem covariate shift in causal testing and formally characterize conditions under which it arises. As a solution to this problem, we propose a novel active learning algorithm that constrains the influence measures of the trained model. We prove that any two predictors whose errors are close on both the original training distribution and the distribution of atypical points are guaranteed to have causal influences that are also close. Further, we empirically demonstrate with synthetic labelers that our algorithm trains models that (i) have similar causal influences as the labeler’s model, and (ii) generalize better to out-of-distribution points while (iii) retaining their accuracy on in-distribution points.
Under the classical long-span asymptotic framework we develop a class of Generalized Laplace (GL) inference methods for the change-point dates in a linear time series regression model with multiple structural changes analyzed in, e.g., Bai and Perron (1998). The GL estimator is defined by an integration rather than optimization-based method and relies on the least-squares criterion function. It is interpreted as a classical (non-Bayesian) estimator and the inference methods proposed retain a frequentist interpretation. Since inference about the change-point dates is a nonstandard statistical problem, the original insight of Laplace to interpret a certain transformation of a least-squares criterion function as a statistical belief over the parameter space provides a better approximation about the uncertainty in the data about the change-points relative to existing methods. Simulations show that the GL estimator is in general more precise than the OLS estimator. On the theoretical side, depending on some input (smoothing) parameter, the class of GL estimators exhibits a dual limiting distribution; namely, the classical shrinkage asymptotic distribution of Bai and Perron (1998), or a Bayes-type asymptotic distribution.
Time-Series Classification (TSC) has attracted a lot of attention in pattern recognition, because wide range of applications from different domains such as finance and health informatics deal with time-series signals. Bag of Features (BoF) model has achieved a great success in TSC task by summarizing signals according to the frequencies of ‘feature words’ of a data-learned dictionary. This paper proposes embedding the Recurrence Plots (RP), a visualization technique for analysis of dynamic systems, in the BoF model for TSC. While the traditional BoF approach extracts features from 1D signal segments, this paper uses the RP to transform time-series into 2D texture images and then applies the BoF on them. Image representation of time-series enables us to explore different visual descriptors that are not available for 1D signals and to treats TSC task as a texture recognition problem. Experimental results on the UCI time-series classification archive demonstrates a significant accuracy boost by the proposed Bag of Recurrence patterns (BoR), compared not only to the existing BoF models, but also to the state-of-the art algorithms.
The Hilbert transform (HT) and associated Gabor analytic signal (GAS) representation are well-known and widely used mathematical formulations for modeling and analysis of signals in various applications. In this study, like the HT, to obtain quadrature component of a signal, we propose the novel discrete Fourier cosine quadrature transforms (FCQTs) and discrete Fourier sine quadrature transforms (FSQTs), designated as Fourier quadrature transforms (FQTs). Using these FQTs, we propose sixteen Fourier-Singh analytic signal (FSAS) representations with following properties: (1) real part of eight FSAS representations is the original signal and imaginary part is the FCQT of the real part, (2) imaginary part of eight FSAS representations is the original signal and real part is the FSQT of the real part, (3) like the GAS, Fourier spectrum of the all FSAS representations has only positive frequencies, however unlike the GAS, the real and imaginary parts of the proposed FSAS representations are not orthogonal to each other. The Fourier decomposition method (FDM) is an adaptive data analysis approach to decompose a signal into a set of small number of Fourier intrinsic band functions which are AM-FM components. This study also proposes a new formulation of the FDM using the discrete cosine transform (DCT) with the GAS and FSAS representations, and demonstrate its efficacy for improved time-frequency-energy representation and analysis of nonlinear and non-stationary time series.
Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those platforms not only aim to improve performance through improved in-memory processing, but in particular provide built-in high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop than with plain Hadoop MapReduce. But is this indeed the case? This paper compares three prominent distributed data processing platforms: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from a usability perspective. We report on the design, execution and results of a usability study with a cohort of masters students, who were learning and working with all three platforms in order to solve different use cases set in a data science context. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink as platforms for batch-oriented big data analysis. This study starts an exploration of the factors that make big data platforms more – or less – effective for users in data science.
Informatics and technological advancements have triggered generation of huge volume of data with varied complexity in its management and analysis. Big Data analytics is the practice of revealing hidden aspects of such data and making inferences from it. Although storage, retrieval and management of Big Data seem possible through efficient algorithm and system development, concern about statistical consistency remains to be addressed in view of its specific characteristics. Since Big Data does not conform to standard analytics, we need proper modification of the existing statistical theory and tools. Here we propose, with illustrations, a general statistical framework and an algorithmic principle for Big Data analytics that ensure statistical accuracy of the conclusions. The proposed framework has the potential to push forward advancement of Big Data analytics in the right direction. The partition-repetition approach proposed here is broad enough to encompass all practical data analytic problems.
In recent years, a large scale of various wireless sensor networks have been deployed for basic scientific works. Massive data loss is so common that there is a great demand for data recovery. While data recovery methods fulfil the requirement of accuracy, the potential privacy leakage caused by them concerns us a lot. Thus the major challenge of sensory data recovery is the issue of effective privacy preservation. Existing algorithms can either accomplish accurate data recovery or solve privacy issue, yet no single design is able to address these two problems simultaneously. Therefore in this paper, we propose a novel approach Privacy-Preserving Compressive Sensing with Multi-Attribute Assistance (PPCS-MAA). It applies PPCS scheme to sensory data recovery, which can effectively encrypts sensory data without decreasing accuracy, because it maintains the homomorphic obfuscation property for compressive sensing. In addition, multiple environmental attributes from sensory datasets usually have strong correlation so that we design a MultiAttribute Assistance (MAA) component to leverage this feature for better recovery accuracy. Combining PPCS with MAA, the novel recovery scheme can provide reliable privacy with high accuracy. Firstly, based on two real datasets, IntelLab and GreenOrbs, we reveal the inherited low-rank features as the ground truth and find such multi-attribute correlation. Secondly, we develop a PPCS-MAA algorithm to preserve privacy and optimize the recovery accuracy. Thirdly, the results of real data-driven simulations show that the algorithm outperforms the existing solutions.
The susceptibility of deep learning to adversarial attack can be understood in the framework of the Renormalisation Group (RG) and the vulnerability of a specific network may be diagnosed provided the weights in each layer are known. An adversary with access to the inputs and outputs could train a second network to clone these weights and, having identified a weakness, use them to compute the perturbation of the input data which exploits it. However, the RG framework also provides a means to poison the outputs of the network imperceptibly, without affecting their legitimate use, so as to prevent such cloning of its weights and thereby foil the generation of adversarial data.
The recent successes of AI have captured the wildest imagination of both the scientific communities and the general public. Robotics and AI amplify human potentials, increase productivity and are moving from simple reasoning towards human-like cognitive abilities. Current AI technologies are used in a set area of applications, ranging from healthcare, manufacturing, transport, energy, to financial services, banking, advertising, management consulting and government agencies. The global AI market is around 260 billion USD in 2016 and it is estimated to exceed 3 trillion by 2024. To understand the impact of AI, it is important to draw lessons from it’s past successes and failures and this white paper provides a comprehensive explanation of the evolution of AI, its current status and future directions.
The rapid growth of population and the permanent increase in the number of vehicles engender several issues in transportation systems, which in turn call for an intelligent and cost-effective approach to resolve the problems in an efficient manner. Smart transportation is a framework that leverages the power of Information and Communication Technology for acquisition, management, and mining of traffic-related data sources, which, in this study, are categorized into: 1) traffic flow sensors, 2) video image processors, 3) probe people and vehicles based on Global Positioning Systems (GPS), mobile phone cellular networks, and Bluetooth, 4) location-based social networks, 5) transit data with the focus on smart cards, and 6) environmental data. For each data source, first, the operational mechanism of the technology for capturing the data is succinctly demonstrated. Secondly, as the most salient feature of this study, the transport-domain applications of each data source that have been conducted by the previous studies are reviewed and classified into the main groups. Thirdly, a number of possible future research directions are provided for all types of data sources. Moreover, in order to alleviate the shortcomings pertaining to each single data source and acquire a better understanding of mobility behavior in transportation systems, the data fusion architectures are introduced to fuse the knowledge learned from a set of heterogeneous but complementary data sources. Finally, we briefly mention the current challenges and their corresponding solutions in the smart transportation.
The success of current deep saliency detection methods heavily depends on the availability of large-scale supervision in the form of per-pixel labeling. Such supervision, while labor-intensive and not always possible, tends to hinder the generalization ability of the learned models. By contrast, traditional handcrafted features based unsupervised saliency detection methods, even though have been surpassed by the deep supervised methods, are generally dataset-independent and could be applied in the wild. This raises a natural question that ‘Is it possible to learn saliency maps without using labeled data while improving the generalization ability?’. To this end, we present a novel perspective to unsupervised saliency detection through learning from multiple noisy labeling generated by ‘weak’ and ‘noisy’ unsupervised handcrafted saliency methods. Our end-to-end deep learning framework for unsupervised saliency detection consists of a latent saliency prediction module and a noise modeling module that work collaboratively and are optimized jointly. Explicit noise modeling enables us to deal with noisy saliency maps in a probabilistic way. Extensive experimental results on various benchmarking datasets show that our model not only outperforms all the unsupervised saliency methods with a large margin but also achieves comparable performance with the recent state-of-the-art supervised deep saliency methods.
A reduction of a source distribution is a collection of smaller sized distributions that are collectively equivalent to the source distribution with respect to the property of decomposability. That is, an arbitrary language is decomposable with respect to the source distribution if and only if it is decomposable with respect to each smaller sized distribution (in the reduction). The notion of reduction of distributions has previously been proposed to improve the complexity of decomposability verification. In this work, we address the problem of generating (optimal) reductions of distributions automatically. A (partial) solution to this problem is provided, which consists of 1) an incremental algorithm for the production of candidate reductions and 2) a reduction validation procedure. In the incremental production stage, backtracking is applied whenever a candidate reduction that cannot be validated is produced. A strengthened substitution-based proof technique is used for reduction validation, while a fixed template of candidate counter examples is used for reduction refutation; put together, they constitute our (partial) solution to the reduction verification problem. In addition, we show that a recursive approach for the generation of (small) reductions is easily supported.
Constraint-based clustering algorithms exploit background knowledge to construct clusterings that are aligned with the interests of a particular user. This background knowledge is often obtained by allowing the clustering system to pose pairwise queries to the user: should these two elements be in the same cluster or not? Active clustering methods aim to minimize the number of queries needed to obtain a good clustering by querying the most informative pairs first. Ideally, a user should be able to answer a couple of these queries, inspect the resulting clustering, and repeat these two steps until a satisfactory result is obtained. We present COBRAS, an approach to active clustering with pairwise constraints that is suited for such an interactive clustering process. A core concept in COBRAS is that of a super-instance: a local region in the data in which all instances are assumed to belong to the same cluster. COBRAS constructs such super-instances in a top-down manner to produce high-quality results early on in the clustering process, and keeps refining these super-instances as more pairwise queries are given to get more detailed clusterings later on. We experimentally demonstrate that COBRAS produces good clusterings at fast run times, making it an excellent candidate for the iterative clustering scenario outlined above.
In this work we present a novel unsupervised framework for hard training example mining. The only input to the method is a collection of images relevant to the target application and a meaningful initial representation, provided e.g. by pre-trained CNN. Positive examples are distant points on a single manifold, while negative examples are nearby points on different manifolds. Both types of examples are revealed by disagreements between Euclidean and manifold similarities. The discovered examples can be used in training with any discriminative loss. The method is applied to unsupervised fine-tuning of pre-trained networks for fine-grained classification and particular object retrieval. Our models are on par or are outperforming prior models that are fully or partially supervised.
We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate performance on diverse transfer tasks. Two variants of the encoding models allow for trade-offs between accuracy and compute resources. For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance. Comparisons are made with baselines that use word level transfer learning via pretrained word embeddings as well as baselines do not use any transfer learning. We find that transfer learning using sentence embeddings tends to outperform word level transfer. With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task. We obtain encouraging results on Word Embedding Association Tests (WEAT) targeted at detecting model bias. Our pre-trained sentence encoding models are made freely available for download and on TF Hub.
Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale datasets is required, however, constructing such a dataset is time-consuming and expensive. Therefore, we develop a completely unsupervised mechanism for textual grounding using hypothesis testing as a mechanism to link words to detected image concepts. We demonstrate our approach on the ReferIt Game dataset and the Flickr30k data, outperforming baselines by 7.98% and 6.96% respectively.
We present a novel framework for iterative visual reasoning. Our framework goes beyond current recognition systems that lack the capability to reason beyond stack of convolutions. The framework consists of two core modules: a local module that uses spatial memory to store previous beliefs with parallel updates; and a global graph-reasoning module. Our graph module has three components: a) a knowledge graph where we represent classes as nodes and build edges to encode different types of semantic relationships between them; b) a region graph of the current image where regions in the image are nodes and spatial relationships between these regions are edges; c) an assignment graph that assigns regions to classes. Both the local module and the global module roll-out iteratively and cross-feed predictions to each other to refine estimates. The final predictions are made by combining the best of both modules with an attention mechanism. We show strong performance over plain ConvNets, \eg achieving an $8.4\%$ absolute improvement on ADE measured by per-class average precision. Analysis also shows that the framework is resilient to missing regions for reasoning.