This paper is a contribution towards interpretability of the deep learning models in different applications of time-series. We propose a temporal attention layer that is capable of selecting the relevant information to perform various tasks, including data completion, key-frame detection and classification. The method uses the whole input sequence to calculate an attention value for each time step. This results in more focused attention values and more plausible visualisation than previous methods. We apply the proposed method to three different tasks. Experimental results show that the proposed network produces comparable results to a state of the art. In addition, the network provides better interpretability of the decision, that is, it generates more significant attention weight to related frames compared to similar techniques attempted in the past.
Correlation matrices are a major type of multivariate data. To examine properties of a given correlation matrix, a common practice is to compare the same quantity between the original correlation matrix and reference correlation matrices, such as those derived from random matrix theory, that partially preserve properties of the original matrix. We propose a model to generate such reference correlation and covariance matrices for the given matrix. Correlation matrices are often analysed as networks, which are heterogeneous across nodes in terms of the total connectivity to other nodes for each node. Given this background, the present algorithm generates random networks that preserve the expectation of total connectivity of each node to other nodes, akin to configuration models for conventional networks. Our algorithm is derived from the maximum entropy principle. We will apply the proposed algorithm to measurement of clustering coefficients and community detection, both of which require a null model to assess the statistical significance of the obtained results.
We propose a new solution to the blind source separation problem that factors mixed time-series signals into a sum of spatiotemporal modes, with the constraint that the temporal components are intrinsic mode functions (IMF’s). The key motivation is that IMF’s allow the computation of meaningful Hilbert transforms of non-stationary data, from which instantaneous time-frequency representations may be derived. Our spatiotemporal intrinsic mode decomposition (STIMD) method leverages spatial correlations to generalize the extraction of IMF’s from one-dimensional signals, commonly performed using the empirical mode decomposition (EMD), to multi-dimensional signals. Further, this data-driven method enables future-state prediction. We demonstrate STIMD on several synthetic examples, comparing it to common matrix factorization techniques, namely singular value decomposition (SVD), independent component analysis (ICA), and dynamic mode decomposition (DMD). We show that STIMD outperforms these methods at reconstruction and extracting interpretable modes. Next, we apply STIMD to analyze two real-world datasets, gravitational wave data and neural recordings from the rodent hippocampus.
Although deep learning techniques have been successfully applied to many tasks, interpreting deep neural network models is still a big challenge to us. Recently, many works have been done on visualizing and analyzing the mechanism of deep neural networks in the areas of image processing and natural language processing. In this paper, we present our approaches to visualize and understand deep neural networks for a very important commercial task–CTR (Click-through rate) prediction. We conduct experiments on the productive data from our online advertising system with daily varying distribution. To understand the mechanism and the performance of the model, we inspect the model’s inner status at neuron level. Also, a probe approach is implemented to measure the layer-wise performance of the model. Moreover, to measure the influence from the input features, we calculate saliency scores based on the back-propagated gradients. Practical applications are also discussed, for example, in understanding, monitoring, diagnosing and refining models and algorithms.
Recently, Deep Neural Networks (DNNs) have emerged as the dominant model across various AI applications. In the era of IoT and mobile systems, the efficient deployment of DNNs on embedded platforms is vital to enable the development of intelligent applications. This paper summarises our recent work on the optimised mapping of DNNs on embedded settings. By covering such diverse topics as DNN-to-accelerator toolflows, high-throughput cascaded classifiers and domain-specific model design, the presented set of works aim to enable the deployment of sophisticated deep learning models on cutting-edge mobile and embedded systems.
There has been growing interest in developing accurate models that can also be explained to humans. Unfortunately, if there exist multiple distinct but accurate models for some dataset, current machine learning methods are unlikely to find them: standard techniques will likely recover a complex model that combines them. In this work, we introduce a way to identify a maximal set of distinct but accurate models for a dataset. We demonstrate empirically that, in situations where the data supports multiple accurate classifiers, we tend to recover simpler, more interpretable classifiers rather than more complex ones.
It is well known that over-parametrized deep neural networks (DNNs) are an overly expressive class of functions that can memorize even random data with $100\%$ training accuracy. This raises the question why they do not easily overfit real data. To answer this question, we study deep networks using Fourier analysis. We show that deep networks with finite weights (or trained for finite number of steps) are inherently biased towards representing smooth functions over the input space. Specifically, the magnitude of a particular frequency component ($k$) of deep ReLU network function decays at least as fast as $\mathcal{O}(k^{-2})$, with width and depth helping polynomially and exponentially (respectively) in modeling higher frequencies. This shows for instance why DNNs cannot perfectly \textit{memorize} peaky delta-like functions. We also show that DNNs can exploit the geometry of low dimensional data manifolds to approximate complex functions that exist along the manifold with simple functions when seen with respect to the input space. As a consequence, we find that all samples (including adversarial samples) classified by a network to belong to a certain class are connected by a path such that the prediction of the network along that path does not change. Finally we find that DNN parameters corresponding to functions with higher frequency components occupy a smaller volume in the parameter.