Programming has been an important skill for researchers and practitioners in computer science and other related areas. To learn basic programing skills, a long-time systematic training is usually required for beginners. According to a recent market report, the computer software market is expected to continue expanding at an accelerating speed, but the market supply of qualified software developers can hardly meet such a huge demand. In recent years, the surge of text generation research works provides the opportunities to address such a dilemma through automatic program synthesis. In this paper, we propose to make our try to solve the program synthesis problem from a data mining perspective. To address the problem, a novel generative model, namely EgoCoder, will be introduced in this paper. EgoCoder effectively parses program code into abstract syntax trees (ASTs), where the tree nodes will contain the program code/comment content and the tree structure can capture the program logic flows. Based on a new unit model called Hsu, EgoCoder can effectively capture both the hierarchical and sequential patterns in the program ASTs. Extensive experiments will be done to compare EgoCoder with the state-of-the-art text generation methods, and the experimental results have demonstrated the effectiveness of EgoCoder in addressing the program synthesis problem.
This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, to perform high-throughput inference by exploiting the computation time-accuracy trade-off. Without the need for retraining, a two-stage architecture tailored for any given FPGA device is generated, consisting of a low- and a high-precision unit. A confidence evaluation unit is employed between them to identify misclassified cases at run time and forward them to the high-precision unit or terminate computation. Experiments demonstrate that CascadeCNN achieves a performance boost of up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy.
We propose a novel data-dependent structured gradient regularizer to increase the robustness of neural networks vis-a-vis adversarial perturbations. Our regularizer can be derived as a controlled approximation from first principles, leveraging the fundamental link between training with noise and regularization. It adds very little computational overhead during learning and is simple to implement generically in standard deep learning frameworks. Our experiments provide strong evidence that structured gradient regularization can act as an effective first line of defense against attacks based on low-level signal corruption.
In recent years, the Word2Vec model trained with the Negative Sampling loss function has shown state-of-the-art results in a number of machine learning tasks, including language modeling tasks, such as word analogy and word similarity, and in recommendation tasks, through Prod2Vec, an extension that applies to modeling user shopping activity and user preferences. Several methods that aim to improve upon the standard Negative Sampling loss have been proposed. In our paper we pursue more sophisticated Negative Sampling, by leveraging ideas from the field of Generative Adversarial Networks (GANs), and propose Adversarial Negative Sampling. We build upon the recent progress made in stabilizing the training objective of GANs in the discrete data setting, and introduce a new GAN-Word2Vec model.We evaluate our model on the task of basket completion, and show significant improvements in performance over Word2Vec trained using standard loss functions, including Noise Contrastive Estimation and Negative Sampling.
Combining Bayesian nonparametrics and a forward model selection strategy, we construct parsimonious Bayesian deep networks (PBDNs) that infer capacity-regularized network architectures from the data and require neither cross-validation nor fine-tuning when training the model. One of the two essential components of a PBDN is the development of a special infinite-wide single-hidden-layer neural network, whose number of active hidden units can be inferred from the data. The other one is the construction of a greedy layer-wise learning algorithm that uses a forward model selection criterion to determine when to stop adding another hidden layer. We develop both Gibbs sampling and stochastic gradient descent based maximum a posteriori inference for PBDNs, providing state-of-the-art classification accuracy and interpretable data subtypes near the decision boundaries, while maintaining low computational complexity for out-of-sample prediction.
In this paper we consider a ranking problem in which we would like to order a set of items by utility or relevance, while also considering the visibility of different groups of items. To solve this problem, we adopt a supervised learning to rank approach that learns a ranking function from a set of training examples, which are queries and ranked lists of documents for each query. We consider that the elements to be ranked are divided into two groups: protected and non-protected. Following long-standing empirical observations showing that users of information retrieval systems rarely look past the first few results, we consider that some items receive more exposure than others. Our objective is to produce a ranker that is able to reproduce the ordering of the training set, which is the standard objective in learning to rank, but that additionally gives protected elements sufficient exposure, compared to non-protected elements. We demonstrate how to describe this objective formally, how to achieve it effectively and implement it, and present an experimental study describing how large differences in exposure can be reduced without having to introduce large distortions in the ranking utility.
Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data.
Conditional generative adversarial networks (cGAN) have led to large improvements in the task of conditional image generation, which lies at the heart of computer vision. The major focus so far has been on performance improvement, while there has been little effort in making cGAN more robust to noise or leveraging structure in the output space of the model. The end-to-end regression (of the generator) might lead to arbitrarily large errors in the output, which is unsuitable for the application of such networks to real-world systems. In this work, we introduce a novel conditional GAN, called RoCGAN, which adds implicit constraints to address the issue. Our proposed model augments the generator with an unsupervised pathway, which encourages the outputs of the generator to span the target manifold even in the presence of large amounts of noise. We prove that RoCGAN shares similar theoretical properties as GAN and experimentally verify that the proposed model outperforms existing state-of-the-art cGAN architectures by a large margin in a variety of domains including images from natural scenes and faces.
In this paper we propose solving localized multiple kernel learning (LMKL) using LMKL-Net, a feedforward deep neural network. In contrast to previous works, as a learning principle we propose {\em parameterizing} both the gating function for learning kernel combination weights and the multiclass classifier in LMKL using an attentional network (AN) and a multilayer perceptron (MLP), respectively. In this way we can learn the (nonlinear) decision function in LMKL (approximately) by sequential applications of AN and MLP. Empirically on benchmark datasets we demonstrate that overall LMKL-Net can not only outperform the state-of-the-art MKL solvers in terms of accuracy, but also be trained about {\em two orders of magnitude} faster with much smaller memory footprint for large-scale learning.
Image-level feature descriptors obtained from convolutional neural networks have shown powerful representation capabilities for image retrieval. In this paper, we present an unsupervised method to aggregate deep convolutional features into compact yet discriminative image vectors by simulating the dynamics of heat diffusion. A distinctive problem in image retrieval is that repetitive or bursty features tend to dominate feature representations, leading to less than ideal matches. We show that by leveraging elegant properties of the heat equation, our method is able to select informative features while avoiding over-representation of bursty features. We additionally present a theoretical time complexity analysis showing the efficiency of our method, which is further demonstrated in our experimental evaluation. Finally, we extensively evaluate the proposed approach with pre-trained and fine-tuned deep networks on common public benchmarks, and show superior performance compared to previous work.
We introduce algorithms for online, full-information prediction that are competitive with contextual tree experts of unknown complexity, in both probabilistic and adversarial settings. We show that by incorporating a probabilistic framework of structural risk minimization into existing adaptive algorithms, we can robustly learn not only the presence of stochastic structure when it exists (leading to constant as opposed to $\mathcal{O}(\sqrt{T})$ regret), but also the correct model order. We thus obtain regret bounds that are competitive with the regret of an optimal algorithm that possesses strong side information about both the complexity of the optimal contextual tree expert and whether the process generating the data is stochastic or adversarial. These are the first constructive guarantees on simultaneous adaptivity to the model and the presence of stochasticity.
In recent days different types of surveillance data are becoming available for public health reasons. In most cases several variables are monitored and events of different types are reported. As the amount of surveillance data increases, statistical methods that can effectively address multivariate surveillance scenarios are demanded. Even though research activity in this field is increasing rapidly in recent years, only a few approaches have simultaneously addressed the integer-valued property of the data and its correlation (both time correlation and cross correlation) structure. In this paper, we suggest a multivariate integer-valued autoregressive model that allows for both serial and cross correlation between the series and can easily accommodate overdispersion and covariate information. Moreover, its structure implies a natural decomposition into an endemic and an epidemic component, a common distinction in dynamic models for infectious disease counts. Detection of disease outbreaks is achieved through the comparison of surveillance data with one-step-ahead predictions obtained after fitting the suggested model to a set of historical data. The performance of the suggested model is illustrated on a trivariate series of syndromic surveillance data collected during Athens 2004 Olympic Games.
In low-rank tensor completion tasks, due to the underlying multiple large-scale singular value decomposition (SVD) operations and rank selection problem of the traditional methods, they suffer from high computational cost and high sensitivity of model complexity. In this paper, taking advantages of high compressibility of the recently proposed tensor ring (TR) decomposition, we propose a new model for tensor completion problem. This is achieved through introducing convex surrogates of tensor low-rank assumption on latent tensor ring factors, which makes it possible for the Schatten norm regularization based models to be solved at much smaller scale. We propose two algorithms which apply different structured Schatten norms on tensor ring factors respectively. By the alternating direction method of multipliers (ADMM) scheme, the tensor ring factors and the predicted tensor can be optimized simultaneously. The experiments on synthetic data and real-world data show the high performance and efficiency of the proposed approach.
Recurrent neural networks (RNNs) are difficult to train on sequence processing tasks, not only because input noise may be amplified through feedback, but also because any inaccuracy in the weights has similar consequences as input noise. We describe a method for denoising the hidden state during training to achieve more robust representations thereby improving generalization performance. Attractor dynamics are incorporated into the hidden state to `clean up’ representations at each step of a sequence. The attractor dynamics are trained through an auxillary denoising loss to recover previously experienced hidden states from noisy versions of those states. This state-denoised recurrent neural network {SDRNN} performs multiple steps of internal processing for each external sequence step. On a range of tasks, we show that the SDRNN outperforms a generic RNN as well as a variant of the SDRNN with attractor dynamics on the hidden state but without the auxillary loss. We argue that attractor dynamics—and corresponding connectivity constraints—are an essential component of the deep learning arsenal and should be invoked not only for recurrent networks but also for improving deep feedforward nets and intertask transfer.
The great success of deep learning shows that its technology contains profound truth, and understanding its internal mechanism not only has important implications for the development of its technology and effective application in various fields, but also provides meaningful insights into the understanding of human brain mechanism. At present, most of the theoretical research on deep learning is based on mathematics. This dissertation proposes that the neural network of deep learning is a physical system, examines deep learning from three different perspectives: microscopic, macroscopic, and physical world views, answers multiple theoretical puzzles in deep learning by using physics principles. For example, from the perspective of quantum mechanics and statistical physics, this dissertation presents the calculation methods for convolution calculation, pooling, normalization, and Restricted Boltzmann Machine, as well as the selection of cost functions, explains why deep learning must be deep, what characteristics are learned in deep learning, why Convolutional Neural Networks do not have to be trained layer by layer, and the limitations of deep learning, etc., and proposes the theoretical direction and basis for the further development of deep learning now and in the future. The brilliance of physics flashes in deep learning, we try to establish the deep learning technology based on the scientific theory of physics.
For neural networks (NNs) with rectified linear unit (ReLU) or binary activation functions, we show that their training can be accomplished in a reduced parameter space. Specifically, the weights in each neuron can be trained on the unit sphere, as opposed to the entire space, and the threshold can be trained in a bounded interval, as opposed to the real line. We show that the NNs in the reduced parameter space are mathematically equivalent to the standard NNs with parameters in the whole space. The reduced parameter space shall facilitate the optimization procedure for the network training, as the search space becomes (much) smaller. We demonstrate the improved training performance using numerical examples.
In this paper, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation learning which can efficiently learn a sparse multi-modal policy distribution from demonstrations. We provide the full mathematical analysis of the proposed framework. First, the optimal solution of an MCTE problem is shown to be a sparsemax distribution, whose supporting set can be adjusted. The proposed method has advantages over a softmax distribution in that it can exclude unnecessary actions by assigning zero probability. Second, we prove that an MCTE problem is equivalent to robust Bayes estimation in the sense of the Brier score. Third, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL) algorithm with a sparse mixture density network (sparse MDN) by modeling mixture weights using a sparsemax distribution. In particular, we show that the causal Tsallis entropy of an MDN encourages exploration and efficient mixture utilization while Boltzmann Gibbs entropy is less effective. We validate the proposed method in two simulation studies and MCTEIL outperforms existing imitation learning methods in terms of average returns and learning multi-modal policies.
The success of deep learning models is heavily tied to the use of massive amount of labeled data and excessively long training time. With the emergence of intelligent edge applications that use these models, the critical challenge is to obtain the same inference capability on a resource-constrained device while providing adaptability to cope with the dynamic changes in the data. We propose AgileNet, a novel lightweight dictionary-based few-shot learning methodology which provides reduced complexity deep neural network for efficient execution at the edge while enabling low-cost updates to capture the dynamics of the new data. Evaluations of state-of-the-art few-shot learning benchmarks demonstrate the superior accuracy of AgileNet compared to prior arts. Additionally, AgileNet is the first few-shot learning approach that prevents model updates by eliminating the knowledge obtained from the primary training. This property is ensured through the dictionaries learned by our novel end-to-end structured decomposition, which also reduces the memory footprint and computation complexity to match the edge device constraints.
The intrinsic error tolerance of neural network (NN) makes approximate computing a promising technique to improve the energy efficiency of NN inference. Conventional approximate computing focuses on balancing the efficiency-accuracy trade-off for existing pre-trained networks, which can lead to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented training framework to facilitate approximate computing for NN inference. Specifically, AxTrain leverages the synergy between two orthogonal methods—one actively searches for a network parameters distribution with high error tolerance, and the other passively learns resilient weights by numerically incorporating the noise distributions of the approximate hardware in the forward pass during the training phase. Experimental results from various datasets with near-threshold computing and approximation multiplication strategies demonstrate AxTrain’s ability to obtain resilient neural network parameters and system energy efficiency improvement.
Causal inference from observational data requires assumptions. These assumptions range from measuring confounders to identifying instruments. Traditionally, these assumptions have focused on estimation in a single causal problem. In this work, we develop techniques for causal estimation in causal problems with multiple treatments. We develop two assumptions based on shared confounding between treatments and independence of treatments given the confounder. Together these assumptions lead to a confounder estimator regularized by mutual information. For this estimator, we develop a tractable lower bound. To fit the outcome model, we use the residual information in the treatments given the confounder. We validate on simulations and an example from clinical medicine.