We review common methods of solving for multi-class from binary and generalize them to a common framework. Since conditional probabilties are useful both for quantifying the accuracy of an estimate and for calibration purposes, these are a required part of the solution. There is some indication that the best solution for multi-class classification is dependent on the particular dataset. As such, we are particularly interested in data-driven solution design, whether based on a priori considerations or empirical examination of the data. Numerical results indicate that while a one-size-fits-all solution consisting of one-versus-one is appropriate for most datasets, a minority will benefit from a more customized approach. The techniques discussed in this paper allow for a large variety of multi-class configurations and solution methods to be explored so as to optimize classification accuracy, accuracy of conditional probabilities and speed.
Fine-Grained Visual Classification (FGVC) is an important computer vision problem that involves small diversity within the different classes, and often requires expert annotators to collect data. Utilizing this notion of small visual diversity, we revisit Maximum-Entropy learning in the context of fine-grained classification, and provide a training routine that maximizes the entropy of the output probability distribution for training convolutional neural networks on FGVC tasks. We provide a theoretical as well as empirical justification of our approach, and achieve state-of-the-art performance across a variety of classification tasks in FGVC, that can potentially be extended to any fine-tuning task. Our method is robust to different hyperparameter values, amount of training data and amount of training label noise and can hence be a valuable tool in many similar problems.
We tackle the problem of multiscale regression for predictors that are spatially or temporally indexed, or with a pre-specified multiscale structure, with a Bayesian modular approach. The regression function at the finest scale is expressed as an additive expansion of coarse to fine step functions. Our Modular and Multiscale (M&M) methodology provides multiscale decomposition of high-dimensional data arising from very fine measurements. Unlike more complex methods for functional predictors, our approach provides easy interpretation of the results. Additionally, it provides a quantification of uncertainty on the data resolution, solving a common problem researchers encounter with simple models on down-sampled data. We show that our modular and multiscale posterior has an empirical Bayes interpretation, with a simple limiting distribution in large samples. An efficient sampling algorithm is developed for posterior computation, and the methods are illustrated through simulation studies and an application to brain image classification. Source code is available as an R package at https://…/bmms.
Class labels are often imperfectly observed, due to mistakes and to genuine ambiguity among classes. We propose a new semi-supervised deep generative model that explicitly models noisy labels, called the Mislabeled VAE (M-VAE). The M-VAE can perform better than existing deep generative models which do not account for label noise. Additionally, the derivation of M-VAE gives new theoretical insights into the popular M1+M2 semi-supervised model.
Adversarial noises are useful tools to probe the weakness of deep learning based computer vision algorithms. In this paper, we describe a robust adversarial perturbation (R-AP) method to attack deep proposal-based object detectors and instance segmentation algorithms. Our method focuses on attacking the common component in these algorithms, namely Region Proposal Network (RPN), to universally degrade their performance in a black-box fashion. To do so, we design a loss function that combines a label loss and a novel shape loss, and optimize it with respect to image using a gradient based iterative algorithm. Evaluations are performed on the MS COCO 2014 dataset for the adversarial attacking of 6 state-of-the-art object detectors and 2 instance segmentation algorithms. Experimental results demonstrate the efficacy of the proposed method.
Variational Auto-Encoders enforce their learned intermediate latent-space data distribution to be a simple distribution, such as an isotropic Gaussian. However, this causes the posterior collapse problem and loses manifold structure which can be important for datasets such as facial images. A GAN can transform a simple distribution to a latent-space data distribution and thus preserve the manifold structure, but optimizing a GAN involves solving a Min-Max optimization problem, which is difficult and not well understood so far. Therefore, we propose a GAN-like method to transform a simple distribution to a data distribution in the latent space by solving only a minimization problem. This minimization problem comes from training a discriminator between a simple distribution and a latent-space data distribution. Then, we can explicitly formulate an Optimal Transport (OT) problem that computes the desired mapping between the two distributions. This means that we can transform a distribution without solving the difficult Min-Max optimization problem. Experimental results on an eight-Gaussian dataset show that the proposed OT can handle multi-cluster distributions. Results on the MNIST and the CelebA datasets validate the effectiveness of the proposed method.
The main purpose of this note is to show that in a realization $(x_{1}^{n}, y_{1}^{n})$ of the causal \textit{information rate-distortion function} (IRDF) for a $\kappa$-th order Markovian source $x_{1}^{n}$, under a single letter sum distortion constraint, the smallest integer $\ell$ for which $y_{k} \leftrightarrow y_{1}^{k-1},x_{k-\ell+1}^{k} \leftrightarrow x_{1}^{k-\ell}$ holds is $\ell=\kappa$. This result is derived under the assumption that the sequences $(x_{1}^{n},y_{1}^{n})$ have a joint probability density function.
The tremendous potential exhibited by deep learning is often offset by architectural and computational complexity, making widespread deployment a challenge for edge scenarios such as mobile and other consumer devices. To tackle this challenge, we explore the following idea: Can we learn generative machines to automatically generate deep neural networks with efficient network architectures? In this study, we introduce the idea of generative synthesis, which is premised on the intricate interplay between a generator-inquisitor pair that work in tandem to garner insights and learn to generate highly efficient deep neural networks that best satisfies operational requirements. What is most interesting is that, once a generator has been learned through generative synthesis, it can be used to generate not just one but a large variety of different, unique highly efficient deep neural networks that satisfy operational requirements. Experimental results for image classification, semantic segmentation, and object detection tasks illustrate the efficacy of generative synthesis in producing generators that automatically generate highly efficient deep neural networks (which we nickname FermiNets) with higher model efficiency and lower computational costs (reaching >10x more efficient and fewer multiply-accumulate operations than several tested state-of-the-art networks), as well as higher energy efficiency (reaching >4x improvements in image inferences per joule consumed on a Nvidia Tegra X2 mobile processor). As such, generative synthesis can be a powerful, generalized approach for accelerating and improving the building of deep neural networks for on-device edge scenarios.
Clustering with incomplete views is a challenge in multi-view clustering. In this paper, we provide a novel and simple method to address this issue. Specifically, the proposed method simultaneously exploits the local information of each view and the complementary information among views to learn the common latent representation for all samples, which can greatly improve the compactness and discriminability of the obtained representation. Compared with the conventional graph embedding methods, the proposed method does not introduce any extra regularization term and corresponding penalty parameter to preserve the local structure of data, and thus does not increase the burden of extra parameter selection. By imposing the orthogonal constraint on the basis matrix of each view, the proposed method is able to handle the out-of-sample. Moreover, the proposed method can be viewed as a unified framework for multi-view learning since it can handle both incomplete and complete multi-view clustering and classification tasks. Extensive experiments conducted on several multi-view datasets prove that the proposed method can significantly improve the clustering performance.
Classic supervised learning makes the closed-world assumption, meaning that classes seen in testing must have been seen in training. However, in the dynamic world, new or unseen class examples may appear constantly. A model working in such an environment must be able to reject unseen classes (not seen or used in training). If enough data is collected for the unseen classes, the system should incrementally learn to accept/classify them. This learning paradigm is called open-world learning (OWL). Existing OWL methods all need some form of re-training to accept or include the new classes in the overall model. In this paper, we propose a meta-learning approach to the problem. Its key novelty is that it only needs to train a meta-classifier, which can then continually accept new classes when they have enough labeled data for the meta-classifier to use, and also detect/reject future unseen classes. No re-training of the meta-classifier or a new overall classifier covering all old and new classes is needed. In testing, the method only uses the examples of the seen classes (including the newly added classes) on-the-fly for classification and rejection. Experimental results demonstrate the effectiveness of the new approach.
In this paper, we propose a random projection approach to estimate variance in kernel ridge regression. Our approach leads to a consistent estimator of the true variance, while being computationally more efficient. Our variance estimator is optimal for a large family of kernels, including cubic splines and Gaussian kernels. Simulation analysis is conducted to support our theory.
Most social network sites allow users to reshare a piece of information posted by a user. As time progresses, the cascade of reshares grows, eventually saturating after a certain time period. While previous studies have focused heavily on one aspect of the cascade phenomenon, specifically predicting when the cascade would go viral, in this paper, we take a more holistic approach by analyzing the occurrence of two events within the cascade lifecycle – the period of maximum growth in terms of surge in reshares and the period where the cascade starts declining in adoption. We address the challenges in identifying these periods and then proceed to make a comparative analysis of these periods from the perspective of network topology. We study the effect of several node-centric structural measures on the reshare responses using Granger causality which helps us quantify the significance of the network measures and understand the extent to which the network topology impacts the growth dynamics. This evaluation is performed on a dataset of 7407 cascades extracted from the Weibo social network. Using our causality framework, we found that an entropy measure based on nodal degree causally affects the occurrence of these events in 93.95% of cascades. Surprisingly, this outperformed clustering coefficient and PageRank which we hypothesized would be more indicative of the growth dynamics based on earlier studies. We also extend the Granger-causality Vector Autoregression (VAR) model to forecast the times at which the events occur in the cascade lifecycle.
In this Letter, we propose a quantum machine learning scheme for the classification of classical nonlinear data. The main ingredients of our method are variational quantum perceptron (VQP) and a quantum generalization of classical ensemble learning. Our VQP employs parameterized quantum circuits to learn a Grover search (or amplitude amplification) operation with classical optimization, and can achieve quadratic speedup in query complexity compared to its classical counterparts. We show how the trained VQP can be used to predict future data with $O(1)$ {query} complexity. Ultimately, a stronger nonlinear classifier can be established, the so-called quantum ensemble learning (QEL), by combining a set of weak VQPs produced using a subsampling method. The subsampling method has two significant advantages. First, all $T$ weak VQPs employed in QEL can be trained in parallel, therefore, the query complexity of QEL is equal to that of each weak VQP multiplied by $T$. Second, it dramatically reduce the {runtime} complexity of encoding circuits that map classical data to a quantum state because this dataset can be significantly smaller than the original dataset given to QEL. This arguably provides a most satisfactory solution to one of the most criticized issues in quantum machine learning proposals. To conclude, we perform two numerical experiments for our VQP and QEL, implemented by Python and pyQuil library. Our experiments show that excellent performance can be achieved using a very small quantum circuit size that is implementable under current quantum hardware development. Specifically, given a nonlinear synthetic dataset with $4$ features for each example, the trained QEL can classify the test examples that are sampled away from the decision boundaries using $146$ single and two qubits quantum gates with $92\%$ accuracy.
Autonomous AI systems will be entering human society in the near future to provide services and work alongside humans. For those systems to be accepted and trusted, the users should be able to understand the reasoning process of the system, i.e. the system should be transparent. System transparency enables humans to form coherent explanations of the system’s decisions and actions. Transparency is important not only for user trust, but also for software debugging and certification. In recent years, Deep Neural Networks have made great advances in multiple application areas. However, deep neural networks are opaque. In this paper, we report on work in transparency in Deep Reinforcement Learning Networks (DRLN). Such networks have been extremely successful in accurately learning action control in image input domains, such as Atari games. In this paper, we propose a novel and general method that (a) incorporates explicit object recognition processing into deep reinforcement learning models, (b) forms the basis for the development of ‘object saliency maps’, to provide visualization of internal states of DRLNs, thus enabling the formation of explanations and (c) can be incorporated in any existing deep reinforcement learning framework. We present computational results and human experiments to evaluate our approach.
In this paper we develop methodology for testing relevant hypotheses in a tuning-free way. Our main focus is on functional time series, but extensions to other settings are also discussed. Instead of testing for exact equality, for example for the equality of two mean functions from two independent time series, we propose to test a relevant deviation under the null hypothesis. In the two sample problem this means that an $L^2$-distance between the two mean functions is smaller than a pre-specified threshold. For such hypotheses self-normalization, which was introduced by Shao (2010) and Shao and Zhang (2010) and is commonly used to avoid the estimation of nuisance parameters, is not directly applicable. We develop new self-normalized procedures for testing relevant hypotheses in the one sample, two sample and change point problem and investigate their asymptotic properties. Finite sample properties of the proposed tests are illustrated by means of a simulation study and a data example.
The serverless scheduling problem poses a new challenge to Cloud service platform providers because it is rather a job scheduling problem than a traditional resource allocation or request load balancing problem. Traditionally, elastic cloud applications use managed virtual resource allocation and employ request load balancers to orchestrate the deployment. With serverless, the provider needs to solve both the load balancing and the allocation. This work reviews the current Apache OpenWhisk serverless event load balancing and a noncooperative game-theoretic load balancing approach for response time minimization in distributed systems. It is shown by simulation that neither performs well under high system utilization which inspired a noncooperative online allocation heuristic that allows tuning the trade-off between for response time and resource cost of each serverless function.
We study in this paper how to initialize the parameters of multinomial logistic regression (a fully connected layer followed with softmax and cross entropy loss), which is widely used in deep neural network (DNN) models for classification problems. As logistic regression is widely known not having a closed-form solution, it is usually randomly initialized, leading to several deficiencies especially in transfer learning where all the layers except for the last task-specific layer are initialized using a pre-trained model. The deficiencies include slow convergence speed, possibility of stuck in local minimum, and the risk of over-fitting. To address those deficiencies, we first study the properties of logistic regression and propose a closed-form approximate solution named regularized Gaussian classifier (RGC). Then we adopt this approximate solution to initialize the task-specific linear layer and demonstrate superior performance over random initialization in terms of both accuracy and convergence speed on various tasks and datasets. For example, for image classification, our approach can reduce the training time by 10 times and achieve 3.2% gain in accuracy for Flickr-style classification. For object detection, our approach can also be 10 times faster in training for the same accuracy, or 5% better in terms of mAP for VOC 2007 with slightly longer training.
This paper considers inference in heteroskedastic linear regression models with many control variables. The slope coefficients on these variables are nuisance parameters. Our setting allows their number to grow with the sample size, possibly at the same rate, in which case they are not consistently estimable. A prime example of this setting are models with many (possibly multi-way) fixed effects. The presence of many nuisance parameters introduces an incidental-parameter problem in the usual heteroskedasticity-robust estimators of the covariance matrix, rendering them biased and inconsistent. Hence, tests based on these estimators are size distorted even in large samples. An alternative covariance-matrix estimator that is conditionally unbiased and remains consistent is presented and supporting simulation results are provided.
Path analysis is a special class of models in structural equation modeling (SEM) where it describes causal relations among measured variables in a form of linear regression. This paper presents two estimation formulations for confirmatory and exploratory SEM in path analysis problems where a zero pattern of the estimated path coefficient matrix explains a causality structure of the variables. In confirmatory SEM, the original nonlinear equality constraints of model parameters are relaxed to an inequality, allowing us to transform the original problem into a convex problem. A regularized estimation formulation is proposed for exploratory SEM, where the objective function is added with an l1-type penalty of the path coefficient matrix. Under a condition on problem parameters, we show that our optimal solution is low rank and provides an estimate of the path matrix of the original problem. To solve our estimation problems in a convex framework, we apply alternating direction method of multiplier (ADMM) which is shown to be suitable for a large-scale implementation. In combination with applying model selection criteria, the penalty parameter in the regularized estimation, controlling the density of nonzero entries in the path matrix, can be chosen to provide a reasonable trade-off between the model fitting and the complexity of causality structure. The performance of our approach is demonstrated in both simulated and real data sets, and with a comparison of existing methods. Real application results include learning causality among climate variables in Thailand where our findings can explain known relations among air pollutants and weather variables. The other experiment is to explore connectivities among brain regions using fMRI time series from ABIDE data sets where our results are interpreted to explain brain network differences in autism patients.
When modeling real world domains we have to deal with information that is incomplete or that comes from sources with different trust levels. This motivates the need for managing uncertainty in the Semantic Web. To this purpose, we introduced a probabilistic semantics, named DISPONTE, in order to combine description logics with probability theory. The probability of a query can be then computed from the set of its explanations by building a Binary Decision Diagram (BDD). The set of explanations can be found using the tableau algorithm, which has to handle non-determinism. Prolog, with its efficient handling of non-determinism, is suitable for implementing the tableau algorithm. TRILL and TRILLP are systems offering a Prolog implementation of the tableau algorithm. TRILLP builds a pinpointing formula, that compactly represents the set of explanations and can be directly translated into a BDD. Both reasoners were shown to outperform state-of-the-art DL reasoners. In this paper, we present an improvement of TRILLP, named TORNADO, in which the BDD is directly built during the construction of the tableau, further speeding up the overall inference process. An experimental comparison shows the effectiveness of TORNADO. All systems can be tried online in the TRILL on SWISH web application at http://…/.
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm which has the high-performance rate for dataset where clusters have the constant density of data points. One of the significant attributes of this algorithm is noise cancellation. However, DBSCAN demonstrates reduced performances for clusters with different densities. Therefore, in this paper, an adaptive DBSCAN is proposed which can work significantly well for identifying clusters with varying densities.
Automated social agents, or bots, are increasingly becoming a problem on social media platforms. There is a growing body of literature and multiple tools to aid in the detection of such agents on online social networking platforms. We propose that the social network topology of a user would be sufficient to determine whether the user is a automated agent or a human. To test this, we use a publicly available dataset containing users on Twitter labelled as either automated social agent or human. Using an unsupervised machine learning approach, we obtain a detection accuracy rate of 70%.
Learning to follow human instructions is a challenging task because while interpreting instructions requires discovering arbitrary algorithms, humans typically provide very few examples to learn from. For learning from this data to be possible, strong inductive biases are necessary. Work in the past has relied on hand-coded components or manually engineered features to provide such biases. In contrast, here we seek to establish whether this knowledge can be acquired automatically by a neural network system through a two phase training procedure: A (slow) offline learning stage where the network learns about the general structure of the task and a (fast) online adaptation phase where the network learns the language of a new given speaker. Controlled experiments show that when the network is exposed to familiar instructions but containing novel words, the model adapts very efficiently to the new vocabulary. Moreover, even for human speakers whose language usage can depart significantly from our artificial training language, our network can still make use of its automatically acquired inductive bias to learn to follow instructions more effectively.
Ranking is a fundamental and widely studied problem in scenarios such as search, advertising, and recommendation. However, joint optimization for multi-scenario ranking, which aims to improve the overall performance of several ranking strategies in different scenarios, is rather untouched. Separately optimizing each individual strategy has two limitations. The first one is lack of collaboration between scenarios meaning that each strategy maximizes its own objective but ignores the goals of other strategies, leading to a sub-optimal overall performance. The second limitation is the inability of modeling the correlation between scenarios meaning that independent optimization in one scenario only uses its own user data but ignores the context in other scenarios. In this paper, we formulate multi-scenario ranking as a fully cooperative, partially observable, multi-agent sequential decision problem. We propose a novel model named Multi-Agent Recurrent Deterministic Policy Gradient (MA-RDPG) which has a communication component for passing messages, several private actors (agents) for making actions for ranking, and a centralized critic for evaluating the overall performance of the co-working actors. Each scenario is treated as an agent (actor). Agents collaborate with each other by sharing a global action-value function (the critic) and passing messages that encodes historical information across scenarios. The model is evaluated with online settings on a large E-commerce platform. Results show that the proposed model exhibits significant improvements against baselines in terms of the overall performance.
In order to compare and benchmark the mathematical software, the performance profiles have been introduced [1]. However, it has been proved that the algorithm is not flawless. The main issue with the performance profile is that it may rank the solvers with respect to the best solver, by excluding the best one and running the algorithm on the remaining set of the solvers, the method may rank the solvers in a different way. We characterize such systems of problems-solvers and propose an efficient and reliable algorithm to overcome this negative side effect. The proposed method is unbiased in comparing the solvers and is successful in detecting the top ones
Generative adversarial networks (GANs) have achieved significant success in generating real-valued data. However, the discrete nature of text hinders the application of GAN to text-generation tasks. Instead of using the standard GAN objective, we propose to improve text-generation GAN via a novel approach inspired by optimal transport. Specifically, we consider matching the latent feature distributions of real and synthetic sentences using a novel metric, termed the feature-mover’s distance (FMD). This formulation leads to a highly discriminative critic and easy-to-optimize objective, overcoming the mode-collapsing and brittle-training problems in existing methods. Extensive experiments are conducted on a variety of tasks to evaluate the proposed model empirically, including unconditional text generation, style transfer from non-parallel text, and unsupervised cipher cracking. The proposed model yields superior performance, demonstrating wide applicability and effectiveness.
Tasks with complex temporal structures and long horizons pose a challenge for reinforcement learning agents due to the difficulty in specifying the tasks in terms of reward functions as well as large variances in the learning signals. We propose to address these problems by combining temporal logic (TL) with reinforcement learning from demonstrations. Our method automatically generates intrinsic rewards that align with the overall task goal given a TL task specification. The policy resulting from our framework has an interpretable and hierarchical structure. We validate the proposed method experimentally on a set of robotic manipulation tasks.
Network adaptation is essential for the efficient operation of Cloud-RANs. Unfortunately, it leads to highly intractable mixed-integer nonlinear programming problems. Existing solutions typically rely on convex relaxation, which yield performance gaps that are difficult to quantify. Meanwhile, global optimization algorithms such as branch-and-bound can find optimal solutions but with prohibitive computational complexity. In this paper, to obtain near-optimal solutions at affordable complexity, we propose to approximate the branch-and-bound algorithm via machine learning. Specifically, the pruning procedure in branch-and-bound is formulated as a sequential decision problem, followed by learning the oracle’s action via imitation learning. A unique advantage of this framework is that the training process only requires a small dataset, and it is scalable to problem instances with larger dimensions than the training setting. This is achieved by identifying and leveraging the problem-size independent features. Numerical simulations demonstrate that the learning based framework significantly outperforms competing methods, with computational complexity much lower than the traditional branch-and-bound algorithm.