The potential lack of fairness in the outputs of machine learning algorithms has recently gained attention both within the research community as well as in society more broadly. Surprisingly, there is no prior work developing tree-induction algorithms for building fair decision trees or fair random forests. These methods have widespread popularity as they are one of the few to be simultaneously interpretable, non-linear, and easy-to-use. In this paper we develop, to our knowledge, the first technique for the induction of fair decision trees. We show that our ‘Fair Forest’ retains the benefits of the tree-based approach, while providing both greater accuracy and fairness than other alternatives, for both ‘group fairness’ and ‘individual fairness.” We also introduce new measures for fairness which are able to handle multinomial and continues attributes as well as regression problems, as opposed to binary attributes and labels only. Finally, we demonstrate a new, more robust evaluation procedure for algorithms that considers the dataset in its entirety rather than only a specific protected attribute.
We study in this paper the rate of convergence for learning densities under the Generative Adversarial Networks (GANs) framework, borrowing insights from nonparametric statistics. We introduce an improved GAN estimator that achieves a faster rate, through leveraging the level of smoothness in the target density and the evaluation metric, which in theory remedies the mode collapse problem reported in the literature. A minimax lower bound is constructed to show that when the dimension is large, the exponent in the rate for the new GAN estimator is near optimal. One can view our results as answering in a quantitative way how well GAN learns a wide range of densities with different smoothness properties, under a hierarchy of evaluation metrics. As a byproduct, we also obtain improved bounds for GAN with deeper ReLU discriminator network.
Though deep neural network has hit a huge success in recent studies and applica- tions, it still remains vulnerable to adversarial perturbations which are imperceptible to humans. To address this problem, we propose a novel network called ReabsNet to achieve high classification accuracy in the face of various attacks. The approach is to augment an existing classification network with a guardian network to detect if a sample is natural or has been adversarially perturbed. Critically, instead of simply rejecting adversarial examples, we revise them to get their true labels. We exploit the observation that a sample containing adversarial perturbations has a possibility of returning to its true class after revision. We demonstrate that our ReabsNet outperforms the state-of-the-art defense method under various adversarial attacks.
A classification algorithm, called the Linear Centralization Classifier (LCC), is introduced. The algorithm seeks to find a transformation that best maps instances from the feature space to a space where they concentrate towards the center of their own classes, while maximimizing the distance between class centers. We formulate the classifier as a quadratic program with quadratic constraints. We then simplify this formulation to a linear program that can be solved effectively using a linear programming solver (e.g., simplex-dual). We extend the formulation for LCC to enable the use of kernel functions for non-linear classification applications. We compare our method with two standard classification methods (support vector machine and linear discriminant analysis) and four state-of-the-art classification methods when they are applied to eight standard classification datasets. Our experimental results show that LCC is able to classify instances more accurately (based on the area under the receiver operating characteristic) in comparison to other tested methods on the chosen datasets. We also report the results for LCC with a particular kernel to solve for synthetic non-linear classification problems.
We present a framework combining hierarchical and multi-agent deep reinforcement learning approaches to solve coordination problems among a multitude of agents using a semi-decentralized model. The framework extends the multi-agent learning setup by introducing a meta-controller that guides the communication between agent pairs, enabling agents to focus on communicating with only one other agent at any step. This hierarchical decomposition of the task allows for efficient exploration to learn policies that identify globally optimal solutions even as the number of collaborating agents increases. We show promising initial experimental results on a simulated distributed scheduling problem.
Despite the tremendous achievements of deep convolutional neural networks~(CNNs) in most of computer vision tasks, understanding how they actually work remains a significant challenge. In this paper, we propose a novel two-step visualization method that aims to shed light on how deep CNNs recognize images and the objects therein. We start out with a layer-wise relevance propagation (LRP) step which estimates a pixel-wise relevance map over the input image. Following, we construct a context-aware saliency map from the LRP-generated map which predicts regions close to the foci of attention. We show that our algorithm clearly and concisely identifies the key pixels that contribute to the underlying neural network’s comprehension of images. Experimental results using the ILSVRC2012 validation dataset in conjunction with two well-established deep CNNs demonstrate that combining the LRP with the visual salience estimation can give great insight into how a CNNs model perceives and understands a presented scene, in relation to what it has learned in the prior training phase.
The 2017 Grand Challenge focused on the problem of automatic detection of anomalies for manufacturing equipment. This paper reports the technical details of a solution focused on particular optimizations of the processing stages. These included customized input parsing, fine tuning of a k-means clustering algorithm and probability analysis using a lazy flavor of a Markov chain. We have observed in our custom implementation that carefully tweaking these processing stages at single node level by leveraging various data stream characteristics can yield good performance results. We start the paper with several observations concerning the input data stream, following with our solution description with details on particular optimizations, and we conclude with evaluation and a discussion of obtained results.
Artifical Neural Network are a particular class of learning system modeled after biological neural functions with an interesting penchant for Hebbian learning, that is ‘neurons that wire together, fire together’. However, unlike their natural counterparts, artificial neural networks have a close and stringent coupling between the modules of neurons in the network. This coupling or locking imposes upon the network a strict and inflexible structure that prevent layers in the network from updating their weights until a full feed-forward and backward pass has occurred. Such a constraint though may have sufficed for a while, is now no longer feasible in the era of very-large-scale machine learning, coupled with the increased desire for parallelization of the learning process across multiple computing infrastructures. To solve this problem, synthetic gradients (SG) with decoupled neural interfaces (DNI) are introduced as a viable alternative to the backpropagation algorithm. This paper performs a speed benchmark to compare the speed and accuracy capabilities of SG-DNI as over to a standard neural interface using multilayer perceptron MLP. SG-DNI shows good promise, in that it not only captures the learning problem, it is also over 3-fold faster due to it asynchronous learning capabilities.
An extremely simple, description of Karmarkar’s algorithm with very few technical terms is given.
We present our winning solution for the WSDM Cup 2017 triple scoring task. We devise an ensemble of four base scorers, so as to leverage the power of both text and knowledge bases for that task. Then we further refine the outputs of the ensemble by trigger word detection, achieving even better predictive accuracy. The code is available at https://…/bokchoy.
In recent years, the graph partitioning problem gained importance as a mandatory preprocessing step for distributed graph processing on very large graphs. Existing graph partitioning algorithms minimize partitioning latency by assigning individual graph edges to partitions in a streaming manner — at the cost of reduced partitioning quality. However, we argue that the mere minimization of partitioning latency is not the optimal design choice in terms of minimizing total graph analysis latency, i.e., the sum of partitioning and processing latency. Instead, for complex and long-running graph processing algorithms that run on very large graphs, it is beneficial to invest more time into graph partitioning to reach a higher partitioning quality — which drastically reduces graph processing latency. In this paper, we propose ADWISE, a novel window-based streaming partitioning algorithm that increases the partitioning quality by always choosing the best edge from a set of edges for assignment to a partition. In doing so, ADWISE controls the partitioning latency by adapting the window size dynamically at run-time. Our evaluations show that ADWISE can reach the sweet spot between graph partitioning latency and graph processing latency, reducing the total latency of partitioning plus processing by up to 23-47 percent compared to the state-of-the-art.
In this paper, we study static output feedback selection in linear time invariant structured systems. We assume that the inputs and the outputs are dedicated, i.e., each input actuates a single state and each output senses a single state. Given a structured system with dedicated inputs and outputs and a cost matrix that denotes the cost of each feedback connection, our aim is to select an optimal set of feedback connections such that the closed-loop system satisfies arbitrary pole-placement. This problem is referred as the optimal feedback selection problem. We first prove the NP-hardness of the problem using a reduction from a well known NP-hard problem, the weighted set cover problem. In addition, we also prove that the optimal feedback selection problem is inapproximable below a constant factor of log (n), where n denotes the system dimension. To this end, we propose an algorithm to find an approximate solution to the optimal feedback selection problem. The proposed algorithm consists of a potential function incorporated with a greedy scheme and attains a solution with a guaranteed approximation ratio.
In the context of post-hoc interpretability, this paper addresses the task of explaining the prediction of a classifier, considering the case where no information is available, neither on the classifier itself, nor on the processed data (neither the training nor the test data). It proposes an instance-based approach whose principle consists in determining the minimal changes needed to alter a prediction: given a data point whose classification must be explained, the proposed method consists in identifying a close neighbour classified differently, where the closeness definition integrates a sparsity constraint. This principle is implemented using observation generation in the Growing Spheres algorithm. Experimental results on two datasets illustrate the relevance of the proposed approach that can be used to gain knowledge about the classifier.
We introduce a simple algorithm, True Asymptotic Natural Gradient Optimization (TANGO), that converges to a true natural gradient descent in the limit of small learning rates, without explicit Fisher matrix estimation. For quadratic models the algorithm is also an instance of averaged stochastic gradient, where the parameter is a moving average of a ‘fast’, constant-rate gradient descent. TANGO appears as a particular de-linearization of averaged SGD, and is sometimes quite different on non-quadratic models. This further connects averaged SGD and natural gradient, both of which are arguably optimal asymptotically. In large dimension, small learning rates will be required to approximate the natural gradient well. Still, this shows it is possible to get arbitrarily close to exact natural gradient descent with a lightweight algorithm.
We consider a nonlinear Fourier transform (NFT)-based transmission scheme, where data is embedded into the imaginary part of the nonlinear discrete spectrum. Inspired by probabilistic amplitude shaping, we propose a probabilistic eigenvalue shaping (PES) scheme as a means to increase the data rate of the system. We exploit the fact that for an NFTbased transmission scheme the pulses in the time domain are of unequal duration by transmitting them with a dynamic symbol interval and find a capacity-achieving distribution. The PES scheme shapes the information symbols according to the capacity-achieving distribution and transmits them together with the parity symbols at the output of a low-density parity-check encoder, suitably modulated, via time-sharing. We furthermore derive an achievable rate for the proposed PES scheme. We verify our results with simulations of the discrete-time model as well as with split-step Fourier simulations.
The diversification (generating slightly varying separating discriminators) of Support Vector Machines (SVMs) for boosting has proven to be a challenge due to the strong learning nature of SVMs. Based on the insight that perturbing the SVM kernel may help in diversifying SVMs, we propose two kernel perturbation based boosting schemes where the kernel is modified in each round so as to increase the resolution of the kernel-induced Reimannian metric in the vicinity of the datapoints misclassified in the previous round. We propose a method for identifying the disjuncts in a dataset, dispelling the dependence on rule-based learning methods for identifying the disjuncts. We also present a new performance measure called Geometric Small Disjunct Index (GSDI) to quantify the performance on small disjuncts for balanced as well as class imbalanced datasets. Experimental comparison with a variety of state-of-the-art algorithms is carried out using the best classifiers of each type selected by a new approach inspired by multi-criteria decision making. The proposed method is found to outperform the contending state-of-the-art methods on different datasets (ranging from mildly imbalanced to highly imbalanced and characterized by varying number of disjuncts) in terms of three different performance indices (including the proposed GSDI).
The problem of private data disclosure is studied from an information theoretic perspective. Considering a pair of correlated random variables $(X,Y)$, where $Y$ denotes the observed data while $X$ denotes the private latent variables, the following problem is addressed: What is the maximum information that can be revealed about $Y$, while disclosing no information about $X$? Assuming that a Markov kernel maps $Y$ to the revealed information $U$, it is shown that the maximum mutual information between $Y$ and $U$, i.e., $I(Y;U)$, can be obtained as the solution of a standard linear program, when $X$ and $U$ are required to be independent, called \textit{perfect privacy}. This solution is shown to be greater than or equal to the \textit{non-private information about $X$ carried by $Y$.} Maximal information disclosure under perfect privacy is is shown to be the solution of a linear program also when the utility is measured by the reduction in the mean square error, $\mathbb{E}[(Y-U)^2]$, or the probability of error, $\mbox{Pr}\{Y\neq U\}$. For jointly Gaussian $(X,Y)$, it is shown that perfect privacy is not possible if the kernel is applied to only $Y$; whereas perfect privacy can be achieved if the mapping is from both $X$ and $Y$; that is, if the private latent variables can also be observed at the encoder. Next, measuring the utility and privacy by $I(Y;U)$ and $I(X;U)$, respectively, the slope of the optimal utility-privacy trade-off curve is studied when $I(X;U)=0$. Finally, through a similar but independent analysis, an alternative characterization of the maximal correlation between two random variables is provided.
Nowadays, events usually burst and are propagated online through multiple modern media like social networks and search engines. There exists various research discussing the event dissemination trends on individual medium, while few studies focus on event popularity analysis from a cross-platform perspective. Challenges come from the vast diversity of events and media, limited access to aligned datasets across different media and a great deal of noise in the datasets. In this paper, we design DancingLines, an innovative scheme that captures and quantitatively analyzes event popularity between pairwise text media. It contains two models: TF-SW, a semantic-aware popularity quantification model, based on an integrated weight coefficient leveraging Word2Vec and TextRank; and wDTW-CD, a pairwise event popularity time series alignment model matching different event phases adapted from Dynamic Time Warping. We also propose three metrics to interpret event popularity trends between pairwise social platforms. Experimental results on eighteen real-world event datasets from an influential social network and a popular search engine validate the effectiveness and applicability of our scheme. DancingLines is demonstrated to possess broad application potentials for discovering the knowledge of various aspects related to events and different media.
We present an overview of scalable load balancing algorithms which provide favorable delay performance in large-scale systems, and yet only require minimal implementation overhead. Aimed at a broad audience, the paper starts with an introduction to the basic load balancing scenario, consisting of a single dispatcher where tasks arrive that must immediately be forwarded to one of $N$ single-server queues. A popular class of load balancing algorithms are so-called power-of-$d$ or JSQ($d$) policies, where an incoming task is assigned to a server with the shortest queue among $d$ servers selected uniformly at random. This class includes the Join-the-Shortest-Queue (JSQ) policy as a special case ($d = N$), which has strong stochastic optimality properties and yields a mean waiting time that vanishes as $N$ grows large for any fixed subcritical load. However, a nominal implementation of the JSQ policy involves a prohibitive communication burden in large-scale deployments. In contrast, a random assignment policy ($d = 1$) does not entail any communication overhead, but the mean waiting time remains constant as $N$ grows large for any fixed positive load. In order to examine the fundamental trade-off between performance and implementation overhead, we consider an asymptotic regime where $d(N)$ depends on $N$. We investigate what growth rate of $d(N)$ is required to match the performance of the JSQ policy on fluid and diffusion scale. The results demonstrate that the asymptotics for the JSQ($d(N)$) policy are insensitive to the exact growth rate of $d(N)$, as long as the latter is sufficiently fast, implying that the optimality of the JSQ policy can asymptotically be preserved while dramatically reducing the communication overhead. We additionally show how the communication overhead can be reduced yet further by the so-called Join-the-Idle-Queue scheme, leveraging memory at the dispatcher.
In the longest common substring problem we are given two strings of length $n$ and must find a substring of maximal length that occurs in both strings. It is well-known that the problem can be solved in linear time, but the solution is not robust and can vary greatly when the input strings are changed even by one letter. To circumvent this, Leimeister and Morgenstern introduced the problem of the longest common substring with $k$ mismatches. Lately, this problem has received a lot of attention in the literature. In this paper we first show a conditional lower bound based on the SETH hypothesis implying that there is little hope to improve existing solutions. We then introduce a new but closely related problem of the longest common substring with approximately $k$ mismatches and use computational geometry techniques to show that it admits a solution with strongly subquadratic running time. We also apply these results to obtain a strongly subquadratic approximation algorithm for the longest common substring with $k$ mismatches problem and show conditional hardness of improving its approximation ratio.
This work investigates training Conditional Random Fields (CRF) by Stochastic Dual Coordinate Ascent (SDCA). SDCA enjoys a linear convergence rate and a strong empirical performance for independent classification problems. However, it has never been used to train CRF. Yet it benefits from an exact line search with a single marginalization oracle call, unlike previous approaches. In this paper, we adapt SDCA to train CRF and we enhance it with an adaptive non-uniform sampling strategy. Our preliminary experiments suggest that this method matches state-of-the-art CRF optimization techniques.
We investigate a series of learning kernel problems with polynomial combinations of base kernels, which will help us solve regression and classification problems. We also perform some numerical experiments of polynomial kernels with regression and classification tasks on different datasets.
In a physical neural system, learning rules must be local both in space and time. In order for learning to occur, non-local information must be communicated to the deep synapses through a communication channel, the deep learning channel. We identify several possible architectures for this learning channel (Bidirectional, Conjoined, Twin, Distinct) and six symmetry challenges: 1) symmetry of architectures; 2) symmetry of weights; 3) symmetry of neurons; 4) symmetry of derivatives; 5) symmetry of processing; and 6) symmetry of learning rules. Random backpropagation (RBP) addresses the second and third symmetry, and some of its variations, such as skipped RBP (SRBP) address the first and the fourth symmetry. Here we address the last two desirable symmetries showing through simulations that they can be achieved and that the learning channel is particularly robust to symmetry variations. Specifically, random backpropagation and its variations can be performed with the same non-linear neurons used in the main input-output forward channel, and the connections in the learning channel can be adapted using the same algorithm used in the forward channel, removing the need for any specialized hardware in the learning channel. Finally, we provide mathematical results in simple cases showing that the learning equations in the forward and backward channels converge to fixed points, for almost any initial conditions. In symmetric architectures, if the weights in both channels are small at initialization, adaptation in both channels leads to weights that are essentially symmetric during and after learning. Biological connections are discussed.