We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and time-varying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning — a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m-1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failure-free setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function.
Learning paradigms involving varying levels of supervision have received a lot of interest within the computer vision and machine learning communities. The supervisory information is typically considered to come from a human supervisor — a ‘teacher’ figure. In this paper, we consider an alternate source of supervision — a ‘peer’ — i.e. a different machine. We introduce cooperative learning, where two agents trying to learn the same visual concepts, but in potentially different environments using different sources of data (sensors), communicate their current knowledge of these concepts to each other. Given the distinct sources of data in both agents, the mode of communication between the two agents is not obvious. We propose the use of visual attributes — semantic mid-level visual properties such as furry, wooden, etc.– as the mode of communication between the agents. Our experiments in three domains — objects, scenes, and animals — demonstrate that our proposed cooperative learning approach improves the performance of both agents as compared to their performance if they were to learn in isolation. Our approach is particularly applicable in scenarios where privacy, security and/or bandwidth constraints restrict the amount and type of information the two agents can exchange.
Expectation for the emergence of higher functions is getting larger in the framework of end-to-end reinforcement learning using a recurrent neural network. However, the emergence of ‘thinking’ that is a typical higher function is difficult to realize because ‘thinking’ needs non fixed-point, flow-type attractors with both convergence and transition dynamics. Furthermore, in order to introduce ‘inspiration’ or ‘discovery’ in ‘thinking’, not completely random but unexpected transition should be also required. By analogy to ‘chaotic itinerancy’, we have hypothesized that ‘exploration’ grows into ‘thinking’ through learning by forming flow-type attractors on chaotic random-like dynamics. It is expected that if rational dynamics are learned in a chaotic neural network (ChNN), coexistence of rational state transition, inspiration-like state transition and also random-like exploration for unknown situation can be realized. Based on the above idea, we have proposed new reinforcement learning using a ChNN as an actor. The positioning of exploration is completely different from the conventional one. The chaotic dynamics inside the ChNN produces exploration factors by itself. Since external random numbers for stochastic action selection are not used, exploration factors cannot be isolated from the output. Therefore, the learning method is also completely different from the conventional one. At each non-feedback connection, one variable named causality trace takes in and maintains the input through the connection according to the change in its output. Using the trace and TD error, the weight is updated. In this paper, as the result of a recent simple task to see whether the new learning works or not, it is shown that a robot with two wheels and two visual sensors reaches a target while avoiding an obstacle after learning though there are still many rooms for improvement.
It is common for real-world applications to analyze big graphs using distributed graph processing systems. Popular in-memory systems require an enormous amount of resources to handle big graphs. While several out-of-core systems have been proposed recently for processing big graphs using secondary storage, the high disk I/O overhead could significantly reduce performance. In this paper, we propose GraphH to enable high- performance big graph analytics in small clusters. Specifically, we design a two-stage graph partition scheme to evenly divide the input graph into partitions, and propose a GAB (Gather-Apply- Broadcast) computation model to make each worker process a partition in memory at a time. We use an edge cache mechanism to reduce the disk I/O overhead, and design a hybrid strategy to improve the communication performance. GraphH can efficiently process big graphs in small clusters or even a single commodity server. Extensive evaluations have shown that GraphH could be up to 7.8x faster compared to popular in-memory systems, such as Pregel+ and PowerGraph when processing generic graphs, and more than 100x faster than recently proposed out-of-core systems, such as GraphD and Chaos when processing big graphs.
Deep learning has significantly advanced the state of the art in machine learning. However, neural networks are often considered black boxes. There is significant effort to develop techniques that explain a classifier’s decisions. Although some of these approaches have resulted in compelling visualisations, there is a lack of theory of what is actually explained. Here we present an analysis of these methods and formulate a quality criterion for explanation methods. On this ground, we propose an improved method that may serve as an extension for existing back-projection and decomposition techniques.
Picasso is a free open-source (Eclipse Public License) web application written in Python for rendering standard visualizations useful for training convolutional neural networks. Picasso ships with occlusion maps and saliency maps, two visualizations which help reveal issues that evaluation metrics like loss and accuracy might hide: for example, learning a proxy classification task. Picasso works with the Keras and Tensorflow deep learning frameworks. Picasso can be used with minimal configuration by deep learning researchers and engineers alike across various neural network architectures. Adding new visualizations is simple: the user can specify their visualization code and HTML template separately from the application code.
An optimal warping path between two time series is generally not unique. The size and form of the set of pairs of time series with non-unique optimal warping path is unknown. This article shows that optimal warping paths are unique for almost every pair of time series in a measure-theoretic sense. All pairs of time series with non-unique optimal warping path form a negligible set and are geometrically the union of zero sets of quadratic forms. The result is useful for analyzing and understanding adaptive learning methods in dynamic time warping spaces.
In this article we study variable selection problem using LASSO with new improvisations. LASSO uses $\ell_{1}$ penalty, it shrinks most of the coefficients to zero when number of explanatory variables $(p)$ are much larger the number of observations $(N)$. Novelty of the approach developed in this article blends basic ideas behind resampling and LASSO together which provides a significant variable reduction and improved prediction accuracy in terms of mean squared error in the test sample. Different weighting schemes have been explored using \textit{Bootstrapped LASSO}, the basic methodology developed in here. Weighting schemes determine to what extent of data blending in case of grouped data. Data sharing (DSL) technique developed by \cite{gross} lies at the root of the present methodology. We apply the technique to analyze the IMDb dataset as discussed in \cite{gross} and compare our result with \cite{gross}.
Knowledge Graphs are important tools to model multi-relational data that serves as information pool for various applications. Traditionally, these graphs are considered to be static in nature. However, recent availability of large scale event-based interaction data has given rise to dynamically evolving knowledge graphs that contain temporal information for each edge. Reasoning over time in such graphs is not yet well understood. In this paper, we present a novel deep evolutionary knowledge network architecture to learn entity embeddings that can dynamically and non-linearly evolve over time. We further propose a multivariate point process framework to model the occurrence of a fact (edge) in continuous time. To facilitate temporal reasoning, the learned embeddings are used to compute relationship score that further parametrizes intensity function of the point process. We demonstrate improved performance over various existing relational learning models on two large scale real-world datasets. Further, our method effectively predicts occurrence or recurrence time of a fact which is novel compared to any prior reasoning approaches in multi-relational setting.
The distance standard deviation, which arises in distance correlation analysis of multivariate data, is studied as a measure of spread. New representations for the distance standard deviation are obtained in terms of Gini’s mean difference and in terms of the moments of spacings of order statistics. Inequalities for the distance variance are derived, proving that the distance standard deviation is bounded above by the classical standard deviation and by Gini’s mean difference. Further, it is shown that the distance standard deviation satisfies the axiomatic properties of a measure of spread. Explicit closed-form expressions for the distance variance are obtained for a broad class of parametric distributions. The asymptotic distribution of the sample distance variance is derived.
Multiresolution analysis and matrix factorization are foundational tools in computer vision. In this work, we study the interface between these two distinct topics and obtain techniques to uncover hierarchical block structure in symmetric matrices — an important aspect in the success of many vision problems. Our new algorithm, the incremental multiresolution matrix factorization, uncovers such structure one feature at a time, and hence scales well to large matrices. We describe how this multiscale analysis goes much farther than what a direct global factorization of the data can identify. We evaluate the efficacy of the resulting factorizations for relative leveraging within regression tasks using medical imaging data. We also use the factorization on representations learned by popular deep networks, providing evidence of their ability to infer semantic relationships even when they are not explicitly trained to do so. We show that this algorithm can be used as an exploratory tool to improve the network architecture, and within numerous other settings in vision.