Permutation entropy has become a standard tool for time series analysis that exploits the temporal properties of these data sets. Many current applications use an approach based on Shannon entropy, which implicitly assumes an underlying uniform distribution of patterns. In this paper, we analyze random walk null models for time series and determine the corresponding permutation distributions. These new techniques allow us to explicitly describe the behavior of real world data in terms of more complex generative processes. Additionally, building on recent results of Martinez, we define a validation measure that allows us to determine when a random walk is an appropriate model for a time series. We demonstrate the usefulness of our methods using empirical data drawn from a variety of fields.
Neural networks have been used prominently in several machine learning and statistics applications. In general, the underlying optimization of neural networks is non-convex which makes their performance analysis challenging. In this paper, we take a novel approach to this problem by asking whether one can constrain neural network weights to make its optimization landscape have good theoretical properties while at the same time, be a good approximation for the unconstrained one. For two-layer neural networks, we provide affirmative answers to these questions by introducing Porcupine Neural Networks (PNNs) whose weight vectors are constrained to lie over a finite set of lines. We show that most local optima of PNN optimizations are global while we have a characterization of regions where bad local optimizers may exist. Moreover, our theoretical and empirical results suggest that an unconstrained neural network can be approximated using a polynomially-large PNN.
The infamous exploration-exploitation dilemma is one of the oldest and most important problems in reinforcement learning (RL). Deliberate and effective exploration is necessary for RL agents to succeed in most environments. However, until very recently even very sophisticated RL algorithms employed simple, undirected exploration strategies in large-scale RL tasks. We introduce a new optimistic count-based exploration algorithm for RL that is feasible in high-dimensional MDPs. The success of RL algorithms in these domains depends crucially on generalization from limited training experience. Function approximation techniques enable RL agents to generalize in order to estimate the value of unvisited states, but at present few methods have achieved generalization about the agent’s uncertainty regarding unvisited states. We present a new method for computing a generalized state visit-count, which allows the agent to estimate the uncertainty associated with any state. In contrast to existing exploration techniques, our $\phi$$\textit{pseudocount}$ achieves generalization by exploiting the feature representation of the state space that is used for value function approximation. States that have less frequently observed features are deemed more uncertain. The resulting $\phi$$\textit{Exploration-Bonus}$ algorithm rewards the agent for exploring in feature space rather than in the original state space. This method is simpler and less computationally expensive than some previous proposals, and achieves near state-of-the-art results on high-dimensional RL benchmarks. In particular, we report world-class results on several notoriously difficult Atari 2600 video games, including Montezuma’s Revenge.
Notoriously, learning with recurrent neural networks (RNNs) on long sequences is a difficult task. There are three major challenges: 1) extracting complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DILATEDRNN, which simultaneously tackles all these challenges. The proposed architecture is characterized by multi-resolution dilated recurrent skip connections and can be combined flexibly with different RNN cells. Moreover, the DILATEDRNN reduces the number of parameters and enhances training efficiency significantly, while matching state-of-the-art performance (even with Vanilla RNN cells) in tasks involving very long-term dependencies. To provide a theory-based quantification of the architecture’s advantages, we introduce a memory capacity measure – the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures. We rigorously prove the advantages of the DILATEDRNN over other recurrent neural architectures.
Networks often exhibit structure at disparate scales. We propose a method for identifying community structure at different scales based on multiresolution modularity and consensus clustering. Our contribution consists of two parts. First, we propose a strategy for sampling the entire range of possible resolutions for the multiresolution modularity quality function. Our approach is directly based on the properties of modularity and, in particular, provides a natural way of avoiding the need to increase the resolution parameter by several orders of magnitude to break a few remaining small communities, necessitating the introduction of ad-hoc limits to the resolution range with standard sampling approaches. Second, we propose a hierarchical consensus clustering procedure, based on a modified modularity, that allows one to construct a hierarchical consensus structure given a set of input partitions. While here we are interested in its application to partitions sampled using multiresolution modularity, this consensus clustering procedure can be applied to the output of any clustering algorithm. As such, we see many potential applications of the individual parts of our multiresolution consensus clustering procedure in addition to using the procedure itself to identify hierarchical structure in networks.
Given sparse multi-dimensional data (e.g., (user, movie, time; rating) for movie recommendations), how can we discover latent concepts/relations and predict missing values? Tucker factorization has been widely used to solve such problems with multi-dimensional data, which are modeled as tensors. However, most Tucker factorization algorithms regard and estimate missing entries as zeros, which triggers a highly inaccurate decomposition. Moreover, few methods focusing on an accuracy exhibit limited scalability since they require huge memory and heavy computational costs while updating factor matrices. In this paper, we propose P-Tucker, a scalable Tucker factorization method for sparse tensors. P-Tucker performs an alternating least squares with a gradient-based update rule in a fully parallel way, which significantly reduces memory requirements for updating factor matrices. Furthermore, we offer two variants of P-Tucker: a caching algorithm P-Tucker-CACHE and an approximation algorithm P-Tucker-APPROX, both of which accelerate the update process. Experimental results show that P-Tucker exhibits 1.7-14.1x speed-up and 1.4-4.8x less error compared to the state-of-the-art. In addition, P-Tucker scales near linearly with the number of non-zeros in a tensor and number of threads. Thanks to P-Tucker, we successfully discover hidden concepts and relations in a large-scale real-world tensor, while existing methods cannot reveal latent features due to their limited scalability or low accuracy.
The emergence of mobile games has caused a paradigm shift in the video-game industry. Game developers now have at their disposal a plethora of information on their players, and thus can take advantage of reliable models that can accurately predict player behavior and scale to huge datasets. Churn prediction, a challenge common to a variety of sectors, is particularly relevant for the mobile game industry, as player retention is crucial for the successful monetization of a game. In this article, we present an approach to predicting game abandon based on survival ensembles. Our method provides accurate predictions on both the level at which each player will leave the game and their accumulated playtime until that moment. Further, it is robust to different data distributions and applicable to a wide range of response variables, while also allowing for efficient parallelization of the algorithm. This makes our model well suited to perform real-time analyses of churners, even for games with millions of daily active users.
The classification of time series data is a challenge common to all data-driven fields. However, there is no agreement about which are the most efficient techniques to group unlabeled time-ordered data. This is because a successful classification of time series patterns depends on the goal and the domain of interest, i.e. it is application-dependent. In this article, we study free-to-play game data. In this domain, clustering similar time series information is increasingly important due to the large amount of data collected by current mobile and web applications. We evaluate which methods cluster accurately time series of mobile games, focusing on player behavior data. We identify and validate several aspects of the clustering: the similarity measures and the representation techniques to reduce the high dimensionality of time series. As a robustness test, we compare various temporal datasets of player activity from two free-to-play video-games. With these techniques we extract temporal patterns of player behavior relevant for the evaluation of game events and game-business diagnosis. Our experiments provide intuitive visualizations to validate the results of the clustering and to determine the optimal number of clusters. Additionally, we assess the common characteristics of the players belonging to the same group. This study allows us to improve the understanding of player dynamics and churn behavior.
Feature representations from pre-trained deep neural networks have been known to exhibit excellent generalization and utility across a variety of related tasks. Fine-tuning is by far the simplest and most widely used approach that seeks to exploit and adapt these feature representations to novel tasks with limited data. Despite the effectiveness of fine-tuning, itis often sub-optimal and requires very careful optimization to prevent severe over-fitting to small datasets. The problem of sub-optimality and over-fitting, is due in part to the large number of parameters used in a typical deep convolutional neural network. To address these problems, we propose a simple yet effective regularization method for fine-tuning pre-trained deep networks for the task of k-shot learning. To prevent overfitting, our key strategy is to cluster the model parameters while ensuring intra-cluster similarity and inter-cluster diversity of the parameters, effectively regularizing the dimensionality of the parameter search space. In particular, we identify groups of neurons within each layer of a deep network that shares similar activation patterns. When the network is to be fine-tuned for a classification task using only k examples, we propagate a single gradient to all of the neuron parameters that belong to the same group. The grouping of neurons is non-trivial as neuron activations depend on the distribution of the input data. To efficiently search for optimal groupings conditioned on the input data, we propose a reinforcement learning search strategy using recurrent networks to learn the optimal group assignments for each network layer. Experimental results show that our method can be easily applied to several popular convolutional neural networks and improve upon other state-of-the-art fine-tuning based k-shot learning strategies by more than10%
The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.
This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer achieves this by pushing a normalized sequence of first-order gradients to a parameter server. This implies that the magnitude of a worker delta is smaller compared to an accumulated gradient, and provides a better direction towards a minimum compared to first-order gradients, which in turn also forces possible implicit momentum fluctuations to be more aligned since we make the assumption that all workers contribute towards a single minima. As a result, our approach mitigates the parameter staleness problem more effectively since staleness in asynchrony induces (implicit) momentum, and achieves a better convergence rate compared to other optimizers such as asynchronous EASGD and DynSGD, which we show empirically.
With the proliferation of social media over the last decade, determining people’s attitude with respect to a specific topic, document, interaction or events has fueled research interest in natural language processing and introduced a new channel called sentiment and emotion analysis. For instance, businesses routinely look to develop systems to automatically understand their customer conversations by identifying the relevant content to enhance marketing their products and managing their reputations. Previous efforts to assess people’s sentiment on Twitter have suggested that Twitter may be a valuable resource for studying political sentiment and that it reflects the offline political landscape. According to a Pew Research Center report, in January 2016 44 percent of US adults stated having learned about the presidential election through social media. Furthermore, 24 percent reported use of social media posts of the two candidates as a source of news and information, which is more than the 15 percent who have used both candidates’ websites or emails combined. The first presidential debate between Trump and Hillary was the most tweeted debate ever with 17.1 million tweets.
We consider a ranking and selection problem in the context of personalized decision making, where the best alternative is not universal but varies as a function of observable covariates. The goal of ranking and selection with covariates (R&S-C) is to use sampling to compute a decision rule that can specify the best alternative with certain statistical guarantee for each subsequent individual after observing his or her covariates. A linear model is proposed to capture the relationship between the mean performance of an alternative and the covariates. Under the indifference-zone formulation, we develop two-stage procedures for both homoscedastic and heteroscedastic sampling errors, respectively, and prove their statistical validity, which is defined in terms of probability of correct selection. We also generalize the well-known slippage configuration, and prove that the generalized slippage configuration is the least favorable configuration of our procedures. Extensive numerical experiments are conducted to investigate the performance of the proposed procedures. Finally, we demonstrate the usefulness of R&S-C via a case study of selecting the best treatment regimen in the prevention of esophageal cancer. We find that by leveraging disease-related personal information, R&S-C can improve substantially the expected quality-adjusted life years for some groups of patients through providing patient-specific treatment regimen.
Current topic models often suffer from discovering topics not matching human intuition, unnatural switching of topics within documents and high computational demands. We address these concerns by proposing a topic model and an inference algorithm based on automatically identifying characteristic keywords for topics. Keywords influence topic-assignments of nearby words. Our algorithm learns (key)word-topic scores and it self-regulates the number of topics. Inference is simple and easily parallelizable. Qualitative analysis yields comparable results to state-of-the-art models (eg. LDA), but with different strengths and weaknesses. Quantitative analysis using 9 datasets shows gains in terms of classification accuracy, PMI score, computational performance and consistency of topic assignments within documents, while most often using less topics.
We introduce the Fixed Cluster Repair System (FCRS) as a novel architecture for Distributed Storage Systems (DSS) that achieves a small repair bandwidth while guaranteeing a high availability. Specifically we partition the set of servers in a DSS into $s$ clusters and allow a failed server to choose any cluster other than its own as its repair group. Thereby, we guarantee an availability of $s-1$. We characterize the repair bandwidth vs. storage trade-off for the FCRS under functional repair and show that the minimum repair bandwidth can be improved by an asymptotic multiplicative factor of $2/3$ compared to the state of the art coding techniques that guarantee the same availability. We further introduce cubic codes designed to minimize the repair bandwidth of the FCRS under the exact repair model. We prove an asymptotic multiplicative improvement of $0.79$ in the minimum repair bandwidth compared to the existing exact repair coding techniques that achieve the same availability.
As the number of documents on the web is growing exponentially, multi-document summarization is becoming more and more important since it can provide the main ideas in a document set in short time. In this paper, we present an unsupervised centroid-based document-level reconstruction framework using distributed bag of words model. Specifically, our approach selects summary sentences in order to minimize the reconstruction error between the summary and the documents. We apply sentence selection and beam search, to further improve the performance of our model. Experimental results show that performance of our model is competitive against the state-of-the-art unsupervised algorithms on standard benchmark datasets.
We present a new clustering algorithm that is based on searching for natural gaps in the components of the lowest energy eigenvectors of the Laplacian of a graph. In comparing the performance of the proposed method with a set of other popular methods (KMEANS, spectral-KMEANS, and an agglomerative method) in the context of the Lancichinetti-Fortunato-Radicchi (LFR) Benchmark for undirected weighted overlapping networks, we find that the new method outperforms the other spectral methods considered in certain parameter regimes. Finally, in an application to climate data involving one of the most important modes of interannual climate variability, the El Nino Southern Oscillation phenomenon, we demonstrate the ability of the new algorithm to readily identify different flavors of the phenomenon.
Over the last five years Deep Neural Nets have offered more accurate solutions to many problems in speech recognition, and computer vision, and these solutions have surpassed a threshold of acceptability for many applications. As a result, Deep Neural Networks have supplanted other approaches to solving problems in these areas, and enabled many new applications. While the design of Deep Neural Nets is still something of an art form, in our work we have found basic principles of design space exploration used to develop embedded microprocessor architectures to be highly applicable to the design of Deep Neural Net architectures. In particular, we have used these design principles to create a novel Deep Neural Net called SqueezeNet that requires as little as 480KB of storage for its model parameters. We have further integrated all these experiences to develop something of a playbook for creating small Deep Neural Nets for embedded systems.
Continuous mixtures of distributions are widely employed in the statistical literature as models for phenomena with highly divergent outcomes; in particular, many familiar heavy-tailed distributions arise naturally as mixtures of light-tailed distributions (e.g., Gaussians), and play an important role in applications as diverse as modeling of extreme values and robust inference. In the case of social networks, continuous mixtures of graph distributions can likewise be employed to model social processes with heterogeneous outcomes, or as robust priors for network inference. Here, we introduce some simple families of network models based on continuous mixtures of baseline distributions. While analytically and computationally tractable, these models allow more flexible modeling of cross-graph heterogeneity than is possible with conventional baseline (e.g., Bernoulli or $U|man$ distributions). We illustrate the utility of these baseline mixture models with application to problems of multiple-network ERGMs, network evolution, and efficient network inference. Our results underscore the potential ubiquity of network processes with nontrivial mixture behavior in natural settings, and raise some potentially disturbing questions regarding the adequacy of current network data collection practices.
A new system model reflecting the clustered structure of distributed storage is suggested to investigate bandwidth requirements for repairing failed storage nodes. Large data centers with multiple racks/disks or local networks of storage devices (e.g. sensor network) are good applications of the suggested clustered model. In realistic scenarios involving clustered storage structures, repairing storage nodes using intact nodes residing in other clusters is more bandwidth-consuming than restoring nodes based on information from intra-cluster nodes. Therefore, it is important to differentiate between intra-cluster repair bandwidth and cross-cluster repair bandwidth in modeling distributed storage. Capacity of the suggested model is obtained as a function of fundamental resources of distributed storage systems, namely, node storage capacity, intra-cluster repair bandwidth and cross-cluster repair bandwidth. The capacity is shown to be asymptotically equivalent to a monotonic decreasing function of number of clusters, as the number of storage nodes increases without bound. Based on the capacity expression, feasible sets of required resources which enable reliable storage are obtained in a closed-form solution. Specifically, it is shown that the cross-cluster traffic can be minimized to zero (i.e., intra-cluster local repair becomes possible) by allowing extra resources on storage capacity and intra-cluster repair bandwidth, according to the law specified in the closed-form. The network coding schemes with zero cross-cluster traffic are defined as intra-cluster repairable codes, which are shown to be a class of the previously developed locally repairable codes.
Next-generation wireless networks must support ultra-reliable, low-latency communication and intelligently manage a massive number of Internet of Things (IoT) devices in real-time, within a highly dynamic environment. This need for stringent communication quality-of-service (QoS) requirements as well as mobile edge and core intelligence can only be realized by integrating fundamental notions of artificial intelligence (AI) and machine learning across the wireless infrastructure and end-user devices. In this context, this paper provides a comprehensive tutorial that introduces the main concepts of machine learning, in general, and artificial neural networks (ANNs), in particular, and their potential applications in wireless communications. For this purpose, we present a comprehensive overview on a number of key types of neural networks that include feed-forward, recurrent, spiking, and deep neural networks. For each type of neural network, we present the basic architecture and training procedure, as well as the associated challenges and opportunities. Then, we provide an in-depth overview on the variety of wireless communication problems that can be addressed using ANNs, ranging from communication using unmanned aerial vehicles to virtual reality and edge caching.For each individual application, we present the main motivation for using ANNs along with the associated challenges while also providing a detailed example for a use case scenario and outlining future works that can be addressed using ANNs. In a nutshell, this article constitutes one of the first holistic tutorials on the development of machine learning techniques tailored to the needs of future wireless networks.
In recent years, a number of methods have been developed for the dimension reduction and decomposition of multiple linked high-content data matrices. Typically these methods assume that just one dimension, rows or columns, is shared among the data sources. This shared dimension may represent common features that are measured for different sample sets (i.e., horizontal integration) or a common set of samples with measurements for different feature sets (i.e., vertical integration). In this article we introduce an approach for simultaneous horizontal and vertical integration, termed Linked Matrix Factorization (LMF), for the more general situation where some matrices share rows (e.g., features) and some share columns (e.g., samples). Our motivating application is a cytotoxicity study with accompanying genomic and molecular chemical attribute data. In this data set, the toxicity matrix (cell lines $\times$ chemicals) shares its sample set with a genotype matrix (cell lines $\times$ SNPs), and shares its feature set with a chemical molecular attribute matrix (chemicals $\times$ attributes). LMF gives a unified low-rank factorization of these three matrices, which allows for the decomposition of systematic variation that is shared among the three matrices and systematic variation that is specific to each matrix. This may be used for efficient dimension reduction, exploratory visualization, and the imputation of missing data even when entire rows or columns are missing from a constituent data matrix. We present theoretical results concerning the uniqueness, identifiability, and minimal parametrization of LMF, and evaluate it with extensive simulation studies.
Clustering samples according to an effective metric and/or vector space representation is a challenging unsupervised learning task with a wide spectrum of applications. Among several clustering algorithms, k-means and its kernelized version have still a wide audience because of their conceptual simplicity and efficacy. However, the systematic application of the kernelized version of k-means is hampered by its inherent square scaling in memory with the number of samples. In this contribution, we devise an approximate strategy to minimize the kernel k-means cost function in which the trade-off between accuracy and velocity is automatically ruled by the available system memory. Moreover, we define an ad-hoc parallelization scheme well suited for hybrid cpu-gpu state-of-the-art parallel architectures. We proved the effectiveness both of the approximation scheme and of the parallelization method on standard UCI datasets and on molecular dynamics (MD) data in the realm of computational chemistry. In this applicative domain, clustering can play a key role for both quantitively estimating kinetics rates via Markov State Models or to give qualitatively a human compatible summarization of the underlying chemical phenomenon under study. For these reasons, we selected it as a valuable real-world application scenario.
Data science and machine learning algorithms running on big data infrastructure are increasingly important in activities ranging from business intelligence and analytics to cybersecurity, smart city management, and many fields of science and engineering. As these algorithms are further integrated into daily operations, understanding how long they take to run on a big data infrastructure is paramount to controlling costs and delivery times. In this paper we discuss the issues involved in understanding the run time of iterative machine learning algorithms and provide a case study of such an algorithm – including a statistical characterization and model of the run time of an implementation of K-Means for the Spark big data engine using the Edward probabilistic programming language.
Trained recurrent networks are powerful tools for modeling dynamic neural computations. We present a target-based method for modifying the full connectivity matrix of a recurrent network to train it to perform tasks involving temporally complex input/output transformations. The method introduces a second network during training to provide suitable ‘target’ dynamics useful for performing the task. Because it exploits the full recurrent connectivity, the method produces networks that perform tasks with fewer neurons and greater noise robustness than traditional least-squares (FORCE) approaches. In addition, we show how introducing additional input signals into the target-generating network, which act as task hints, greatly extends the range of tasks that can be learned and provides control over the complexity and nature of the dynamics of the trained, task-performing network.
Random Projection is a foundational research topic that connects a bunch of machine learning algorithms under a similar mathematical basis. It is used to reduce the dimensionality of the dataset by projecting the data points efficiently to a smaller dimensions while preserving the original relative distance between the data points. In this paper, we are intended to explain random projection method, by explaining its mathematical background and foundation, the applications that are currently adopting it, and an overview on its current research perspective.