Magister Dixit

“Thanks to a perfect storm of recent advances in the tech industry, AI has risen from the ashes and regained its aura of cool.” Mike Barlow ( 2017 )

If you did not already know

Generalized Autoregressive Moving Average Models (GARMA)
A class of generalized autoregressive moving average (GARMA) models is developed that extends the univariate Gaussian ARMA time series model to a flexible observation-driven model for non-Gaussian time series data. The dependent variable is assumed to have a conditional exponential family distribution given the past history of the process. The model estimation is carried out using an iteratively reweighted least squares algorithm. Properties of the model, including stationarity and marginal moments, are either derived explicitly or investigated using Monte Carlo simulation. The relationship of the GARMA model to other models is shown, including the autoregressive models of Zeger and Qaqish, the moving average models of Li, and the reparameterized generalized autoregressive conditional heteroscedastic GARCH model (providing the formula for its fourth marginal moment not previously derived). The model is demonstrated by the application of the GARMA model with a negative binomial conditional distribution to a well-known time series dataset of poliomyelitis counts. …

Machine Vision (MV)
Machine vision (MV) is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance in industry. The scope of MV is broad. MV is related to, though distinct from, computer vision. …

GraphH
It is common for real-world applications to analyze big graphs using distributed graph processing systems. Popular in-memory systems require an enormous amount of resources to handle big graphs. While several out-of-core systems have been proposed recently for processing big graphs using secondary storage, the high disk I/O overhead could significantly reduce performance. In this paper, we propose GraphH to enable high- performance big graph analytics in small clusters. Specifically, we design a two-stage graph partition scheme to evenly divide the input graph into partitions, and propose a GAB (Gather-Apply- Broadcast) computation model to make each worker process a partition in memory at a time. We use an edge cache mechanism to reduce the disk I/O overhead, and design a hybrid strategy to improve the communication performance. GraphH can efficiently process big graphs in small clusters or even a single commodity server. Extensive evaluations have shown that GraphH could be up to 7.8x faster compared to popular in-memory systems, such as Pregel+ and PowerGraph when processing generic graphs, and more than 100x faster than recently proposed out-of-core systems, such as GraphD and Chaos when processing big graphs. …

R Packages worth a look

Univariate Pseudo-Random Number Generation (UnivRNG)
Pseudo-random number generation of 17 univariate distributions.

Data Management Tools for Real-Time Monitoring/Ecological Momentary Assessment Data (EMAtools)
Do data management functions common in real-time monitoring (also called: ecological momentary assessment, experience sampling, micro-longitudinal) data, including centering on participant means and merging event-level data into momentary data sets where you need the events to correspond to the nearest data point in the momentary data. This is VERY early release software, and more features will be added over time.

Visualization for Time Series Data (timelineR)
Helps to visualize multi-variate time-series having numeric and factor variables. You can use the package for visual analysis of data by plotting the data for each variable in the desired order and study interaction between a factor and a numeric variable by creating overlapping plots.

ROI’-Plugin ‘optimx’ (ROI.plugin.optimx)
Enhances the R Optimization Infrastructure (‘ROI’) package with the ‘optimx’ package.

Simulate Models Based on the Generalized Linear Model (simglm)
Easily simulates regression models, including both simple regression and generalized linear mixed models with up to three level of nesting. Power simulations that are flexible allowing the specification of missing data, unbalanced designs, and different random error distributions are built into the package.

Document worth reading: “Interpreting Blackbox Models via Model Extraction”

Interpretability has become an important issue as machine learning is increasingly used to inform consequential decisions. We propose an approach for interpreting a blackbox model by extracting a decision tree that approximates the model. Our model extraction algorithm avoids overfitting by leveraging blackbox model access to actively sample new training points. We prove that as the number of samples goes to infinity, the decision tree learned using our algorithm converges to the exact greedy decision tree. In our evaluation, we use our algorithm to interpret random forests and neural nets trained on several datasets from the UCI Machine Learning Repository, as well as control policies learned for three classical reinforcement learning problems. We show that our algorithm improves over a baseline based on CART on every problem instance. Furthermore, we show how an interpretation generated by our approach can be used to understand and debug these models. Interpreting Blackbox Models via Model Extraction

Book Memo: “Statistical Shape Analysis”

 With Applications in R A thoroughly revised and updated edition of this introduction to modern statistical methods for shape analysis. Shape analysis is an important tool in the many disciplines where objects are compared using geometrical features. Examples include comparing brain shape in schizophrenia; investigating protein molecules in bioinformatics; and describing growth of organisms in biology. This book is a significant update of the highly-regarded `Statistical Shape Analysis’ by the same authors. The new edition lays the foundations of landmark shape analysis, including geometrical concepts and statistical techniques, and extends to include analysis of curves, surfaces, images and other types of object data. Key definitions and concepts are discussed throughout, and the relative merits of different approaches are presented. The authors have included substantial new material on recent statistical developments and offer numerous examples throughout the text. Concepts are introduced in an accessible manner, while retaining sufficient detail for more specialist statisticians to appreciate the challenges and opportunities of this new field. Computer code has been included for instructional use, along with exercises to enable readers to implement the applications themselves in R and to follow the key ideas by hands-on analysis. Statistical Shape Analysis: with Applications in R will offer a valuable introduction to this fast-moving research area for statisticians and other applied scientists working in diverse areas, including archaeology, bioinformatics, biology, chemistry, computer science, medicine, morphometics and image analysis

Magister Dixit

“Today, you are much less likely to face a scenario in which you cannot query data and get a response back in a brief period of time. Analytical processes that used to require month, days, or hours have been reduced to minutes, seconds, and fractions of seconds. But shorter processing times have led to higher expectations. Two years ago, many data analysts thought that generating a result from a query in less than 40 minutes was nothing short of miraculous. Today, they expect to see results in under a minute. That’s practically the speed of thought – you think of a query, you get a result, and you begin your experiment. “It’s about moving with greater speed toward previously unknown questions, defining new insights, and reducing the time between when an event happens somewhere in the world and someone responds or reacts to that event,” says Erickson. A rapidly emerging universe of newer technologies has dramatically reduced data processing cycle time, making it possible to explore and experiment with data in ways that would not have been practical or even possible a few years ago. Despite the availability of new tools and systems for handling massive amounts of data at incredible speeds, however, the real promise of advanced data analytics lies beyond the realm of pure technology. “Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse,” says Michael Minelli, co-author of Big Data, Big Analytics. “It’s about the ability to make better decisions and take meaningful actions at the right time. It’s about detecting fraud while someone is swiping a credit card, or triggering an offer while a shopper is standing on a checkout line, or placing an ad on a website while someone is reading a specific article. It’s about combining and analyzing data so you can take the right action, at the right time, and at the right place.” For some, real-time big data analytics (RTBDA) is a ticket to improved sales, higher profits and lower marketing costs. To others, it signals the dawn of a new era in which machines begin to think and respond more like humans.” Mike Barlow ( 2013 )

If you did not already know

Deep Rotation Equivariant Network (DREN)
Recently, learning equivariant representations has attracted considerable research attention. Dieleman et al. introduce four operations which can be inserted to CNN to learn deep representations equivariant to rotation. However, feature maps should be copied and rotated four times in each layer in their approach, which causes much running time and memory overhead. In order to address this problem, we propose Deep Rotation Equivariant Network(DREN) consisting of cycle layers, isotonic layers and decycle layers.Our proposed layers apply rotation transformation on filters rather than feature maps, achieving a speed up of more than 2 times with even less memory overhead. We evaluate DRENs on Rotated MNIST and CIFAR-10 datasets and demonstrate that it can improve the performance of state-of-the-art architectures. Our codes are released on GitHub. …

Semantic Matching
Semantic matching is a technique used in computer science to identify information which is semantically related. Given any two graph-like structures, e.g. classifications, taxonomies database or XML schemas and ontologies, matching is an operator which identifies those nodes in the two structures which semantically correspond to one another. For example, applied to file systems it can identify that a folder labeled “car” is semantically equivalent to another folder “automobile” because they are synonyms in English. This information can be taken from a linguistic resource like WordNet. In the recent years many of them have been offered. S-Match is an example of a semantic matching operator. It works on lightweight ontologies, namely graph structures where each node is labeled by a natural language sentence, for example in English. These sentences are translated into a formal logical formula (according to an artificial unambiguous language) codifying the meaning of the node taking into account its position in the graph. For example, in case the folder “car” is under another folder “red” we can say that the meaning of the folder “car” is “red car” in this case. This is translated into the logical formula “red AND car”. The output of S-Match is a set of semantic correspondences called mappings attached with one of the following semantic relations: disjointness (⊥), equivalence (≡), more specific (⊑) and less specific (⊒). In our example the algorithm will return a mapping between ”car” and ”automobile” attached with an equivalence relation. Information semantically matched can also be used as a measure of relevance through a mapping of near-term relationships. Such use of S-Match technology is prevalent in the career space where it is used to gauge depth of skills through relational mapping of information found in applicant resumes. Semantic matching represents a fundamental technique in many applications in areas such as resource discovery, data integration, data migration, query translation, peer to peer networks, agent communication, schema and ontology merging. It using is also being investigated in other areas such as event processing. In fact, it has been proposed as a valid solution to the semantic heterogeneity problem, namely managing the diversity in knowledge. Interoperability among people of different cultures and languages, having different viewpoints and using different terminology has always been a huge problem. Especially with the advent of the Web and the consequential information explosion, the problem seems to be emphasized. People face the concrete problem to retrieve, disambiguate and integrate information coming from a wide variety of sources. …

Waterfall Plot
A waterfall plot is a three-dimensional plot in which multiple curves of data, typically spectra, are displayed simultaneously. Typically the curves are staggered both across the screen and vertically, with ‘nearer’ curves masking the ones behind. The result is a series of ‘mountain’ shapes that appear to be side by side. The waterfall plot is often used to show how two-dimensional information changes over time or some other variable such as rpm. The term ‘waterfall plot’ is sometimes used interchangeably with ‘spectrogram’ or ‘Cumulative Spectral Decay’ (CSD) plot. …

Whats new on arXiv

This paper presents two unsupervised learning layers (UL layers) for label-free video analysis: one for fully connected layers, and the other for convolutional ones. The proposed UL layers can play two roles: they can be the cost function layer for providing global training signal; meanwhile they can be added to any regular neural network layers for providing local training signals and combined with the training signals backpropagated from upper layers for extracting both slow and fast changing features at layers of different depths. Therefore, the UL layers can be used in either pure unsupervised or semi-supervised settings. Both a closed-form solution and an online learning algorithm for two UL layers are provided. Experiments with unlabeled synthetic and real-world videos demonstrated that the neural networks equipped with UL layers and trained with the proposed online learning algorithm can extract shape and motion information from video sequences of moving objects. The experiments demonstrated the potential applications of UL layers and online learning algorithm to head orientation estimation and moving object localization.
Variational inference is a powerful approach for approximate posterior inference. However, it is sensitive to initialization and can be subject to poor local optima. In this paper, we develop proximity variational inference (PVI). PVI is a new method for optimizing the variational objective that constrains subsequent iterates of the variational parameters to robustify the optimization path. Consequently, PVI is less sensitive to initialization and optimization quirks and finds better local optima. We demonstrate our method on three proximity statistics. We study PVI on a Bernoulli factor model and sigmoid belief network with both real and synthetic data and compare to deterministic annealing (Katahira et al., 2008). We highlight the flexibility of PVI by designing a proximity statistic for Bayesian deep learning models such as the variational autoencoder (Kingma and Welling, 2014; Rezende et al., 2014). Empirically, we show that PVI consistently finds better local optima and gives better predictive performance.
Generative adversarial networks (GAN) approximate a target data distribution by jointly optimizing an objective function through a ‘two-player game’ between a generator and a discriminator. Despite their empirical success, however, two very basic questions on how well they can approximate the target distribution remain unanswered. First, it is not known how restricting the discriminator family affects the approximation quality. Second, while a number of different objective functions have been proposed, we do not understand when convergence to the global minima of the objective function leads to convergence to the target distribution under various notions of distributional convergence. In this paper, we address these questions in a broad and unified setting by defining a notion of adversarial divergences that includes a number of recently proposed objective functions. We show that if the objective function is an adversarial divergence with some additional conditions, then using a restricted discriminator family has a moment-matching effect. Additionally, we show that for objective functions that are strict adversarial divergences, convergence in the objective function implies weak convergence, thus generalizing previous results.
We present a neural network technique for the analysis and extrapolation of time-series data called Neural Decomposition (ND). Units with a sinusoidal activation function are used to perform a Fourier-like decomposition of training samples into a sum of sinusoids, augmented by units with nonperiodic activation functions to capture linear trends and other nonperiodic components. We show how careful weight initialization can be combined with regularization to form a simple model that generalizes well. Our method generalizes effectively on the Mackey-Glass series, a dataset of unemployment rates as reported by the U.S. Department of Labor Statistics, a time-series of monthly international airline passengers, the monthly ozone concentration in downtown Los Angeles, and an unevenly sampled time-series of oxygen isotope measurements from a cave in north India. We find that ND outperforms popular time-series forecasting techniques including LSTM, echo state networks, ARIMA, SARIMA, SVR with a radial basis function, and Gashler and Ashmore’s model.
This work presents a rigorous statistical analysis of adversarial training for generative models, advancing recent work by Arjovsky and Bottou [2]. A key element is the distinction between the objective function with respect to the (unknown) data distribution, and its empirical counterpart. This yields a straight-forward explanation for common pathologies in practical adversarial training such as vanishing gradients. To overcome such issues, we pursue the idea of smoothing the Jensen-Shannon Divergence (JSD) by incorporating noise in the formulation of the discriminator. As we show, this effectively leads to an empirical version of the JSD in which the true and the generator densities are replaced by kernel density estimates. We analyze statistical consistency of this objective, and demonstrate its practical effectiveness.
Recurrent neural networks have achieved remarkable success at generating sequences with complex structures, thanks to advances that include richer embeddings of input and cures for vanishing gradients. Trained only on sequences from a known grammar, though, they can still struggle to learn rules and constraints of the grammar. Neural Attribute Machines (NAMs) are equipped with a logical machine that represents the underlying grammar, which is used to teach the constraints to the neural machine by (i) augmenting the input sequence, and (ii) optimizing a custom loss function. Unlike traditional RNNs, NAMs are exposed to the grammar, as well as samples from the language of the grammar. During generation, NAMs make significantly fewer violations of the constraints of the underlying grammar than RNNs trained only on samples from the language of the grammar.
Machine learning and data analysis now finds both scientific and industrial application in biology, chemistry, geology, medicine, and physics. These applications rely on large quantities of data gathered from automated sensors and user input. Furthermore, the dimensionality of many datasets is extreme: more details are being gathered about single user interactions or sensor readings. All of these applications encounter problems with a common theme: use observed data to make inferences about the world. Our work obtains the first provably efficient algorithms for Independent Component Analysis (ICA) in the presence of heavy-tailed data. The main tool in this result is the centroid body (a well-known topic in convex geometry), along with optimization and random walks for sampling from a convex body. This is the first algorithmic use of the centroid body and it is of independent theoretical interest, since it effectively replaces the estimation of covariance from samples, and is more generally accessible. This reduction relies on a non-linear transformation of samples from such an intersection of halfspaces (i.e. a simplex) to samples which are approximately from a linearly transformed product distribution. Through this transformation of samples, which can be done efficiently, one can then use an ICA algorithm to recover the vertices of the intersection of halfspaces. Finally, we again use ICA as an algorithmic primitive to construct an efficient solution to the widely-studied problem of learning the parameters of a Gaussian mixture model. Our algorithm again transforms samples from a Gaussian mixture model into samples which fit into the ICA model and, when processed by an ICA algorithm, result in recovery of the mixture parameters. Our algorithm is effective even when the number of Gaussians in the mixture grows polynomially with the ambient dimension
Content popularity prediction has been extensively studied due to its importance and interest for both users and hosts of social media sites like Facebook, Instagram, Twitter, and Pinterest. However, existing work mainly focuses on modeling popularity using a single metric such as the total number of likes or shares. In this work, we propose Diffusion-LSTM, a memory-based deep recurrent network that learns to recursively predict the entire diffusion path of an image through a social network. By combining user social features and image features, and encoding the diffusion path taken thus far with an explicit memory cell, our model predicts the diffusion path of an image more accurately compared to alternate baselines that either encode only image or social features, or lack memory. By mapping individual users to user prototypes, our model can generalize to new users not seen during training. Finally, we demonstrate our model’s capability of generating diffusion trees, and show that the generated trees closely resemble ground-truth trees.
We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.

Document worth reading: “Living Together: Mind and Machine Intelligence”

In this paper we consider the nature of the machine intelligences we have created in the context of our human intelligence. We suggest that the fundamental difference between human and machine intelligence comes down to \emph{embodiment factors}. We define embodiment factors as the ratio between an entity’s ability to communicate information vs compute information. We speculate on the role of embodiment factors in driving our own intelligence and consciousness. We briefly review dual process models of cognition and cast machine intelligence within that framework, characterising it as a dominant System Zero, which can drive behaviour through interfacing with us subconsciously. Driven by concerns about the consequence of such a system we suggest prophylactic courses of action that could be considered. Our main conclusion is that it is \emph{not} sentient intelligence we should fear but \emph{non-sentient} intelligence. Living Together: Mind and Machine Intelligence

Book Memo: “Health 4.0”

 How Virtualization and Big Data are Revolutionizing Healthcare This book describes how the creation of new digital services—through vertical and horizontal integration of data coming from sensors on top of existing legacy systems—that has already had a major impact on industry is now extending to healthcare. The book describes the fourth industrial revolution (i.e. Health 4.0), which is based on virtualization and service aggregation. It shows how sensors, embedded systems, and cyber-physical systems are fundamentally changing the way industrial processes work, their business models, and how we consume, while also affecting the health and care domains. Chapters describe the technology behind the shift of point of care to point of need and away from hospitals and institutions; how care will be delivered virtually outside hospitals; that services will be tailored to individuals rather than being designed as statistical averages; that data analytics will be used to help patients to manage their chronic conditions with help of smart devices; and that pharmaceuticals will be interactive to help prevent adverse reactions. The topics presented will have an impact on a variety of healthcare stakeholders in a continuously global and hyper-connected world.