# Distilled News

Neural networks designed for sequence predictions have recently gained renewed interested by achieving state-of-the-art performance across areas such as speech recognition, machine translation or language modeling. However, these models are quite computationally demanding, which in turn can limit their application. In the area of language modeling, recent advances have been made leveraging massively large models that could only be trained on a large GPU cluster for weeks at a time. While impressive, these processing-intensive practices favor exploring on large computational infrastructures that are typically too expensive for academic environments and impractical in a production setting, limiting the speed of research, reproducibility, and usability of the results.
In two of my previous blog posts, I explained how the black box of a random forest can be opened up by tracking decision paths along the trees and computing feature contributions. This way, any prediction can be decomposed into contributions from features, such that prediction=bias+feature 1 contribution+..+feature n contribution . However, this linear breakdown is inherently imperfect, since a linear combination of features cannot capture interactions between them.
Finding clusters is a powerful tool for understanding and exploring data. While the task sounds easy, it can be surprisingly difficult to do it well. Most standard clustering algorithms can, and do, provide very poor clustering results in many cases. We discuss how to do clustering correctly. Finding clusters is a powerful tool for understanding and exploring data. While the task sounds easy, it can be surprisingly difficult to it well. Most standard clustering algorithms can, and do, provide very poor clustering results in many cases. Our intuitions for what a cluster is are not as clear as we would like, and can easily be lead astray. We will attempt to find a definition of clustering that makes sense for most cases, and introduce an algorithm for finding such clusters, along with a high performance python implementation of the algorithm, building up more intuition for what clustering really means as we go.
Simplified implementation of “Convolutional Neural Networks for Sentence Classification” paper
A curated list of awesome TensorFlow experiments, libraries, and projects. Inspired by awesome-machine-learning.
When assessing the quality of a model, being able to accurately measure its prediction error is of key importance. Often, however, techniques of measuring error are used that give grossly misleading results. This can lead to the phenomenon of over-fitting where a model may fit the training data very well, but will do a poor job of predicting results for new data not used in model training. Here is an overview of methods to accurately measure model prediction error.
Last post I described how I collected implicit feedback data from the website Sketchfab. I then claimed I would write about how to actually build a recommendation system with this data. Well, here we are! Let’s build. I think the best place to start when looking into implicit feedback recommenders is with the model outlined in the classic paper ‘Collaborative Filtering for Implicit Feedback Datasets’ by Koren et.al. (warning: pdf link). I have seen many names in the literature and machine learning libraries for this model. I’ll call it Weighted Regularized Matrix Factorization (WRMF) which tends to be a name used fairly often. WRMF is like the classic rock of implicit matrix factorization. It may not be the trendiest, but it will never go out of style. And, everytime I use it, I know that I’m guaranteed to like what I get out. Specifically, this model makes reasonable intuitive sense, it’s scalable, and, most importantly, I’ve found it easy to tune. There are much fewer hyperparameters than, say, stochastic gradient descent models.
This came up recently on StackOverflow. One of the answers was particularly helpful and I thought it might be worth mentioning here. The idea presented there is to break the code into four files, all stored in your project directory. These four files are to be processed in the following order. …
One of my favorite things about Python is that users get the benefit of observing the R community and then emulating the best parts of it. I’m a big believer that a language is only as helpful as its libraries and tools. This post is about pandasql, a Python package we (Yhat) wrote that emulates the R package sqldf. It’s a small but mighty library comprised of just 358 lines of code. The idea of pandasql is to make Python speak SQL. For those of you who come from a SQL-first background or still ‘think in SQL’, pandasql is a nice way to take advantage of the strengths of both languages.
Google has recently launched a new beta product – Google Cloud Natural Language API. As a rule, any new product by Google makes the company’s rivals get in a slight panic. What major players are to expect in the Natural Language area? Does it mean the end of free competition in the linguistic API market and dictatorship by “The good corporation”?
The future is undoubtedly attached to uncertainty, and this uncertainty can be estimated.
I’ve spent the last week hand-rolling recurrent neural networks. I’m currently taking Udacity’s Deep Learning course, and arriving at the section on RNN’s and LSTM’s, I decided to build a few for myself.
Most questions I’ve gotten since I released bayesAB have been along the lines of:
• Why/how is Bayesian AB testing better than Frequentist hypothesis AB testing?
• Why do I need priors?
• Do I really really really need priors?
• How do I choose priors?

# Whats new on arXiv

Multi-hop inference is necessary for machine learning systems to successfully solve tasks such as Recognising Textual Entailment and Machine Reading. In this work, we demonstrate the effectiveness of adaptive computation for learning the number of inference steps required for examples of different complexity and that learning the correct number of inference steps is difficult. We introduce the first model involving Adaptive Computation Time which provides a small performance benefit on top of a similar model without an adaptive component as well as enabling considerable insight into the reasoning process of the model.
In this paper, we develop new test statistics for private hypothesis testing. These statistics are designed specifically so that their asymptotic distributions, after accounting for noise added for privacy concerns, match the asymptotics of the classical (non-private) chi-square tests for testing if the multinomial data parameters lie in lower dimensional manifolds (examples include goodness of fit and independence testing). Empirically, these new test statistics outperform prior work, which focused on noisy versions of existing statistics.
A/B testing, also known as controlled experiment, bucket testing or splitting testing, has been widely used for evaluating a new feature, service or product in the data-driven decision processes of online websites. The goal of A/B testing is to estimate or test the difference between the treatment effects of the old and new variations. It is a well-studied two-sample comparison problem if each user’s response is influenced by her treatment only. However, in many applications of A/B testing, especially those in HIVE of Yahoo and other social networks of Microsoft, Facebook, LinkedIn, Twitter and Google, users in the social networks influence their friends via underlying social interactions, and the conventional A/B testing methods fail to work. This paper considers the network A/B testing problem and provide a general framework consisting of five steps: data sampling, probabilistic model, parameter inference, computing average treatment effect and hypothesis test. The framework performs well for network A/B testing in simulation studies.
Methods for unsupervised anomaly detection suffer from the fact that the data is unlabeled, making it difficult to assess the optimality of detection algorithms. Ensemble learning has shown exceptional results in classification and clustering problems, but has not seen as much research in the context of outlier detection. Existing methods focus on combining output scores of individual detectors, but this leads to outputs that are not easily interpretable. In this paper, we introduce a theoretical foundation for combining individual detectors with Bayesian classifier combination. Not only are posterior distributions easily interpreted as the probability distribution of anomalies, but bias, variance, and individual error rates of detectors are all easily obtained. Performance on real-world datasets shows high accuracy across varied types of time series data.
We classify the rare events of structured, memoryful stochastic processes and use this to analyze sequential and parallel generators for these events. Given a stochastic process, we introduce a method to construct a new process whose typical realizations are a given process’ rare events. This leads to an expression for the minimum memory required to generate rare events. We then show that the recently discovered classical-quantum ambiguity of simplicity also occurs when comparing the structure of process fluctuations.
Topic modeling is an increasingly important component of Big Data analytics, enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM), while mathematically elegant, do not lend themselves well to direct parallelization because of dependencies from one time step to another. Data decomposition approaches that partition data across time segments and then combine results in a global view of the dynamic change of topics enable execution of topic models on much larger datasets than is possibly without data decomposition. However, these methods are difficult to analyze mathematically and are relatively untested for quality of topics and performance on parallel systems. In this paper, we introduce and empirically analyze Clustered Latent Dirichlet Allocation (CLDA), a method for extracting dynamic latent topics from a collection of documents. CLDA uses a data decomposition strategy to partition data. CLDA takes advantage of parallelism, enabling fast execution for even very large datasets and a large number of topics. A large corpus is split into local segments to extract textual information from different time steps. Latent Dirichlet Allocation (LDA) is applied to infer topics at local segments. The results are merged, and clustering is used to combine topics from different segments into global topics. Results show that the perplexity is comparable and that topics generated by this algorithm are similar to those generated by DTM. In addition, CLDA is two orders of magnitude faster than existing approaches and allows for more freedom of experiment design. In this paper CLDA is applied successfully to seventeen years of NIPS conference papers, seventeen years of computer science journal abstracts, and to forty years of the PubMed corpus.
Machine Learning has been a big success story during the AI resurgence. One particular stand out success relates to unsupervised learning from a massive amount of data, albeit much of it relates to one modality/type of data at a time. In spite of early assertions of the unreasonable effectiveness of data, there is increasing recognition of utilizing knowledge whenever it is available or can be created purposefully. In this paper, we focus on discussing the indispensable role of knowledge for deeper understanding of complex text and multimodal data in situations where (i) large amounts of training data (labeled/unlabeled) are not available or labor intensive to create, (ii) the objects (particularly text) to be recognized are complex (i.e., beyond simple entity-person/location/organization names), such as implicit entities and highly subjective content, and (iii) applications need to use complementary or related data in multiple modalities/media. What brings us to the cusp of rapid progress is our ability to (a) create knowledge, varying from comprehensive or cross domain to domain or application specific, and (b) carefully exploit the knowledge to further empower or extend the applications of ML/NLP techniques. Using the early results in several diverse situations – both in data types and applications – we seek to foretell unprecedented progress in our ability for deeper understanding and exploitation of multimodal data.
In this paper, a solution to the problem of Active Authentication using trace histories is addressed. Specifically, the task is to perform user verification on mobile devices using historical location traces of the user as a function of time. Considering the movement of a human as a Markovian motion, a modified Hidden Markov Model (HMM)-based solution is proposed. The proposed method, namely the Marginally Smoothed HMM (MSHMM), utilizes the marginal probabilities of location and timing information of the observations to smooth-out the emission probabilities while training. Hence, it can efficiently handle unforeseen observations during the test phase. The verification performance of this method is compared to a sequence matching (SM) method , a Markov Chain-based method (MC) and an HMM with basic Laplace Smoothing (HMM-lap). Experimental results using the location information of the UMD Active Authentication Dataset-02 (UMDAA02) and the GeoLife dataset are presented. The proposed MSHMM method outperforms the compared methods in terms of equal error rate (EER). Additionally, the effects of different parameters on the proposed method are discussed.
In this work, we present and analyze reported failures of artificially intelligent systems and extrapolate our analysis to future AIs. We suggest that both the frequency and the seriousness of future AI failures will steadily increase. AI Safety can be improved based on ideas developed by cybersecurity experts. For narrow AIs safety failures are at the same, moderate, level of criticality as in cybersecurity, however for general AI, failures have a fundamentally different impact. A single failure of a superintelligent system may cause a catastrophic event without a chance for recovery. The goal of cybersecurity is to reduce the number of successful attacks on the system; the goal of AI Safety is to make sure zero attacks succeed in bypassing the safety mechanisms. Unfortunately, such a level of performance is unachievable. Every security system will eventually fail; there is no such thing as a 100% secure system.

# Whats new on arXiv

We present a simple algorithm for explicitly computing all k shortest paths bounded by length L from a fixed source to a target in O(m + kL) and O(mlogm + kL) time for unweighted and weighted directed graphs with m edges respectively. For many graphs, this outperforms existing algorithms by exploiting the fact that real world networks have short average path length. Consequently, we would like to adapt our almost shortest paths algorithm to find an efficient solution to the almost short- est simple paths, where we exclude paths that visit any node more than once. To this end, we consider realizations from the Chung-Lu random graph model as the Chung-Lu random graph model is not only amenable to analysis, but also emulates many of the properties frequently observed in real world networks including the small world phenomenon and degree heterogeneity. We provide theoretical and numeric evidence regarding the efficiency of utilizing our almost shortest paths algorithm to find al- most shortest simple paths for Chung-Lu random graphs for a wide range of parameters. Finally, we consider a special application of our almost shortest paths algorithm to study internet routing (withdrawals) in the Autonomous System graph.
Past work in computational sarcasm deals primarily with sarcasm detection. In this paper, we introduce a novel, related problem: sarcasm target identification (\textit{i.e.}, extracting the target of ridicule in a sarcastic sentence). We present an introductory approach for sarcasm target identification. Our approach employs two types of extractors: one based on rules, and another consisting of a statistical classifier. To compare our approach, we use two baselines: a na\’ive baseline and another baseline based on work in sentiment target identification. We perform our experiments on book snippets and tweets, and show that our hybrid approach performs better than the two baselines and also, in comparison with using the two extractors individually. Our introductory approach establishes the viability of sarcasm target identification, and will serve as a baseline for future work.
Independent component analysis (ICA) is the most popular method for blind source separation (BSS) with a diverse set of applications, such as biomedical signal processing, video and image analysis, and communications. Maximum likelihood (ML), an optimal theoretical framework for ICA, requires knowledge of the true underlying probability density function (PDF) of the latent sources, which, in many applications, is unknown. ICA algorithms cast in the ML framework often deviate from its theoretical optimality properties due to poor estimation of the source PDF. Therefore, accurate estimation of source PDFs is critical in order to avoid model mismatch and poor ICA performance. In this paper, we propose a new and efficient ICA algorithm based on entropy maximization with kernels, (ICA-EMK), which uses both global and local measuring functions as constraints to dynamically estimate the PDF of the sources with reasonable complexity. In addition, the new algorithm performs optimization with respect to each of the cost function gradient directions separately, enabling parallel implementations on multi-core computers. We demonstrate the superior performance of ICA-EMK over competing ICA algorithms using simulated as well as real-world data.
We present a framework and analysis of consistent binary classification for complex and non-decomposable performance metrics such as the F-measure and the Jaccard measure. The proposed framework is general, as it applies to both batch and online learning, and to both linear and non-linear models. Our work follows recent results showing that the Bayes optimal classifier for many complex metrics is given by a thresholding of the conditional probability of the positive class. This manuscript extends this thresholding characterization — showing that the utility is strictly locally quasi-concave with respect to the threshold for a wide range of models and performance metrics. This, in turn, motivates simple normalized gradient ascent updates for threshold estimation. We present a finite-sample regret analysis for the resulting procedure. In particular, the risk for the batch case converges to the Bayes risk at the same rate as that of the underlying conditional probability estimation, and the risk of proposed online algorithm converges at a rate that depends on the conditional probability estimation risk. For instance, in the special case where the conditional probability model is logistic regression, our procedure achieves $O(\frac{1}{\sqrt{n}})$ sample complexity, both for batch and online training. Empirical evaluation shows that the proposed algorithms out-perform alternatives in practice, with comparable or better prediction performance and reduced run time for various metrics and datasets.
Due to the recent cases of algorithmic bias in data-driven decision-making, machine learning methods are being put under the microscope in order to understand the root cause of these biases and how to correct them. Here, we consider a basic algorithmic task that is central in machine learning: subsampling from a large data set. Subsamples are used both as an end-goal in data summarization (where fairness could either be a legal, political or moral requirement) and to train algorithms (where biases in the samples are often a source of bias in the resulting model). Consequently, there is a growing effort to modify either the subsampling methods or the algorithms themselves in order to ensure fairness. However, in doing so, a question that seems to be overlooked is whether it is possible to produce fair subsamples that are also adequately representative of the feature space of the data set – an important and classic requirement in machine learning. Can diversity and fairness be simultaneously ensured? We start by noting that, in some applications, guaranteeing one does not necessarily guarantee the other, and a new approach is required. Subsequently, we present an algorithmic framework which allows us to produce both fair and diverse samples. Our experimental results on an image summarization task show marked improvements in fairness without compromising feature diversity by much, giving us the best of both the worlds.
In this paper, we develop a new sequential regression modeling approach for data streams. Data streams are commonly found around us, e.g in a retail enterprise sales data is continuously collected every day. A demand forecasting model is an important outcome from the data that needs to be continuously updated with the new incoming data. The main challenge in such modeling arises when there is a) high dimensional and sparsity, b) need for an adaptive use of prior knowledge, and/or c) structural changes in the system. The proposed approach addresses these challenges by incorporating an adaptive L1-penalty and inertia terms in the loss function, and thus called Inertial Regularization and Selection (IRS). The former term performs model selection to handle the first challenge while the latter is shown to address the last two challenges. A recursive estimation algorithm is developed, and shown to outperform the commonly used state-space models, such as Kalman Filters, in experimental studies and real data.
We propose a new model based on the deconvolutional networks and SAX discretization to learn the representation for multivariate time series. Deconvolutional networks fully exploit the advantage the powerful expressiveness of deep neural networks in the manner of unsupervised learning. We design a network structure specifically to capture the cross-channel correlation with deconvolution, forcing the pooling operation to perform the dimension reduction along each position in the individual channel. Discretization based on Symbolic Aggregate Approximation is applied on the feature vectors to further extract the bag of features. We show how this representation and bag of features helps on classification. A full comparison with the sequence distance based approach is provided to demonstrate the effectiveness of our approach on the standard datasets. We further build the Markov matrix from the discretized representation from the deconvolution to visualize the time series as complex networks, which show more class-specific statistical properties and clear structures with respect to different labels.
Even though in recent years the scale of statistical analysis problems has increased tremendously, many statistical software tools are still limited to single-node computations. However, statistical analyses are largely based on dense linear algebra operations, which have been deeply studied, optimized and parallelized in the high-performance-computing community. To make high-performance distributed computations available for statistical analysis, and thus enable large scale statistical computations, we introduce RElem, an open source package that integrates the distributed dense linear algebra library Elemental into R. While on the one hand, RElem provides direct wrappers of Elemental’s routines, on the other hand, it overloads various operators and functions to provide an entirely native R experience for distributed computations. We showcase how simple it is to port existing R programs to Relem and demonstrate that Relem indeed allows to scale beyond the single-node limitation of R with the full performance of Elemental without any overhead.
Similarity search on time series is a frequent operation in large-scale data-driven applications. Sophisticated similarity measures are standard for time series matching, as they are usually misaligned. Dynamic Time Warping or DTW is the most widely used similarity measure for time series because it combines alignment and matching at the same time. However, the alignment makes DTW slow. To speed up the expensive similarity search with DTW, branch and bound based pruning strategies are adopted. However, branch and bound based pruning are only useful for very short queries (low dimensional time series), and the bounds are quite weak for longer queries. Due to the loose bounds branch and bound pruning strategy boils down to a brute-force search. To circumvent this issue, we design SSH (Sketch, Shingle, & Hashing), an efficient and approximate hashing scheme which is much faster than the state-of-the-art branch and bound searching technique: the UCR suite. SSH uses a novel combination of sketching, shingling and hashing techniques to produce (probabilistic) indexes which align (near perfectly) with DTW similarity measure. The generated indexes are then used to create hash buckets for sub-linear search. Our results show that SSH is very effective for longer time sequence and prunes around 95% candidates, leading to the massive speedup in search with DTW. Empirical results on two large-scale benchmark time series data show that our proposed method can be around 20 times faster than the state-of-the-art package (UCR suite) without any significant loss in accuracy.
This special issue is dedicated to get a better picture of the relationships between computational linguistics and cognitive science. It specifically raises two questions: ‘what is the potential contribution of computational language modeling to cognitive science?’ and conversely: ‘what is the influence of cognitive science in contemporary computational linguistics?’
We present a new algorithm, truncated variance reduction (TruVaR), that treats Bayesian optimization (BO) and level-set estimation (LSE) with Gaussian processes in a unified fashion. The algorithm greedily shrinks a sum of truncated variances within a set of potential maximizers (BO) or unclassified points (LSE), which is updated based on confidence bounds. TruVaR is effective in several important settings that are typically non-trivial to incorporate into myopic algorithms, including pointwise costs and heteroscedastic noise. We provide a general theoretical guarantee for TruVaR covering these aspects, and use it to recover and strengthen existing results on BO and LSE. Moreover, we provide a new result for a setting where one can select from a number of noise levels having associated costs. We demonstrate the effectiveness of the algorithm on both synthetic and real-world data sets.
We show that every matrix all of whose entries are in a fixed subgroup of the group of units of a commutative ring with identity is equivalent to a standard form. As a consequence, we improve the proof of Theorem 5 in D. Best, H. Kharaghani, H. Ramp [Disc. Math. 313 (2013), 855–864].
Meaning has been called the ‘holy grail’ of a variety of scientific disciplines, ranging from linguistics to philosophy, psychology and the neurosciences. The field of Artifical Intelligence (AI) is very much a part of that list: the development of sophisticated natural language semantics is a sine qua non for achieving a level of intelligence comparable to humans. Embodiment theories in cognitive science hold that human semantic representation depends on sensori-motor experience; the abundant evidence that human meaning representation is grounded in the perception of physical reality leads to the conclusion that meaning must depend on a fusion of multiple (perceptual) modalities. Despite this, AI research in general, and its subdisciplines such as computational linguistics and computer vision in particular, have focused primarily on tasks that involve a single modality. Here, we propose virtual embodiment as an alternative, long-term strategy for AI research that is multi-modal in nature and that allows for the kind of scalability required to develop the field coherently and incrementally, in an ethically responsible fashion.
We consider a distributed learning approach in supervised learning for a large class of spectral regularization methods in an RKHS framework. The data set of size n is partitioned into $m=O(n^\alpha)$, $\alpha \leq \frac{1}{2}$, disjoint subsets. On each subset, some spectral regularization method (belonging to a large class, including in particular Kernel Ridge Regression, $L^2$-boosting and spectral cut-off) is applied. The regression function $f$ is then estimated via simple averaging, leading to a substantial reduction in computation time. We show that minimax optimal rates of convergence are preserved if m grows sufficiently slowly (corresponding to an upper bound for $\alpha$) as $n \to \infty$, depending on the smoothness assumptions on $f$ and the intrinsic dimensionality. In spirit, our approach is classical.
Applications of functional data with large numbers of predictors have grown precipitously in recent years, driven, in part, by rapid advances in genotyping technologies. Given the large numbers of genetic mutations encountered in genetic association studies, statistical methods which more fully exploit the underlying structure of the data are imperative for maximizing statistical power. However, there is currently very limited work in functional data with large numbers of predictors. Tools are presented for simultaneous variable selection and parameter estimation in a functional linear model with a functional outcome and a large number of scalar predictors; the technique is called AFSL for $\textit{Adaptive Function-on-Scalar Lasso}.$ It is demonstrated how techniques from convex analysis over Hilbert spaces can be used to establish a functional version of the oracle property for AFSL over any real separable Hilbert space, even when the number of predictors, $I$, is exponentially large compared to the sample size, $N$. AFSL is illustrated via a simulation study and data from the Childhood Asthma Management Program, CAMP, selecting those genetic mutations which are important for lung growth.
There are several confounding factors that can reduce the accuracy of gait recognition systems. These factors can reduce the distinctiveness, or alter the features used to characterise gait, they include variations in clothing, lighting, pose and environment, such as the walking surface. Full invariance to all confounding factors is challenging in the absence of high-quality labelled training data. We introduce a simulation-based methodology and a subject-specific dataset which can be used for generating synthetic video frames and sequences for data augmentation. With this methodology, we generated a multi-modal dataset. In addition, we supply simulation files that provide the ability to simultaneously sample from several confounding variables. The basis of the data is real motion capture data of subjects walking and running on a treadmill at different speeds. Results from gait recognition experiments suggest that information about the identity of subjects is retained within synthetically generated examples. The dataset and methodology allow studies into fully-invariant identity recognition spanning a far greater number of observation conditions than would otherwise be possible.

# Book Memo: “Novel Applications of Intelligent Systems”

 In this carefully edited book some selected results of theoretical and applied research in the field of broadly perceived intelligent systems are presented. The problems vary from industrial to web and problem independent applications. All this is united under the slogan: ‘Intelligent systems conquer the world”. The book brings together innovation projects with analytical research, invention, retrieval and processing of knowledge and logical applications in technology. This book is aiming to a wide circle of readers and particularly to the young generation of IT/ICT experts who will build the next generations of intelligent systems.

# R Packages worth a look

Performs Bayesian Variable Selection on the Covariates in a Semi-Competing Risks Model (SCRSELECT)
Contains four functions used in the DIC-tau_g procedure. SCRSELECT() and SCRSELECTRUN() uses Stochastic Search Variable Selection to select important covariates in the three hazard functions of a semi-competing risks model. These functions perform the Gibbs sampler for variable selection and a Metropolis-Hastings-Green sampler for the number of split points and parameters for the three baseline hazard function. The function SCRSELECT() returns the posterior sample of all quantities sampled in the Gibbs sampler after a burn-in period to a desired file location, while the function SCRSELECTRUN() returns posterior values of important quantities to the DIC-Tau_g procedure in a list. The function DICTAUG() returns a list containing the DIC values for the unique models visited by the DIC-Tau_g grid search. The function ReturnModel() uses SCRSELECTRUN() and DICTAUG() to return a summary of the posterior coefficient vectors for the optimal model along with saving this posterior sample to a desired path location.

Fast Linear Models for Objects from the ‘bigmemory’ Package (bigFastlm)
A reimplementation of the fastLm() functionality of ‘RcppEigen’ for big.matrix objects for fast out-of-memory linear model fitting.

d3.js’ Utilities for R (d3r)
Helper functions for using ‘d3.js’ in R.

# If you did not already know: “Azure Machine Learning”

Azure Machine Learning, a fully-managed service in the cloud that enables you to easily build, deploy and share advanced analytics solutions. No software to download, no servers to manage – all you need to start is a browser and internet connectivity. … Azure Machine Learning

# Whats new on arXiv

Existing machine translation decoding algorithms generate translations in a strictly monotonic fashion and never revisit previous decisions. As a result, earlier mistakes cannot be corrected at a later stage. In this paper, we present a translation scheme that starts from an initial guess and then makes iterative improvements that may revisit previous decisions. We parameterize our model as a convolutional neural network that predicts discrete substitutions to an existing translation based on an attention mechanism over both the source sentence as well as the current translation output. By making less than one modification per sentence, we improve the output of a phrase-based translation system by up to 0.4 BLEU on WMT15 German-English translation.
In this note we describe the theory of functional asynchronous networks and one of the main results, the Modularization of Dynamics Theorem, which for a large class of functional asynchronous networks gives a factorization of dynamics in terms of constituent subnetworks. For these networks we can give a complete description of the network function in terms of the function of the events comprising the network and thereby answer a question originally raised by Alon in the context of biological networks.
The vast majority of recommender systems model preferences as static or slowly changing due to observable user experience. However, spontaneous changes in user preferences are ubiquitous in many domains like media consumption and key factors that drive changes in preferences are not directly observable. These latent sources of preference change pose new challenges. When systems do not track and adapt to users’ tastes, users lose confidence and trust, increasing the risk of user churn. We meet these challenges by developing a model of novelty preferences that learns and tracks latent user tastes. We combine three innovations: a new measure of item similarity based on patterns of consumption co-occurrence; model for {\em spontaneous} changes in preferences; and a learning agent that tracks each user’s dynamic preferences and learns individualized policies for variety. The resulting framework adaptively provides users with novelty tailored to their preferences for change per se.
In this paper we present a new algorithm for computing a low rank approximation of the product $A^TB$ by taking only a single pass of the two matrices $A$ and $B$. The straightforward way to do this is to (a) first sketch $A$ and $B$ individually, and then (b) find the top components using PCA on the sketch. Our algorithm in contrast retains additional summary information about $A,B$ (e.g. row and column norms etc.) and uses this additional information to obtain an improved approximation from the sketches. Our main analytical result establishes a comparable spectral norm guarantee to existing two-pass methods; in addition we also provide results from an Apache Spark implementation that shows better computational and statistical performance on real-world and synthetic evaluation datasets.
We present new methods for batch anomaly detection in multivariate time series. Our methods are based on maximizing the Kullback-Leibler divergence between the data distribution within and outside an interval of the time series. An empirical analysis shows the benefits of our algorithms compared to methods that treat each time step independently from each other without optimizing with respect to all possible intervals.
In this paper, Kernel PCA is reinterpreted as the solution to a convex optimization problem. Actually, there is a constrained convex problem for each principal component, so that the constraints guarantee that the principal component is indeed a solution, and not a mere saddle point. Although these insights do not imply any algorithmic improvement, they can be used to further understand the method, formulate possible extensions and properly address them. As an example, a new convex optimization problem for semi-supervised classification is proposed, which seems particularly well-suited whenever the number of known labels is small. Our formulation resembles a Least Squares SVM problem with a regularization parameter multiplied by a negative sign, combined with a variational principle for Kernel PCA. Our primal optimization principle for semi-supervised learning is solved in terms of the Lagrange multipliers. Numerical experiments in several classification tasks illustrate the performance of the proposed model in problems with only a few labeled data.
In recent years, traditional cybersecurity safeguards have proven ineffective against insider threats. Famous cases of sensitive information leaks caused by insiders, including the WikiLeaks release of diplomatic cables and the Edward Snowden incident, have greatly harmed the U.S. government’s relationship with other governments and with its own citizens. Data Leak Prevention (DLP) is a solution for detecting and preventing information leaks from within an organization’s network. However, state-of-art DLP detection models are only able to detect very limited types of sensitive information, and research in the field has been hindered due to the lack of available sensitive texts. Many researchers have focused on document-based detection with artificially labeled ‘confidential documents’ for which security labels are assigned to the entire document, when in reality only a portion of the document is sensitive. This type of whole-document based security labeling increases the chances of preventing authorized users from accessing non-sensitive information within sensitive documents. In this paper, we introduce Automated Classification Enabled by Security Similarity (ACESS), a new and innovative detection model that penetrates the complexity of big text security classification/detection. To analyze the ACESS system, we constructed a novel dataset, containing formerly classified paragraphs from diplomatic cables made public by the WikiLeaks organization. To our knowledge this paper is the first to analyze a dataset that contains actual formerly sensitive information annotated at paragraph granularity.
In computer vision, action recognition refers to the act of classifying an action that is present in a given video and action detection involves locating actions of interest in space and/or time. Videos, which contain photometric information (e.g. RGB, intensity values) in a lattice structure, contain information that can assist in identifying the action that has been imaged. The process of action recognition and detection often begins with extracting useful features and encoding them to ensure that the features are specific to serve the task of action recognition and detection. Encoded features are then processed through a classifier to identify the action class and their spatial and/or temporal locations. In this report, a thorough review of various action recognition and detection algorithms in computer vision is provided by analyzing the two-step process of a typical action recognition and detection algorithm: (i) extraction and encoding of features, and (ii) classifying features into action classes. In efforts to ensure that computer vision-based algorithms reach the capabilities that humans have of identifying actions irrespective of various nuisance variables that may be present within the field of view, the state-of-the-art methods are reviewed and some remaining problems are addressed in the final chapter.
Automatic construction of large knowledge graphs (KG) by mining web-scale text datasets has received considerable attention over the last few years, resulting in the construction of several KGs, such as NELL, Google Knowledge Vault, etc. These KGs consist of thousands of predicate-relations (e.g., isPerson, isMayorOf ) and millions of their instances (e.g., (Bill de Blasio, isMayorOf, New York City)). Estimating accuracy of such automatically constructed KGs is a challenging problem due to their size and diversity. Even though crowdsourcing is an obvious choice for such evaluation, the standard single-task crowdsourcing, where each predicate in the KG is evaluated independently, is very expensive and especially problematic if the budget available is limited. We show that such approaches are sub-optimal as they ignore dependencies among various predicates and their instances. To overcome this challenge, we propose Relational Crowdsourcing (RelCrowd), where the tasks are created while taking dependencies among predicates and instances into account. We apply this framework in the context of evaluation of large-scale KGs and demonstrate its effectiveness through extensive experiments on real-world datasets.

# Book Memo: “Statistics for Lawyers”

 This classic text, first published in 1990, is designed to introduce law students, law teachers, practitioners, and judges to the basic ideas of mathematical probability and statistics as they have been applied in the law. The third edition includes over twenty new sections, including the addition of timely topics, like New York City police stops, exonerations in death-sentence cases, projecting airline costs, and new material on various statistical techniques such as the randomized response survey technique, rare-events meta-analysis, competing risks, and negative binomial regression. The book consists of sections of exposition followed by real-world cases and case studies in which statistical data have played a role. The reader is asked to apply the theory to the facts, to calculate results (a hand calculator is sufficient), and to explore legal issues raised by quantitative findings. The authors’ calculations and comments are given in the back of the book. As with previous editions, the cases and case studies reflect a broad variety of legal subjects, including antidiscrimination, mass torts, taxation, school finance, identification evidence, preventive detention, handwriting disputes, voting, environmental protection, antitrust, sampling for insurance audits, and the death penalty. A chapter on epidemiology was added in the second edition. In 1991, the first edition was selected by the University of Michigan Law Review as one of the important law books of the year.