Advertisements

Documents

1 | 4 | 5 | 7 | 8 | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | Y | Z
|Documents| = 869

1

10 Tips to Create Useful and Beautiful Visualizations (Slide Deck)

4

4 Steps to Successfully Evaluating Business Analytics Software The goal of Business Analytics and Intelligence software is to help businesses access, analyze and visualize data, and then communicate those insights in meaningful dashboards and metrics. Unfortunately, the reality is that the majority of software options on the market today provide only a subset of that functionality. And those that provide a more comprehensive solution, tend to then lack the features that make it user-friendly. With a crowded marketplace, businesses need to go through a complex evaluation process and make some fundamental technology decisions before selecting a vendor. Finding a business intelligence (BI) software that will scale with your organization’s needs may seem like an impossible task. Here are the four questions you can ask when beginning the BI evaluation process that will save you a lot of time and help set you in the right direction.

5

5 Best Practices for Creating Effective Dashboards You’ve been there: no matter how many reports, formal meetings, casual conversations or emailed memos, someone important inevitably claims they didn’t know about some important fact or insight and says “we should have a dashboard to monitor the performance of X.” Or maybe you’ve been here: you’ve said “yes, let’s have a dashboard. It will help us improve return on investment (ROI) if everyone can see how X is performing and be able to quickly respond. I’ll update it weekly.” Unfortunately, by week 3, you realize you’re killing several hours a week integrating data from multiple sources to update a dashboard you’re not sure anyone is actually using. Yet, dashboards have been all the rage and with good reason. They can help you and your coworkers achieve a better grasp on the data – one of your most important, and often overlooked assets. You’ve read how they help organizations get on the same page, speed decision-making and improve ROI. They help create organizational alignment because everyone is looking at the same thing. So dashboards can be effective. They can work. The question becomes: How can you get one to work for you? Focus on these 5 best practices. Equally important, keep an eye on the 7 critical mistakes you don’t want to make.
50 years of Data Science More than 50 years ago, John Tukey called for a reformation of academic statistics. In `The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or `data analysis’. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name \Data Science’ for his envisioned eld. A recent and growing phenomenon is the emergence of \Data Science’ programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a $100M \Data Science Initiative’ that will hire 35 new faculty. Teaching in these new programs has signi cant overlap in curricular subject matter with tradi- tional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments. This paper reviews some ingredients of the current \Data Science moment’, including recent commentary about data science in the popular media, and about how/whether Data Science is really di erent from Statistics. The now-contemplated eld of Data Science amounts to a superset of the elds of statistics and machine learning which adds some technology for `scaling up’ to `big data’. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fty years. Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere `scaling up’, but instead the emergence of scienti c studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis work ows would impact the validity of data analysis across all of science, even predicting the impacts eld-by- eld. Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are `learning from data’, and I describe an academic eld dedicated to improving that activity in an evidence-based manner. This new eld is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.

7

7 Signs You Need Advanced Analytics for Salesforce.com (or any CRM) and Why They Matter Sure, customer relationship management (CRM) applications provide reports and dashboards. But if you rely on the built-in analytic capabilities of CRM, you’re leaving money on the table. Because that’s what the information in your CRM system is; it’s money. But you can’t extract the true value of that information without an analytics application that does the heavy lifting without putting your sales team through hell. You also want your sales team to stay in your CRM application. That was the point. Remember, all CRM, all the time. Directing the team to another application for analytic insight just defeats the purpose. What you need are robust, easy-to-access analytics embedded right in your CRM solution. Following are seven signs that you are not operating efficiently and making reporting and analytics more difficult for your sales team and your business less productive. Don’t ignore these seven warning signs. They all carry one message: Yes, you need advanced analytics!
7 Tips to Succeed with Big Data in 2014 Just when you thought big data couldn’t get any bigger, it got bigger still. Regardless of its actual size, big data is showing its value. Organizations everywhere have big data of all shapes and sizes. They recognize the importance, the opportunity, and even the imperative to pay attention. It has become clear that big data will outlive those who ignore it. Organizations that have already tamed big data – the multi-structured mass they stored before they knew its worth – are improving their operational efficiency, growing their revenues, and empowering new business models. How do they do it? Their techniques for success can be summarized in seven tips.

8

8 Critical Metrics for Measuring App User Engagement In this guide, we outline for you the eight engagement metrics critical to app success, including suggestions for running marketing campaigns and boosting ROI.

A

A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.
A Closer Look at Memorization in Deep Networks We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While deep networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.
A comparative study of fuzzy c-means algorithm and entropy-based fuzzy clustering algorithms Fuzzy clustering is useful to mine complex and multi-dimensional data sets, where the members have partial or fuzzy relations. Among the various developed techniques, fuzzy-C-means (FCM) algorithm is the most popular one, where a piece of data has partial membership with each of the pre-defined cluster centers. Moreover, in FCM, the cluster centers are virtual, that is, they are chosen at random and thus might be out of the data set. The cluster centers and membership values of the data points with them are updated through some iterations. On the other hand, entropy-based fuzzy clustering (EFC) algorithm works based on a similarity-threshold value. Contrary to FCM, in EFC, the cluster centers are real, that is, they are chosen from the data points. In the present paper, the performances of these algorithms have been compared on four data sets, such as IRIS, WINES, OLITOS and psychosis (collected with the help of forty doctors), in terms of the quality of the clusters (that is, discrepancy factor, compactness, distinctness) obtained and their computational time. Moreover, the best set of clusters has been mapped into 2-D for visualization using a self-organizing map (SOM).
A Comparative Study of Matrix Factorization and Random Walk with Restart in Recommender Systems Between matrix factorization or Random Walk with Restart (RWR), which method works better for recommender systems? Which method handles explicit or implicit feedback data better? Does additional side information help recommen- dation? Recommender systems play an important role in many e-commerce services such as Amazon and Netflix to recommend new items to a user. Among various recommendation strategies, collaborative filtering has shown good performance by using rating patterns of users. Matrix factorization and random walk with restart are the most representative collaborative filtering methods. However, it is still unclear which method provides better recommendation performance despite their extensive utility. In this paper, we provide a comparative study of matrix factorization and RWR in recommender systems. We exactly formulate each correspondence of the two methods according to various tasks in recommendation. Especially, we newly devise an RWR method using global bias term which corresponds to a matrix factorization method using biases. We describe details of the two methods in various aspects of recommendation quality such as how those methods handle cold-start problem which typ- ically happens in collaborative filtering. We extensively perform experiments over real-world datasets to evaluate the performance of each method in terms of various measures. We observe that matrix factorization performs better with explicit feedback ratings while RWR is better with implicit ones. We also observe that exploiting global popularities of items is advantageous in the performance and that side information produces positive synergy with explicit feedback but gives negative effects with implicit one.
A Comparative Study of Recommendation Algorithms in Ecommerce Applications We evaluate a wide range of recommendation algorithms on e-commerce-related datasets. These algorithms include the popular user-based and item-based correlation/similarity algorithms as well as methods designed to work with sparse transactional data. Data sparsity poses a significant challenge to recommendation approaches when applied in ecommerce applications. We experimented with approaches such as dimensionality reduction, generative models, and spreading activation, which are designed to meet this challenge. In addition, we report a new recommendation algorithm based on link analysis. Initial experimental results indicate that the link analysis-based algorithm achieves the best overall performance across several e-commerce datasets.
A comparison of algorithms for the multivariate L1-median The L1-median is a robust estimator of multivariate location with good statistical properties. Several algorithms for computing the L1-median are available. Problem specific algorithms can be used, but also general optimization routines. The aim is to compare different algorithms with respect to their precision and runtime. This is possible because all considered algorithms have been implemented in a standardized manner in the open source environment R. In most situations, the algorithm based on the optimization routine NLM (non-linear minimization) clearly outperforms other approaches. Its low computation time makes applications for large and high-dimensional data feasible.
A Composite Model for Computing Similarity Between Texts Computing text similarity is a foundational technique for a wide range of tasks in natural language processing such as duplicate detection, question answering, or automatic essay grading. Just recently, text similarity received wide-spread attention in the research community by the establishment of the Semantic Textual Similarity (STS) Task at the Semantic Evaluation (SemEval) workshop in 2012 – a fact that stresses the importance of text similarity research. The goal of the STS Task is to create automated measures which are able to compute the degree of similarity between two given texts in the same way that humans do. Measures are thereby expected to output continuous text similarity scores, which are then either compared with human judgments or used as a means for solving a particular problem. We start this thesis with the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. No attempt has been made yet to formalize in what way text similarity between two texts can be computed. Still, text similarity is regarded as a fixed, axiomatic notion in the community. To alleviate this shortcoming, we describe existing formal models of similarity and discuss how we can adapt them to texts. We propose to judge text similarity along multiple text dimensions, i.e. characteristics inherent to texts, and provide empirical evidence based on a set of annotation studies that the proposed dimensions are perceived by humans. We continue with a comprehensive survey of state-of-the-art text similarity measures previously proposed in the literature. To the best of our knowledge, no such survey has been done yet. We propose a classification into compositional and noncompositional text similarity measures according to their inherent properties. Compositional measures compute text similarity based on pairwise word similarity scores between all words which are then aggregated to an overall similarity score, while noncompositional measures project the complete texts onto particular models and then compare the texts based on these models. Based on our theoretical insights, we then present the implementation of a text similarity system which composes a multitude of text similarity measures along multiple text dimensions using a machine learning classifier. Depending on the concrete task at hand, we argue that such a system may need to address more than a single text dimension in order to best resemble human judgments. Our efforts culminate in the open source framework DKPro Similarity, which streamlines the development of text similarity measures and experimental setups. We apply our system in two evaluations, for which it consistently outperforms prior work and competing systems: an intrinsic and an extrinsic evaluation. In the intrinsic evaluation, the performance of text similarity measures is evaluated in an isolated setting by comparing the algorithmically produced scores with human judgments. We conducted the intrinsic evaluation in the context of the STS Task as part of the SemEval workshop. In the extrinsic evaluation, the performance of text similarity measures is evaluated with respect to a particular task at hand, where text similarity is a means for solving a particular problem. We conducted the extrinsic evaluation in the text classification task of text reuse detection. The results of both evaluations support our hypothesis that a composition of text similarity measures highly benefits the similarity computation process. Finally, we stress the importance of text similarity measures for real-world applications. We therefore introduce the application scenario Self-Organizing Wikis, where users of wikis, i.e. web-based collaborative content authoring systems, are supported in their everyday tasks by means of natural language processing techniques in general, and text similarity in particular. We elaborate on two use cases where text similarity computation is particularly beneficial: the detection of duplicates, and the semi-automatic insertion of hyperlinks. Moreover, we discuss two further applications where text similarity is a valuable tool: In both question answering and textual entailment recognition, text similarity has been used successfully in experiments and appears to be a promising means for further research in these fields. We conclude this thesis with an analysis of shortcomings of current text similarity research and formulate challenges which should be tackled by future work. In particular, we believe that computing text similarity along multiple text dimensions – which depend on the specific task at hand – will benefit any other task where text similarity is fundamental, as a composition of text similarity measures has shown superior performance in both the intrinsic as well as the extrinsic evaluation.
A Concise Guide to Compositional Data Analysis Why a course in compositional data analysis? Compositional data consist of vectors whose components are the proportion or percentages of some whole. Their peculiarity is that their sum is constrained to the be some constant, equal to 1 for proportions, 100 for percentages or possibly some other constant c for other situations such as parts per million (ppm) in trace element compositions. Unfortunately a cursory look at such vectors gives the appearance of vectors of real numbers with the consequence that over the last century all sorts of sophisticated statistical methods designed for unconstrained data have been applied to compositional data with inappropriate inferences. All this despite the fact that many workers have been, or should have been, aware that the sample space for compositional vectors is radically different from the real Euclidean space associated with unconstrained data. Several substantial warnings had been given, even as early as 1897 by Karl Pearson in his seminal paper on spurious correlations and then repeatedly in the 1960’s by geologist Felix Chayes. Unfortunately little heed was paid to such warnings and within the small circle who did pay attention the approach was essentially pathological, attempting to answer the question: what goes wrong when we apply multivariate statistical methodology designed for unconstrained data to our constrained data and how can the unconstrained methodology be adjusted to give meaningful inferences.
A Contemporary Overview of Probabilistic Latent Variable Models In this paper we provide a conceptual overview of latent variable models within a probabilistic modeling framework, an overview that emphasizes the compositional nature and the interconnectedness of the seemingly disparate models commonly encountered in statistical practice.
A correspondence between thermodynamics and inference A rough analogy between Bayesian statistics and statistical mechanics has long been discussed. We explore this analogy systematically and discover that it is more substantive than previously reported. We show that most canonical thermodynamic quantities have a natural correspondence with well-established statistical quantities. A novel correspondence is discovered between the heat capacity and the model complexity in information-based inference. This leads to a critical insight: We argue that the well-known mechanisms of failure of equipartition in statistical mechanics explain the nature of sloppy models in statistics. Finally, we exploit the correspondence to propose a solution to a long-standing ambiguity in Bayesian statistics: the definition of an objective or uninformative prior. In particular, we propose that the Gibbs entropy provides a natural generalization of the principle of indifference.
A Data Management System for Computational Experiments (3X) 3X, which stands for eXecuting eXploratory eXperiments, is a software tool to ease the burden of conducting computational experiments. 3X provides a standard yet con gurable structure to execute a wide variety of experiments in a systematic way. 3X organizes the code, inputs, and outputs for an experiment, records results, and lets users visualize result data in a variety of ways. Its interface allows further runs of the experiment to be driven interactively. Our demonstration will illustrate how 3X eases the process of conducting computational experiments, using two complementary examples designed to quickly show the many features of 3X.
A data scientist’s guide to start-ups In August 2013, we held a panel discussion at the KDD 2013 conference in Chicago on the subject of data science, data scientists, and start-ups. KDD is the premier conference on data science research and practice. The panel discussed the pros and cons for top-notch data scientists of the hot data science start-up scene. In this article, we first present background on our panelists. Our four panelists have unquestionable pedigrees in data science and substantial experience with start-ups from multiple perspectives (founders, employees, chief scientists, venture capitalists). For the casual reader, we next present a brief summary of the experts’ opinions on eight of the issues the panel discussed. The rest of the article presents a lightly edited transcription of the entire panel discussion.
A fast learning algorithm for deep belief nets We show how to use “complementary priors” to eliminate the explaining away effects that make inference difficult in densely-connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modelled by long ravines in the free-energy landscape of the top-level associative memory and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.
A Few Useful Things to Know about Machine Learning Machine learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious problems can be tackled. As a result, machine learning is widely used in computer science and other fields. However, developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.
A Framework for Considering Comprehensibility Comprehensibility in modeling is the ability of stakeholders to understand relevant aspects of the modeling process. In this article, we provide a framework to help guide exploration of the space of comprehensibility challenges. We consider facets organized around key questions: Who is comprehending? Why are they trying to comprehend? Where in the process are they trying to comprehend? How can we help them comprehend? How do we measure their comprehension?With each facet we consider the broad range of options.We discuss why taking a broad view of comprehensibility in modeling is useful in identifying challenges and opportunities for solutions.
A Framework for Time-Consistent, Risk-Averse Model Predictive Control: Theory and Algorithms In this paper we present a framework for risk-averse model predictive control (MPC) of linear systems affected by multiplicative uncertainty. Our key innovation is to consider time-consistent, dynamic risk metrics as objective functions to be minimized. This framework is axiomatically justified in terms of time-consistency of risk assessments, is amenable to dynamic optimization, and is unifying in the sense that it captures a full range of risk preferences from risk-neutral to worst case. Within this framework, we propose and analyze an online risk-averse MPC algorithm that is provably stabilizing. Furthermore, by exploiting the dual representation of time-consistent, dynamic risk metrics, we cast the computation of the MPC control law as a convex optimization problem amenable to real-time implementation. Simulation results are presented and discussed.
A General Theory for Training Learning Machine Though the deep learning is pushing the machine learning to a new stage, basic theories of machine learning are still limited. The principle of learning, the role of the a prior knowledge, the role of neuron bias, and the basis for choosing neural transfer function and cost function, etc., are still far from clear. In this paper, we present a general theoretical framework for machine learning. We classify the prior knowledge into common and problem-dependent parts, and consider that the aim of learning is to maximally incorporate them. The principle we suggested for maximizing the former is the design risk minimization principle, while the neural transfer function, the cost function, as well as pretreatment of samples, are endowed with the role for maximizing the latter. The role of the neuron bias is explained from a different angle. We develop a Monte Carlo algorithm to establish the input-output responses, and we control the input-output sensitivity of a learning machine by controlling that of individual neurons. Applications of function approaching and smoothing, pattern recognition and classification, are provided to illustrate how to train general learning machines based on our theory and algorithm. Our method may in addition induce new applications, such as the transductive inference.
A Gentle Introduction to Memetic Algorithms The generic denomination of `Memetic Algorithms’ (MAs) is used to encompass a broad class of metaheuristics (i.e. general purpose methods aimed to guide an underlying heuristic). The method is based on a population of agents and proved to be of practical success in a variety of problem domains and in particular for the approximate solution of NP Optimization problems. Unlike traditional Evolutionary Computation (EC) methods, MAs are intrinsically concerned with exploiting all available knowledge about the problem under study. The incorporation of prob- lem domain knowledge is not an optional mechanism, but a fundamental feature that characterizes MAs. This functioning philosophy is perfectly illustrated by the term \memetic’. Coined by R. Dawkins , the word `meme’ denotes an analogous to the gene in the context of cultural evolution .
A Graph Summarization: A Survey While advances in computing resources have made processing enormous amounts of data possible, human ability to identify patterns in such data has not scaled accordingly. Thus, efficient computational methods for condensing and simplifying data are becoming vital for extracting actionable insights. In particular, while data summarization techniques have been studied extensively, only recently has summarizing interconnected data, or graphs, become popular. This survey is a structured, comprehensive overview of the state-of-the-art methods for summarizing graph data. We first broach the motivation behind and the challenges of graph summarization. We then categorize summarization approaches by the type of graphs taken as input and further organize each category by core methodology. Finally, we discuss applications of summarization on real-world graphs and conclude by describing some open problems in the field.
A History of Bayesian Neural Networks (Slide Deck)
A Joint Model for Question Answering and Question Generation We propose a generative machine comprehension model that learns jointly to ask and answer questions based on documents. The proposed model uses a sequence-to-sequence framework that encodes the document and generates a question (answer) given an answer (question). Significant improvement in model performance is observed empirically on the SQuAD corpus, confirming our hypothesis that the model benefits from jointly learning to perform both tasks. We believe the joint model’s novelty offers a new perspective on machine comprehension beyond architectural engineering, and serves as a first step towards autonomous information seeking.
A joint renewal process used to model event based data In many industrial situations, where systems must be monitored using data recorded throughout a historical period of observation, one cannot fully rely on sensor data, but often only has event data to work with. This, in particular, holds for legacy data, whose evaluation is of interest to systems analysts, reliability planners, maintenance engineers etc. Event data, herein defined as a collection of triples containing a time stamp, a failure code and eventually a descriptive text, can best be evaluated by using the paradigm of joint renewal processes. The present paper formulates a model of such a process, which proceeds by means of state dependent event rates. The system state is defined, at each point in time, as the vector of backward times, whereby the backward time of an event is the time passed since the last occurrence of this event. The present paper suggests a mathematical model relating event rates linearly to the backward times. The parameters can then be estimated by means of the method of moments. In a subsequent step, these event rates can be used in a Monte-Carlo simulation to forecast the numbers of occurrences of each failure in a future time interval, based on the current system state. The model is illustrated by means of an example. As forecasting system malfunctions receives increasingly more attention in light of modern condition-based maintenance policies, this approach enables decision makers to use existing event data to implement state dependent maintenance measures.
A Mathematical Theory for Clustering in Metric Spaces Clustering is one of the most fundamental problems in data analysis and it has been studied extensively in the literature. Though many clustering algorithms have been proposed, clustering theories that justify the use of these clustering algorithms are still unsatisfactory. In particular, one of the fundamental challenges is to address the following question: What is a cluster in a set of data points? In this paper, we make an attempt to address such a question by considering a set of data points associated with a distance measure (metric). We first propose a new cohesion measure in terms of the distance measure. Using the cohesion measure, we define a cluster as a set of points that are cohesive to themselves. For such a definition, we show there are various equivalent statements that have intuitive explanations. We then consider the second question: How do we find clusters and good partitions of clusters under such a definition? For such a question, we propose a hierarchical agglomerative algorithm and a partitional algorithm. Unlike standard hierarchical agglomerative algorithms, our hierarchical agglomerative algorithm has a specific stopping criterion and it stops with a partition of clusters. Our partitional algorithm, called the K-sets algorithm in the paper, appears to be a new iterative algorithm. Unlike the Lloyd iteration that needs two-step minimization, our K-sets algorithm only takes one-step minimization. One of the most interesting findings of our paper is the duality result between a distance measure and a cohesion measure. Such a duality result leads to a dual K-sets algorithm for clustering a set of data points with a cohesion measure. The dual K-sets algorithm converges in the same way as a sequential version of the classical kernel K-means algorithm. The key difference is that a cohesion measure does not need to be positive semi-definite.
A Measure of Similarity Between Graph Vertices: Applications to Synonym Extraction and Web Searching We introduce a concept of similarity between vertices of directed graphs. Let GA and GB be two directed graphs with respectively nA and nB vertices. We define a nB × nA similarity matrix S whose real entry sij expresses how similar vertex j (in GA) is to vertex i (in GB) : we say that sij is their similarity score. The similarity matrix can be obtained as the limit of the normalized even iterates of S(k+1) = BS(k)AT +BT S(k)A where A and B are adjacency matrices of the graphs and S(0) is a matrix whose entries are all equal to one. In the special case where GA = GB = G, the matrix S is square and the score sij is the similarity score between the vertices i and j of G. We point out that Kleinberg’s “hub and authority” method to identify web-pages relevant to a given query can be viewed as a special case of our definition in the case where one of the graphs has two vertices and a unique directed edge between them. In analogy to Kleinberg, we show that our similarity scores are given by the components of a dominant eigenvector of a non-negative matrix. Potential applications of our similarity concept are numerous. We illustrate an application for the automatic extraction of synonyms in a monolingual dictionary.
A Model Explanation System We propose a general model explanation system (MES) for “explaining” the output of black box classifiers. In this introduction we use the motivating example of a classifier trained to detect fraud in a credit card transaction history. The key aspect is that we provide explanations applicable to a single prediction, rather than provide an interpretable set of parameters. The labels in the provided examples are usually negative. Hence, we focus on explaining positive predictions (alerts). In many classification applications, but especially in fraud detection, there is an expectation of false positives. Alerts are given to a human analyst before any further action is taken. Analysts often insist on understanding “why” there was an alert, since an opaque alert makes it difficult for them to proceed. Analogous scenarios occur in computer vision , credit risk , spam detection , etc. Furthermore, the MES framework is useful for model criticism. In the world of generative models, practitioners often generate synthetic data from a trained model to get an idea of “what the model is doing”. Our MES framework augments such tools. As an added benefit, MES is applicable to completely non-probabilistic black boxes that only provide hard labels. In Section 3 we use MES to visualize the decisions of a face recognition system.
A Model of Modeling We propose a formal model of scientific modeling, geared to applications of decision theory and game theory. The model highlights the freedom that modelers have in conceptualizing social phenomena using general paradigms in these elds. It may shed some light on the distinctions between (i) refutation of a theory and a paradigm, (ii) notions of rationality, (iii) modes of application of decision models, and (iv) roles of economics as an academic discipline. Moreover, the model suggests that all four distinctions have some common features that are captured by the model.
A model of text for experimentation in the social sciences Statistical models of text have become increasingly popular in statistics and com- puter science as a method of exploring large document collections. Social scientists often want to move beyond exploration, to measurement and experimentation, and make inference about social and political processes that drive discourse and content. In this paper, we develop a model of text data that supports this type of substantive re- search. Our approach is to posit a hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates. In this model, topical prevalence and topical content are speci ed as a sim- ple generalized linear model on an arbitrary number of document-level covariates, such as news source and time of release, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a gen- erally applicable framework. We demonstrate the proposed methodology by analyzing a collection of news reports about China, where we allow the prevalence of topics to evolve over time and vary across newswire services. Our methods help quantify the e ect of news wire source on both the frequency and nature of topic coverage. All the methods we describe are available as part of the open source R package stm.
A Natural Language Query Interface to Structured Information Accessing structured data such as that encoded in ontologies and knowledge bases can be done using either syntactically complex formal query languages like SPARQL or complicated form interfaces that require expensive customisation to each particular application domain. This paper presents the QuestIO system – a natural language interface for accessing structured information, that is domain independent and easy to use without training. It aims to bring the simplicity of Google’s search interface to conceptual retrieval by automatically converting short conceptual queries into formal ones, which can then be executed against any semantic repository. QuestIO was developed specifically to be robust with regard to language ambiguities, incomplete or syntactically ill-formed queries, by harnessing the structure of ontologies, fuzzy string matching, and ontologymotivated similarity metrics.
A Neural Bayesian Estimator for Conditional Probability Densities This article describes a robust algorithm to estimate a conditional probability density f(t|x) as a non-parametric smooth regression function. It is based on a neural network and the Bayesian interpretation of the network output as a posteriori probabability. The network is trained using example events from history or simulation, which define the underlying probability density f(t, x). Once trained, the network is applied on new, unknown examples x, for which it can predict the probability distribution of the target variable t. Event-by-event knowledge of the smooth function f(t|x) can be very useful, e.g. in maximum likelihood fits or for forecasting tasks. No assumptions are necessary about the distribution, and non-Gaussian tails are accounted for automatically. Important quantities like median, mean value, left and right standard deviations, moments and expectation values of any function of t are readily derived from it. The algorithm can be considered as an event-by-event unfolding and leads to statistically optimal reconstruction. The largest benefit of the method lies in complicated problems, when the measurements x are only relatively weakly correlated to the output t. As to assure optimal generalisation features and to avoid overfitting, the networks are regularised by extended versions of weight decay. The regularisation parameters are determined during the online-learning of the network by relations obtained from Bayesian statistics. Some toy Monte Carlo tests and first real application examples from high-energy physics and econometry are discussed.
A new look at clustering through the lens of deep convolutional neural networks Classification and clustering have been studied separately in machine learning and computer vision. Inspired by the recent success of deep learning models in solving various vision problems (e.g., object recognition, semantic segmentation) and the fact that humans serve as the gold standard in assessing clustering algorithms, here, we advocate for a unified treatment of the two problems and suggest that hierarchical frameworks that progressively build complex patterns on top of the simpler ones (e.g., convolutional neural networks) offer a promising solution. We do not dwell much on the learning mechanisms in these frameworks as they are still a matter of debate, with respect to biological constraints. Instead, we emphasize on the compositionality of the real world structures and objects. In particular, we show that CNNs, trained end to end using back propagation with noisy labels, are able to cluster data points belonging to several overlapping shapes, and do so much better than the state of the art algorithms. The main takeaway lesson from our study is that mechanisms of human vision, particularly the hierarchal organization of the visual ventral stream should be taken into account in clustering algorithms (e.g., for learning representations in an unsupervised manner or with minimum supervision) to reach human level clustering performance. This, by no means, suggests that other methods do not hold merits. For example, methods relying on pairwise affinities (e.g., spectral clustering) have been very successful in many cases but still fail in some cases (e.g., overlapping clusters).
A New View of Predictive State Methods for Dynamical System Learning Recently there has been substantial interest in predictive state methods for learning dynamical systems: these algorithms are popular since they often offer a good tradeoff between computational speed and statistical efficiency. Despite their desirable properties, though, predictive state methods can sometimes be difficult to use in practice. E.g., in contrast to the rich literature on supervised learning methods, which allows us to choose from an extensive menu of models and algorithms to suit the prior beliefs we have about properties of the function to be learned, predictive state dynamical system learning methods are comparatively inflexible: it is as if we were restricted to use only linear regression instead of being allowed to choose decision trees, nonparametric regression, or the lasso. To address this problem, we propose a new view of predictive state methods in terms of instrumentalvariable regression. This view allows us to construct a wide variety of dynamical system learners simply by swapping in different supervised learning methods. We demonstrate the effectiveness of our proposed methods by experimenting with non-linear regression to learn a hidden Markov model, showing that the resulting algorithm outperforms its linear counterpart; the correctness of this algorithm follows directly from our general analysis.
A Non-Geek’s Big Data Playbook This Big Data Playbook demonstrates in six common “plays” how Apache Hadoop supports and extends the EDW ecosystem.
A novel algorithm for fast and scalable subspace clustering of high-dimensional data Rapid growth of high dimensional datasets in recent years has created an emergent need to extract the knowledge underlying them. Clustering is the process of automatically finding groups of similar data points in the space of the dimensions or attributes of a dataset. Finding clusters in the high dimensional datasets is an important and challenging data mining problem. Data group together differently under different subsets of dimensions, called subspaces. Quite often a dataset can be better understood by clustering it in its subspaces, a process called subspace clustering. But the exponential growth in the number of these subspaces with the dimensionality of data makes the whole process of subspace clustering computationally very expensive. There is a growing demand for efficient and scalable subspace clustering solutions in many Big data application domains like biology, computer vision, astronomy and social networking. Apriori based hierarchical clustering is a promising approach to find all possible higher dimensional subspace clusters from the lower dimensional clusters using a bottom-up process. However, the performance of the existing algorithms based on this approach deteriorates drastically with the increase in the number of dimensions. Most of these algorithms require multiple database scans and generate a large number of redundant subspace clusters, either implicitly or explicitly, during the clustering process. In this paper, we present SUBSCALE, a novel clustering algorithm to find non-trivial subspace clusters with minimal cost and it requires only k database scans for a k-dimensional data set. Our algorithm scales very well with the dimensionality of the dataset and is highly parallelizable. We present the details of the SUBSCALE algorithm and its evaluation in this paper.
A novel algorithmic approach to Bayesian Logic Regression Logic regression was developed more than a decade ago as a tool to construct predictors from Boolean combinations of binary covariates. It has been mainly used to model epistatic effects in genetic association studies, which is very appealing due to the intuitive interpretation of logic expressions to describe the interaction between genetic variations. Nevertheless logic regression has remained less well known than other approaches to epistatic association mapping. Here we will adopt an advanced evolutionary algorithm called GMJMCMC (Genetically modified Mode Jumping Markov Chain Monte Carlo) to perform Bayesian model selection in the space of logic regression models. After describing the algorithmic details of GMJMCMC we perform a comprehensive simulation study that illustrates its performance given logic regression terms of various complexity. Specifically GMJMCMC is shown to be able to identify three-way and even four-way interactions with relatively large power, a level of complexity which has not been achieved by previous implementations of logic regression. We apply GMJMCMC to reanalyze QTL mapping data for Recombinant Inbred Lines in Arabidopsis thaliana and from a backcross population in Drosophila where we identify several interesting epistatic effects.
A novel framework to analyze road accident time series data Road accident data analysis plays an important role in identifying key factors associated with road accidents. These associated factors help in taking preventive measures to overcome the road accidents. Various studies have been done on road accident data analysis using traditional statistical techniques and data mining techniques. All these studies focused on identifying key factors associated with road accidents in different countries. Road accident is uncertain and unpredictable events which can occur in any circumstances. Also, road accidents do not have similar impacts in every region of the districts. There are chances that road accident rate is increasing in a certain district but it has some lower impact in other districts. Hence, the more focus on road safety should be on those regions or districts where road accident trend is increasing. Time series analysis is an important area of study which can be helpful in identifying the increasing or decreasing trends in different districts. In this paper, we have proposed a framework to analyze road accident time series data that takes 39 time series data of 39 districts of Gujrat and Uttarakhand state of India. This framework segments the time series data into different clusters. A time series merging algorithm is proposed to find the representative time series (RTS) for each cluster. This RTS is further used for trend analysis of different clusters. The results reveals that road accident trend is going to increase in certain clusters and those districts should be the prime concern to take preventive measure to overcome the road accidents.
A Practical Guide to Support Vector Classification The support vector machine (SVM) is a popular classi cation technique. However, beginners who are not familiar with SVM often get unsatisfactory results since they miss some easy but signi cant steps. In this guide, we propose a simple procedure which usually gives reasonable results.
A Primer on Neural Network Models for Natural Language Processing Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in elds such as image recognition and speech processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques. The tutorial covers input encoding for natural language tasks, feed-forward networks, convolutional networks, recurrent networks and recursive networks, as well as the computation graph abstraction for automatic gradient computation.
A Probabilistic Theory of Deep Learning A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks such as visual object and speech recognition. The key factor complicating such tasks is the presence of numerous nuisance variables, for instance, the unknown object position, orientation, and scale in object recognition or the unknown voice pronunciation, pitch, and speed in speech recognition. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks; they are constructed from many layers of alternating linear and nonlinear processing units and are trained using large-scale algorithms and massive amounts of training data. The recent success of deep learning systems is impressive – they now routinely yield pattern recognition systems with nearor super-human capabilities – but a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on a Bayesian generative probabilistic model that explicitly captures variation due to nuisance variables. The graphical structure of the model enables it to be learned from data using classical expectation-maximization techniques. Furthermore, by relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks (DCNs) and random decision forests (RDFs), providing insights into their successes and shortcomings as well as a principled route to their improvement.
A Revealing Introduction to Hidden Markov Models Suppose we want to determine the average annual temperature at a particular location on earth over a series of years. To make it interesting, suppose the years we are concerned with lie in the distant past, before thermometers were invented. Since we can’t go back in time, we instead look for indirect evidence of the temperature…
A review and comparative study on functional time series techniques This paper reviews the main estimation and prediction results derived in the context of functional time series, when Hilbert and Banach spaces are considered, specially, in the context of autoregressive processes of order one (ARH(1) and ARB(1) processes, for H and B being a Hilbert and Banach space, respectively). Particularly, we pay attention to the estimation and prediction results, and statistical tests, derived in both parametric and non-parametric frameworks. A comparative study between different ARH(1) prediction approaches is developed in the simulation study undertaken.
A Review of 40 Years of Cognitive Architecture Research: Focus on Perception, Attention, Learning and Applications In this paper we present a broad overview of the last 40 years of research on cognitive architectures. Although the number of existing architectures is nearing several hundred, most of the existing surveys do not reflect this growth and focus on a handful of well-established architectures. While their contributions are undeniable, they represent only a part of the research in the field. Thus, in this survey we wanted to shift the focus towards a more inclusive and high-level overview of the research in cognitive architectures. Our final set of 86 architectures includes 55 that are still actively developed, and borrow from a diverse set of disciplines, spanning areas from psychoanalysis to neuroscience. To keep the length of this paper within reasonable limits we discuss only the core cognitive abilities, such as perception, attention mechanisms, learning and memory structure. To assess the breadth of practical applications of cognitive architectures we gathered information on over 700 practical projects implemented using the cognitive architectures in our list. We use various visualization techniques to highlight overall trends in the development of the field. Our analysis of practical applications shows that most architectures are very narrowly focused on a particular application domain. Furthermore, there is an apparent gap between general research in robotics and computer vision and research in these areas within the cognitive architectures field. It is very clear that biologically inspired models do not have the same range and efficiency compared to the systems based on engineering principles and heuristics. Another observation is related to a general lack of collaboration. Several factors hinder communication, such as the closed nature of the individual projects (only one-third of the reviewed here architectures are open-source) and terminological differences.
A Review of Data Fusion Techniques In general, all tasks that demand any type of parameter estimation from multiple sources can benefit from the use of data/information fusion methods. The terms information fusion and data fusion are typically employed as synonyms; but in some scenarios, the term data fusion is used for raw data (obtained directly from the sensors) and the term information fusion is employed to define already processed data. In this sense, the term information fusion implies a higher semantic level than data fusion. Other terms associated with data fusion that typically appear in the literature include decision fusion, data combination, data aggregation, multisensor data fusion, and sensor fusion. Researchers in this field agree that the most accepted definition of data fusion was provided by the Joint Directors of Laboratories (JDL) workshop : ‘A multi-level process dealing with the association, correlation, combination of data and information from single and multiple sources to achieve refined position, identify estimates and complete and timely assessments of situations, threats and their significance.’ Hall and Llinas provided the following well-known definition of data fusion: ‘data fusion techniques combine data from multiple sensors and related information from associated databases to achieve improved accuracy and more specific inferences than could be achieved by the use of a single sensor alone.’ Briefly, we can define data fusion as a combination of multiple sources to obtain improved information; in this context, improved information means less expensive, higher quality, or more relevant information. Data fusion techniques have been extensively employed on multisensor environments with the aim of fusing and aggregating data from different sensors; however, these techniques can also be applied to other domains, such as text processing.The goal of using data fusion inmultisensor environments is to obtain a lower detection error probability and a higher reliability by using data from multiple distributed sources. The available data fusion techniques can be classified into three nonexclusive categories: (i) data association, (ii) state estimation, and (iii) decision fusion. Because of the large number of published papers on data fusion, this paper does not aim to provide an exhaustive review of all of the studies; instead, the objective is to highlight the main steps that are involved in the data fusion framework and to review the most common techniques for each step. The remainder of this paper continues as follows. The next section provides various classification categories for data fusion techniques. Then, Section 3 describes the most common methods for data association tasks. Section 4 provides a review of techniques under the state estimation category. Next, the most common techniques for decision fusion are enumerated in Section 5. Finally, the conclusions obtained from reviewing the different methods are highlighted in Section 6.
A review of data mining using big data in health informatics The amount of data produced within Health Informatics has grown to be quite vast, and analysis of this Big Data grants potentially limitless possibilities for knowledge to be gained. In addition, this information can improve the quality of healthcare offered to patients. However, there are a number of issues that arise when dealing with these vast quantities of data, especially how to analyze this data in a reliable manner. The basic goal of Health Informatics is to take in real world medical data from all levels of human existence to help advance our understanding of medicine and medical practice. This paper will present recent research using Big Data tools and approaches for the analysis of Health Informatics data gathered at multiple levels, including the molecular, tissue, patient, and population levels. In addition to gathering data at multiple levels, multiple levels of questions are addressed: human-scale biology, clinical-scale, and epidemic-scale. We will also analyze and examine possible future work for each of these areas, as well as how combining data from each level may provide the most promising approach to gain the most knowledge in Health Informatics.
A Review of Features for the Discrimination of Twitter Users: Application to the Prediction of Offline Influence Many works related to Twitter aim at characterizing its users in some way: role on the service (spammers, bots, organizations, etc.), nature of the user (socio-professional category, age, etc.), topics of interest, and others. However, for a given user classification problem, it is very difficult to select a set of appropriate features, because the many features described in the literature are very heterogeneous, with name overlaps and collisions, and numerous very close variants. In this article, we review a wide range of such features. In order to present a clear state-of-the-art description, we unify their names, definitions and relationships, and we propose a new, neutral, typology. We then illustrate the interest of our review by applying a selection of these features to the offline influence detection problem. This task consists in identifying users which are influential in real-life, based on their Twitter account and related data. We show that most features deemed efficient to predict online influence, such as the numbers of retweets and followers, are not relevant to this problem. However, We propose several content-based approaches to label Twitter users as Influencers or not. We also rank them according to a predicted influence level. Our proposals are evaluated over the CLEF RepLab 2014 dataset, and outmatch state-of-the-art methods.
A review of instance selection methods In supervised learning, a training set providing previously known information is used to classify new instances. Commonly, several instances are stored in the training set but some of them are not useful for classifying therefore it is possible to get acceptable classification rates ignoring non useful cases; this process is known as instance selection. Through instance selection the training set is reduced which allows reducing runtimes in the classification and/or training stages of classifiers. This work is focused on presenting a survey of the main instance selection methods reported in the literature.
A Review of Relational Machine Learning for Knowledge Graphs Relational machine learning studies methods for the statistical analysis of relational, or graph-structured, data. In this paper, we provide a review of how such statistical models can be ‘trained’ on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph). In particular, we discuss two different kinds of statistical relational models, both of which can scale to massive datasets. The first is based on tensor factorization methods and related latent variable models. The second is based on mining observable patterns in the graph. We also show how to combine these latent and observable models to get improved modeling power at decreased computational cost. Finally, we discuss how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web. In particular, we discuss Google’s Knowledge Vault project.
A Review of Self-Exciting Spatio-Temporal Point Processes and Their Applications Self-exciting spatio-temporal point process models predict the rate of events as a function of space, time, and the previous history of events. These models naturally capture triggering and clustering behavior, and have been widely used in fields where spatio-temporal clustering of events is observed, such as earthquake modeling, infectious disease, and crime. In the past several decades, advances have been made in estimation, inference, simulation, and diagnostic tools for self-exciting point process models. In this review, I describe the basic theory, survey related estimation and inference techniques from each field, highlight several key applications, and suggest directions for future research.
A Review on Algorithms for Constraint-based Causal Discovery Causal discovery studies the problem of mining causal relationships between variables from data, which is of primary interest in science. During the past decades, significant amount of progresses have been made toward this fundamental data mining paradigm. Recent years, as the availability of abundant large-sized and complex observational data, the constrain-based approaches have gradually attracted a lot of interest and have been widely applied to many diverse real-world problems due to the fast running speed and easy generalizing to the problem of causal insufficiency. In this paper, we aim to review the constraint-based causal discovery algorithms. Firstly, we discuss the learning paradigm of the constraint-based approaches. Secondly and primarily, the state-of-the-art constraint-based casual inference algorithms are surveyed with the detailed analysis. Thirdly, several related open-source software packages and benchmark data repositories are briefly summarized. As a conclusion, some open problems in constraint-based causal discovery are outlined for future research.
A Review on Deep Learning Techniques Applied to Semantic Segmentation Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers. Many applications on the rise need accurate and efficient segmentation mechanisms: autonomous driving, indoor navigation, and even virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every field or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we describe the terminology of this field as well as mandatory background concepts. Next, the main datasets and challenges are exposed to help researchers decide which are the ones that best suit their needs and their targets. Then, existing methods are reviewed, highlighting their contributions and their significance in the field. Finally, quantitative results are given for the described methods and the datasets in which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw our own conclusions about the state of the art of semantic segmentation using deep learning techniques.
A review on statistical inference methods for discrete Markov random fields Developing satisfactory methodology for the analysis of Markov random field is a very challenging task. Indeed, due to the Markovian dependence structure, the normalizing constant of the fields cannot be computed using standard analytical or numerical methods. This forms a central issue for any statistical approach as the likelihood is an integral part of the procedure. Furthermore, such unobserved fields cannot be integrated out and the likelihood evaluation becomes a doubly intractable problem. This report gives an overview of some of the methods used in the literature to analyse such observed or unobserved random fields.
A Security Framework for Wireless Sensor Networks: Theory and Practice Wireless sensor networks are often deployed in public or otherwise untrusted and even hostile environments, which prompts a number of security issues. Although security is a necessity in other types of networks, it is much more so in sensor networks due to the resource-constraint, susceptibility to physical capture, and wireless nature. In this work we emphasize two security issues: (1) secure communication infrastructure and (2) secure nodes scheduling algorithm. Due to resource constraints, specific strategies are often necessary to preserve the network’s lifetime and its quality of service. For instance, to reduce communication costs nodes can go to sleep mode periodically (nodes scheduling). These strategies must be proven as secure, but protocols used to guarantee this security must be compatible with the resource preservation requirement. To achieve this goal, secure communications in such networks will be defined, together with the notions of secure scheduling. Finally, some of these security properties will be evaluated in concrete case studies.
A Short Course on Network Analysis These are lecture notes prepared for a short (6 hours) course given at the conference Method- ological Advances in Statistics Related to Big Data, held in Castro Urdiales, Spain, June 8-12, 2015. The course focuses on the analysis of networks without labels at the nodes. It covers de- scriptives statistics for graphs, random graphs models, and graph partitioning, including recent advances in spectral and semide nite methods.
A Short Introduction to Boosting Boosting is a general method for improving the accuracy of any given learning algorithm. This short overview paper introduces the boosting algorithm AdaBoost, and explains the underlying theory of boosting, including an explanation of why boosting often does not suffer from overfitting as well as boosting’s relationship to support-vector machines. Some examples of recent applications of boosting are also described.
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University
A simple neural network module for relational reasoning Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn. In this paper we describe how to use Relation Networks (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning. We tested RN-augmented networks on three tasks: visual question answering using a challenging dataset called CLEVR, on which we achieve state-of-the-art, super-human performance; text-based question answering using the bAbI suite of tasks; and complex reasoning about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with RNs. Our work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.
A Statistical Learning Model of Text Classification for Support Vector Machines
A summary on Maximum likelihood Estimator A general method of building a predictive model requires least square estimation at first. Then we need work on the residuals, find the confidence interval of parameters and test how well the model fits the data which are based on the normally distributed assumption of the residuals (or noises). But unfortunately the assumption is not guaranteed. Most of the time, you will have a graph of residuals that looks like another distribution rather than the normal. At this moment you could add one more factor term to your model so as to filter out the non-normal distributed noise, and then calculate the LSE again. But you may still have the same problem again. Or if you can recognize the distribution of the graph (or somehow you know the pdf of the noise), you can just calculate the MLE of the parameters of your model. This time, your work is really finished.
A Survey and Evaluation of Data Center Network Topologies Data centers are becoming increasingly popular for their flexibility and processing capabilities in the modern computing environment. They are managed by a single entity (administrator) and allow dynamic resource provisioning, performance optimization as well as efficient utilization of available resources. Each data center consists of massive compute, network and storage resources connected with physical wires. The large scale nature of data centers requires careful planning of compute, storage, network nodes, interconnection as well as inter-communication for their effective and efficient operations. In this paper, we present a comprehensive survey and taxonomy of network topologies either used in commercial data centers, or proposed by researchers working in this space. We also compare and evaluate some of those topologies using mininet as well as gem5 simulator for different traffic patterns, based on various metrics including throughput, latency and bisection bandwidth.
A Survey of Algorithms for Keyword Search on Graph Data In this chapter, we survey methods that perform keyword search on graph data. Keyword search provides a simple but user-friendly interface to retrieve information from complicated data structures. Since many real life datasets are represented by trees and graphs, keyword search has become an attractive mechanism for data of a variety of types. In this survey, we discuss methods of keyword search on schema graphs, which are abstract representation for XML data and relational data, and methods of keyword search on schema-free graphs. In our discussion, we focus on three major challenges of keyword search on graphs. First, what is the semantics of keyword search on graphs, or, what qualifies as an answer to a keyword search; second, what constitutes a good answer, or, how to rank the answers; third, how to perform keyword search efficiently. We also discuss some unresolved challenges and propose some new research directions on this topic.
A survey of Bayesian predictive methods for model assessment, selection and comparison To date, several methods exist in the statistical literature for model assessment, which purport themselves specifically as Bayesian predic- tive methods. The decision theoretic assumptions on which these methods are based are not always clearly stated in the original articles, however. The aim of this survey is to provide a unified review of Bayesian predictive model assessment and selection methods, and of methods closely related to them. We review the various assumptions that are made in this context and discuss the connections between different approaches, with an emphasis on how each method approximates the expected utility of using a Bayesian model for the purpose of predicting future data.
A Survey of Binary Similarity and Distance Measures The binary feature vector is one of the most common representations of patterns and measuring similarity and distance measures play a critical role in many problems such as clustering, classification, etc. Ever since Jaccard proposed a similarity measure to classify ecological species in 1901, numerous binary similarity and distance measures have been proposed in various fields. Applying appropriate measures results in more accurate data analysis. Notwithstanding, few comprehensive surveys on binary measures have been conducted. Hence we collected 76 binary similarity and distance measures used over the last century and reveal their correlations through the hierarchical clustering technique.
A survey of Community Question Answering With the advent of numerous community forums, tasks associated with the same have gained importance in the recent past. With the influx of new questions every day on these forums, the issues of identifying methods to find answers to said questions, or even trying to detect duplicate questions, are of practical importance and are challenging in their own right. This paper aims at surveying some of the aforementioned issues, and methods proposed for tackling the same.
A Survey of Cross-Lingual Embedding Models Cross-lingual embedding models allow us to project words from different languages into a shared embedding space. This allows us to apply models trained on languages with a lot of data, e.g. English to low-resource languages. In the following, we will survey models that seek to learn cross-lingual embeddings. We will discuss them based on the type of approach and the nature of parallel data that they employ. Finally, we will present challenges and summarize how to evaluate cross-lingual embedding models.
A Survey of Deep Learning Methods for Relation Extraction Relation Extraction is an important sub-task of Information Extraction which has the potential of employing deep learning (DL) models with the creation of large datasets using distant supervision. In this review, we compare the contributions and pitfalls of the various DL models that have been used for the task, to help guide the path ahead.
A survey of dimensionality reduction techniques based on random projection Dimensionality reduction techniques play important roles in the analysis of big data. Traditional dimensionality reduction approaches, such as Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA), have been studied extensively in the past few decades. However, as the dimension of huge data increases, the computational cost of traditional dimensionality reduction approaches grows dramatically and becomes prohibitive. It has also triggered the development of Random Projection (RP) technique which maps high-dimensional data onto low-dimensional subspace within short time. However, RP generates transformation matrix without considering intrinsic structure of original data and usually leads to relatively high distortion. Therefore, in the past few years, some approaches based on RP have been proposed to address this problem. In this paper, we summarized these approaches in different applications to help practitioners to employ proper approaches in their specific applications. Also, we enumerated their benefits and limitations to provide further references for researchers to develop novel RP-based approaches.
A Survey of Inductive Biases for Factorial Representation-Learning With the resurgence of interest in neural networks, representation learning has re-emerged as a central focus in artificial intelligence. Representation learning refers to the discovery of useful encodings of data that make domain-relevant information explicit. Factorial representations identify underlying independent causal factors of variation in data. A factorial representation is compact and faithful, makes the causal factors explicit, and facilitates human interpretation of data. Factorial representations support a variety of applications, including the generation of novel examples, indexing and search, novelty detection, and transfer learning. This article surveys various constraints that encourage a learning algorithm to discover factorial representations. I dichotomize the constraints in terms of unsupervised and supervised inductive bias. Unsupervised inductive biases exploit assumptions about the environment, such as the statistical distribution of factor coefficients, assumptions about the perturbations a factor should be invariant to (e.g. a representation of an object can be invariant to rotation, translation or scaling), and assumptions about how factors are combined to synthesize an observation. Supervised inductive biases are constraints on the representations based on additional information connected to observations. Supervisory labels come in variety of types, which vary in how strongly they constrain the representation, how many factors are labeled, how many observations are labeled, and whether or not we know the associations between the constraints and the factors they are related to. This survey brings together a wide variety of models that all touch on the problem of learning factorial representations and lays out a framework for comparing these models based on the strengths of the underlying supervised and unsupervised inductive biases.
A Survey of Methods for Collective Communication Optimization and Tuning New developments in HPC technology in terms of increasing computing power on multi/many core processors, high-bandwidth memory/IO subsystems and communication interconnects, pose a direct impact on software and runtime system development. These advancements have become useful in producing high-performance collective communication interfaces that integrate efficiently on a wide variety of platforms and environments. However, number of optimization options that shows up with each new technology or software framework has resulted in a \emph{combinatorial explosion} in feature space for tuning collective parameters such that finding the optimal set has become a nearly impossible task. Applicability of algorithmic choices available for optimizing collective communication depends largely on the scalability requirement for a particular usecase. This problem can be further exasperated by any requirement to run collective problems at very large scales such as in the case of exascale computing, at which impractical tuning by brute force may require many months of resources. Therefore application of statistical, data mining and artificial Intelligence or more general hybrid learning models seems essential in many collectives parameter optimization problems. We hope to explore current and the cutting edge of collective communication optimization and tuning methods and culminate with possible future directions towards this problem.
A Survey of Monte Carlo Tree Search Methods Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
A Survey of Neural Network Techniques for Feature Extraction from Text This paper aims to catalyze the discussions about text feature extraction techniques using neural network architectures. The research questions discussed in the paper focus on the state-of-the-art neural network techniques that have proven to be useful tools for language processing, language generation, text classification and other computational linguistics tasks.
A Survey of Online Failure Prediction Methods With ever-growing complexity and dynamicity of computer systems, proactive fault management is an effective approach to enhancing availability. Online failure prediction is the key to such techniques. In contrast to classical reliability methods, online failure prediction is based on runtime monitoring and a variety of models and methods that use the current state of a system and, frequently, the past experience as well. This survey describes these methods. To capture the wide spectrum of approaches concerning this area, a taxonomy has been developed, whose different approaches are explained and major concepts are described in detail.
A Survey of Point-of-interest Recommendation in Location-based Social Networks Point-of-interest (POI) recommendation that suggests new places for users to visit arises with the popularity of location-based social networks (LBSNs). Due to the importance of POI recommendation in LBSNs, it has attracted much academic and industrial interest. In this paper, we offer a systematic review of this field, summarizing the contributions of individual efforts and exploring their relations. We discuss the new properties and challenges in POI recommendation, compared with traditional recommendation problems, e.g., movie recommendation. Then, we present a comprehensive review in three aspects: influential factors for POI recommendation, methodologies employed for POI recommendation, and different tasks in POI recommendation. Specifically, we propose three taxonomies to classify POI recommendation systems. First, we categorize the systems by the influential factors check-in characteristics, including the geographical information, social relationship, temporal influence, and content indications. Second, we categorize the systems by the methodology, including systems modeled by fused methods and joint methods. Third, we categorize the systems as general POI recommendation and successive POI recommendation by subtle differences in the recommendation task whether to be bias to the recent check-in. For each category, we summarize the contributions and system features, and highlight the representative work. Moreover, we discuss the available data sets and the popular metrics. Finally, we point out the possible future directions in this area and conclude this survey.
A Survey of Shortest-Path Algorithms A shortest-path algorithm finds a path containing the minimal cost between two vertices in a graph. A plethora of shortest-path algorithms is studied in the literature that span across multiple disciplines. This paper presents a survey of shortest-path algorithms based on a taxonomy that is introduced in the paper. One dimension of this taxonomy is the various flavors of the shortest-path problem. There is no one general algorithm that is capable of solving all variants of the shortest-path problem due to the space and time complexities associated with each algorithm. Other important dimensions of the taxonomy include whether the shortest-path algorithm operates over a static or a dynamic graph, whether the shortest-path algorithm produces exact or approximate answers, and whether the objective of the shortest-path algorithm is to achieve time-dependence or is to only be goal directed. This survey studies and classifies shortest-path algorithms according to the proposed taxonomy. The survey also presents the challenges and proposed solutions associated with each category in the taxonomy.
A Survey of Tensor Methods Matrix decompositions have always been at the heart of signal, circuit and system theory. In particular, the Singular Value Decomposition (SVD) has been an important tool. There is currently a shift of paradigm in the algebraic foundations of these fields. Quite recently, Nonnegative Matrix Factorization (NMF) has been shown to outperform SVD at a number of tasks. Increasing research efforts are spent on the study and application of decompositions of higher-order tensors or multi-way arrays. This paper is a partial survey on tensor generalizations of the SVD and their applications. We also touch on Nonnegative Tensor Factorizations.
A survey of transfer learning Machine learning and data mining techniques have been used in numerous real-world applications. An assumption of traditional machine learning methodologies is the training data and testing data are taken from the same domain, such that the input feature space and data distribution characteristics are the same. However, in some real-world machine learning scenarios, this assumption does not hold. There are cases where training data is expensive or difficult to collect. Therefore, there is a need to create high-performance learners trained with more easily obtained data from different domains. This methodology is referred to as transfer learning. This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied to transfer learning. Lastly, there is information listed on software downloads for various transfer learning solutions and a discussion of possible future research work. The transfer learning solutions surveyed are independent of data size and can be applied to big data environments.
A Survey of Visual Analysis of Human Motion and Its Applications This paper summarizes the recent progress in human motion analysis and its applications. In the beginning, we reviewed the motion capture systems and the representation model of human’s motion data. Next, we sketched the advanced human motion data processing technologies, including motion data filtering, temporal alignment, and segmentation. The following parts overview the state-of-the-art approaches of action recognition and dynamics measuring since these two are the most active research areas in human motion analysis. The last part discusses some emerging applications of the human motion analysis in healthcare, human robot interaction, security surveillance, virtual reality and animation. The promising research topics of human motion analysis in the future is also summarized in the last part.
A Survey on Artificial Intelligence and Data Mining for MOOCs Massive Open Online Courses (MOOCs) have gained tremendous popularity in the last few years. Thanks to MOOCs, millions of learners from all over the world have taken thousands of high-quality courses for free. Putting together an excellent MOOC ecosystem is a multidisciplinary endeavour that requires contributions from many different fields. Artificial intelligence (AI) and data mining (DM) are two such fields that have played a significant role in making MOOCs what they are today. By exploiting the vast amount of data generated by learners engaging in MOOCs, DM improves our understanding of the MOOC ecosystem and enables MOOC practitioners to deliver better courses. Similarly, AI, supported by DM, can greatly improve student experience and learning outcomes. In this survey paper, we first review the state-of-the-art artificial intelligence and data mining research applied to MOOCs, emphasising the use of AI and DM tools and techniques to improve student engagement, learning outcomes, and our understanding of the MOOC ecosystem. We then offer an overview of key trends and important research to carry out in the fields of AI and DM so that MOOCs can reach their full potential.
A Survey on Contextual Multi-armed Bandits The natural of contextual bandits makes it suitable for many machine learning applications such as user modeling, Internet advertising, search engine, experiments optimization etc. In this survey we cover three different types of contextual bandits algorithms, and for each type we introduce several representative algorithms. We also compare the regrets and assumptions between these algorithms.
A Survey on Domain-Specific Languages for Machine Learning in Big Data The amount of data generated in the modern society is increasing rapidly. New problems and novel approaches of data capture, storage, analysis and visualization are responsible for the emergence of the Big Data research field. Machine Learning algorithms can be used in Big Data to make better and more accurate inferences. However, because of the challenges Big Data imposes, these algorithms need to be adapted and optimized to specific applications. One important decision made by software engineers is the choice of the language that is used in the implementation of these algorithms. Therefore, this literature survey identifies and describes domain-specific languages and frameworks used for Machine Learning in Big Data. By doing this, software engineers can then make more informed choices and beginners have an overview of the main languages used in this domain.
A Survey on Geographically Distributed Big-Data Processing using MapReduce Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have a major drawback in terms of locally distributed computations, which prevent them in implementing geographically distributed data processing. The increasing amount of geographically distributed massive data is pushing industries and academia to rethink the current big-data processing systems. The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved in the current system, are expected to process geographically distributed data at their locations without moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and requirements in designing geographically distributed data processing frameworks and protocols. We classify and study batch processing (MapReduce-based systems), stream processing (Spark-based systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues.
A Survey on Load Balancing Algorithms for VM Placement in Cloud Computing The emergence of cloud computing based on virtualization technologies brings huge opportunities to host virtual resource at low cost without the need of owning any infrastructure. Virtualization technologies enable users to acquire, configure and be charged on pay-per-use basis. However, Cloud data centers mostly comprise heterogeneous commodity servers hosting multiple virtual machines (VMs) with potential various specifications and fluctuating resource usages, which may cause imbalanced resource utilization within servers that may lead to performance degradation and service level agreements (SLAs) violations. To achieve efficient scheduling, these challenges should be addressed and solved by using load balancing strategies, which have been proved to be NP-hard problem. From multiple perspectives, this work identifies the challenges and analyzes existing algorithms for allocating VMs to PMs in infrastructure Clouds, especially focuses on load balancing. A detailed classification targeting load balancing algorithms for VM placement in cloud data centers is investigated and the surveyed algorithms are classified according to the classification. The goal of this paper is to provide a comprehensive and comparative understanding of existing literature and aid researchers by providing an insight for potential future enhancements.
A Survey on Monochromatic Connections of Graphs The concept of monochromatic connection of graphs was introduced by Caro and Yuster in 2011. Recently, a lot of results have been published about it. In this survey, we attempt to bring together all the results that dealt with it. We begin with an introduction, and then classify the results into the following categories: monochromatic connection coloring of edge-version, monochromatic connection coloring of vertex-version, monochromatic index, monochromatic connection coloring of total-version.
A Survey on Social Media Anomaly Detection Social media anomaly detection is of critical importance to prevent malicious activities such as bullying, terrorist attack planning, and fraud information dissemination. With the recent popularity of social media, new types of anomalous behaviors arise, causing concerns from various parties. While a large amount of work have been dedicated to traditional anomaly detection problems, we observe a surge of research interests in the new realm of social media anomaly detection. In this paper, we present a survey on existing approaches to address this problem. We focus on the new type of anomalous phenomena in the social media and review the recent developed techniques to detect those special types of anomalies. We provide a general overview of the problem domain, common formulations, existing methodologies and potential directions. With this work, we hope to call out the attention from the research community on this challenging problem and open up new directions that we can contribute in the future.
A Survey: Time Travel in Deep Learning Space: An Introduction to Deep Learning Models and How Deep Learning Models Evolved from the Initial Ideas This report will show the history of deep learning evolves. It will trace back as far as the initial belief of connectionism modelling of brain, and come back to look at its early stage realization: neural networks. With the background of neural network, we will gradually introduce how convolutional neural networks, as a representative of deep discriminative models, is developed from neural networks, together with many practical techniques that can help in optimization of neural networks. On the other hand, we will also trace back to see the evolution history of deep generative models, to see how researchers balance the representation power and computation complexity to reach Restricted Boltzmann Machine and eventually reach Deep Belief Nets. Further, we will also look into the development history of modelling time series data with neural networks. We start with Time Delay Neural Networks and move further to currently famous model named Recurrent Neural Network and its extension Lone Time Short Memory. We will also briefly look into how to construct deep recurrent neural networks. Finally, we will conclude this report with some interesting open-ended questions of deep neural networks.
A Theory of Output-Side Unsupervised Domain Adaptation When learning a mapping from an input space to an output space, the assumption that the sample distribution of the training data is the same as that of the test data is often violated. Unsupervised domain shift methods adapt the learned function in order to correct for this shift. Previous work has focused on utilizing unlabeled samples from the target distribution. We consider the complementary problem in which the unlabeled samples are given post mapping, i.e., we are given the outputs of the mapping of unknown samples from the shifted domain. Two other variants are also studied: the two sided version, in which unlabeled samples are give from both the input and the output spaces, and the Domain Transfer problem, which was recently formalized. In all cases, we derive generalization bounds that employ discrepancy terms.
A Tour through the Visualization Zoo A survey of powerful visualization techniques, from the obvious to the obscure
A Tutorial for Reinforcement Learning The tutorial is written for those who would like an introduction to reinforcement learning (RL). The aim is to provide an intuitive presentation of the ideas rather than concentrate on the deeper mathematics underlying the topic. RL is generally used to solve the so-called Markov decision problem (MDP). In other words, the problem that you are attempting to solve with RL should be an MDP or its variant. The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI). We will begin with a quick description of MDPs. We will discuss what we mean by “complex” and “large-scale” MDPs. Then we will explain why RL is needed to solve complex and large-scale MDPs. The semi-Markov decision problem (SMDP) will also be covered.
A tutorial on active learning (Slide Deck)
A Tutorial on Bayesian Belief Networks This tutorial provides an overview of Bayesian belief networks. The subject is introduced through a discussion on probabilistic models that covers probability language, dependency models, graphical representations of models, and belief networks as a particular representation of probabilistic models. The general class of causal belief networks is presented, and the concept of d-separation and its relationship with independence in probabilistic models is introduced. This leads to a description of Bayesian belief networks as a specific class of causal belief networks, with detailed discussion on belief propagation and practical network design. The target recognition problem is presented as an example of the application of Bayesian belief networks to a real problem, and the tutorial concludes with a brief summary of Bayesian belief networks.
A Tutorial on Bridge Sampling The marginal likelihood plays an important role in many areas of Bayesian statistics such as parameter estimation, model comparison, and model averaging. In most applications, however, the marginal likelihood is not analytically tractable and must be approximated using numerical methods. Here we provide a tutorial on bridge sampling (Bennett, 1976; Meng & Wong, 1996), a reliable and relatively straightforward sampling method that allows researchers to obtain the marginal likelihood for models of varying complexity. First, we introduce bridge sampling and three related sampling methods using the beta-binomial model as a running example. We then apply bridge sampling to estimate the marginal likelihood for the Expectancy Valence (EV) model—a popular model for reinforcement learning. Our results indicate that bridge sampling provides accurate estimates for both a single participant and a hierarchical version of the EV model. We conclude that bridge sampling is an attractive method for mathematical psychologists who typically aim to approximate the marginal likelihood for a limited set of possibly high-dimensional models.
A Tutorial on Deep Learning Part 1: Nonlinear Classi ers and The Backpropagation Algorithm In the past few years, Deep Learning has generated much excitement in Machine Learning and industry thanks to many breakthrough results in speech recognition, computer vision and text processing. So, what is Deep Learning? For many researchers, Deep Learning is another name for a set of algorithms that use a neural network as an architecture. Even though neural networks have a long history, they became more successful in recent years due to the availability of inexpensive, parallel hardware (GPUs, computer clusters) and massive amounts of data. In this tutorial, we will start with the concept of a linear classi er and use that to develop the concept of neural networks. I will present two key algorithms in learning with neural networks: the stochastic gradient descent algorithm and the backpropagation algorithm. Towards the end of the tutorial, I will explain some simple tricks and recent advances that improve neural networks and their training. For that, let’s start with a simple example.
A Tutorial on Deep Learning Part 2: Autoencoders, Convolutional Neural Networks and Recurrent Neural Networks In the previous tutorial, I discussed the use of deep networks to classify nonlinear data. In addition to their ability to handle nonlinear data, deep networks also have a special strength in their exibility which sets them apart from other tranditional machine learning models: we can modify them in many ways to suit our tasks. In the following, I will discuss three most common modi cations:
• Unsupervised learning and data compression via autoencoders which require modi cations in the loss function,
• Translational invariance via convolutional neural networks which require modi cations in the network architecture,
• Variable-sized sequence prediction via recurrent neural networks which require modi cations in the network architecture.
The exibility of neural networks is a very powerful property. In many cases, these changes lead to great improvements in accuracy compared to basic models that we discussed in the previous tutorial. In the last part of the tutorial, I will also explain how to parallelize the training of neural networks. This is also an important topic because parallelizing neural networks has played an important role in the current deep learning movement.
A Tutorial on Fisher Information In many statistical applications that concern mathematical psychologists, the concept of Fisher information plays an important role. In this tutorial we clarify the concept of Fisher information as it manifests itself across three different statistical paradigms. First, in the frequentist paradigm, Fisher information is used to construct hypothesis tests and confidence intervals using maximum likelihood estimators; second, in the Bayesian paradigm, Fisher information is used to define a default prior; finally, in the minimum description length paradigm, Fisher information is used to measure model complexity.
A tutorial on geometric programming A geometric program (GP) is a type of mathematical optimization problem characterized by objective and constraint functions that have a special form. Recently developed solution methods can solve even large-scale GPs extremely efficiently and reliably; at the same time a number of practical problems, particularly in circuit design, have been found to be equivalent to (or well approximated by) GPs. Putting these two together, we get effective solutions for the practical problems. The basic approach in GP modeling is to attempt to express a practical problem, such as an engineering analysis or design problem, in GP format. In the best case, this formulation is exact; when this is not possible, we settle for an approximate formulation. This tutorial paper collects together in one place the basic background material needed to do GP modeling. We start with the basic definitions and facts, and some methods used to transform problems into GP format.We show how to recognize functions and problems compatible with GP, and how to approximate functions or data in a form compatible with GP (when this is possible). We give some simple and representative examples, and also describe some common extensions of GP, along with methods for solving (or approximately solving) them.
A Tutorial on Spectral Clustering In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ‘echo state network’ approach (Slide Deck)
A weakly informative default prior distribution for logistic and other regression models We propose a new prior distribution for classical (nonhierarchical) logistic regression models, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Student-t prior distributions on the coefficients. As a default choice, we recommend the Cauchy distribution with center 0 and scale 2.5, which in the simplest setting is a longer-tailed version of the distribution attained by assuming one-half additional success and one-half additional failure in a logistic regression. Cross-validation on a corpus of datasets shows the Cauchy class of prior distributions to outperform existing implementations of Gaussian and Laplace priors. We recommend this prior distribution as a default choice for routine applied use. It has the advantage of always giving answers, even when there is complete separation in logistic regression (a common problem, even when the sample size is large and the number of predictors is small), and also automatically applying more shrinkage to higher-order interactions. This can be useful in routine data analysis as well as in automated procedures such as chained equations for missing-data imputation. We implement a procedure to fit generalized linear models in R with the Student-t prior distribution by incorporating an approximate EM algorithm into the usual iteratively weighted least squares. We illustrate with several applications, including a series of logistic regressions predicting voting preferences, a small bioassay experiment, and an imputation model for a public health data set.
Ad Click Prediction: a View from the Trenches Predicting ad click-through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm (which has excellent sparsity and convergence properties) and the use of per-coordinate learning rates. We also explore some of the challenges that arise in a real-world system that may appear at first to be outside the domain of traditional machine learning research. These include useful tricks for memory savings, methods for as- sessing and visualizing performance, practical methods for providing con dence estimates for predicted probabilities, calibration methods, and methods for automated management of features. Finally, we also detail several directions that did not turn out to be bene cial for us, despite promis- ing results elsewhere in the literature. The goal of this paper is to highlight the close relationship between theoretical ad- vances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system.
ADADELTA: An Adaptive Learning Rate Method We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.
Addressing the “Big Data” Issue: What You Need to Know These days, you’re probably hearing a lot of hype about “big data.” Vendors are currently hawking a wealth of new tools, all of which promise to help your organization unlock previously inaccessible insights from your proprietary information. According to the authors, there is no doubt that big data, i.e., organization-wide data that’s being managed in a centralized repository, can yield valuable discoveries that will result in improved products and performance – if properly analyzed. Nonetheless, you must look before you leap. First, is your company culture ready for such a move? How will data managers be affected when scores of discrete data silos are gathered and reviewed as a whole? How will you involve leadership and others in ongoing decision-making processes? How will you choose your architecture and tools from the dizzying array of options that are currently available? How will you stay up-to-date in this rapidly evolving field? Finally, how will you train your company’s users so that they can actually leverage the new capabilities? This ExecBlueprint explores these and other key concerns.
Advanced Analytics with the SAP HANA Database MapReduce as a programming paradigm provides a simple-to-use yet very powerful abstraction encapsulated in two second-order functions: Map and Reduce. As such, they allow defining single sequentially processed tasks while at the same time hiding many of the framework details about how those tasks are parallelized and scaled out. In this paper we discuss four processing patterns in the context of the distributed SAP HANA database that go beyond the classic MapReduce paradigm. We illustrate them using some typical Machine Learning algorithms and present experimental results that demonstrate how the data flows scale out with the number of parallel tasks.
Advances in Artificial Intelligence Require Progress Across all of Computer Science Advances in Artificial Intelligence require progress across all of computer science.
Agent-based computing from multi-agent systems to agent-based Models: a visual survey Agent-Based Computing is a diverse research domain concerned with the building of intelligent software based on the concept of ‘agents’. In this paper, we use Scientometric analysis to analyze all sub-domains of agent-based computing. Our data consists of 1,064 journal articles indexed in the ISI web of knowledge published during a twenty year period: 1990-2010. These were retrieved using a topic search with various keywords commonly used in sub-domains of agent-based computing. In our proposed approach, we have employed a combination of two applications for analysis, namely Network Workbench and CiteSpace – wherein Network Workbench allowed for the analysis of complex network aspects of the domain, detailed visualization-based analysis of the bibliographic data was performed using CiteSpace. Our results include the identification of the largest cluster based on keywords, the timeline of publication of index terms, the core journals and key subject categories. We also identify the core authors, top countries of origin of the manuscripts along with core research institutes. Finally, our results have interestingly revealed the strong presence of agent-based computing in a number of non-computing related scientific domains including Life Sciences, Ecological Sciences and Social Sciences.
Agent-Based Modeling and Simulation Agent-based modeling and simulation (ABMS) is a new approach to modeling systems comprised of autonomous, interacting agents. Computational advances have made possible a growing number of agent-based models across a variety of application domains. Applications range from modeling agent behavior in the stock market, supply chains, and consumer markets, to predicting the spread of epidemics, mitigating the threat of bio-warfare, and understanding the factors that may be responsible for the fall of ancient civilizations. Such progress suggests the potential of ABMS to have far-reaching effects on the way that businesses use computers to support decision-making and researchers use agent-based models as electronic laboratories. Some contend that ABMS “is a third way of doing science” and could augment traditional deductive and inductive reasoning as discovery methods. This brief tutorial introduces agent-based modeling by describing the foundations of ABMS, discuss-ing some illustrative applications, and addressing toolkits and methods for developing agent-based models.
Agile business intelligence: reshaping the landscape The last few years have brought a wave of changes for business intelligence (BI) solutions. A set of redefining technological trends is reshaping the landscape from a slow and cumbersome process practiced mainly by large enterprises to a much more flexible, agile process that mid-market companies as well as individuals can utilize. This report explores the key features that influence the evolution of agile BI and takes a look at the BI landscape under this light. At first glance, polarization seems to exist between traditional BI vendors, who are focused on extract, transform, and load (ETL) and reporting, and the newcomers, who are focused on data exploration and visualization, but a closer look reveals that, in fact, they converge as adoption of useful features is taking place across the spectrum.
Algorithm quasi-optimal (AQ) learning The algorithm quasi-optimal (AQ) is a powerful machine learning methodology aimed at learning symbolic decision rules from a set of examples and counterexamples. It was first proposed in the late 1960s to solve the Boolean function satisfiability problem and further refined over the following decade to solve the general covering problem. In its newest implementations, it is a powerful but yet little explored methodology for symbolic machine learning classification. It has been applied to solve several problems from different domains, including the generation of individuals within an evolutionary computation framework. The current article introduces the main concepts of the AQ methodology and describes AQ for source detection(AQ4SD), a tailored implementation of the AQ methodology to solve the problem of finding the sources of atmospheric releases using distributed sensor measurements. The AQ4SD program is tested to find the sources of all the releases of the prairie grass field experiment.
Algorithms and Methods in Recommender Systems Today, there is a big variety of different approaches and algorithms of data filtering and recommendations giving. In this paper we describe traditional approaches and explain what kind of modern approaches have been developed lately. All the paper long we will try to explain approaches and their problems based on movie recommendations. In the end we will show the main challenges recommender systems come across.
Algorithms for Active Learning This dissertation explores both the algorithmic and statistical aspects of active learning for binary classification. What are effective procedures for determining which data to label? How can these procedures take advantage of the interactive learning process, and in what circumstances do they yield improved learning performance compared to standard passive learners? To answer these questions, we develop and rigorously analyze a broad class of general active learning methods that address the essential algorithmic and statistical difficulties of the problem.
Algorithms for Reinforcement Learning Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner’s predictions. Further, the predictions may have long term e ects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop e cient learning algorithms, as well as to understand the algorithms’ merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in arti cial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas together with a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.
Algorithms in Data Mining using Matrix and Tensor Methods In many elds of science, engineering, and economics large amounts of data are stored and there is a need to analyze these data in order to extract information for various purposes. Data mining is a general concept involving di erent tools for performing this kind of analysis. The development of mathematical models and e cient algorithms is of key importance. In this thesis we discuss algorithms for the reduced rank regression problem and algorithms for the computation of the best multilinear rank approximation of tensors.
Amazon.com Recommendations: Item-to-Item Collaborative Filtering Recommendation algorithms are best known for their use on e-commerce Web sites, where they use input about a customer’s interests to generate a list of recommended items. Many applications use only the items that customers purchase and explicitly rate to represent their interests, but they can also use other attributes, including items viewed, demographic data, subject interests, and favorite artists. At Amazon.com, we use recommendation algorithms to personalize the online store for each customer. The store radically changes based on customer interests, showing programming titles to a software engineer and baby toys to a new mother. The click-through and conversion rates – two important measures of Web-based and email advertising effectiveness – vastly exceed those of untargeted content such as banner advertisements and top-seller lists….
An Analysis of Machine Learning Intelligence Deep neural networks (DNNs) have set state of the art results in many machine learning and NLP tasks. However, we do not have a strong understanding of what DNN models learn. In this paper, we examine learning in DNNs through analysis of their outputs. We compare DNN performance directly to a human population, and use characteristics of individual data points such as difficulty to see how well models perform on easy and hard examples. We investigate how training size and the incorporation of noise affect a DNN’s ability to generalize and learn. Our experiments show that unlike traditional machine learning models (e.g., Naive Bayes, Decision Trees), DNNs exhibit human-like learning properties. As they are trained with more data, they are more able to distinguish between easy and difficult items, and performance on easy items improves at a higher rate than difficult items. We find that different DNN models exhibit different strengths in learning and are robust to noise in training data.
An Analysis of Visual Question Answering Algorithms In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are evaluated on them. As a result, evaluation scores are inflated and predominantly determined by answering easier questions, making it difficult to compare different methods. In this paper, we analyze existing VQA algorithms using a new dataset. It contains over 1.6 million questions organized into 12 different categories. We also introduce questions that are meaningless for a given image to force a VQA system to reason about image content. We propose new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms. We analyze the performance of both baseline and state-of-the-art VQA models, including multi-modal compact bilinear pooling (MCB), neural module networks, and recurrent answering units. Our experiments establish how attention helps certain categories more than others, determine which models work better than others, and explain how simple models (e.g. MLP) can surpass more complex models (MCB) by simply learning to answer large, easy question categories.
An Economist’s Guide to Visualizing Data Once upon a time, a picture was worth a thousand words. But with online news, blogs, and social media, a good picture can now be worth so much more. Economists who want to disseminate their research, both inside and outside the seminar room, should invest some time in thinking about how to construct compelling and effective graphics. An effective graph should tap into the brain’s “pre-attentive visual processing” (Few 2004; Healey and Enns 2012). Because our eyes detect a limited set of visual characteristics, such as shape or contrast, we easily combine those characteristics and unconsciously perceive them as an image. In contrast to “attentive processing” – the conscious part of perception that allows us to perceive things serially – pre-attentive processing is done in parallel and is much faster. Pre-attentive processing allows the reader to perceive multiple basic visual elements simultaneously….
An Example Inference Task: Clustering Human brains are good at nding regularities in data. One way of expressing regularity is to put a set of objects into groups that are similar to each other. For example, biologists have found that most objects in the natural world fall into one of two categories: things that are brown and run away, and things that are green and don’t run away. The rst group they call animals, and the second, plants. We’ll call this operation of grouping things together clustering. If the biologist further sub-divides the cluster of plants into sub- clusters, we would call this `hierarchical clustering’; but we won’t be talking about hierarchical clustering yet. In this chapter we’ll just discuss ways to take a set of N objects and group them into K clusters.
An Impossibility Theorem for Clustering Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and pro- foundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median.
An Introduction to Advanced Analytics Advanced Analytics is “the analysis of all kinds of data using sophisticated quantitative methods (for example, statistics, descriptive and predictive data mining, simulation and optimization) to produce insights that traditional approaches to business intelligence (BI) – such as query and reporting – are unlikely to discover.”
An Introduction to Bayesian Networks: Concepts and Learning from Data (Slide Deck)
An Introduction to Cluster Analysis for Data Mining Cluster analysis divides data into meaningful or useful groups (clusters). If meaningful clusters are the goal, then the resulting clusters should capture the “natural” structure of the data. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, and to provide a grouping of spatial locations prone to earthquakes. However, in other cases, cluster analysis is only a useful starting point for other purposes, e.g., data compression or efficiently finding the nearest neighbors of points. Whether for understanding or utility, cluster analysis has long been used in a wide variety of fields: psychology and other social sciences, biology, statistics, pattern recognition, information retrieval, machine learning, and data mining. The scope of this paper is modest: to provide an introduction to cluster analysis in the field of data mining, where we define data mining to be the discovery of useful, but non-obvious, information or patterns in large collections of data. Much of this paper is necessarily consumed with providing a general background for cluster analysis, but we also discuss a number of clustering techniques that have recently been developed specifically for data mining. While the paper strives to be self-contained from a conceptual point of view, many details have been omitted. Consequently, many references to relevant books and papers are provided.
An Introduction To Compressive Sampling This article surveys the theory of compressive sampling, also known as compressed sensing or CS, a novel sensing/sampling paradigm that goes against the common wisdom in data acquisition. CS theory asserts that one can recover certain signals and images from far fewer samples or measurements than traditional methods use. To make this possible, CS relies on two principles: sparsity, which pertains to the signals of interest, and incoherence, which pertains to the sensing modality.
An Introduction to Factor Graphs A large variety of algorithms in coding, signal processing, and artificial intelligence may be viewed as instances of the summary-product algorithm (or belief/probability propagation algorithm), which operates by message passing in a graphical model. Specific instances of such algorithms include Kalman filtering and smoothing; the forward–backward algorithm for hidden Markov models; probability propagation in Bayesian networks; and decoding algorithms for error-correcting codes such as the Viterbi algorithm, the BCJR algorithm, and the iterative decoding of turbo codes, low-density parity-check (LDPC) codes, and similar codes. New algorithms for complex detection and estimation problems can also be derived as instances of the summary-product algorithm. In this article, we give an introduction to this unified perspective in terms of (Forney-style) factor graphs.
An introduction to graphical models The following quotation, from the Preface provides a very concise introduction to graphical models: Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering { uncertainty and complexity { and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity { a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of e cient general-purpose algorithms. Many of the classical multivariate probabalistic systems studied in elds such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism { examples include mixture models, factor analysis, hidden Markov models, Kalman lters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages { in particular, specialized techniques that have been developed in one eld can be transferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems.
An introduction to graphical models (Slide Deck)
An introduction to high-dimensional statistics In this note, we aim to give a very brief introduction to high-dimensional statistics. Rather than attempting to give an overview of this vast area, we will explain what is meant by highdimensional data and then focus on two methods which have been introduced to deal with this sort of data. Many of the state of the art techniques used in high-dimensional statistics today are based on these two core methods. We begin with a quick recap of least squares regression.
An Introduction to Latent Semantic Analysis The question of knowledge induction, i.e. how children are able to learn so much about, say, what words mean without any explicit instruction, is one that has vexed philosophers, linguistics, and psychologists alike. Indeed, inferring the vast amount of knowledge that children learn almost effortlessly from an apparently ‘impoverished stimulus’ seems paradoxical. The Latent Semantic Analysis model (Landauer & Dumais, 1997) is a theory for how meaning representations might be learned from encountering large samples of language without explicit directions as to how it is structured. To do this, LSA makes two assumptions about how the meaning of linguistic expressions is present in the distributional patterns of simple expressions (e.g words) within more complex expressions (e.g. sentences and paragraphs) viewed across many samples of language….
An Introduction to Latent Variable Mixture Modeling (Part 1): Overview and Cross-Sectional Latent Class and Latent Profile Analyses Objective: Pediatric psychologists are often interested in finding patterns in heterogeneous cross-sectional data. Latent variable mixture modeling is an emerging person-centered statistical approach that models heterogeneity by classifying individuals into unobserved groupings (latent classes) with similar (more homogenous) patterns. The purpose of this article is to offer a nontechnical introduction to cross-sectional mixture modeling.
Method: An overview of latent variable mixture modeling is provided and 2 cross-sectional examples are reviewed and distinguished.
Results: Step-by-step pediatric psychology examples of latent class and latent profile analyses are provided using the Early Childhood Longitudinal Study–Kindergarten Class of 1998–1999 data file.
Conclusions: Latent variable mixture modeling is a technique that is useful to pediatric psychologists who wish to find groupings of individuals who share similar data patterns to determine the extent to which these patterns may relate to variables of interest.
An Introduction to Latent Variable Mixture Modeling (Part 2): Longitudinal Latent Class Growth Analysis and Growth Mixture Models Objective: Pediatric psychologists are often interested in finding patterns in heterogeneous longitudinal data. Latent Variable Mixture Modeling is an emerging statistical approach that models such heterogeneity by classifying individuals into unobserved groupings (latent classes) with similar (more homogenous) patterns. The purpose of the second of a two article set is to offer a nontechnical introduction to longitudinal latent variable mixture modeling.
Methods: 3 latent variable approaches to modeling longitudinal data are reviewed and distinguished.
Results: Step-by-step pediatric psychology examples of latent growth curve modeling, latent class growth analysis, and growth mixture modeling are provided using the Early Childhood Longitudinal Study-Kindergarten Class of 1998–99 data file.
Conclusions: Latent variable mixture modeling is a technique that is useful to pediatric psychologists who wish to find groupings of individuals who share similar longitudinal data patterns to determine the extent to which these patterns may relate to variables of interest.
An introduction to modern missing data analyses A great deal of recent methodological research has focused on two modern missing data analysis methods: maximum likelihood and multiple imputation. These approaches are advantageous to traditional techniques (e.g. deletion and mean imputation techniques) because they require less stringent assumptions and mitigate the pitfalls of traditional techniques. This article explains the theoretical underpinnings of missing data analyses, gives an overview of traditional missing data techniques, and provides accessible descriptions of maximum likelihood and multiple imputation. In particular, this article focuses on maximum likelihood estimation and presents two analysis examples from the Longitudinal Study of American Youth data. One of these examples includes a description of the use of auxiliary variables. Finally, the paper illustrates ways that researchers can use intentional, or planned, missing data to enhance their research designs.
An Introduction to Multivariate Statistics The term “multivariate statistics” is appropriately used to include all statistics where there are more than two variables simultaneously analyzed. You are already familiar with bivariate statistics such as the Pearson product moment correlation coefficient and the independent groups t-test. A one-way ANOVA with 3 or more treatment groups might also be considered a bivariate design, since there are two variables: one independent variable and one dependent variable. Statistically, one could consider the one-way ANOVA as either a bivariate curvilinear regression or as a multiple regression with the K level categorical independent variable dummy coded into K-1 dichotomous variables.
An Introduction to Neural Networks An accurate forecast into the future can offer tremendous value in areas as diverse as financial market price movements, financial expense budget forecasts, website clickthrough likelihoods, insurance risk, and drug compound efficacy, to name just a few. Many algorithm techniques, ranging from regression analysis to ARIMA for time series, among others, are regularly used to generate forecasts. A neural network approach provides a forecasting technique that can operate in circumstances where classical techniques cannot perform or do not generate the desired accuracy in a forecast.
An Introduction to Ontology Learning Ever since the early days of Artificial Intelligence and the development of the first knowledge-based systems in the 70s people have dreamt of self-learning machines. When knowledge-based systems grew larger and the commercial interest in these technologies increased, people became aware of the knowledge acquisition bottleneck and the necessity to (partly) automatize the creation and maintenance of knowledge bases. Today, many applications which exhibit ’intelligent’ behavior thanks to symbolic knowledge representation and logical inference rely on ontologies and the standards provided by the World Wide Web Committee (W3C). Supporting the construction of ontologies and populating them with instantiations of both concepts and relations, commonly referred to as ontology learning. Early research in ontology learning has concentrated on the extraction of facts or schema-level knowledge from textual resources building upon earlier work in the field of computational linguistics and lexical acquisition. However, as we will show in this book, ontology learning is a very diverse and interdisciplinary field of research. Ontology learning approaches are as heterogeneous as the sources of data on the web, and as different from one another as the types of knowledge representations called “ontologies”. In the remainder of this introduction, we briefly summarize the state-of-the-art in ontology learning and elaborate on what we consider as the key challenges for current and future ontology learning research.
An introduction to ROC analysis Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
An Introduction to Variable and Feature Selection Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
An Introduction to Visualizing Data The purpose of this document is to provide an introduction to the theory behind visualizing data. After studying the works of many talented people I decided to summarize the key points of information into this single paper. If you found this document interesting please take some time to look at the list of resources that I used (see Chapter 8) because I could never have created this without the excellent work done by others.
An Overview of Multi-Processor Approximate Message Passing Approximate message passing (AMP) is an algorithmic framework for solving linear inverse problems from noisy measurements, with exciting applications such as reconstructing images, audio, hyper spectral images, and various other signals, including those acquired in compressive signal acquisiton systems. The growing prevalence of big data systems has increased interest in large-scale problems, which may involve huge measurement matrices that are unsuitable for conventional computing systems. To address the challenge of large-scale processing, multiprocessor (MP) versions of AMP have been developed. We provide an overview of two such MP-AMP variants. In row-MP-AMP, each computing node stores a subset of the rows of the matrix and processes corresponding measurements. In column- MP-AMP, each node stores a subset of columns, and is solely responsible for reconstructing a portion of the signal. We will discuss pros and cons of both approaches, summarize recent research results for each, and explain when each one may be a viable approach. Aspects that are highlighted include some recent results on state evolution for both MP-AMP algorithms, and the use of data compression to reduce communication in the MP network.
An Overview of Multi-Task Learning in Deep Neural Networks Multi-task learning (MTL) has led to successes in many applications of machine learning, from natural language processing and speech recognition to computer vision and drug discovery. This article aims to give a general overview of MTL, particularly in deep neural networks. It introduces the two most common methods for MTL in Deep Learning, gives an overview of the literature, and discusses recent advances. In particular, it seeks to help ML practitioners apply MTL by shedding light on how MTL works and providing guidelines for choosing appropriate auxiliary tasks.
An Overview of Spatial Econometrics This paper offers an expository overview of the field of spatial econometrics. It first justifies the necessity of special statistical procedures for the analysis of spatial data and then proceeds to describe the fundamentals of these procedures. In particular, this paper covers three crucial techniques for building models with spatial data. First, we discuss how to create a spatial weights matrix based on the distances between each data point in a dataset. Next, we describe the conventional methods to formally detect spatial autocorrelation, both global and local. Finally, we outline the chief components of a spatial autoregressive model, noting the circumstances under which it would be appropriate to incorporate each component into a model. This paper seeks to offer a concise introduction to spatial econometrics that will be accessible to interested individuals with a background in statistics or econometrics.
An Overview of Statistical Learning Theory Statistical learning theory was introduced in the late 1960’s. Until the 1990’s it was a purely theoretical analysis of the problem of function estimation from a given collection of data. In the middle of the 1990’s new types of learning algorithms (called support vector machines) based on the developed theory were proposed. This made statistical learning theory not only a tool for the theoretical analysis but also a tool for creating practical algorithms for estimating multidimensional functions. This article presents a very general overview of statistical learning theory including both theoretical and algorithmic aspects of the theory. The goal of this overview is to demonstrate how the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems. A more detailed overview of the theory (without proofs) can be found in Vapnik (1995). In Vapnik (1998) one can find detailed description of the theory (including proofs).
Analysing spatial point patterns in R This is a detailed set of notes for a workshop on Analysing spatial point patterns in R, presented by the author in Australia and New Zealand since 2006. The goal of the workshop is to equip researchers with a range of practical techniques for the statistical analysis of spatial point patterns. Some of the techniques are well established in the applications literature, while some are very recent developments. The workshop is based on spatstat, a contributed library for the statistical package R, which is free open source software. Topics covered include: statistical formulation and methodological issues; data input and handling; R concepts such as classes and methods; exploratory data analysis; nonparametric intensity and risk estimates; goodness-of-fit testing for Complete Spatial Randomness; maximum likelihood inference for Poisson processes; spatial logistic regression; model validation for Poisson processes; exploratory analysis of dependence; distance methods and summary functions such as Ripley’s K function; simulation techniques; non-Poisson point process models; fitting models using summary statistics; LISA and local analysis; inhomogeneous K-functions; Gibbs point process models; fitting Gibbs models; simulating Gibbs models; validating Gibbs models; multitype and marked point patterns; exploratory analysis of multitype and marked point patterns; multitype Poisson process models and maximum likelihood inference; multitype Gibbs process models and maximum pseudolikelihood; line segment patterns, 3-dimensional point patterns, multidimensional space-time point patterns, replicated point patterns, and stochastic geometry methods.
Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining The proliferation of textual data in business is overwhelming. Unstructured textual data is being constantly generated via call center logs, emails, documents on the web, blogs, tweets, customer comments, customer reviews, and so on. While the amount of textual data is increasing rapidly, businesses’ ability to summarize, understand, and make sense of such data for making better business decisions remain challenging. This paper takes a quick look at how to organize and analyze textual data for extracting insightful customer intelligence from a large collection of documents and for using such information to improve business operations and performance. Multiple business applications of case studies using real data that demonstrate applications of text analytics and sentiment mining using SAS Text Miner and SAS Sentiment Analysis Studio are presented. While SAS products are used as tools for demonstration only, the topics and theories covered are generic (not tool specific).
Analytical Skills, Tools & Attitudes 2013: Analytics capabilities needed now and in the future Organizations continue to invest more in analytics, but increasingly there is recognition that a shortage of analytic talent is holding back even greater investment. Lavastorm Analytics polled more than 425 people in the analytics community about whether their organization needs more analytic resources or skills and which skills are valued most and are most urgently needed. Survey respondents included business analysts, technologists, data analytics professionals, managers, and C-level executives across a broad variety of industries. The top findings were:
– According to the survey respondents, a lack of skills/training/education is the biggest factor holding back organizations from using analytics more.
– Skills most urgently needed in their organizations are Statistics, math or other quantitative skills; Analytic tool training; and Critical thinking.
– Lack of funding or resources, however, also has a significant impact on adoption of analytics to drive day-to-day decisions. Lesser factors also include inadequate support from executives and data that is not integrated.
Analytics 3.0 In the new era, big data will power consumer products and services
Analytics: The real-world use of big data “Big data” – which admittedly means many things to many people – is no longer confined to the realm of technology. Today it is a business priority, given its ability to profoundly affect commerce in the globally integrated economy. In addition to providing solutions to long-standing business challenges, big data inspires new ways to transform processes, organizations, entire industries and even society itself. Yet extensive media coverage makes it hard to distinguish hype from reality – what is really happening? Our newest research finds that organizations are using big data to target customer-centric outcomes, tap into internal data and build a better information ecosystem.
Analyzing the Analyzers Binita, Chao, Dmitri, and Rebecca are data scientists. What does that statement tell you about them? Probably not as much as you’d like. You know they probably know something about statistics, programming, and data visualization. You’d hope that they had some experience finding insights from data, maybe even “big data.” But if you’re trying to find the best person for a job, you need to be more specific than just “doctor,” or “athlete,” or “data scientist.” And that’s a problem. Finding the right people for a task is all about efficient communication and, without the appropriate shared vocabulary, data science talent and data science problems are too often kept apart….
Anomaly Detection: A Survey Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
APACHE DRILL: Interactive Ad-Hoc Analysis at Scale Apache Drill is a distributed system for interactive ad-hoc analysis of large-scale datasets. Designed to handle up to petabytes of data spread across thousands of servers, the goal of Drill is to respond to ad-hoc queries in a lowlatency manner. In this article, we introduce Drill’s architecture, discuss its extensibility points, and put it into the context of the emerging offerings in the interactive analytics realm.
Applied Data Science in Europe Google Trends and other IT fever charts rate Data Science among the most rapidly emerging and promising fields that expand around computer science. Although Data Science draws on content from established fields like artificial intelligence, statistics, databases, visualization and many more, industry is demanding for trained data scientists that no one seems able to deliver. This is due to the pace at which the field has expanded and the corresponding lack of curricula; the unique skill set, which is inherently multi-disciplinary; and the translation work (from the US web economy to other ecosystems) necessary to realize the recognized world-wide potential of applying analytics to all sorts of data. In this contribution we draw from our experiences in establishing an inter-disciplinary Data Science lab in order to highlight the challenges and potential remedies for Data Science in Europe. We discuss our role as academia in the light of the potential societal/economic impact as well as the challenges in organizational leadership tied to such inter-disciplinary work.
Architecting a High Performance Storage System Designing a large-scale, high-performance data storage system presents significant challenges. This paper describes a step-by-step approach to designing such a system and presents an iterative methodology that applies at both the component level and the system level. A detailed case study using the methodology described to design a Lustre storage system is presented.
Are Saddles Good Enough for Deep Learning? Recent years have seen a growing interest in understanding deep neural networks from an optimization perspective. It is understood now that converging to low-cost local minima is sufficient for such models to become effective in practice. However, in this work, we propose a new hypothesis based on recent theoretical findings and empirical studies that deep neural network models actually converge to saddle points with high degeneracy. Our findings from this work are new, and can have a significant impact on the development of gradient descent based methods for training deep networks. We validated our hypotheses using an extensive experimental evaluation on standard datasets such as MNIST and CIFAR-10, and also showed that recent efforts that attempt to escape saddles finally converge to saddles with high degeneracy, which we define as `good saddles’. We also verified the famous Wigner’s Semicircle Law in our experimental results.
Are You a Bayesian or a Frequentist? (Slide Deck)
Artificial Intelligence and Economic Theories The advent of artificial intelligence has changed many disciplines such as engineering, social science and economics. Artificial intelligence is a computational technique which is inspired by natural intelligence such as the swarming of birds, the working of the brain and the pathfinding of the ants. These techniques have impact on economic theories. This book studies the impact of artificial intelligence on economic theories, a subject that has not been extensively studied. The theories that are considered are: demand and supply, asymmetrical information, pricing, rational choice, rational expectation, game theory, efficient market hypotheses, mechanism design, prospect, bounded rationality, portfolio theory, rational counterfactual and causality. The benefit of this book is that it evaluates existing theories of economics and update them based on the developments in artificial intelligence field.
Artificial Intelligence Now The phrase “artificial intelligence” has a way of retreating into the future: as things that were once in the realm of imagination and fiction become reality, they lose their wonder and become “machine translation,” “real-time traffic updates,” “self-driving cars,” and more. But the past 12 months have seen a true explosion in the capacities as well as adoption of AI technologies. While the flavor of these developments has not pointed to the “general AI” of science fiction, it has come much closer to offering generalized AI tools—these tools are being deployed to solve specific problems. But now they solve them more powerfully than the complex, rule-based tools that preceded them. More importantly, they are flexible enough to be deployed in many contexts. This means that more applications and industries are ripe for transformation with AI technologies. This book, drawing from the best posts on the O’Reilly AI blog, brings you a summary of the current state of AI technologies and applications, as well as a selection of useful guides to getting started with deep learning and AI technologies. Part I covers the overall landscape of AI, focusing on the platforms, businesses, and business models are shaping the growth of AI. We then turn to the technologies underlying AI, particularly deep learning, in Part II. Part III brings us some “hobbyist” applications: intelligent robots. Even if you don’t build them, they are an incredible illustration of the low cost of entry into computer vision and autonomous operation. Part IV also focuses on one application: natural language. Part V takes us into commercial use cases: bots and autonomous vehicles. And finally, Part VI discusses a few of the interplays ix between human and machine intelligence, leaving you with some big issues to ponder in the coming year.
Assessing Your Business Analytics Initiatives – Eight Metrics That Matter It’s no secret that using analytics to uncover meaningful insights from data is crucial for making fact-based decisions. Now considered mainstream, the business analytics market worldwide is expected to exceed $50 billion by the year 2016.1 Yet when it comes to making analytics work, not all organizations are equal. In fact, despite the transformative power of big data and analytics, many organizations still struggle to wring value from their information. The complexities of dealing with big data, integrating technologies, finding analytical talent and challenging corporate culture are the main pitfalls to the successful use of analytics within organizations. The management of information – including the analytics used to transform it – is an evolutionary process, and organizations are at various levels of this evolution. Those wanting to advance analytics to a new level need to understand their analytics activities across the organization, from both an IT and business perspective. Toward that end, an assessment focusing on eight key analytics metrics can be used to identify strengths and areas for improvement in the analytics life cycle.
At what sample size do correlations stabilize? Sample correlations converge to the population value with increasing sample size, but the estimates are often inaccurate in small samples. In this report we use Monte-Carlo simulations to determine the critical sample size from which on the magnitude of a correlation can be expected to be stable. The necessary sample size to achieve stable estimates for correlations depends on the effect size, the width of the corridor of stability (i.e., a corridor around the true value where deviations are tolerated), and the requested confidence that the trajectory does not leave this corridor any more. Results indicate that in typical scenarios the sample size should approach 250 for stable estimates.
Automated Problem Identification: Regression vs Classification via Evolutionary Deep Networks Regression or classification? This is perhaps the most basic question faced when tackling a new supervised learning problem. We present an Evolutionary Deep Learning (EDL) algorithm that automatically solves this by identifying the question type with high accuracy, along with a proposed deep architecture. Typically, a significant amount of human insight and preparation is required prior to executing machine learning algorithms. For example, when creating deep neural networks, the number of parameters must be selected in advance and furthermore, a lot of these choices are made based upon pre-existing knowledge of the data such as the use of a categorical cross entropy loss function. Humans are able to study a dataset and decide whether it represents a classification or a regression problem, and consequently make decisions which will be applied to the execution of the neural network. We propose the Automated Problem Identification (API) algorithm, which uses an evolutionary algorithm interface to TensorFlow to manipulate a deep neural network to decide if a dataset represents a classification or a regression problem. We test API on 16 different classification, regression and sentiment analysis datasets with up to 10,000 features and up to 17,000 unique target values. API achieves an average accuracy of $96.3\%$ in identifying the problem type without hardcoding any insights about the general characteristics of regression or classification problems. For example, API successfully identifies classification problems even with 1000 target values. Furthermore, the algorithm recommends which loss function to use and also recommends a neural network architecture. Our work is therefore a step towards fully automated machine learning.
Automatic Conversion of Tables to LongForm Dataframes TableToLongForm automatically converts hierarchical Tables intended for a human reader into a simple LongForm Dataframe that is machine readable, hence enabling much greater utilisation of the data. It does this by recognising positional cues present in the hierarchical Table (which would normally be interpreted visually by the human brain) to decompose, then reconstruct the data into a LongForm Dataframe. The article motivates the benefit of such a conversion with an example Table, followed by a short user manual, which includes a comparison between the simple one argument call to TableToLongForm, with code for an equivalent manual conversion. The article then explores the types of Tables the package can convert by providing a gallery of all recognised patterns. It finishes with a discussion of available diagnostic methods and future work.
Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures Automatic description generation from natural images is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing communities. In this survey, we classify the existing approaches based on how they conceptualize this problem, viz., models that cast description as either generation problem or as a retrieval problem over a visual or multimodal representational space. We provide a detailed review of existing models, highlighting their advantages and disadvantages. Moreover, we give an overview of the benchmark image datasets and the evaluation measures that have been developed to assess the quality of machine-generated image descriptions. Finally we extrapolate future directions in the area of automatic image description generation.
Automatic Extraction of Causal Relations from Natural Language Texts: A Comprehensive Survey Automatic extraction of cause-effect relationships from natural language texts is a challenging open problem in Artificial Intelligence. Most of the early attempts at its solution used manually constructed linguistic and syntactic rules on small and domain-specific data sets. However, with the advent of big data, the availability of affordable computing power and the recent popularization of machine learning, the paradigm to tackle this problem has slowly shifted. Machines are now expected to learn generic causal extraction rules from labelled data with minimal supervision, in a domain independent-manner. In this paper, we provide a comprehensive survey of causal relation extraction techniques from both paradigms, and analyse their relative strengths and weaknesses, with recommendations for future work.
Automatic Keyphrase Extraction: A Survey of the State of the Art While automatic keyphrase extraction has been examined extensively, state-of-theart performance on this task is still much lower than that on many core natural language processing tasks. We present a survey of the state of the art in automatic keyphrase extraction, examining the major sources of errors made by existing systems and discussing the challenges ahead.
Automatic Keyword Extraction for Text Summarization: A Survey In recent times, data is growing rapidly in every domain such as news, social media, banking, education, etc. Due to the excessiveness of data, there is a need of automatic summarizer which will be capable to summarize the data especially textual data in original document without losing any critical purposes. Text summarization is emerged as an important research area in recent past. In this regard, review of existing work on text summarization process is useful for carrying out further research. In this paper, recent literature on automatic keyword extraction and text summarization are presented since text summarization process is highly depend on keyword extraction. This literature includes the discussion about different methodology used for keyword extraction and text summarization. It also discusses about different databases used for text summarization in several domains along with evaluation matrices. Finally, it discusses briefly about issues and research challenges faced by researchers along with future direction.
Automatic Sarcasm Detection: A Survey Automatic detection of sarcasm has witnessed interest from the sentiment analysis research community. With diverse approaches, datasets and analyses that have been reported, there is an essential need to have a collective understanding of the research in this area. In this survey of automatic sarcasm detection, we describe datasets, approaches (both supervised and rule-based), and trends in sarcasm detection research. We also present a research matrix that summarizes past work, and list pointers to future work.
Automatic Tag Recommendation Algorithms for Social Recommender Systems The emergence of Web 2.0 and the consequent success of social network websites such as del.icio.us and Flickr introduce us to a new concept called social bookmarking, or tagging in short. Tagging can be seen as the action of connecting a relevant user-defined keyword to a document, image or video, which helps user to better organize and share their collections of interesting stuff. With the rapid growth of Web 2.0, tagged data is becoming more and more abundant on the social network websites. An interesting problem is how to automate the process of making tag recommendations to users when a new resource becomes available. In this paper, we address the issue of tag recommendation from a machine learning perspective of view. From our empirical observation of two large-scale data sets, we first argue that the user-centered approach for tag recommendation is not very effective in practice. Consequently, we propose two novel document-centered approaches that are capable of making effective and efficient tag recommendations in real scenarios. The first graph-based method represents the tagged data into two bipartite graphs of (document, tag) and (document, word), then finds document topics by leveraging graph partitioning algorithms. The second prototype-based method aims at finding the most representative documents within the data collections and advocates a sparse multi-class Gaussian process classifier for efficient document classification. For both methods, tags are ranked within each topic cluster/class by a novel ranking method. Recommendations are performed by first classifying a new document into one or more topic clusters/classes, and then selecting the most relevant tags from those clusters/classes as machine-recommended tags. Experiments on real-world data from Del.icio.us, CiteULike and BibSonomy examine the quality of tag recommendation as well as the efficiency of our recommendation algorithms. The results suggest that our document-centered models can substantially improve the performance of tag recommendations when compared to the user-centered methods, as well as topic models LDA and SVM classifiers.
Average Predictive Comparisons for models with nonlinearity, interactions, and variance components In a predictive model, what is the expected difference in the outcome associated with a unit difference in one of the inputs? In a linear regression model without interactions, this average predictive comparison is simply a regression coefficient (with associated uncertainty). In a model with nonlinearity or interactions, however, the average predictive comparison in general depends on the values of the predictors. We consider various definitions based on averages over a population distribution of the predictors, and we compute standard errors based on uncertainty in model parameters. We illustrate with a study of criminal justice data for urban counties in the United States. The outcome of interest measures whether a convicted felon received a prison sentence rather than a jail or non-custodial sentence, with predictors available at both individual and county levels.We fit three models: (1)a hierarchical logistic regression with varying coefficients for the within-county intercepts as well as for each individual predictor; (2)a hierarchical model with varying intercepts only; and (3)a nonhierarchical model that ignores themultilevel nature of the data. The regression coefficients have different interpretations for the different models; in contrast, the models can be compared directly using predictive comparisons. Furthermore, predictive comparisons clarify the interplay between the individual and county predictors for the hierarchical models and also illustrate the relative size of varying county effects.
Avoiding the Barriers of In-Memory Business Intelligence: Making Data Discovery Scalable When looking at the growth rates of the business intelligence platform space, it is apparent that acquisitions of new business intelligence tools have shifted dramatically from traditional data visualization and aggregation use cases to newer data discovery implementations. This shift toward data discovery use cases has been driven by two key factors: faster implementation times and the ability to visualize and manipulate data as quickly as an analyst can click a mouse. The improvements in implementation speeds stem from the use of architectures that access source data directly without having to first aggregate all the data in a central location such as an enterprise data warehouse or departmental data mart. The promise of fast manipulation of data has largely been accomplished by employing in-memory data management models to exploit the speed advantage of accessing data from server memory over traditional disk-based approaches. The “physics” of data access favors in-memory data management models. However, in-memory techniques are not without drawbacks. As companies attempt to evolve from small departmental projects to broader division-wide or enterprise-wide initiatives, increasing data volumes and the impact of increasing data consumer counts challenge the limits of early in-memory implementations. These challenges raise serious questions that should be considered by any organization considering in-memory techniques for business intelligence platforms.

B

Bayesian Computation Via Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC) algorithms are an indispensable tool for performing Bayesian inference. This review discusses widely used sampling algorithms and illustrates their implementation on a probit regression model for lupus data. The examples considered highlight the importance of tuning the simulation parameters and underscore the important contributions of modern developments such as adaptive MCMC. We then use the theory underlying MCMC to explain the validity of the algorithms considered and to assess the variance of the resulting Monte Carlo estimators.
Bayesian Computational Tools This article surveys advances in the field of Bayesian computation over the past 20 years from a purely personal viewpoint, hence containing some ommissions given the spectrum of the field. Monte Carlo, MCMC, and ABC themes are covered here, whereas the rapidly expanding area of particle methods is only briefly mentioned and different approximative techniques such as variational Bayes and linear Bayes methods do not appear at all. This article also contains some novel computational entries on the doubleexponential model that may be of interest.
Bayesian Computing with INLA: A Review The key operation in Bayesian inference, is to compute high-dimensional integrals. An old approximate technique is the Laplace method or approximation, which dates back to Pierre- Simon Laplace (1774). This simple idea approximates the integrand with a second order Taylor expansion around the mode and computes the integral analytically. By developing a nested version of this classical idea, combined with modern numerical techniques for sparse matrices, we obtain the approach of Integrated Nested Laplace Approximations (INLA) to do approximate Bayesian inference for latent Gaussian models (LGMs). LGMs represent an important model-abstraction for Bayesian inference and include a large proportion of the statistical models used today. In this review, we will discuss the reasons for the success of the INLA-approach, the R-INLA package, why it is so accurate, why the approximations are very quick to compute and why LGMs make such a useful concept for Bayesian computing.
Bayesian estimation supersedes the t test Bayesian estimation for two groups provides complete distributions of credible values for the effect size, group means and their difference, standard deviations and their difference, and the normality of the data. The method handles outliers. The decision rule can accept the null value (unlike traditional t tests) when certainty in the estimate is high (unlike Bayesian model com- parison using Bayes factors). The method also yields precise estimates of statistical power for various research goals. The software and programs are free, and run on Macintosh, Windows, and Linux platforms.
Bayesian Group Decisions: Algorithms and Complexity Many important real-world decision-making problems involve interactions of individuals with purely informational externalities, for example, in jury deliberations, expert committees, etc. We model such interactions of rational agents in a group, where they receive private information and act based on that information while also observing other people’s beliefs or actions. As a Bayesian agent attempts to infer the truth from her sequence of observations of actions of others and her own private signal, she recursively refines her belief on the signals that other players could have observed and actions that they could have taken given that other players are also rational. The existing literature addresses asymptotic properties of Bayesian group decisions (important questions such as convergence to consensus and learning). In this work, we address the computations that the Bayesian agent should undertake to realize the optimal actions at every decision epoch. We use the iterated eliminations of infeasible signals (IEIS) to model the thinking process as well as the calculations of a Bayesian agent in a group decision scenario. We show that IEIS algorithm runs in exponential time; however, when the group structure is a partially ordered set the Bayesian calculations simplify and polynomial-time computation of the Bayesian recommendations is possible. We next shift attention to the case where agents reveal their beliefs (instead of actions) at every decision epoch. We analyze the computational complexity of the Bayesian belief formation in groups and show that it is NP-hard. We also investigate the factors underlying this computational complexity and show how belief calculations simplify in special network structures or cases with strong inherent symmetries. We finally give insights about the statistical efficiency (optimality) of the beliefs and its relations to computational efficiency.
Bayesian Methods of Parameter Estimation In order to motivate the idea of parameter estimation we need to first understand the notion of mathematical modeling. What is the idea behind modeling real world phenomena? Mathematically modeling an aspect of the real world enables us to better understand it and better explain it, and perhaps enables us to reproduce it, either on a large scale, or on a simplified scale that characterizes only the critical parts of that phenomenon. How do we model these real life phenomena? These real life phenomena are captured by means of distribution models, which are extracted or learned directly from data gathered about them. So, what do we mean by parameter estimation? Every distribution model has a set of parameters that need to be estimated. These parameters specify any constants appearing in the model and provide a mechanism for efficient and accurate use of data. …
Bayesian Model Averaging: A Tutorial Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA) provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples. In these examples, BMA provides improved out-ofsample predictive performance. We also provide a catalogue of currently available BMA software.
Bayesian Statistics Students are to choose one paper in the following list, or possibly outside of the list upon my agreement. The papers are available online. Most of them are collected in this zip file . The presentation can focus on a particular section / result / example of the paper. Evaluation of the students is based on the understanding and presentation of the chosen paper.
Bayesian Statistics Papers (Paper Collection)
Behavior Trees in Robotics and AI, an Introduction A Behavior Tree (BT) is a way to structure the switching between different tasks in an autonomous agent, such as a robot or a virtual entity in a computer game. BTs are a very efficient way of creating complex systems that are both modular and reactive. These properties are crucial in many applications, which has led to the spread of BT from computer game programming to many branches of AI and Robotics. In this book, we will first give an introduction to BTs, then we describe how BTs relate to, and in many cases generalize, earlier switching structures. These ideas are then used as a foundation for a set of efficient and easy to use design principles. Properties such as safety, robustness, and efficiency are important for an autonomous system, and we describe a set of tools for formally analyzing these using a state space description of BTs. With the new analysis tools, we can formalize the descriptions of how BTs generalize earlier approaches. Finally, we describe an extended set of tools to capture the behavior of Stochastic BTs, where the outcomes of actions are described by probabilities. These tools enable the computation of both success probabilities and time to completion.
Best Practices for Scientific Computing Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists’ productivity and the reliability of their software.
Better Decisions through Science Math-based aids for making decisions in medicine and industry could improve many diagnoses – often saving lives in the process.
Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm Figures in scientific publications are critically important because they often show the data supporting key findings. Our systematic review of research articles published in top physiology journals (n = 703) suggests that, as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies. Papers rarely included scatterplots, box plots, and histograms that allow readers to critically evaluate continuous data. Most papers presented continuous data in bar and line graphs. This is problematic, as many different data distributions can lead to the same bar or line graph. The full data may suggest different conclusions from the summary statistics. We recommend training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies. Investigators can quickly make univariate scatterplots for small sample size studies using our Excel templates.
BI forward: A full view of your business Imagine that your organization is effectively using a business intelligence (BI) solution that provides everything you need to make better decisions and improve operational efficiency. Imagine users with their fingers on the pulse of markets, customers, channels and operations at all times. And imagine that your programs, plans, services and products are being designed with full and timely insight into all the factors – past, present and future – critical to success. What would it take to make that happen? What businesses need from BI is a full picture. And that is why it is important to understand that, for now and in the future, BI should help you not only describe and diagnose your past and current performance, but also predict future performance. When your business can do all three, you have a better idea of what your business needs to do to stay competitive. You have reports that show you where you have been, scorecards and real-time monitoring that show what is happening now and predictive analytics to show where your business is headed. This paper explains the advantages of a BI solution that includes predictive analytics.
BI, Analytics and Big Data A Modern-Day Perspective (Slide Deck)
Big Data Analytics in Action How Your Organization Can Improve its Bottom Line through Better Measurement, Better Decisions and Faster Response to Dynamic Market Conditions.
Big Data Analytics: A Survey The age of big data is now coming. But the traditional data analytics may not be able to handle such large quantities of data. The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data. To deeply discuss this issue, this paper begins with a brief introduction to data analytics, followed by the discussions of big data analytics. Some important open issues and further research directions will also be presented for the next step of big data analytics.
Big Data and Machine Learning with an Actuarial Perspective (Slide Deck)
Big Data and the Creative Destruction of Today’s Business Models
Big data and the democratisation of decisions In August 2012 the Economist Intelligence Unit conducted a survey sponsored by Alteryx of 241 global executives to gauge their perceptions of big data adoption. Fifty-three percent of respondents are board members or C-suite executives, including 66 CEOs, presidents or managing directors. Those polled are based in North America (34%), the Asia-Pacific region (27%), Western Europe (25%), the Middle East and Africa (6%), Latin America (5%) and Eastern Europe (4%). Half of executives work for companies with revenue that exceeds US$500m. Executives hail from 18 sectors and represent 14 functional roles, including general management (30%), strategy and business development (18%), finance (17%) and marketing and sales (10%).
Big Data and the Internet of Things Advances in sensing and computing capabilities are making it possible to embed increasing computing power in small devices. This has enabled the sensing devices not just to passively capture data at very high resolution but also to take sophisticated actions in response. Combined with advances in communication, this is resulting in an ecosystem of highly interconnected devices referred to as the Internet of Things – IoT. In conjunction, the advances in machine learning have allowed building models on this ever increasing amounts of data. Consequently, devices all the way from heavy assets such as aircraft engines to wearables such as health monitors can all now not only generate massive amounts of data but can draw back on aggregate analytics to ‘improve’ their performance over time. Big data analytics has been identified as a key enabler for the IoT. In this chapter, we discuss various avenues of the IoT where big data analytics either is already making a significant impact or is on the cusp of doing so. We also discuss social implications and areas of concern.
Big Data for Big Business? A Taxonomy of Data-driven Business Models used by Start-up Firms This paper reports a study which provides a series of implications that may be particularly helpful to companies already leveraging ‘big data’ for their businesses or planning to do so. The Data Driven Business Model (DDBM) framework represents a basis for the analysis and clustering of business models. For practitioners the dimensions and various features may provide guidance on possibilities to form a business model for their specific venture. The framework allows identification and assessment of available potential data sources that can be used in a new DDBM. It also provides comprehensive sets of potential key activities as well as revenue models. The identified business model types can serve as both inspiration and blueprint for companies considering creating new data-driven business models. Although the focus of this paper was on business models in the start-up world, the key findings presumably also apply to established organisations to a large extent. The DDBM can potentially be used and tested by established organisations across different sectors in future research.
Big Data for Finance According to the 2014 IDG Enterprise Big Data research report, companies are intensifying their efforts to derive value through big data initiatives with nearly half (49%) of respondents already implementing big data projects or in the process of doing so in the future. Further, organizations are seeing exponential growth in the amount of data managed with an expected increase of 76% within the next 12-18 months. With growth there are opportunities as well as challenges. Among those facing the big data challenge are finance executives, as this extraordinary growth presents a unique opportunity to leverage data assets like never before. As the 3 V’s of big data: volume, velocity and variety continue to grow, so too does the opportunity for finance sector firms to capitalize on this data for strategic advantage. Finance professionals are accomplished in collecting, analyzing and benchmarking data, so they are in a unique position to provide a new and critical service – making big data more manageable while condensing vast amounts of information into actionable business insights.
Big Data Gets Personal Big data and personal data are converging to shape the Internet’s most surprising consumer products. They’ll predict your needs and store your memories – if you let them.
Big Data in Big Companies Big data burst upon the scene in the first decade of the 21st century, and the first organizations to embrace it were online and startup firms. Arguably, firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning. They didn’t have to reconcile or integrate big data with more traditional sources of data and the analytics performed upon them, because they didn’t have those traditional forms. They didn’t have to merge big data technologies with their traditional IT infrastructures because those infrastructures didn’t exist. Big data could stand alone, big data analytics could be the only focus of analytics, and big data technology architectures could be the only architecture. Consider, however, the position of large, well-established businesses. Big data in those environments shouldn’t be separate, but must be integrated with everything else that’s going on in the company. Analytics on big data have to coexist with analytics on other types of data. Hadoop clusters have to do their work alongside IBM mainframes. Data scientists must somehow get along and work jointly with mere quantitative analysts. In order to understand this coexistence, we interviewed 20 large organizations in the early months of 2013 about how big data fit in to their overall data and analytics environments. Overall, we found the expected co-existence; in not a single one of these large organizations was big data being managed separately from other types of data and analytics. The integration was in fact leading to a new management perspective on analytics, which we’ll call “Analytics 3.0.” In this paper we’ll describe the overall context for how organizations think about big data, the organizational structure and skills required for it…etc. We’ll conclude by describing the Analytics 3.0 era.
Big Data Machine Learning: Patterns for Predictive Analytics (RefCard)
Big data maturity: An action plan for policymakers and executives Big data have the potential to improve or transform existing business operations and reshape entire economic sectors. Big data can pave the way for disruptive, entrepreneurial companies and allow new industries to emerge. The technological aspect is important, but insufficient to allow big data to show their full potential and to stop companies from feeling swamped by this information. What matters is to reshape internal decision-making culture so that executives base their judgments on data rather than hunches. Research already indicates that companies that have managed this are more likely to be productive and profitable than the competition. Organizations need to understand where they are in terms of big data maturity, an approach that allows them to assess progress and identify necessary initiatives. Judging maturity requires looking at environment readiness, how far governments have provided the necessary legal and regulatory frameworks, and information and communications technology (ICT) infrastructure; an organization’s internal capabilities and how ready it is to implement big data initiatives; and the many and more complicated methods for using big data, which can mean simple efficiency gains or revamping a business model. The ultimate maturity level involves transforming the business model to be data-driven, which requires significant investment over many years. Policymakers should pay particular attention to environment readiness. They should present citizens with a compelling case for the benefits of big data. This means addressing privacy concerns and seeking to harmonize regulations around data privacy globally. Policymakers should establish an environment that facilitates the business viability of the big data sector (such as data, service, or IT system providers), and they should take educational measures to address the shortage of big data specialists. As big data become ubiquitous in public and private organizations, their use will become a source of national and corporate competitive advantage.
Big Data Visualization: Turning Big Data Into Big Insights This white paper provides valuable information about visualization-based data discovery tools and how they can help IT decision-makers derive more value from big data. Topics include:
• An overview of the IT landscape and the challenges that are leading more businesses to look for alternatives to traditional business intelligence tools
• A description of the features and benefits of visualization-based data discovery tools
• Guidance and suggestions on data governance, and ways to protect the quality of big data while facilitating self-service business intelligence
• Several usage examples of visualization-based data discovery tools from TIBCO* Software, the world’s second-largest data discovery vendor
Big Data: Harnessing the Power of Big Data through Education and data-driven Decision Making Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data-science programs, and publications are touting data science as a hot – even ‘‘sexy’’ – career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this article, we argue that there are good reasons why it has been hard to pin down exactly what is data science. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner’s field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of data science precisely is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii), we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this article, we present a perspective that addresses all these concepts.We close by offering, as examples, a partial list of fundamental principles underlying data science.
Big Data: New Tricks for Econometrics Computers are now involved in many economic transactions and can capture data associated with these transactions, which can then be manipulated and analyzed. Conventional statistical and econometric techniques such as regression often work well, but there are issues unique to big datasets that may require different tools. First, the sheer size of the data involved may require more powerful data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large datasets may allow for more flexible relationships than simple linear models. Machine learning techniques such as decision trees, support vector machines, neural nets, deep learning, and so on may allow for more effective ways to model complex relationships. In this essay, I will describe a few of these tools for manipulating and analyzing big data. I believe that these methods have a lot to offer and should be more widely known and used by economists. In fact, my standard advice to graduate students these days is go to the computer science department and take a class in machine learning. There have been very fruitful collaborations between computer scientists and statisticians in the last decade or so, and I expect collaborations between computer scientists and econometricians will also be productive in the future.
Big data: The next frontier for innovation, competition, and productivity This report contributes to MGI’s mission to help global leaders understand the forces transforming the global economy, improve company performance, and work for better national and international policies. As with all MGI research, we would like to emphasize that this work is independent and has not been commissioned or sponsored in any way by any business, government, or other institution.
Big data: The next frontier for innovation, competition, and productivity This report contributes to MGI’s mission to help global leaders understand the forces transforming the global economy, improve company performance, and work for better national and international policies. As with all MGI research, we would like to emphasize that this work is independent and has not been commissioned or sponsored in any way by any business, government, or other institution.
Big Workflow: More than Just Intelligent Workload Management for Big Data Big data applications represent a fast-growing category of high-value applications that are increasingly employed by business and technical computing users. However, they have exposed an inconvenient dichotomy in the way resources are utilized in data centers. Conventional enterprise and web-based applications can be executed efficiently in virtualized server environments, where resource management and scheduling is generally confined to a single server. By contrast, data-intensive analytics and technical simulations demand large aggregated resources, necessitating intelligent scheduling and resource management that spans a computer cluster, cloud, or entire data center. Although these tools exist in isolation, they are not available in a general-purpose framework that allows them to interoperate easily and automatically within existing IT infrastructure. A new approach, known as “Big Workflow,” is being created by Adaptive Computing to address the needs of these applications. It is designed to unify public clouds, private clouds, Map Reduce-type clusters, and technical computing clusters. Specifically Big Workflow will:
• Schedule, optimize and enforce policies across the data center
• Enable data-aware workflow coordination across storage and compute silos
• Integrate with external workflow automation tools Such a solution will provide a much-needed toolset for managing big data applications, shortening timelines, simplifying operations, and maximizing resource utilization, and preserving existing investments.
Blending Transactions and Analytics in a Single In-Memory Platform: Key to the Real-Time Enterprise This white paper discusses the issues involved in the traditional practice of deploying transactional and analytic applications on separate platforms using separate databases. It analyzes the results from a user survey, conducted on SAP’s behalf by IDC, that explores these issues. The paper then considers how SAP HANA, with its combination of in-memory data management and its ability to handle both transactions and analytics in real time, can resolve these issues. It explores how businesses may find opportunities for innovation (such as the ability to engage in a richer dialog with a customer based on analysis of the latest transactional information), for speed (with the ability to provide faster access to information to make timely decisions), and for simplification of the IT landscape with a single in-memory platform.
Blind Source Separation: Fundamentals and Recent Advances (A Tutorial Overview Presented at SBrT-2001) Blind source separation (BSS), i.e., the decoupling of unknown signals that have been mixed in an unknown way, has been a topic of great interest in the signal processing community for the last decade, covering a wide range of applications in such diverse fields as digital communications, pattern recognition, biomedical engineering, and financial data analysis, among others. This course aims at an introduction to the BSS problem via an exposition of well-known and established as well as some more recent approaches to its solution. A unified way is followed in presenting the various results so as to more easily bring out their similarities/differences and emphasize their relative advantages/disadvantages. Only a representative sample of the existing knowledge on BSS will be included in this course. The interested readers are encouraged to consult the list of bibliographical references for more details on this exciting and always active research topic.
Breaking Data Science Open Deliver Collaboration, Self-Service and Production Deployment with Open Data Science Data science has burst into public attention over the past few years as perhaps the hottest and most lucrative technology field. No longer just a buzzword for advanced analytics, Christine Doig is a senior data scientist at Continuum Analytics, where she’s worked on several projects, including MEMEX, a DARPA-funded open data science project to help stop human trafficking. She has 5+ years of experience in analytics, operations research, and machine learning in a variety of industries. Christine Doig @ch_doig data science is poised to change everything about an organization: its potential customers, expansion plans, engineering and manufacturing process, how it chooses and interacts with suppliers and more. The leading edge of this tsunami is a combination of innovative business and technology trends that promise a more intelligent future based on Open Data Science. Open Data Science is a movement that makes the open source tools of data science—data, analytics and computation—work together as a connected ecosystem.
Bridging the gap between hierarchical network representation and functional analysis RedeR is an R-based package combined with a Java application for dynamic network visualization and manipulation. It implements a callback engine by using a low-level R-to-Java interface to build and run common plugins. In this sense, RedeR takes advantage of R to run robust statistics, while the R-to-Java interface bridge the gap between network analysis and visualization. RedeR is designed to deal with three key challenges in network analysis. Firstly, biological networks are modular and hierarchical, so network visualization needs to take advantage of such structural features. Secondly, network analysis relies on statistical methods, many of which are already available in resources like CRAN or Bioconductor. However, the missing link between ad- vanced visualization and statistical computing makes it hard to take full advantage of R packages for network analysis. Thirdly, in larger networks user input is needed to focus the view of the network on the biologically relevant parts, rather than relying on an automatic layout function. RedeR is designed to address these challenges (additional information is available at Castro et al.).
Brief: Real-Time Speech Analytics — Still More Sizzle Than Steak Most customer service organizations record phone interactions with their customers. If they get around to analyzing those recordings, whatever they find can’t change the outcome of those calls — they are long since over. Vendors of real-time speech analytics tools promise to allow companies to intervene at the moment of truth, while the customer and the contact center agent are still talking. This brief discusses the hurdles application development and delivery (AD&D) pros will need to overcome to justify the expenditure on this technology and the steps they will need to take to prepare for a world of alerts generated in real-time based on customer conversations.
Build a Powerful Business Case for Data Quality with Metrics Money and resources wasted; sales missed; extra costs incurred. Recent research by industry analyst firm Gartner shows that the shocking price that companies are paying because of poor quality data adds up to a staggering $8.2 million annually.
Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models We survey latent variable models for solving data-analysis problems. A latent variable model is a probabilistic model that encodes hidden patterns in the data.We uncover these patterns from their conditional distribution and use them to summarize data and form predictions. Latent variable models are important in many fields, including computational biology, natural language processing, and social network analysis. Our perspective is that models are developed iteratively: We build a model, use it to analyze data, assess how it succeeds and fails, revise it, and repeat. We describe how new research has transformed these essential activities. First, we describe probabilistic graphical models, a language for formulating latent variable models. Second, we describe mean field variational inference, a generic algorithm for approximating conditional distributions. Third, we describe how to use our analyses to solve problems: exploring the data, forming predictions, and pointing us in the direction of improved models.
Building Data Science Teams Starting in 2008, Jeff Hammerbacher (@hackingdata) and I sat down to share our experiences building the data and analytics groups at Facebook and LinkedIn. In many ways, that meeting was the start of data science as a distinct professional specialization (see “What Makes a Data Scientist?” on page 11 for the story on how we came up with the title “Data Scientist”). Since then, data science has taken on a life of its own. The hugely positive response to “What Is Data Science?,” a great introduction to the meaning of data science in today’s world, showed that we were at the start of a movement. There are now regular meetups, well-established startups, and even college curricula focusing on data science. As McKinsey’s big data research report and LinkedIn’s data indicates indicates (see Figure 1), data science talent is in high demand. This increase in the demand for data scientists has been driven by the success of the major Internet companies. Google, Facebook, LinkedIn, and Amazon have all made their marks by using data creatively: not just warehousing data, but turning it into something of value. Whether that value is a search result, a targeted advertisement, or a list of possible acquaintances, data science is producing products that people want and value. And it’s not just Internet companies: Walmart doesn’t produce “data products” as such, but they’re well known for using data to optimize every aspect of their retail operations. Given how important data science has grown, it’s important to think about what data scientists add to an organization, how they fit in, and how to hire and build effective data science teams.
Building High-level Features Using Large Scale Unsupervised Learning We consider the problem of building highlevel, class-speci c feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200×200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also nd that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 22,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art.
Building Production-Ready Predictive Analytics There’s a part of data science that you never hear about: the production. Everybody talks about how to build models, but not many people worry about how to actually use those models. Yet production issues are the reason many companies fail to see value come from their data science efforts. We wondered how companies handled their production processes and environments to build production-ready data products, and we figured the easiest way to find out was to ask them. We conducted a worldwide survey and asked thousands of companies. And we got our answers. After analyzing those answers, we isolated four different ways companies are dealing with production today, and we put together a series of recommendations on how to build production-ready data science projects.
Building Real-Time Data Pipelines Imagine you had a time machine that could go back one minute, or an hour. Think about what you could do with it. From the perspective of other people, it would seem like there was nothing you couldn’t do, no contest you couldn’t win. In the real world, there are three basic ways to win. One way is to have something, or to know something, that your competition does not. Nice work if you can get it. The second way to win is to simply be more intelligent. However, the number of people who think they are smarter is much larger than the number of people who actually are smarter. The third way is to process information faster so you can make and act on decisions faster. Being able to make more decisions in less time gives you an advantage in both information and intelligence. It allows you to try many ideas, correct the bad ones, and react to changes before your competition. If your opponent cannot react as fast as you can, it does not matter what they have, what they know, or how smart they are. Taken to extremes, it’s almost like having a time machine. An example of the third way can be found in high-frequency stock trading. Every trading desk has access to a large pool of highly intelligent people, and pays them well. All of the players have access to the same information at the same time, at least in theory. Being more or less equally smart and informed, the most active area of competition is the end-to-end speed of their decision loops. In recent years, traders have gone to the trouble of building their own wireless long-haul networks, to exploit the fact that microwaves move through the air 50% faster than light can pulse through fiber optics. This allows them to execute trades a crucial millisecond faster. Finding ways to shorten end-to-end information latency is also a constant theme at leading tech companies. They are forever working to reduce the delay between something happening out there in the world or in their huge clusters of computers, and when it shows up on a graph. At Facebook in the early 2010s, it was normal to wait hours after pushing new code to discover whether everything was working efficiently. The full report came in the next day. After building their own distributed in-memory database and event pipeline, their information loop is now on the order of 30 seconds, and they push at least two full builds per day. Instead of slowing down as they got bigger, Facebook doubled down on making more decisions faster. What is your system’s end-to-end latency? How long is your decision loop, compared to the competition? Imagine you had a system that was twice as fast. What could you do with it? This might be the most important question for your business. In this book we’ll explore new models of quickly processing information end to end that are enabled by long-term hardware trends, learnings from some of the largest and most successful tech companies, and surprisingly powerful ideas that have survived the test of time.
Business Analytics for Manufacturing: Four Ways to Increase Efficiency and Performance Whether the economy is strong or weak, the fundamental strategies for surviving and thriving still hold true. Manufacturers have to be highly efficient to meet demand and supply requirements. Costs and resources also have to be managed carefully and intelligently. At the same time, companies are considering new tactics: inventory optimization, maintenance operations, intelligent supply chains and leveraging technology as a focal point of business strategy. In order to be successful your company needs access to critical information and visibility into how well your business, your market and your competitors are responding to today’s challenging and changing times. …
Business Models for the Data Economy Whether you call it Big Data, data science, or simply analytics, modern businesses see data as a gold mine. Sometimes they already have this data in hand and understand that it is central to their activities. Other times, they uncover new data that fills a perceived gap, or seemingly “useless” data generated by other processes. Whatever the case, there is certainly value in using data to advance your business.
Business Process Deviance Mining: Review and Evaluation Business process deviance refers to the phenomenon whereby a subset of the executions of a business process deviate, in a negative or positive way, with respect to its expected or desirable outcomes. Deviant executions of a business process include those that violate compliance rules, or executions that undershoot or exceed performance targets. Deviance mining is concerned with uncovering the reasons for deviant executions by analyzing business process event logs. This article provides a systematic review and comparative evaluation of deviance mining approaches based on a family of data mining techniques known as sequence classification. Using real-life logs from multiple domains, we evaluate a range of feature types and classification methods in terms of their ability to accurately discriminate between normal and deviant executions of a process. We also analyze the interestingness of the rule sets extracted using different methods. We observe that feature sets extracted using pattern mining techniques only slightly outperform simpler feature sets based on counts of individual activity occurrences in a trace.
Business-Driven BI: Using New Technologies to Foster Self-Service Access to Insights Self-Service Business Intelligence (BI) has been the holy grail for BI professionals for a long time. Yet almost two-thirds of BI professionals (64%) rate the success of their self-service initiatives “average” or lower. Newcomers to BI struggle even more, with more than half (52%) rating their attempts at selfservice BI “fair” or “poor.” One reason for these less-than-stellar numbers is this: Implementing selfservice BI is more complex than it looks. It’s not a one-size-fits-all program. BI users come in many different shapes and sizes, each with unique information requirements. This report lays out several frameworks that explain how users interact with information and then maps elements of each to BI functionality and categories of BI tools. This mapping is critical to success with self-service BI….

C

Caching and Distributing Statistical Analyses in R We present the cacher package for R, which provides tools for caching statistical analyses and for distributing these analyses to others in an e cient manner. The cacher package takes objects created by evaluating R expressions and stores them in key-value databases. These databases of cached objects can subsequently be assembled into packages for distribution over the web. The cacher package also provides tools to help readers examine the data and code in a statistical analysis and reproduce, modify, or improve upon the results. In addition, readers can easily conduct alternate analyses of the data. We describe the design and implementation of the cacher package and provide two examples of how the package can be used for reproducible research. This vignette was originally published as Peng (2008).
Calling R from .NET: a case-study using Rapid NCA, the non-compartmental analysis workflow tool (Slide Deck)
Canonical example of Bayes’ theorem in detail The most common elementary illustration of Bayes’ theorem is medical testing for a rare disease. The example is almost a clich´e in probability and statistics books. And yet in my opinion, it’s usually presented too quickly and too abstractly. Here I’m going to risk erring on the side of going too slowly and being too concrete. I’ll work out an example with numbers and no equations before presenting Bayes theorem. Then I’ll include a few graphs.
Capitalizing on the power of big data for retail The retail industry is changing dramatically as consumers shop in new ways. With the growing popularity of online shopping and mobile commerce, consumers are using more retail channels than ever before to research products, compare prices, search for promotions, make purchases and provide feedback. Social media has become one of the key channels. Consumers are using social media – and the leading e-commerce platforms that integrate with social media – to find product recommendations, lavish praise, voice complaints, capitalize on product offers and engage in ongoing dialogs with their favorite brands. The multiplication of retail channels and the increasing use of social media are empowering consumers. With a wealth of information readily available online, consumers are now better able to compare products, services and prices – even as they shop in physical stores. When consumers interact with companies publically through social media, they have greater power to influence other customers or damage a brand. These and other changes in the retail industry are creating important opportunities for retailers. But to capitalize on those opportunities, retailers need ways to collect, manage and analyze a tremendous volume, variety and velocity of data. When point-of-sale (POS) systems were first commercialized, retailers were able to collect large amounts of potentially valuable information, but most of that information remained untapped. The emergence of social media and other consumer-oriented technologies is now introducing even more data to the retail ecosystem. Retailers must handle not only the growing volume of information but also an increasing variety – including both structured and unstructured data. They must also find ways to accommodate the changing nature of this data and the velocity at which is being produced and collected. If retailers succeed in addressing the challenges of “big data,” they can use this data to generate valuable insights for personalizing marketing and improving the effectiveness of marketing campaigns, optimizing assortment and merchandising decisions, and removing inefficiencies in distribution and operations. Adopting solutions designed to capitalize on this big data allows companies to navigate the shifting retail landscape and drive a positive transformation for the business….
Causal inference in statistics: An overview This review presents empirical researchers with recent advances in causal inference, and stresses the paradigmatic shifts that must be undertaken in moving from traditional statistical analysis to causal analysis of multivariate data. Special emphasis is placed on the assumptions that underly all causal inferences, the languages used in formulating those assumptions, the conditional nature of all causal and counterfactual claims, and the methods that have been developed for the assessment of such claims. These advances are illustrated using a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a), which subsumes and unifies other approaches to causation, and provides a coherent mathematical foundation for the analysis of causes and counterfactuals. In particular, the paper surveys the development of mathematical tools for inferring (from a combination of data and assumptions) answers to three types of causal queries:
(1) queries about the effects of potential interventions, (also called ‘causal effects’ or ‘policy evaluation’)
(2) queries about probabilities of counterfactuals, (including assessment of ‘regret,’ ‘attribution’ or ’causes of effects’) and
(3) queries about direct and indirect effects (also known as ‘mediation’).
Finally, the paper defines the formal and conceptual relationships between the structural and potential-outcome frameworks and presents tools for a symbiotic analysis that uses the strong features of both.
Causality and Statistical Learning In social science we are sometimes in the position of studying descriptive questions (In what places do working-class whites vote for Republicans? In what eras has social mobility been higher in the United States than in Europe? In what social settings are different sorts of people more likely to act strategically?). Answering descriptive questions is not easy and involves issues of data collection, data analysis, and measurement (how one should define concepts such as “working-class whites,” “social mobility,” and “strategic”) but is uncontroversial from a statistical standpoint. All becomes more difficult when we shift our focus from what to what if and why. Consider two broad classes of inferential questions:
1. Forward causal inference. What might happen if we do X? What are the effects of smoking on health, the effects of schooling on knowledge, the effect of campaigns on election outcomes, and so forth?
2. Reverse causal inference. What causes Y? Why do more attractive people earn more money? Why do many poor people vote for Republicans and rich people vote for Democrats? Why did the economy collapse?
Challenges and Opportunities with Big Data The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of “Big Data.’’ While the promise of Big Data is real — for example, it is estimated that Google alone contributed 54 billion dollars to the US economy in 2009 — there is currently a wide gap between its potential and its realization. Heterogeneity, scale, timeliness, complexity, and privacy problems with Big Data impede progress at all phases of the pipeline that can create value from data. The problems start right away during data acquisition, when the data tsunami requires us to make decisions, currently in an ad hoc manner, about what data to keep and what to discard, and how to store what we keep reliably with the right metadata. Much data today is not natively in structured format; for example, tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not for semantic content and search: transforming such content into a structured format for later analysis is a major challenge. The value of data explodes when it can be linked with other data, thus data integration is a major creator of value. Since most data is directly generated in digital format today, we have the opportunity and the challenge both to influence the creation to facilitate later linkage and to automatically link previously created data. Data analysis, organization, retrieval, and modeling are other foundational challenges. Data analysis is a clear bottleneck in many applications, both due to lack of scalability of the underlying algorithms and due to the complexity of the data that needs to be analyzed. Finally, presentation of the results and its interpretation by non-technical domain experts is crucial to extracting actionable knowledge. During the last 35 years, data management principles such as physical and logical independence, declarative querying and cost-based optimization have led, during the last 35 years, to a multi-billion dollar industry. More importantly, these technical advances have enabled the first round of business intelligence applications and laid the foundation for managing and analyzing Big Data today. The many novel challenges and opportunities associated with Big Data necessitate rethinking many aspects of these data management platforms, while retaining other desirable aspects. We believe that appropriate investment in Big Data will lead to a new wave of fundamental technological advances that will be embodied in the next generations of Big Data management and analysis platforms, products, and systems. We believe that these research problems are not only timely, but also have the potential to create huge economic value in the US economy for years to come. However, they are also hard, requiring us to rethink data analysis systems in fundamental ways. A major investment in Big Data, properly directed, can result not only in major scientific advances, but also lay the foundation for the next generation of advances in science, medicine, and business.
Challenges of Big Data Analysis Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
Challenges of Big Data Analysis Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity.They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
Challenges of Big Data analysis Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity.They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
Character-level Convolutional Networks for Text Classification This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several largescale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
Chart Suggestions – A Thought-Starter (Cheat Sheet)
Choosing the right NoSQL database for the job: a quality attribute evaluation For over forty years, relational databases have been the leading model for data storage, retrieval and management. However, due to increasing needs for scalability and performance, alternative systems have emerged, namely NoSQL technology. The rising interest in NoSQL technology, as well as the growth in the number of use case scenarios, over the last few years resulted in an increasing number of evaluations and comparisons among competing NoSQL technologies. While most research work mostly focuses on performance evaluation using standard benchmarks, it is important to notice that the architecture of real world systems is not only driven by performance requirements, but has to comprehensively include many other quality attribute requirements. Software quality attributes form the basis from which software engineers and architects develop software and make design decisions. Yet, there has been no quality attribute focused survey or classification of NoSQL databases where databases are compared with regards to their suitability for quality attributes common on the design of enterprise systems. To fill this gap, and aid software engineers and architects, in this article, we survey and create a concise and up-to-date comparison of NoSQL engines, identifying their most beneficial use case scenarios from the software engineer point of view and the quality attributes that each of them is most suited to.
Classification and Regression Tree Methods A classification or regression tree is a prediction model that can be represented as a decision tree. This article discusses the C4.5, CART, CRUISE, GUIDE, and QUEST methods in terms of their algorithms, features, properties, and performance.
Classification and Regression Trees Classification and regression trees are machine-learningmethods for constructing predictionmodels from data. Themodels are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.
Classification And Regression Trees : A Practical Guide for Describing a Dataset (Slide Deck)
Classification revisited: a web of knowledge The vision of the Semantic Web (SW) is gradually unfolding and taking shape through a web of linked data, a part of which is built by capturing semantics stored in existing knowledge organization systems (KOS), subject metadata and resource metadata. The content of vast bibliographic collections is currently categorized by some widely used bibliographic classification and we may soon see them being mined for information and linked in a meaningful way across the Web. Bibliographic classifications are designed for knowledge mediation which offers both a rich terminology and different ways in which concepts can be categorized and related to each other in the universe of knowledge. From 1990-2010 they have been used in various resource discovery services on the Web and continue to be used to support information integration in a number of international digital library projects. In this chapter we will revisit some of the ways in which universal classifications, as language independent concept schemes, can assist humans and computers in structuring and presenting information and formulating queries. Most importantly, we highlight issues important to understanding bibliographic classifications, both in terms of their unused potential and technical limitations.
Classification via Minimum Incremental Coding Length We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We demonstrate the asymptotic optimality of this criterion for Gaussian distributions and analyze its relationships to classical classifiers. The theoretical results clarify the connections between our approach and popular classifiers such as MAP, RDA, k-NN, and SVM, as well as unsupervised methods based on lossy coding. Our formulation induces several good effects on the resulting classifier. First, minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small sample setting. Second, compression provides a uniform means of handling classes of varying dimension. The new criterion and its kernel and local versions perform competitively on synthetic examples, as well as on real imagery data such as handwritten digits and face images. On these problems, the performance of our simple classifier approaches the best reported results, without using domain-specific information. All MATLAB code and classification results are publicly available for peer evaluation at http://…/home.htm.
Cloud based
Predictive Analytics poised for rapid growth
Rather than report survey results question by question the results, and their implications, have been grouped into a number of sections. Each section highlights significant results from the survey and discusses its implication.
– Business solutions are what organizations need
– Predictive analytics are showing real strength
– Customers are the focus for predictive analytics and cloud
– Cloud-based predictive analytic scenarios are gaining momentum
– Early adopters are gaining a competitive advantage
– Decision Management matters to predictive analytic success
– There are still some barriers and concerns with cloud-based predictive analytics
– Industries vary in their adoption and concerns
– A mix of clouds is appropriate
– Traditional data sources dominate predictive analytic models
After the survey results and implications are discussed we will make some recommendations and identify pros and cons of the various options. Demographics and vendor profiles complete the paper.
Cloud Service Matchmaking Approaches: A Systematic Literature Survey Service matching concerns finding suitable services according to the service requester’s requirements, which is a complex task due to the increasing number and diversity of cloud services available. Service matching is discussed in web services composition and user oriented service marketplaces contexts. The suggested approaches have different problem definitions and have to be examined closer in order to identify comparable results and to find out which approaches have built on the former ones. One of the most important use cases is service requesters with limited technical knowledge who need to compare services based on their QoS requirements in cloud service marketplaces. Our survey examines the service matching approaches in order to find out the relation between their context and their objectives. Moreover, it evaluates their applicability for the cloud service marketplaces context.
Cluster Analysis: Tutorial with R In this tutorial we inspect classification. classification and ordination are al- ternative strategies of simplifying data. Ordination tries to simplify data into a map showing similarities among points. classification simpli es data by putting similar points into same class. The task of describing a high number of points is simpli ed to an easier task of describing a low number of classes.
Cluster Validation (Slide Deck)
Clustering large Data Sets with mixed numeric and Categorical Values Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The k-means based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we present a k-prototypes algorithm which is based on the k-means paradigm but removes the numeric data limitation whilst preserving its efficiency. In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the kmeans. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with other statistics about clusters, can assist data miners to understand and identify interesting clusters.
Cognitive Dynamic Systems: A Technical Review of Cognitive Radar We start with the history of cognitive radar, where origins of the PAC, Fuster research on cognition and principals of cognition are provided. Fuster describes five cognitive functions: perception, memory, attention, language, and intelligence. We describe the Perception-Action Cyclec as it applies to cognitive radar, and then discuss long-term memory, memory storage, memory retrieval and working memory. A comparison between memory in human cognition and cognitive radar is given as well. Attention is another function described by Fuster, and we have given the comparison of attention in human cognition and cognitive radar. We talk about the four functional blocks from the PAC: Bayesian filter, feedback information, dynamic programming and state-space model for the radar environment. Then, to show that the PAC improves the tracking accuracy of Cognitive Radar over Traditional Active Radar, we have provided simulation results. In the simulation, three nonlinear filters: Cubature Kalman Filter, Unscented Kalman Filter and Extended Kalman Filter are compared. Based on the results, radars implemented with CKF perform better than the radars implemented with UKF or radars implemented with EKF. Further, radar with EKF has the worst accuracy and has the biggest computation load because of derivation and evaluation of Jacobian matrices. We suggest using the concept of risk management to better control parameters and improve performance in cognitive radar. We believe, spectrum sensing can be seen as a potential interest to be used in cognitive radar and we propose a new approach Probabilistic ICA which will presumably reduce noise based on estimation error in cognitive radar. Parallel computing is a concept based on divide and conquers mechanism, and we suggest using the parallel computing approach in cognitive radar by doing complicated calculations or tasks to reduce processing time.
Collaborative Filtering Recommender Systems Recommender systems are an important part of the information and e-commerce ecosystem. They represent a powerful method for enabling users to filter through large information and product spaces. Nearly two decades of research on collaborative filtering have led to a varied set of algorithms and a rich collection of tools for evaluating their performance. Research in the field is moving in the direction of a richer understanding of how recommender technology may be embedded in specific domains. The differing personalities exhibited by different recommender algorithms show that recommendation is not a one-sizefits- all problem. Specific tasks, information needs, and item domains represent unique problems for recommenders, and design and evaluation of recommenders needs to be done based on the user tasks to be supported. Effective deployments must begin with careful analysis of prospective users and their goals. Based on this analysis, system designers have a host of options for the choice of algorithm and for its embedding in the surrounding user experience. This paper discusses a wide variety of the choices available and their implications, aiming to provide both practicioners and researchers with an introduction to the important issues underlying recommenders and current best practices for addressing these issues.
Combining Predictions for Accurate Recommender Systems We analyze the application of ensemble learning to recommender systems on the Net ix Prize dataset. For our analysis we use a set of diverse state-of-the-art collaborative ltering (CF) algorithms, which include: SVD, Neighborhood Based Approaches, Restricted Boltzmann Machine, Asymmetric Factor Model and Global E ects. We show that linearly combining (blending) a set of CF algorithms increases the accuracy and outperforms any single CF algorithm. Furthermore, we show how to use ensemble methods for blending predictors in order to outperform a single blending algorithm. The dataset and the source code for the ensemble blending are available online.
Community Detection in Networks: The Leader-Follower Algorithm Natural networks such as those between humans observed through their interactions or biological networks predicted based on various experimental measurements contain a wealth of information about the unobserved structure of the social or biological system. However, these networks are inherently noisy in the sense that they contain spurious connections making them seemingly dense. Therefore, identifying important, refined structures such as communities or clusters becomes quite challenging. Specifically, we find that the popular, traditional method of spectral clustering does not manage to learn refined community structure. The primary reason for this is that it is based upon external community connectivity properties such as graph-cuts. Motivated to overcome this limitation, we propose a community detection algorithm, called the leader-follower algorithm, based upon identifying the natural internal structure of the expected communities. The algorithm uses the notion of network centrality in a novel manner to differentiate leaders (nodes which connect different communities) from loyal followers (nodes which only have neighbors within a single community). Using this approach, it is able to learn the communities from the network structure. A salient feature of our algorithm is that, unlike the spectral clustering, it does not require knowledge of number of communities in the network; it learns it naturally. We show that our algorithm is quite effective. We prove that it detects all of the communities exactly for any network possessing communities with the natural internal structure expected in social networks. More importantly, we demonstrate its effectiveness in the context of various real networks ranging from social networks such as Facebook to biological networks such as an fMRI based human brain network.
Comparative Analysis of K-Means and Fuzzy C-Means Algorithms In the arena of software, data mining technology has been considered as useful means for identifying patterns and trends of large volume of data. This approach is basically used to extract the unknown pattern from the large set of data for business as well as real time applications. It is a computational intelligence discipline which has emerged as a valuable tool for data analysis, new knowledge discovery and autonomous decision making. The raw, unlabeled data from the large volume of dataset can be classified initially in an unsupervised fashion by using cluster analysis i.e. clustering the assignment of a set of observations into clusters so that observations in the same cluster may be in some sense be treated as similar. The outcome of the clustering process and efficiency of its domain application are generally determined through algorithms. There are various algorithms which are used to solve this problem. In this research work two important clustering algorithms namely centroid based K-Means and representative object based FCM (Fuzzy C-Means) clustering algorithms are compared. These algorithms are applied and performance is evaluated on the basis of the efficiency of clustering output. The numbers of data points as well as the number of clusters are the factors upon which the behaviour patterns of both the algorithms are analyzed. FCM produces close results to K-Means clustering but it still requires more computation time than K-Means clustering.
Comparison of Bayesian predictive methods for model selection The goal of this paper is to compare several widely used Bayesian model selection methods in practical model selection problems, highlight their differences and give recommendations about the preferred approaches. We focus on the variable subset selection for regression and classification and perform several numerical experiments using both simulated and real world data. The results show that the optimization of a utility estimate such as the cross-validation score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. Better and much less varying results are obtained by incorporating all the uncertainties into a full encompassing model and projecting this information onto the submodels. The reference model projection appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly bene t from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.
Comprehensive View on Cran Packages (Cheat Sheet)
Computation of the multivariate Oja median The multivariate Oja (1983) median is an affine equivariant multivariate location estimate with high efficiency. This estimate has a bounded influence function but zero breakdown. The computation of the estimate appears to be highly intensive. We consider different, exact and stochastic, algorithms for the calculation of the value of the estimate. In the stochastic algorithms, the gradient of the objective function, the rank function, is estimated by sampling observation hyperplanes. The estimated rank function with its estimated accuracy then yields a confidence region for the true Oja samplemedian, and the confidence region shrinks to the sample median with the increasing number of the sampled hyperplanes. Regular grids and and the grid given by the data points are used in the construction. Computation times of different algorithms are discussed and compared.
Computational Machines in a Coexistence with Concrete Universals and Data Streams We discuss that how the majority of traditional modeling approaches are following the idealism point of view in scientific modeling, which follow the set theoretical notions of models based on abstract universals. We show that while successful in many classical modeling domains, there are fundamental limits to the application of set theoretical models in dealing with complex systems with many potential aspects or properties depending on the perspectives. As an alternative to abstract universals, we propose a conceptual modeling framework based on concrete universals that can be interpreted as a category theoretical approach to modeling. We call this modeling framework pre-specific modeling. We further, discuss how a certain group of mathematical and computational methods, along with ever-growing data streams are able to operationalize the concept of pre-specific modeling.
Condition-Based Maintenance Using Sensor Arrays and Telematics Emergence of uniquely addressable embeddable devices has raised the bar on Telematics capabilities. Though the technology itself is not new, its application has been quite limited until now. Sensor based telematics technologies generate volumes of data that are orders of magnitude larger than what operators have dealt with previously. Real-time big data computation capabilities have opened the flood gates for creating new predictive analytics capabilities into an otherwise simple data log systems, enabling real-time control and monitoring to take preventive action in case of any anomalies. Condition-based-maintenance, usage-based-insurance, smart metering and demand-based load generation etc. are some of the predictive analytics use cases for Telematics. This paper presents the approach of condition-based maintenance using real-time sensor monitoring, Telematics and predictive data analytics.
Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2013) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B = Theta(n^1.5) bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B = Theta(n) replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.
Considerations for maximising analytic performance When it comes to running business analytics, there are three key nonfunctional requirements that must be met: fast performance, usability and affordability. Bloor Research was asked by IBM to compare the performance capabilities of the leading business analytic platforms. Specifically, we were asked to evaluate how the combined capabilities of business analytic tools and the underlying database management system can affect the overall performance of your analytic applications, reports and dashboards.
Content Selection in Data-to-Text Systems: A Survey Data-to-text systems are powerful in generating reports from data automatically and thus they simplify the presentation of complex data. Rather than presenting data using visualisation techniques, data-to-text systems use natural (human) language, which is the most common way for human-human communication. In addition, data-to-text systems can adapt their output content to users’ preferences, background or interests and therefore they can be pleasant for users to interact with. Content selection is an important part of every data-to-text system, because it is the module that determines which from the available information should be conveyed to the user. This survey initially introduces the field of data-to-text generation, describes the general data-to-text system architecture and then it reviews the state-of-the-art content selection methods. Finally, it provides recommendations for choosing an approach and discusses opportunities for future research.
Context-Aware Recommender Systems The importance of contextual information has been recognized by researchers and practitioners in many disciplines, including e-commerce personalization, information retrieval, ubiquitous and mobile computing, data mining, marketing, and management. While a substantial amount of research has already been performed in the area of recommender systems, most existing approaches focus on recommending the most relevant items to users without taking into account any additional contextual information, such as time, location, or the company of other people (e.g., for watching movies or dining out). In this chapter we argue that relevant contextual information does matter in recommender systems and that it is important to take this information into account when providing recommendations. We discuss the general notion of context and how it can be modeled in recommender systems. Furthermore, we introduce three different algorithmic paradigms – contextual prefiltering, post-filtering, and modeling – for incorporating contextual information into the recommendation process, discuss the possibilities of combining several context-aware recommendation techniques into a single unifying approach, and provide a case study of one such combined approach. Finally, we present additional capabil- ities for context-aware recommenders and discuss important and promising directions for future research.
Control And Protect Sensitive Information In The Era Of Big Data This report outlines the future look of Forrester’s solution for security and risk (S&R) executives seeking to develop a holistic strategy to protect and manage sensitive data. In the never-ending race to stay ahead of the competition, companies are developing advanced capabilities to store, process, and analyze vast amounts of data from social networks, sensors, IT systems, and other sources to improve business intelligence and decisioning capabilities. “Big data processing” refers to the tools and techniques that handle the extreme data volumes and velocities and wide variety of data formats resulting from implementing these capabilities. As organizations aggregate more and more data, they need to be aware that much of it could be financial, personal, and other types of sensitive data that are subject to global laws and regulations. S&R professionals need to be aware of the security issues surrounding big data so they can take an active role early in these initiatives. This report will help S&R pros understand how to control and properly protect sensitive information in the era of big data.
Converging High-Throughput and High-Performance Computing: A Case Study The computing systems used by LHC experiments has historically consisted of the federation of hundreds to thousands of distributed resources, ranging from small to mid-size resource. In spite of the impressive scale of the existing distributed computing solutions, the federation of small to mid-size resources will be insufficient to meet projected future demands. This paper is a case study of how the ATLAS experiment has embraced Titan — a DOE leadership facility in conjunction with traditional distributed high-throughput computing to reach sustained production scales of approximately 51M core-hours a years. The three main contributions of this paper are: (i) a critical evaluation of design and operational considerations to support the sustained, scalable and production usage of Titan; (ii) a preliminary characterization of a next generation executor for PanDA to support new workloads and advanced execution modes; and (iii) early lessons for how current and future experimental and observational systems can be integrated with production supercomputers and other platforms in a general and extensible manner.
Cooperating with Machines Since Alan Turing envisioned Artificial Intelligence (AI) [1], a major driving force behind technical progress has been competition with human cognition. Historical milestones have been frequently associated with computers matching or outperforming humans in difficult cognitive tasks (e.g. face recognition [2], personality classification [3], driving cars [4], or playing video games [5]), or defeating humans in strategic zero-sum encounters (e.g. Chess [6], Checkers [7], Jeopardy! [8], Poker [9], or Go [10]). In contrast, less attention has been given to developing autonomous machines that establish mutually cooperative relationships with people who may not share the machine’s preferences. A main challenge has been that human cooperation does not require sheer computational power, but rather relies on intuition [11], cultural norms [12], emotions and signals [13, 14, 15, 16], and pre-evolved dispositions toward cooperation [17], common-sense mechanisms that are difficult to encode in machines for arbitrary contexts. Here, we combine a state-of-the-art machine-learning algorithm with novel mechanisms for generating and acting on signals to produce a new learning algorithm that cooperates with people and other machines at levels that rival human cooperation in a variety of two-player repeated stochastic games. This is the first general-purpose algorithm that is capable, given a description of a previously unseen game environment, of learning to cooperate with people within short timescales in scenarios previously unanticipated by algorithm designers. This is achieved without complex opponent modeling or higher-order theories of mind, thus showing that flexible, fast, and general human-machine cooperation is computationally achievable using a non-trivial, but ultimately simple, set of algorithmic mechanisms.
Copulas: A Personal View Copula modeling has taken the world of finance and insurance, and well beyond, by storm. Why is this? In this paper I review the early start of this development, discuss some important current research, mainly from an applications point of view, and comment on potential future developments. An alternative title of the paper would be ‘Demystifying the copula craze’. The paper also contains what I would like to call the copula must-reads.
Copy the dynamics using a learning machine Is it possible to generally construct a dynamical system to simulate a black system without recovering the equations of motion of the latter? Here we show that this goal can be approached by a learning machine. Trained by a set of input-output responses or a segment of time series of a black system, a learning machine can be served as a copy system to mimic the dynamics of various black systems. It can not only behave as the black system at the parameter set that the training data are made, but also recur the evolution history of the black system. As a result, the learning machine provides an effective way for prediction, and enables one to probe the global dynamics of a black system. These findings have significance for practical systems whose equations of motion cannot be approached accurately. Examples of copying the dynamics of an artificial neural network, the Lorenz system, and a variable star are given. Our idea paves a possible way towards copy a living brain.
Correlated Topic Models Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets.
Correspondence Analysis This working paper gives a comprehensive explanation of the multivariate technique called correspondence analysis, applied in the context of a large survey of a nation’s state of health, in this case the Spanish National Health Survey. It is first shown how correspondence analysis can be used to interpret a simple cross-tabulation by visualizing the table in the form of a map of points representing the rows and columns of the table. Combinations of variables can also be interpreted by coding the data in the appropriate way. The technique can also be used to deduce optimal scale values for the levels of a categorical variable, thus giving quantitative meaning to the categories. Multiple correspondence analysis can analyze several categorical variables simultaneously, and is analogous to factor analysis of continuous variables. Other uses of correspondence analysis are illustrated using different variables of the same Spanish database: for example, exploring patterns of missing data and visualizing trends across surveys from consecutive years.
Cross-media Similarity Metric Learning with Unified Deep Networks As a highlighting research topic in the multimedia area, cross-media retrieval aims to capture the complex correlations among multiple media types. Learning better shared representation and distance metric for multimedia data is important to boost the cross-media retrieval. Motivated by the strong ability of deep neural network in feature representation and comparison functions learning, we propose the Unified Network for Cross-media Similarity Metric (UNCSM) to associate cross-media shared representation learning with distance metric in a unified framework. First, we design a two-pathway deep network pretrained with contrastive loss, and employ double triplet similarity loss for fine-tuning to learn the shared representation for each media type by modeling the relative semantic similarity. Second, the metric network is designed for effectively calculating the cross-media similarity of the shared representation, by modeling the pairwise similar and dissimilar constraints. Compared to the existing methods which mostly ignore the dissimilar constraints and only use sample distance metric as Euclidean distance separately, our UNCSM approach unifies the representation learning and distance metric to preserve the relative similarity as well as embrace more complex similarity functions for further improving the cross-media retrieval accuracy. The experimental results show that our UNCSM approach outperforms 8 state-of-the-art methods on 4 widely-used cross-media datasets.
Cross-validation This text is a survey on cross-validation. We define all classical cross-validation procedures, and we study their properties for two different goals: estimating the risk of a given estimator, and selecting the best estimator among a given family. For the risk estimation problem, we compute the bias (which can also be corrected) and the variance of cross-validation methods. For estimator selection, we first provide a first-order analysis (based on expectations). Then, we explain how to take into account second-order terms (from variance computations, and by taking into account the usefulness of overpenalization). This allows, in the end, to provide some guidelines for choosing the best cross-validation method for a given learning problem.
Cumulative Gains Model Quality Metric This paper proposes a more comprehensive look at the ideas of KS and Area Under the Curve AUC of a cumulative gains chart to develop a model quality statistic which can be used agnostically to evaluate the quality of a wide range of models in a standardized fashion. It can be either used holistically on the entire range of the model or at a given decision threshold of the model. Further it can be extended into the model learning process.
Customer Analytics in the age of Social Media Becoming “customer centric” is a top priority today, and for good reason: as if it weren’t important enough that customers buy products and contract for services, they now do much more than simply buy. Customers participate in social media networks and chat rooms; they write blogs and contribute to comment sites; and they share information through sites such as YouTube and Flickr. Their activities and expressions not only reveal personal buying behavior and interests, but they also bring into focus their influence on purchasing by others in their social networks.

D

Data Acceleration: Architecture for the Modern Data Supply Chain Data technologies are evolving rapidly, but organizations have adopted most of these in piecemeal fashion. As a result, enterprise data—whether related to customer interactions, business performance, computer notifications, or external events in the business environment —is vastly underutilized. Moreover, companies’ data ecosystems have become complex and littered with data silos. This makes the data more difficult to access, which in turn limits the value that organizations can get out of it. Indeed, according to a recent Gartner, Inc. report, 85 percent of Fortune 500 organizations will be unable to exploit Big Data for competitive advantage through 2015.1 Furthermore, a recent Accenture study found that half of all companies have concerns about the accuracy of their data, and the majority of executives are unclear about the business outcomes they are getting from their data analytics programs. To unlock the value hidden in their data, companies must start treating data as a supply chain, enabling it to flow easily and usefully through the entire organization—and eventually throughout each company’s ecosystem of partners, including suppliers and customers. The time is right for this approach. For one thing, new external data sources are becoming available, providing fresh opportunities for data insights. In addition, the tools and technology required to build a better data platform are available and in use. These provide a foundation on which companies can construct an integrated, end-to-end data supply chain.
Data Analysis the Data.Table way (Cheat Sheet)
Data Center Infrastructure Management (DCIM) For Dummies Data Center Infrastructure Management (DCIM) is the discipline of managing the physical infrastructure of a data center and optimizing its ongoing operation. DCIM is a software suite that bridges the traditional gap between IT and the facilities groups and coordinates between the two. DCIM reduces computing costs while making it easier to quickly support new applications and other business requirements. About This Book This book explains the importance of DCIM, describes the key components of a modern DCIM system, guides you in the selection of the right DCIM solution for your particular needs, and gives you a step-by-step formula for a successful DCIM implementation. Because this is a For Dummies book, you can be sure that it’s easy to read and has touches of humor.
Data Clustering With Leaders and Subleaders Algorithm In this paper, an efficient hierarchical clustering algorithm, suitable for large data sets is proposed for effective clustering and prototype selection for pattern classification. It is another simple and efficient technique which uses incremental clustering principles to generate a hierarchical structure for finding the subgroups/subclusters within each cluster. As an example, a two level clustering algorithm – Leaders–Subleaders, an extension of the leader algorithm is presented. Classification accuracy (CA) obtained using the representatives generated by the Leaders–Subleaders method is found to be better than that of using leaders as representatives. Even if more number of prototypes are generated, classification time is less as only a part of the hierarchical structure is searched.
Data Clustering: A Review Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
Data Driven: Creating a Data Culture The data movement is in full swing. There are conferences (Strata +Hadoop World), bestselling books (Big Data, The Signal and the Noise, Lean Analytics), business articles (“Data Scientist: The Sexiest Job of the 21st Century”), and training courses (An Introduction to Machine Learning with Web Data, the Insight Data Science Fellows Program) on the value of data and how to be a data scientist. Unfortunately, there is little that discusses how companies that successfully use data actually do that work. Using data effectively is not just about which database you use or how many data scientists you have on staff, but rather it’s a complex interplay between the data you have, where it is stored and how people work with it, and what problems are considered worth solving. While most people focus on the technology, the best organizations recognize that people are at the center of this complexity. In any organization, the answers to questions such as who controls the data, who they report to, and how they choose what to work on are always more important than whether to use a database like PostgreSQL or Amazon Redshift or HDFS. We want to see more organizations succeed with data. We believe data will change the way that businesses interact with the world, and we want more people to have access. To succeed with data, businesses must develop a data culture.
Data Management: A Unified Approach Unified data management is becoming a strategic advantage in today’s business world. With the advent of big data, the volume and type of information that companies must use in near-real time to gain a competitive edge is growing at an unprecedented rate. Meanwhile, industry consolidation is leading to mergers and acquisitions that require disparate IT systems to be harmonized in order to move forward. These forces, combined with ongoing pressure to use all available data to improve employee productivity, customer satisfaction and innovation, are spurring enterprises to make data management planning a top priority. To support these plans and help achieve important business goals, enterprises are turning to data management solutions with significant urgency. According to a recent IDG Research Services study of 118 IT professionals, 87 percent of respondents said data integration tools have been deployed or are on their company’s road maps; 84 percent answered the same for data quality tools; 82 percent for master data management solutions; and 81 percent for data governance/data stewardship initiatives. Nearly three-fifths of respondents at organizations that have data management solutions in place are planning to continue making near-term investments in these types of tools.
Data Mining and Statistics: What is the Connection? Data Mining is used to discover patterns and relationships in data, with an emphasis on large observational data bases. It sits at the common frontiers of several fields including Data Base Management, Arti cial Intelligence, Machine Learning, Pattern Recognition, and Data Visualization. From a statistical perspective it can be viewed as computer automated exploratory data analysis of (usually) large complex data sets. In spite of (or perhaps because of) the somewhat exaggerated hype, this eld is having a major impact in business, industry, and science. It also a ords enormous research opportunities for new methodological developments. Despite the obvious connections between data mining and statistical data analysis, most of the methodologies used in Data Mining have so far originated in fields other than Statistics. This paper explores some of the reasons for this, and why statisticians should have an interest in Data Mining. It is argued that Statistics can potentially have a major in uence on Data Mining, but in order to do so some of our basic paradigms and operating principles may have to be modified.
Data Mining Cluster Analysis: Basic Concepts and Algorithms (Slide Deck)
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Manufacturing enterprises have been collecting and storing more and more current, detailed and accurate production relevant data. The data stores offer enormous potential as source of new knowledge, but the huge amount of data and its complexity far exceeds the ability to reduce and analyze data without the use of automated analysis techniques. This paper provides a brief introduction into knowledge discovery from databases and presents the methology for data mining in time series. The relevancy of data mining for manufacturing shall be depicted.
Data Mining Standards In this survey paper we have consolidated all the current data mining standards. We have categorized them in to process standards, XML standards, standard APIs, web standards and grid standards and discussed them in considerable detail. We have also designed an application using these standards. We later also analyze the standards their influence on data mining application development and later point out areas in the data mining application development that need to be standardized. We also talk about the trend in the focus areas addressed by these standards.
Data Mining: A Conceptual Overview This tutorial provides an overview of the data mining process. The tutorial also provides a basic understanding of how to plan, evaluate and successfully refine a data mining project, particularly in terms of model building and model evaluation. Methodological considerations are discussed and illustrated. After explaining the nature of data mining and its importance in business, the tutorial describes the underlying machine learning and statistical techniques involved. It describes the CRISP-DM standard now being used in industry as the standard for a technology-neutral data mining process model. The paper concludes with a major illustration of the data mining process methodology and the unsolved problems that offer opportunities for research. The approach is both practical and conceptually sound in order to be useful to both academics and practitioners.
Data Mining: Discovering and Visualizing Patterns with Python (RefCard)
Data profit vs. Data waste: Boosting business performance every day in the real world with information optimization Companies do many things to grow profits. They discover new market opportunities. They sell more effectively. They innovate. They delight their customers. They improve productivity. They find ways to cut costs and mitigate risks. It can be difficult to do these things in today’s economic environment, because revenue opportunities are not always abundant and executives are largely disinclined to make substantial investments in new business capabilities. Despite current conditions, businesses are still finding ways to significantly improve their performance on a daily basis. One of these ways is the aggressive pursuit of data profit. Data profit is what results when companies make economically optimized use of all the structured and unstructured data already residing in existing systems across the enterprise to get better at everything the business needs to do: discovering opportunities, selling, innovating, delighting customers, improving productivity, cutting costs, and mitigating risk. Data profit has become an especially compelling business strategy today, because companies now suffer as never before from a specific problem that is the very opposite of data profit. That problem is data waste. Data waste occurs when companies do not fully utilize the wealth of data that they already have. This problem has become highly prevalent because companies have implemented so many systems over the past decade or more – from high-end databases and applications to email and basic desktop productivity tools – but have not developed effective strategies for fully leveraging their collective information output….
Data Science (Poster)
Data Science and its Relationship to Big Data and data-driven Decision Making Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data-science programs, and publications are touting data science as a hot – even ‘‘sexy’’ – career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this article, we argue that there are good reasons why it has been hard to pin down exactly what is data science. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner’s field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of data science precisely is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii), we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this article, we present a perspective that addresses all these concepts. We close by offering, as examples, a partial list of fundamental principles underlying data science.
Data Science Code of Professional Conduct We look at the proposed Data Science Code of Professional Conduct and nominate a “Golden Rule” which summarizes the data scientist ethic.
Data Science in the Cloud with Microsoft Azure Machine Learning and R Recently, Microsoft launched the Azure Machine Learning cloud platform – Azure ML. Azure ML provides an easy-to-use and powerful set of cloud-based data transformation and machine learning tools. This report covers the basics of manipulating data, as well as constructing and evaluating models in Azure ML, illustrated with a data science example. Before we get started, here are a few of the benefits Azure ML provides for machine learning solutions:
• Solutions can be quickly deployed as web services.
• Models run in a highly scalable cloud environment.
• Code and data are maintained in a secure cloud environment.
• Available algorithms and data transformations are extendable using the R language for solution-specific functionality.
Throughout this report, we’ll perform the required data manipulation then construct and evaluate a regression model for a bicycle sharing demand dataset. You can follow along by downloading the code and data provided below. Afterwards, we’ll review how to publish your trained models as web services in the Azure cloud.
Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update This report covers the basics of manipulating data, constructing models, and evaluating models in the Microsoft Azure Machine Learning platform (Azure ML). The Azure ML platform has greatly simplified the development and deployment of machine learning models, with easy-to-use and powerful cloud-based data transformation and machine learning tools. In this report, we’ll explore extending Azure ML with the R language. (A companion report explores extending Azure ML using the Python language.) All of the concepts we will cover are illustrated with a data science example, using a bicycle rental demand dataset. We’ll perform the required data manipulation, or data munging. Then, we will construct and evaluate regression models for the dataset. You can follow along by downloading the code and data provided in the next section. Later in the report, we’ll discuss publishing your trained models as web services in the Azure cloud.
Data Science Revealed: A Data-Driven Glimpse into the Burgeoning new Field As the cost of computing power, data storage, and high-bandwidth Internet access and have plunged exponentially over the past two decades, companies around the globe recognized the power of harnessing data as a source of competitive advantage. But it was only recently, as social web applications and massive, parallel processing have become more widely available that the nescient field of data science revealed what many are becoming to understand: that data is the new oil,i the source for corporate energy and differentiation in the 21st century. Companies like Facebook, LinkedIn, Yahoo, and Google are generating data not only as their primary product, but are analyzing it to continuously improve their products. Pharmaceutical and biomedical companies are using big data to find new cures and analyze genetic information, while marketers leverage the same technology to generate new customer insights. In order to tap this newfound wealth, organizations of all sizes are turning to practitioners in the new field of data science who are capable of translating massive data into predictive insights that lead to results. Data science is an emerging field, with rapid changes, great uncertainty, and exciting opportunities. Our study attempts the first ever benchmark of the data science community, looking at how they interact with their data, the tools they use, their education, and how their organizations approach data-driven problem solving. We also looked at a smaller group of business intelligence professionals to identify areas of contrast between the emerging role of data scientists and the more mature field of BI. Our findings, summarized here, show an emerging talent gap between organizational needs and current industry capabilities exemplified by the unique contributions data scientists can make to an organization and the broad expectations of data science professionals generally.
Data Science Salary Survey 2013 O’Reilly Media conducted an anonymous salary and tools survey in 2012 and 2013 with attendees of the Strata Conference: Making Data Work in Santa Clara, California and Strata + Hadoop World in New York. Respondents from 37 US states and 33 countries, representing a variety of industries in the public and private sector, completed the survey. We ran the survey to better understand which tools data analysts and data scientists use and how those tools correlate with salary. Not all respondents describe their primary role as data scientist/data analyst, but almost all respondents are exposed to data analytics. Similarly, while just over half the respondents described themselves as technical leads, almost all reported that some part of their role included technical duties (i.e., 10–20% of their responsibilities included data analysis or software development). We looked at which tools correlate with others (if respondents use one, are they more likely to use another?) and created a network graph of the positive correlations. Tools could then be compared with salary, either individually or collectively, based on where they clustered on the graph.
Data Science, an Overview of Classification Techniques (Slide Deck)
Data Science, Banking, and Fintech The financial industry today is under siege, but not from economic pressures in Europe and China. Rather, this once-impenetrable fortress is currently riding a giant entrepreneurial wave of disruption, disintermediation, and digital innovation. Behind the siege is fintech, a spunky and growing group of financial technology companies. These venture-backed new arrivals are challenging the old champions in lending, payments, money transfer, trading, wealth management, and cryptocurrencies. In this O’Reilly report, author Cornelia Lévy-Bencheton examines the disruptive megatrends taking hold at every level and juncture of the financial ecosystem. You’ll find out how fintech is reshaping the financial industry, reimagining the ways consumers manage, save, and spend money through a data-driven culture of big data analytics, mobile payment services, and robo-advising. Can traditional financial institutions evolve in time to catch up and avoid being replaced? Pick up this report to learn about the current banking and financial services industry, key participants in fintech, and some adaptive strategies being used by traditional financial organizations.
Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics An action plan to enlarge the technical areas of statistics focuses on the data analyst. The plan sets out six technical areas of work for a university department and advocates a specific allocation of resources devoted to research in each area and to courses in each area. The value of technical work is judged by the extent to which it benefits the data analyst, either directly or indirectly. The plan is also applicable to government research labs and corporate research organizations.
Data Scientist Enablement Roadmap (Slide Deck)
Data Scientist: The Sexiest Job of the 21st Century When Jonathan Goldman arrived for work in June 2006 at LinkedIn, the business networking site, the place still felt like a startup. The company had just under 8 million accounts, and the number was growing quickly as existing members invited their friends and colleagues to join. But users weren’t seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently missing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”
Data Storytelling: Using visualization to share the human impact of numbers Storytelling is a cornerstone of the human experience. The universe may be full of atoms, but it’s through stories that we truly construct our world. From Greek mythology to the Bible to television series like Cosmos, stories have been shaping our experience on Earth for as long as we’ve lived on it. A key purpose of storytelling is not just understanding the world but changing it. After all, why would we study the world if we didn’t want to know how we can—and should— influence it? Though many elements of stories have remained the same throughout history, we have developed better tools and mediums for telling them, such as printed books, movies, and comics. This has changed storytelling styles—and perhaps most importantly, the impact of those stories—over the millennia. But can stories be told with data, as well as with images and words? That’s what this paper’s about.
Data Stream Mining – A Practical Approach Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naïve Bayes classifiers at the leaves. MOA is related to WEKA, the Waikato Environment for Knowledge Analysis, which is an award-winning open-source workbench containing implementations of a wide range of batch machine learning methods. WEKA is also written in Java. The main benefits of Java are portability, where applications can be run on any platform with an appropriate Java virtual machine, and the strong and welldeveloped support libraries. Use of the language is widespread, and features such as the automatic garbage collection help to reduce programmer burden and error. This text explains the theoretical and practical foundations of the methods and streams available in MOA.
Data Visualization Techniques – From Basics to Big Data with SAS Visual Analytics A picture is worth a thousand words – especially when you are trying to understand and gain insights from data. It is particularly relevant when you are trying to find relationships among thousands or even millions of variables and determine their relative importance. Organizations of all types and sizes generate data each minute, hour and day. Everyone – including executives, departmental decision makers, call center workers and employees on production lines – hopes to learn things from collected data that can help them make better decisions, take smarter actions and operate more efficiently. Regardless of how much data you have, one of the best ways to discern important relationships is through advanced analysis and high-performance data visualization. If sophisticated analyses can be performed quickly, even immediately, and results presented in ways that showcase patterns and allow querying and exploration, people across all levels in your organization can make faster, more effective decisions. To create meaningful visuals of your data, there are some basics you should consider. Data size and column composition play an important role when selecting graphs to represent your data. This paper discusses some of the basic issues concerning data visualization and provides suggestions for addressing those issues. In addition, big data brings a unique set of challenges for creating visualizations. This paper covers some of those challenges and potential solutions as well. If you are working with massive amounts of data, one challenge is how to display results of data exploration and analysis in a way that is not overwhelming. You may need a new way to look at the data – one that collapses and condenses the results in an intuitive fashion but still displays graphs and charts that decision makers are accustomed to seeing. And, in today’s on-the-go society, you may also need to make the results available quickly via mobile devices, and provide users with the ability to easily explore data on their own in real time. SAS Visual Analytics is a new business intelligence solution that uses intelligent autocharting to help business analysts and nontechnical users visualize data. It creates the best possible visual based on the data that is selected. The visualizations make it easy to see patterns and trends and identify opportunities for further analysis. The heart and soul of SAS Visual Analytics is the SAS LASR Analytic Server, which can execute and accelerate analytic computations in-memory with unprecedented performance. The combination of high-performance analytics and an easy-to-use data exploration interface enables different types of users to create and interact with graphs so they can understand and derive value from their data faster than ever. This creates an unprecedented ability to solve difficult problems, improve business performance and mitigate risk – rapidly and confidently.
Data Visualization with ggplot2 (Cheat Sheet)
Data Visualization: A New Language for Storytelling An Emerging Universal Medium: When was the last time you saw a business presentation that did not include at least one slide with a bar graph or a pie chart? Data visualizations have become so ubiquitous that we no longer find them remarkable.
Data Visualization: Making Big Data Approachable and Valuable Enterprises today are beginning to realize the important role Big Data plays in achieving business goals. Concepts that used to be difficult for companies to comprehend— factors that influence a customer to make a purchase, behavior patterns that point to fraud or misuse, inefficiencies slowing down business processes—now can be understood and addressed by collecting and analyzing Big Data. The insight gained from such analysis helps organizations improve operations and identify new product and service opportunities that they may have otherwise missed. In essence, Big Data promises to deliver the advantages that companies need to drive revenue growth and gain a competitive edge. However, getting to that Big Data payoff is proving a difficult challenge for many organizations. Big Data is often voluminous and tends to rapidly change and morph, making it challenging to get a handle on and difficult to access. The majority of tools available to work with Big Data are complex and hard to use, and most enterprises don’t have the in-house expertise to perform the required data analysis and manipulation to draw out the answers that the business is seeking. In fact, in a recent survey conducted by IDG Research, when asked about analyzing Big Data, respondents cite lack of skills and difficulty in making Big Data available to users as two significant challenges. “A lot of existing Big Data techniques require you to really get your hands dirty; I don’t think that most Big Data software is as mature as it needs to be in order to be accessible to business users at most enterprises,” says Paul Kent, vice president of Big Data with SAS. “So if you’re not Google or LinkedIn or Facebook, and you don’t have thousands of engineers to work with Big Data, it can be difficult to find business answers in the information.” What enterprises need are tools to help them easily and effectively understand and analyze Big Data. Employees who aren’t data scientists or analysts should be able to ask questions of the data based on their own business expertise and quickly and easily find patterns, spot inconsistencies, even get answers to questions they haven’t yet thought to ask. Otherwise, the effort and expense that companies invest in collecting and mining Big Data may be challenged to yield significant actionable results. And companies run the risk of missing important business opportunities if they can’t find the answers that are likely stored in their own data.
Data Visualization: When Data Speaks Business This TEC Product Analysis Report aims to provide an extensive review of the set of data visualization features that form part of the essential core of IBM Cognos Business Intelligence (BI) capabilities. The report contains the following elements:
1. An introduction to IBM Cognos Business Intelligence and data visualization for providing extensive analytics and data discovery services
2. An analyst perspective covering data visualization, its role, importance, and value in the BI lifecycle chain and examining its relationship to other elements in a reliable and best practice scenario for performing BI within an organization
3. A review of IBM Cognos data visualization capabilities
4. A general conclusion and final analyst summary
Data Warehousing: Best Practices for Collecting, Storing, and Delivering Decision-Support Data Data Warehousing is a process for collecting, storing, and delivering decision-support data for some or all of an enterprise. Data warehousing is a broad subject that is described point by point in this Refcard. A data warehouse is one of the artifacts created in the data warehousing process.
Data Wrangling with dplyr and tidyr Cheat Sheet (Cheat Sheet)
Data: Emerging Trends and Technologies What are the emerging trends and technologies that will transform the data landscape in coming months? In this report from Strata + Hadoop World co-chair Alistair Croll, you’ll learn how the ubiquity of cheap sensors, fast networks, and distributed computing have given rise to several developments that will soon have a profound effect on individuals and society as a whole. Machine learning, for example, has quickly moved from lab tool to hosted, pay-as-you-go services in the cloud. Those services, in turn, are leading to predictive apps that will provide individuals with the right functionality and content at the right time by continuously learning about them and predicting what they’ll need. Computational power can produce cognitive augmentation.
Data-Driven Nested Stochastic Robust Optimization: A General Computational Framework and Algorithm for Optimization under Uncertainty in the Big Data Era A novel data-driven nested stochastic robust optimization (DDNSRO) framework is proposed to systematically and automatically handle labeled multi-class uncertainty data in optimization problems. Uncertainty realizations in large datasets are often collected from various conditions, which are encoded by class labels. A group of Dirichlet process mixture models is employed for uncertainty modeling from the multi-class uncertainty data. The proposed data-driven nonparametric uncertainty model could automatically adjust its complexity based on the data structure and complexity, thus accurately capturing the uncertainty information. A DDNSRO framework is further proposed based on the data-driven uncertainty model through a bi-level optimization structure. The outer optimization problem follows a two-stage stochastic programming approach to optimize the expected objective across different classes of data; robust optimization is nested as the inner problem to ensure the robustness of the solution while maintaining computational tractability. A tailored column-and-constraint generation algorithm is further developed to solve the resulting multi-level optimization problem efficiently. Case studies on strategic planning of process networks are presented to demonstrate the applicability of the proposed framework.
Data-intensive applications, challenges, techniques and technologies: A survey on Big Data It is already true that Big Data has drawn huge attention from researchers in information sciences, policy and decision makers in governments and enterprises. As the speed of information growth exceeds Moore’s Law at the beginning of this new century, excessive data is making great troubles to human beings. However, there are so much potential and highly useful values hidden in the huge volume of data. A new scientific paradigm is born as dataintensive scientific discovery (DISD), also known as Big Data problems. A large number of fields and sectors, ranging from economic and business activities to public administration, from national security to scientific researches in many areas, involve with Big Data problems. On the one hand, Big Data is extremely valuable to produce productivity in businesses and evolutionary breakthroughs in scientific disciplines, which give us a lot of opportunities to make great progresses in many fields. There is no doubt that the future competitions in business productivity and technologies will surely converge into the Big Data explorations. On the other hand, Big Data also arises with many challenges, such as difficulties in data capture, data storage, data analysis and data visualization. This paper is aimed to demonstrate a close-up view about Big Data, including Big Data applications, Big Data opportunities and challenges, as well as the state-of-the-art techniques and technologies we currently adopt to deal with the Big Data problems. We also discuss several underlying methodologies to handle the data deluge, for example, granular computing, cloud computing, bio-inspired computing, and quantum computing.
Deciphering Big Data Stacks: An Overview of Big Data Tools With its ability to ingest, process, and decipher an abundance of incoming data, the Big Data is considered by many a cornerstone of future research and development. However, the large number of available tools and the overlap between those are impeding their technological potential. In this paper, we present a systematic grouping of the available tools and present a network of dependencies among those with the aim of composing individual tools into functional software stacks required to perform Big Data analyses.
Decision Management and Cloud as a Platform for Predictive Analytics (Slide Deck)
Decision Modeling with DMN: How to Build a Decision Requirements Model using the new Decision Model and Notation (DMN) standard The goal of this paper is to describe the four iterative steps to complete a Decision Requirements Model using the forthcoming DMN standard. Before beginning, it is important to understand the value of defining decision requirements as part of your overall requirements process. Experience shows that there are three main reasons for doing so:
1. Current requirements approaches don’t tackle the decision-making that is increasingly important in information systems.
2. While important for all software development projects, decision requirements are especially important for projects adopting business rules and advanced analytic technologies.
3. Decisions are a common language across business, IT and analytic organizations improving collaboration, increasing reuse, and easing implementation.
Decision Requirements Modeling for Analytic Projects Established analytic approaches like CRISP-DM stress the importance of understanding the project objectives and requirements from a business perspective, but to date there are no formal approaches to capturing this understanding in a repeatable, understandable format. Decision Requirements Modeling closes this gap. Decision Requirements Modeling is a successful technique that develops a richer, more complete business understanding earlier. Decision Requirements Modeling results in a clear business target, an understanding of how the results will be used and deployed, and by whom. Using Decision Requirements Modeling to guide and shape analytics projects reduces reliance on constrained specialist resources by improving requirements gathering, helps teams ask the key questions and enables teams to collaborate effectively across the organization, bringing analytics, IT and business professionals together. Using Decision Requirements Modeling to document analytic project requirements enables organizations to:
– Compare multiple projects for prioritization, including allowing new analytic development to be compared with updating or refining existing analytics.
– Act on a specific plan to guide analytic development that is accessible to business, IT and analytic teams alike.
– Reuse knowledge from project to project by creating an increasingly detailed and accurate view of decision-making and the role of analytics.
– Value information sources and analytics in terms of business impact.
There is an emerging consensus that Decision Requirements Modeling is the best way to specify decision-making. It is also central to a forthcoming standard, the Object Management Group’s Decision Model and Notation, which will give adopters access to a broad community and a vehicle for sharing expertise more widely.
Decision Theory – A Brief Introduction Decision theory is theory about decisions. The subject is not a very unified one. To the contrary, there are many different ways to theorize about decisions, and therefore also many different research traditions. This text attempts to reflect some of the diversity of the subject. Its emphasis lies on the less (mathematically) technical aspects of decision theory.
Decision Tree Classification with Differential Privacy: A Survey Data mining information about people is becoming increasingly important in the data-driven society of the 21st century. Unfortunately, sometimes there are real-world considerations that conflict with the goals of data mining; sometimes the privacy of the people being data mined needs to be considered. This necessitates that the output of data mining algorithms be modified to preserve privacy while simultaneously not ruining the predictive power of the outputted model. Differential privacy is a strong, enforceable definition of privacy that can be used in data mining algorithms, guaranteeing that nothing will be learned about the people in the data that could not already be discovered without their participation. In this survey, we focus on one particular data mining algorithm — decision trees — and how differential privacy interacts with each of the components that constitute decision tree algorithms. We analyze both greedy and random decision trees, and the conflicts that arise when trying to balance privacy requirements with the accuracy of the model.
Decorrelation of Neutral Vector Variables: Theory and Applications In this paper, we propose novel strategies for neutral vector variable decorrelation. Two fundamental invertible transformations, namely serial nonlinear transformation and parallel nonlinear transformation, are proposed to carry out the decorrelation. For a neutral vector variable, which is not multivariate Gaussian distributed, the conventional principal component analysis (PCA) cannot yield mutually independent scalar variables. With the two proposed transformations, a highly negatively correlated neutral vector can be transformed to a set of mutually independent scalar variables with the same degrees of freedom. We also evaluate the decorrelation performances for the vectors generated from a single Dirichlet distribution and a mixture of Dirichlet distributions. The mutual independence is verified with the distance correlation measurement. The advantages of the proposed decorrelation strategies are intensively studied and demonstrated with synthesized data and practical application evaluations.
Deep Architectures for Modulation Recognition We survey the latest advances in machine learning with deep neural networks by applying them to the task of radio modulation recognition. Results show that radio modulation recognition is not limited by network depth and further work should focus on improving learned synchronization and equalization. Advances in these areas will likely come from novel architectures designed for these tasks or through novel training methods.
Deep Belief Nets (Slide Deck)
Deep EHR: A Survey of Recent Advances on Deep Learning Techniques for Electronic Health Record (EHR) Analysis The past decade has seen an explosion in the amount of digital information stored in electronic health records (EHR). While primarily designed for archiving patient clinical information and administrative healthcare tasks, many researchers have found secondary use of these records for various clinical informatics tasks. Over the same period, the machine learning community has seen widespread advances in deep learning techniques, which also have been successfully applied to the vast amount of EHR data. In this paper, we review these deep EHR systems, examining architectures, technical aspects, and clinical applications. We also identify shortcomings of current techniques and discuss avenues of future research for EHR-based deep learning.
Deep Learning (Slide Deck)
Deep learning applications and challenges in big data analytics Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. Complex abstractions are learnt at a given level based on relatively simpler abstractions formulated in the preceding level in the hierarchy. A key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and un-categorized. In the present study, we explore how Deep Learning can be utilized for addressing some important problems in Big Data Analytics, including extracting complex patterns from massive volumes of data, semantic indexing, data tagging, fast information retrieval, and simplifying discriminative tasks.We also investigate some aspects of Deep Learning research that need further exploration to incorporate specific challenges introduced by Big Data Analytics, including streaming data, high-dimensional data, scalability of models, and distributed computing. We conclude by presenting insights into relevant future works by posing some questions, including defining data sampling criteria, domain adaptation modeling, defining criteria for obtaining useful data abstractions, improving semantic indexing, semi-supervised learning, and active learning.
Deep Learning applied to NLP Convolutional Neural Network (CNNs) are typically associated with Computer Vision. CNNs are responsible for major breakthroughs in Image Classification and are the core of most Computer Vision systems today. More recently CNNs have been applied to problems in Natural Language Processing and gotten some interesting results. In this paper, we will try to explain the basics of CNNs, its different variations and how they have been applied to NLP.
Deep Learning based Recommender System: A Survey and New Perspectives With the ever-growing volume, complexity and dynamicity of online information, recommender system is an effective key solution to overcome such information overload. In recent years, deep learning’s revolutionary advances in speech recognition, image analysis and natural language processing have drawn significant attention. Meanwhile, recent studies also demonstrate its effectiveness in coping with information retrieval and recommendation tasks. Applying deep learning techniques into recommender system has been gaining momentum due to its state-of-the-art performances and high-quality recommendations. In contrast to traditional recommendation models, deep learning provides a better understanding of user’s demands, item’s characteristics and historical interactions between them. This article provides a comprehensive review of recent research efforts on deep learning based recommender systems towards fostering innovations of recommender system research. A taxonomy of deep learning based recommendation models is presented and used to categorise surveyed articles. Open problems are identified based on the insightful analytics of the reviewed works and potential solutions discussed.
Deep Learning is Robust to Massive Label Noise Deep neural networks trained on large supervised datasets have led to impressive results in recent years. However, since well-annotated datasets can be prohibitively expensive and time-consuming to collect, recent work has explored the use of larger but noisy datasets that can be more easily obtained. In this paper, we investigate the behavior of deep neural networks on training sets with massively noisy labels. We show that successful learning is possible even with an essentially arbitrary amount of noise. For example, on MNIST we find that accuracy of above 90 percent is still attainable even when the dataset has been diluted with 100 noisy examples for each clean example. Such behavior holds across multiple patterns of label noise, even when noisy labels are biased towards confusing classes. Further, we show how the required dataset size for successful training increases with higher label noise. Finally, we present simple actionable techniques for improving learning in the regime of high label noise.
Deep Learning: A Bayesian Perspective Deep learning is a form of machine learning for nonlinear high dimensional data reduction and prediction. A Bayesian probabilistic perspective provides a number of advantages. Specifically statistical interpretation and properties, more efficient algorithms for optimisation and hyper-parameter tuning, and an explanation of predictive performance. Traditional high-dimensional statistical techniques; principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), projection pursuit regression (PPR) are shown to be shallow learners. Their deep learning counterparts exploit multiple layers of of data reduction which leads to performance gains. Stochastic gradient descent (SGD) training and optimisation and Dropout (DO) provides model and variable selection. Bayesian regularization is central to finding networks and provides a framework for optimal bias-variance trade-off to achieve good out-of sample performance. Constructing good Bayesian predictors in high dimensions is discussed. To illustrate our methodology, we provide an analysis of first time international bookings on Airbnb. Finally, we conclude with directions for future research.
Deep Learning: Past, Present and Future (Slide Deck)
Deep Neural Decision Forests We present Deep Neural Decision Forests – a novel approach that unifies classification trees with the representation learning functionality known from deep convolutional networks, by training them in an end-to-end manner. To combine these two worlds, we introduce a stochastic and differentiable decision tree model, which steers the representation learning usually conducted in the initial layers of a (deep) convolutional network. Our model differs from conventional deep networks because a decision forest provides the final predictions and it differs from conventional decision forests since we propose a principled, joint and global optimization of split and leaf node parameters. We show experimental results on benchmark machine learning datasets like MNIST and ImageNet and find onpar or superior results when compared to state-of-the-art deep models. Most remarkably, we obtain Top5-Errors of only 7:84%=6:38% on ImageNet validation data when integrating our forests in a single-crop, single/seven model GoogLeNet architecture, respectively. Thus, even without any form of training data set augmentation we are improving on the 6.67% error obtained by the best GoogLeNet architecture (7 models, 144 crops).
Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify objects in images with near-human-level performance, questions naturally arise as to what differences remain between computer and human vision. A recent study revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-theart DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects. Our results shed light on interesting differences between human vision and current DNNs, and raise questions about the generality of DNN computer vision.
Deep Reinforcement Learning: An Overview We give an overview of recent exciting achievements of deep reinforcement learning (RL). We start with background of deep learning and reinforcement learning, as well as introduction of testbeds. Next we discuss Deep Q-Network (DQN) and its extensions, asynchronous methods, policy optimization, reward, and planning. After that, we talk about attention and memory, unsupervised learning, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, spoken dialogue systems (a.k.a. chatbot), machine translation, text sequence prediction, neural architecture design, personalized web services, healthcare, finance, and music generation. We mention topics/papers not reviewed yet. After listing a collection of RL resources, we close with discussions.
Deep-learning in Mobile Robotics – from Perception to Control Systems: A Survey on Why and Why not Deep-learning has dramatically changed the world overnight. It greatly boosted the development of visual perception, object detection, and speech recognition, etc. That was attributed to the multiple convolutional processing layers for abstraction of learning representations from massive data. The advantages of deep convolutional structures in data processing motivated the applications of artificial intelligence methods in robotic problems, especially perception and control system, the two typical and challenging problems in robotics. This paper presents a survey of the deep-learning research landscape in mobile robotics. We start with introducing the definition and development of deep-learning in related fields, especially the essential distinctions between image processing and robotic tasks. We described and discussed several typical applications and related works in this domain, followed by the benefits from deep-learning, and related existing frameworks. Besides, operation in the complex dynamic environment is regarded as a critical bottleneck for mobile robots, such as that for autonomous driving. We thus further emphasize the recent achievement on how deep-learning contributes to navigation and control systems for mobile robots. At the end, we discuss the open challenges and research frontiers.
DeepWalk: Online Learning of Social Representations We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk’s latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk’s representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk’s representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection.
Delivering Information Faster: In-Memory Technology Reboots the Big Data Analytics World In-memory technology – in which entire datasets are pre-loaded into a computer’s random access memory, alleviating the need for shuttling data between memory and disk storage every time a query is initiated – has actually been around for a number of years. However, with the onset of big data, as well as an insatiable thirst for analytics, the industry is taking a second look at this promising approach to speeding up data processing.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%,) on this visual recognition challenge.
Demystifying Fog Computing: Characterizing Architectures, Applications and Abstractions Internet of Things (IoT) has accelerated the deployment of millions of sensors at the edge of the network, through Smart City infrastructure and lifestyle devices. Cloud computing platforms are often tasked with handling these large volumes and fast streams of data from the edge. Recently, Fog computing has emerged as a concept for low-latency and resource-rich processing of these observation streams, to complement Edge and Cloud computing. In this paper, we review various dimensions of system architecture, application characteristics and platform abstractions that are manifest in this Edge, Fog and Cloud eco-system. We highlight novel capabilities of the Edge and Fog layers, such as physical and application mobility, privacy sensitivity, and a nascent runtime environment. IoT application case studies based on first-hand experiences across diverse domains drive this categorization. We also highlight the gap between the potential and the reality of Fog computing, and identify challenges that need to be overcome for the solution to be sustainable. Together, our article can help platform and application developers bridge the gap that remains in making Fog computing viable.
Density-based Clustering The clustering methods like K-means or Expectation-Maximization are suitable for finding ellipsoid-shaped clusters, or at best convex clusters. However, for non-convex clusters, such as those shown in Figure 15.1, these methods have trouble finding the true clusters, since two points from different clusters may be closer than two points in the same cluster. The density-based methods we consider in this chapter are able to mine such non-convex or shape-based clusters. Figure
Design Principles of Massive, Robust Prediction Systems Most data mining research is concerned with building high-quality classification models in isolation. In massive production systems, however, the ability to monitor and maintain performance over time while growing in size and scope is equally important. Many external factors may degrade classification performance including changes in data distribution, noise or bias in the source data, and the evolution of the system itself. A well-functioning system must gracefully handle all of these. This paper lays out a set of design principles for large-scale autonomous data mining systems and then demonstrates our application of these principles within the m6d automated ad targeting system. We demonstrate a comprehensive set of quality control processes that allow us monitor and maintain thousands of distinct classification models automatically, and to add new models, take on new data, and correct poorly-performing models without manual intervention or system disruption.
Designing Great Visualizations This paper traces the history of visual representation, from early cave drawings through the computer revolution and the launch of Tableau. We will discuss some of the pioneers in data research and show how their work helped to revolutionize visualization techniques. We will also examine the different styles of data visuals, discuss some of the barriers to making effective visuals and the methods we use to overcome those barriers. In the end, we will show the power (and limits) of human perception, and how we can use data to tell stories – much like those of the earliest cave drawings.
Deterministic Distributed Matching: Simpler, Faster, Better We present improved deterministic distributed algorithms for a number of well-studied matching problems, which are simpler, faster, more accurate, and/or more general than their known counterparts. The common denominator of these results is a deterministic distributed rounding method for certain linear programs, which is the first such rounding method, to our knowledge. A sampling of our end results is as follows: — An $O(\log^2 \Delta \log n)$-round deterministic distributed algorithm for computing a maximal matching, in $n$-node graphs with maximum degree $\Delta$. This is the first improvement in about 20 years over the celebrated $O(\log^4 n)$-round algorithm of Hanckowiak, Karonski, and Panconesi [SODA’98, PODC’99]. — An $O(\log^2 \Delta \log \frac{1}{\varepsilon} + \log^ * n)$-round deterministic distributed algorithm for a $(2+\varepsilon)$-approximation of maximum matching. This is exponentially faster than the classic $O(\Delta +\log^* n)$-round $2$-approximation of Panconesi and Rizzi [DIST’01]. With some modifications, the algorithm can also find an almost maximal matching which leaves only an $\varepsilon$-fraction of the edges on unmatched nodes. — An $O(\log^2 \Delta \log \frac{1}{\varepsilon} \log_{1+\varepsilon} W + \log^ * n)$-round deterministic distributed algorithm for a $(2+\varepsilon)$-approximation of a maximum weighted matching, and also for the more general problem of maximum weighted $b$-matching. Here, $W$ denotes the maximum normalized weight. These improve over the $O(\log^4 n \log_{1+\varepsilon} W)$-round $(6+\varepsilon)$-approximation algorithm of Panconesi and Sozio [DIST’10].
Different Approach to the Problem of Missing Data There is a long history of devleopment of methodology dealing with missing data in statistical analysis. Today, the most popular methods fall into two classes, Complete Cases (CC) and Multiple Imputation (MI). Another approach, Available Cases (AC), has occasionally been mentioned in the research literature, in the context of linear regression analysis, but has generally been ignored. In this paper, we revisit the AC method, showing that it can perform better than CC and MI, and we extend its breadth of application.
Directional Statistics in Machine Learning: a Brief Review The modern data analyst must cope with data encoded in various forms, vectors, matrices, strings, graphs, or more. Consequently, statistical and machine learning models tailored to different data encodings are important. We focus on data encoded as normalized vectors, so that their ‘direction’ is more important than their magnitude. Specifically, we consider high-dimensional vectors that lie either on the surface of the unit hypersphere or on the real projective plane. For such data, we briefly review common mathematical models prevalent in machine learning, while also outlining some technical aspects, software, applications, and open mathematical challenges.
Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society The Big Data Research and Development Initiative is now in its third year and making great strides to address the challenges of Big Data. To further advance this initiative, we describe how statistical thinking can help tackle the many Big Data challenges, emphasizing that often the most productive approach will involve multidisciplinary teams with statistical, computational, mathematical, and scientific domain expertise.
discrete examples: genetics and spell checking
Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier – A Review The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested example and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures? This review attempts to answer the previous question through evaluating the performance (measured by accuracy, precision and recall) of the KNN using a large number of distance measures, tested on a number of real world datasets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, the results showed large gaps between the performances of different distances. We found that a recently proposed non-convex distance performed the best when applied on most datasets comparing to the other tested distances. In addition, the performance of the KNN degraded only about $20\%$ while the noise level reaches $90\%$, this is true for all the distances used. This means that the KNN classifier using any of the top $10$ distances tolerate noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing to other distances.
Distance Metric Learning – A Comprehensive Survey Many machine learning algorithms, such as K Nearest Neighbor (KNN), heavily rely on the distance metric for the input data patterns. Distance Metric learning is to learn a distance metric for the input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data. In recent years, many studies have demonstrated, both empirically and theoretically, that a learned metric can significantly improve the performance in classification, clustering and retrieval tasks. This paper surveys the field of distance metric learning from a principle perspective, and includes a broad selection of recent work. In particular, distance metric learning is reviewed under different learning conditions: supervised learning versus unsupervised learning, learning in a global sense versus in a local sense; and the distance matrix based on linear kernel versus nonlinear kernel. In addition, this paper discusses a number of techniques that is central to distance metric learning, including convex programming, positive semi-definite programming, kernel learning, dimension reduction, K Nearest Neighbor, large margin classification, and graph-based approaches.
Distinguishing cause from effect using observational data: methods and benchmarks The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X; Y . This was often considered to be impossible. Nevertheless, several approaches for addressing this bivariate causal discovery problem were proposed recently. In this paper, we present the benchmark data set CauseEffectPairs that consists of 88 di erent \causee ect pairs” selected from 31 datasets from various domains. We evaluated the performance of several bivariate causal discovery methods on these real-world benchmark data and on arti cially simulated data. Our empirical results provide evidence that additive-noise methods are indeed able to distinguish cause from e ect using only purely observational data. In addition, we prove consistency of the additive-noise method proposed by Hoyer et al. (2009).
Distributed Constraint Optimization Problems and Applications: A Survey The field of Multi-Agent System (MAS) is an active area of research within Artificial Intelligence, with an increasingly important impact in industrial and other real-world applications. Within a MAS, autonomous agents interact to pursue personal interests and/or to achieve common objectives. Distributed Constraint Optimization Problems (DCOPs) have emerged as one of the prominent agent architectures to govern the agents’ autonomous behavior, where both algorithms and communication models are driven by the structure of the specific problem. During the last decade, several extensions to the DCOP model have enabled them to support MAS in complex, real-time, and uncertain environments. This survey aims at providing an overview of the DCOP model, giving a classification of its multiple extensions and addressing both resolution methods and applications that find a natural mapping within each class of DCOPs. The proposed classification suggests several future perspectives for DCOP extensions, and identifies challenges in the design of efficient resolution algorithms, possibly through the adaptation of strategies from different areas.
Distributed Decision Tree Learning for Mining Big Data Streams Web companies need to e ectively analyse big data in order to enhance the experiences of their users. They need to have systems that are capable of handling big data in term of three dimensions: volume as data keeps growing, variety as the type of data is diverse, and velocity as the is continuously arriving very fast into the systems. However, most of the existing systems have addressed at most only two out of the three dimensions such as Mahout, a distributed machine learning framework that addresses the volume and variety dimensions, and Massive Online Analysis (MOA), a streaming machine learning framework that handles the variety and velocity dimensions. In this thesis, we propose and develop Scalable Advanced Massive Online Analysis (SAMOA), a distributed streaming machine learning framework to address the aforementioned challenge. SAMOA provides exible application programming interfaces (APIs) to allow rapid development of new ML algorithms for dealing with variety. Moreover, we integrate SAMOA with Storm, a state-of-the-art stream processing engine (SPE), which allows SAMOA to inherit Storm’s scalability to address velocity and volume. The main benefits of SAMOA are: it provides exibility in developing new ML algorithms and extensibility in integrating new SPEs. We develop a distributed online classification algorithm on top of SAMOA to verify the aforementioned features of SAMOA. The evaluation results show that the distributed algorithm is suitable for high number of attributes settings.
Distributed Latent Dirichlet Allocation via Tensor Factorization Latent Dirichlet Allocation (LDA) has proven extremely popular and versatile since its introduction over a decade ago. LDA is successful in part because it assigns a mixture of latent states (‘topics’) to each set of exchangeable observations (‘document’), in contrast to a hard clustering. This property complicates the estimation of latent parameters, and has led to extensive research in disparate learning techniques. Broadly speaking there are 3 basic strategies: variational inference ; Markov chain Monte Carlo ; and the method of moments , the latter having been recently discovered. Due to high dimensional data with large vocabulary size; numerous documents; and number of topics, computational constraints are the limiting factor to developing large scale topic models. This has motivated research into scalable computational strategies for LDA. In the single node context, stochastic variational inference is fast and accurate, but has high communication costs in the distributed setting. Batch variational inference has a more favorable ratio of communication to computation as the E-step (but not the M-step) is embarrisingly parallel. Markov chain Monte Carlo (MCMC) techniques have also been implemented in the distributed setting, both synchronous and asynchronous variants. Due to their recent introduction, there are no distributed implementations of method of moments based approaches to LDA. We leverage that the method of moments for LDA reduces to canonical polyadic (CP) decomposition of a tensor, a problem which has received extensive study in the literature , including distributed variants. We combine ALS with whitening preprocessing (data orthogonalization and dimensionality reduction) motivated by better convergence rate and perturbation guarantees compared to previous methods. Additionally, the preprocessing has the benefit that the subsequent tensor decomposition is independent of the vocabulary size and the number of documents. Although ALS requires many iterations to converge (more than would be tolerable using map-reduce without custom support for low-overhead iteration), we utilize REEF , a distributed processing framework which runs on YARN managed clusters, e.g., a Hadoop 2 installation.
Distributed Machine Learning with Apache Mahout (RefCard) Apache Mahout is a library for scalable machine learning. Originally a subproject of Apache Lucene (a high-performance text search engine library), Mahout has progressed to be a top-level Apache project. While Mahout has only been around for a few years, it has established itself as a frontrunner in the field of machine learning technologies. This Refcard will present the basics of Mahout by studying two possible applications:
• Training and testing a Random Forest for handwriting recognition using Amazon Web Services EMR.
• Running a recommendation engine on a standalone Spark cluster.
Do Neural Nets Learn Statistical Laws behind Natural Language? The performance of deep learning in natural language processing has been spectacular, but the reason for this success remains unclear because of the inherent complexity of deep learning. This paper provides empirical evidence of its effectiveness and of a limitation of neural networks for language engineering. Precisely, we demonstrate that a Long Short-Term Memory (LSTM)-based neural language model effectively reproduces Zipf’s law and Heaps’ law, two representative statistical properties underlying natural language. We discuss the quality of the reproducibility and the emergence of Zipf’s law and Heaps’ law as training progresses. We also point out that the neural language model has a limitation in reproducing long-range correlation, another statistical law of natural language. This understanding could provide a direction of improvement of architectures of neural networks.
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
Do you know Big Data? (Cheat Sheet)
Domain Adaptation for Visual Applications: A Comprehensive Survey The aim of this paper is to give an overview of domain adaptation and transfer learning with a specific view on visual applications. After a general motivation, we first position domain adaptation in the larger transfer learning problem. Second, we try to address and analyze briefly the state-of-the-art methods for different types of scenarios, first describing the historical shallow methods, addressing both the homogeneous and the heterogeneous domain adaptation methods. Third, we discuss the effect of the success of deep convolutional architectures which led to new type of domain adaptation methods that integrate the adaptation within the deep architecture. Fourth, we overview the methods that go beyond image categorization, such as object detection or image segmentation, video analyses or learning visual attributes. Finally, we conclude the paper with a section where we relate domain adaptation to other machine learning solutions.
Don’t Be Overwhelmed by Big Data Big. Data. The CPG industry is abuzz with those two words. And for good reason, as both brick-and-mortar retailers and online retailers each attempt to create the ideal omni-channel consumer experience that will drive increased sales, they look to Big Data for actionable insights that can be measured against key KPIs. And while it’s understandable that the CPG industry is excited by the prospect of more data they can use to better understand the who, what, why and when of consumer purchasing behavior, it’s critical CPG organizations pause and ask themselves, “Are we providing retail team and executive team members with “quality” data? POS, Big Data, order data or shipment summary data. It doesn’t matter. Is the right (i.e., “quality”) data getting to the right people at the right time? In essence, it’s the same question retail and executive teams face when considering how to best merchandise their SKUs. If CPG organizations understand the importance of getting the right product to the right people at the right time, then surely they understand the importance of applying the same forethought to their demand data? …
Dynamic Bayesian Networks: A State of the Art
Dynamic Decision Networks for Decision-Making in Self-Adaptive Systems: A Case Study Bayesian decision theory is increasingly applied to support decision-making processes under environmental variability and uncertainty. Researchers from application areas like psychology and biomedicine have applied these techniques successfully. However, in the area of software engineering and specifically in the area of self-adaptive systems (SASs), little progress has been made in the application of Bayesian decision theory. We believe that techniques based on Bayesian Networks (BNs) are useful for systems that dynamically adapt themselves at runtime to a changing environment, which is usually uncertain. In this paper, we discuss the case for the use of BNs, specifically Dynamic Decision Networks (DDNs), to support the decision-making of self-adaptive systems. We present how such a probabilistic model can be used to support the decisionmaking in SASs and justify its applicability. We have applied our DDN-based approach to the case of an adaptive remote data mirroring system. We discuss results, implications and potential benefits of the DDN to enhance the development and operation of self-adaptive systems, by providing mechanisms to cope with uncertainty and automatically make the best decision.
Dynamo: Amazon’s Highly Available Key-value Store Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

E

Easy over Hard: A Case Study on Deep Learning While deep learning is an exciting new technique, the benefits of this method need to be assessed w.r.t. its computational cost. This is particularly important for deep learning since these learners need hours (to weeks) to train the model. Such long CPU times limit the ability of (a) a researcher to test the stability of their conclusion via repeated runs with different random seeds; and (b)other researchers to repeat, improve, or even refute that original work. For example, recently, deep learning was used to find which questions in the Stack Overflow programmer discussion forum can be linked together. That system took 14 hours to execute. We show here that a very simple optimizer called DE (differential evolution) can achieve similar (and sometimes better). The DE approach terminated in 10 minutes; i.e. 84 times faster hours than deep learning. We offer these results as a cautionary tale to the software analytics community and suggest that not every new innovation should be applied without critical analysis. If researchers deploy some new and expensive process, that work should be baselined against some simpler and faster alternatives.
Econometrics in R A more advanced tutorial on econometrics with R
Economic Forecasting Forecasts guide decisions in all areas of economics and finance and their value can only be understood in relation to, and in the context of, such decisions. We discuss the central role of the loss function in helping determine the forecaster’s objectives. Decision theory provides a framework for both the construction and evaluation of forecasts. This framework allows an understanding of the challenges that arise from the explosion in the sheer volume of predictor variables under consideration and the forecaster’s ability to entertain an endless array of forecasting models and time-varying specifications, none of which may coincide with the ‘true’ model. We show this along with reviewing methods for comparing the forecasting performance of pairs of models or evaluating the ability of the best of many models to beat a benchmark specification.
EDISON Data Science Framework The EDISON Data Science Framework is a collection of documents that define the Data Science profession. Freely available, these documents have been developed to guide educators and trainers, emplyers and managers, and Data Scientists themselves. This collection of documents collectively breakdown the complexity of the skills and competences need to define Data Science as a professional practice.
Effective optimization using sample persistence: A case study on quantum annealers and various Monte Carlo optimization methods We present and apply a general-purpose, multi-start algorithm for improving the performance of low-energy samplers used for solving optimization problems. The algorithm iteratively fixes the value of a large portion of the variables to values that have a high probability of being optimal. The resulting problems are smaller and less connected, and samplers tend to give better low-energy samples for these problems. The algorithm is trivially parallelizable, since each start in the multi-start algorithm is independent, and could be applied to any heuristic solver that can be run multiple times to give a sample. We present results for several classes of hard problems solved using simulated annealing, path-integral quantum Monte Carlo, parallel tempering with isoenergetic cluster moves, and a quantum annealer, and show that the success metrics as well as the scaling are improved substantially. When combined with this algorithm, the quantum annealer’s scaling was substantially improved for native Chimera graph problems. In addition, with this algorithm the scaling of the time to solution of the quantum annealer is comparable to the Hamze–de Freitas–Selby algorithm on the weak-strong cluster problems introduced by Boixo et al. Parallel tempering with isoenergetic cluster moves was able to consistently solve 3D spin glass problems with 8000 variables when combined with our method, whereas without our method it could not solve any.
Efficient Dimensionality Reduction for High-Dimensional Network Estimation We propose module graphical lasso (MGL), an aggressive dimensionality reduction and network estimation technique for a highdimensional Gaussian graphical model (GGM). MGL achieves scalability, interpretability and robustness by exploiting the modularity property of many real-world networks. Variables are organized into tightly coupled modules and a graph structure is estimated to determine the conditional independencies among modules. MGL iteratively learns the module assignment of variables, the latent variables, each corresponding to a module, and the parameters of the GGM of the latent variables. In synthetic data experiments, MGL outperforms the standard graphical lasso and three other methods that incorporate latent variables into GGMs. When applied to gene expression data from ovarian cancer, MGL outperforms standard clustering algorithms in identifying functionally coherent gene sets and predicting survival time of patients. The learned modules and their dependencies provide novel insights into cancer biology as well as identifying possible novel drug targets.
Efficient Estimation of Word Representations in Vector Space We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Efficient Forecasting for Hierarchical Time Series Forecasting is used as the basis for business planning in many application areas such as energy, sales and traffic management. Time series data used in these areas is often hierarchically organized and thus, aggregated along the hierarchy levels based on their dimensional features. Calculating forecasts in these environments is very time consuming, due to ensuring forecasting consistency between hierarchy levels. To increase the forecasting efficiency for hierarchically organized time series, we introduce a novel forecasting approach that takes advantage of the hierarchical organization. There, we reuse the forecast models maintained on the lowest level of the hierarchy to almost instantly create already estimated forecast models on higher hierarchical levels. In addition, we define a hierarchical communication framework, increasing the communication flexibility and efficiency. Our experiments show significant runtime improvements for creating a forecast model at higher hierarchical levels, while still providing a very high accuracy.
Efficient Processing of Deep Neural Networks: A Tutorial and Survey Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of deep neural network to improve energy-efficiency and throughput without sacrificing performance accuracy or increasing hardware cost are critical to enabling the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various platforms and architectures that support DNNs, and highlight key trends in recent efficient processing techniques that reduce the computation cost of DNNs either solely via hardware design changes or via joint hardware design and network algorithm changes. It will also summarize various development resources that can enable researchers and practitioners to quickly get started on DNN design, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-design, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand trade-offs between various architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand of recent implementation trends and opportunities.
Elements of nonlinear analysis of information streams This review considers methods of nonlinear dynamics to apply for analysis of time series corresponding to information streams on the Internet. In the main, these methods are based on correlation, fractal, multifractal, wavelet, and Fourier analysis. The article is dedicated to a detailed description of these approaches and interconnections among them. The methods and corresponding algorithms presented can be used for detecting key points in the dynamic of information processes; identifying periodicity, anomaly, self-similarity, and correlations; forecasting various information processes. The methods discussed can form the basis for detecting information attacks, campaigns, operations, and wars.
Elite Bases Regression: A Real-time Algorithm for Symbolic Regression Symbolic regression is an important but challenging research topic in data mining. It can detect the underlying mathematical models. Genetic programming (GP) is one of the most popular methods for symbolic regression. However, its convergence speed might be too slow for large scale problems with a large number of variables. This drawback has become a bottleneck in practical applications. In this paper, a new non-evolutionary real-time algorithm for symbolic regression, Elite Bases Regression (EBR), is proposed. EBR generates a set of candidate basis functions coded with parse-matrix in specific mapping rules. Meanwhile, a certain number of elite bases are preserved and updated iteratively according to the correlation coefficients with respect to the target model. The regression model is then spanned by the elite bases. A comparative study between EBR and a recent proposed machine learning method for symbolic regression, Fast Function eXtraction (FFX), are conducted. Numerical results indicate that EBR can solve symbolic regression problems more effectively.
Embedded Analytics – Empower the Citizen Data Scientist: A guide for analytics teams and line-ofbusiness managers trying to do more with embedded analytics With the advent of technologies that connect more people, machines and processes to one another, the importance of extending advanced analytics and machine learning to your users is growing fast. But the effort to derive maximum benefit from advanced analytics is still limited in most organizations by the human element. Data scientists, seen as the only people sufficiently trained to navigate big data successfully, become the bottleneck. This paper examines how the citizen data scientist can use embedded analytics in your software applications, and how the line of business (LOB) can benefit. Powerful software tools like TIBCO Statistica™ allow trained data scientists to develop models around advanced analytics, machine learning and algorithmic business, then make them available to the LOB managers and staff who use your applications to make better decisions.
Embodied Artificial Intelligence through Distributed Adaptive Control: An Integrated Framework In this paper, we argue that the future of Artificial Intelligence research resides in two keywords: integration and embodiment. We support this claim by analyzing the recent advances of the field. Regarding integration, we note that the most impactful recent contributions have been made possible through the integration of recent Machine Learning methods (based in particular on Deep Learning and Recurrent Neural Networks) with more traditional ones (e.g. Monte-Carlo tree search, goal babbling exploration or addressable memory systems). Regarding embodiment, we note that the traditional benchmark tasks (e.g. visual classification or board games) are becoming obsolete as state-of-the-art learning algorithms approach or even surpass human performance in most of them, having recently encouraged the development of first-person 3D game platforms embedding realistic physics. Building upon this analysis, we first propose an embodied cognitive architecture integrating heterogenous sub-fields of Artificial Intelligence into a unified framework. We demonstrate the utility of our approach by showing how major contributions of the field can be expressed within the proposed framework. We then claim that benchmarking environments need to reproduce ecologically-valid conditions for bootstrapping the acquisition of increasingly complex cognitive skills through the concept of a cognitive arms race between embodied agents.
Emotion in Reinforcement Learning Agents and Robots: A Survey This article provides the first survey of computational models of emotion in reinforcement learning (RL) agents. The survey focuses on agent/robot emotions, and mostly ignores human user emotions. Emotions are recognized as functional in decision-making by influencing motivation and action selection. Therefore, computational emotion models are usually grounded in the agent’s decision making architecture, of which RL is an important subclass. Studying emotions in RL-based agents is useful for three research fields. For machine learning (ML) researchers, emotion models may improve learning efficiency. For the interactive ML and human-robot interaction (HRI) community, emotions can communicate state and enhance user investment. Lastly, it allows affective modelling (AM) researchers to investigate their emotion theories in a successful AI agent class. This survey provides background on emotion theory and RL. It systematically addresses 1) from what underlying dimensions (e.g., homeostasis, appraisal) emotions can be derived and how these can be modelled in RL-agents, 2) what types of emotions have been derived from these dimensions, and 3) how these emotions may either influence the learning efficiency of the agent or be useful as social signals. We also systematically compare evaluation criteria, and draw connections to important RL sub-domains like (intrinsic) motivation and model-based RL. In short, this survey provides both a practical overview for engineers wanting to implement emotions in their RL agents, and identifies challenges and directions for future emotion-RL research.
Empirically Grounded Agent-Based Models of Innovation Diffusion: A Critical Review Innovation diffusion has been studied extensively in a variety of disciplines, including sociology, economics, marketing, ecology, and computer science. Traditional literature on innovation diffusion has been dominated by models of aggregate behavior and trends. However, the agent-based modeling (ABM) paradigm is gaining popularity as it captures agent heterogeneity and enables fine-grained modeling of interactions mediated by social and geographic networks. While most ABM work on innovation diffusion is theoretical, empirically grounded models are increasingly important, particularly in guiding policy decisions. We present a critical review of empirically grounded agent-based models of innovation diffusion, developing a categorization of this research based on types of agent models as well as applications. By connecting the modeling methodologies in the fields of information and innovation diffusion, we suggest that the maximum likelihood estimation framework widely used in the former is a promising paradigm for calibration of agent-based models for innovation diffusion. Although many advances have been made to standardize ABM methodology, we identify four major issues in model calibration and validation, and suggest potential solutions. Finally, we discuss open problems that are critical for the future development of empirically grounded agent-based models of innovation diffusion that enable reliable decision support for stakeholders.
Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics: An Issue Brief In data mining and data analytics, tools and techniques once confined to research laboratories are being adopted by forward-looking industries to generate business intelligence for improving decision making. Higher education institutions are beginning to use analytics for improving the services they provide and for increasing student grades and retention. The U.S. Department of Education’s National Education Technology Plan, as one part of its model for 21st-century learning powered by technology, envisions ways of using data from online learning systems to improve instruction.
Entering the Era of Data Science: Targeted Learning and the Integration of Statistics and Computational Data Analysis This outlook paper reviews the research of van der Laan’s group on Targeted Learning, a subfield of statistics that is concerned with the construction of data adaptive estimators of user-supplied target parameters of the probability distribution of the data and corresponding confidence intervals, aiming at only relying on realistic statistical assumptions. Targeted Learning fully utilizes the state of the art in machine learning tools, while still preserving the important identity of statistics as a field that is concerned with both accurate estimation of the true target parameter value and assessment of uncertainty in order to make sound statistical conclusions.We also provide a philosophical historical perspective on Targeted Learning, also relating it to the new developments in Big Data. We conclude with some remarks explaining the immediate relevance of Targeted Learning to the current Big Data movement.
Error Statistics Error statistics, as we are using that term, has a dual dimension involving philosophy and methodology. It refers to a standpoint regarding both:
1. a cluster of statistical tools, their interpretation and justification,
2. a general philosophy of science, and the roles probability plays in inductive inference.
Evaluating a classification model – What does precision and recall tell me? Once you have created a predictive model, you always need to find out is how good it is. RDS helps you with this by reporting the performance of the model. All measures reported by RDS are estimated from data not used for creating the model.
Evaluating Machine Learning Models This report on evaluating machine learning models arose out of a sense of need. The content was first published as a series of six technical posts on the Dato Machine Learning Blog. I was the editor of the blog, and I needed something to publish for the next day. Dato builds machine learning tools that help users build intelligent data products. In our conversations with the community, we sometimes ran into a confusion in terminology. For example, people would ask for cross-validation as a feature, when what they really meant was hyperparameter tuning, a feature we already had. So I thought, “Aha! I’ll just quickly explain what these concepts mean and point folks to the relevant sections in the user guide.”
Evaluating the Design of the R Language R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we assess the success of different language features.
Evaluation-as-a-Service: Overview and Outlook Evaluation in empirical computer science is essential to show progress and assess technologies developed. Several research domains such as information retrieval have long relied on systematic evaluation to measure progress: here, the Cranfield paradigm of creating shared test collections, defining search tasks, and collecting ground truth for these tasks has persisted up until now. In recent years, however, several new challenges have emerged that do not fit this paradigm very well: extremely large data sets, confidential data sets as found in the medical domain, and rapidly changing data sets as often encountered in industry. Also, crowdsourcing has changed the way that industry approaches problem-solving with companies now organizing challenges and handing out monetary awards to incentivize people to work on their challenges, particularly in the field of machine learning. This white paper is based on discussions at a workshop on Evaluation-as-a-Service (EaaS). EaaS is the paradigm of not providing data sets to participants and have them work on the data locally, but keeping the data central and allowing access via Application Programming Interfaces (API), Virtual Machines (VM) or other possibilities to ship executables. The objective of this white paper are to summarize and compare the current approaches and consolidate the experiences of these approaches to outline the next steps of EaaS, particularly towards sustainable research infrastructures. This white paper summarizes several existing approaches to EaaS and analyzes their usage scenarios and also the advantages and disadvantages. The many factors influencing EaaS are overviewed, and the environment in terms of motivations for the various stakeholders, from funding agencies to challenge organizers, researchers and participants, to industry interested in supplying real-world problems for which they require solutions.
Event History Analysis Event history analysis deals with data obtained by observing individuals over time, focusing on events occurring for the individuals under observation. Important applications are to life events of humans in demography, life insurance mathematics, epidemiology, and sociology. The basic data are the times of occurrence of the events and the types of events that occur. The standard approach to the analysis of such data is to use multistate models; a basic example is finite-state Markov processes in continuous time. Censoring and truncation are defining features of the area. This review comments specifically on three areas that are current subjects of active development, all motivated by demands from applications: sampling patterns, the possibility of causal interpretation of the analyses, and the levels and interpretation of variability.
Evolutionary Multimodal Optimization: A Short Survey Real world problems always have different multiple solutions. For instance, optical engineers need to tune the recording parameters to get as many optimal solutions as possible for multiple trials in the varied-line-spacing holographic grating design problem. Unfortunately, most traditional optimization techniques focus on solving for a single optimal solution. They need to be applied several times; yet all solutions are not guaranteed to be found. Thus the multimodal optimization problem was proposed. In that problem, we are interested in not only a single optimal point, but also the others. With strong parallel search capability, evolutionary algorithms are shown to be particularly effective in solving this type of problem. In particular, the evolutionary algorithms for multimodal optimization usually not only locate multiple optima in a single run, but also preserve their population diversity throughout a run, resulting in their global optimization ability on multimodal functions. In addition, the techniques for multimodal optimization are borrowed as diversity maintenance techniques to other problems. In this chapter, we describe and review the state-of-the-arts evolutionary algorithms for multimodal optimization in terms of methodology, benchmarking, and application.
Experiencing SAX: a Novel Symbolic Representation of Time Series Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.
Experimental Analysis of Design Elements of Scalarizing Functions-based Multiobjective Evolutionary Algorithms In this paper we systematically study the importance, i.e., the influence on performance, of the main design elements that differentiate scalarizing functions-based multiobjective evolutionary algorithms (MOEAs). This class of MOEAs includes Multiobjecitve Genetic Local Search (MOGLS) and Multiobjective Evolutionary Algorithm Based on Decomposition (MOEA/D) and proved to be very successful in multiple computational experiments and practical applications. The two algorithms share the same common structure and differ only in two main aspects. Using three different multiobjective combinatorial optimization problems, i.e., the multiobjective symmetric traveling salesperson problem, the traveling salesperson problem with profits, and the multiobjective set covering problem, we show that the main differentiating design element is the mechanism for parent selection, while the selection of weight vectors, either random or uniformly distributed, is practically negligible if the number of uniform weight vectors is sufficiently large.
Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models With the availability of large databases and recent improvements in deep learning methodology, the performance of AI systems is reaching or even exceeding the human level on an increasing number of complex tasks. Impressive examples of this development can be found in domains such as image classification, sentiment analysis, speech understanding or strategic game playing. However, because of their nested non-linear structure, these highly successful machine learning and artificial intelligence models are usually applied in a black box manner, i.e., no information is provided about what exactly makes them arrive at their predictions. Since this lack of transparency can be a major drawback, e.g., in medical applications, the development of methods for visualizing, explaining and interpreting deep learning models has recently attracted increasing attention. This paper summarizes recent developments in this field and makes a plea for more interpretability in artificial intelligence. Furthermore, it presents two approaches to explaining predictions of deep learning models, one method which computes the sensitivity of the prediction with respect to changes in the input and one approach which meaningfully decomposes the decision in terms of the input variables. These methods are evaluated on three classification tasks.
Explanation in Artificial Intelligence: Insights from the Social Sciences There has been a recent resurgence in the area of explainable artificial intelligence as researchers and practitioners seek to provide more transparency to their algorithms. Much of this research is focused on explicitly explaining decisions or actions to a human observer, and it should not be controversial to say that, if these techniques are to succeed, the explanations they generate should have a structure that humans accept. However, it is fair to say that most work in explainable artificial intelligence uses only the researchers’ intuition of what constitutes a `good’ explanation. There exists vast and valuable bodies of research in philosophy, psychology, and cognitive science of how people define, generate, select, evaluate, and present explanations. This paper argues that the field of explainable artificial intelligence should build on this existing research, and reviews relevant papers from philosophy, cognitive psychology/science, and social psychology, which study these topics. It draws out some important findings, and discusses ways that these can be infused with work on explainable artificial intelligence.
Exploiting Innovative Technologies in BI and Big Data Analytics Data is worth nothing without the right technologies to facilitate its transformation into meaningful information – delivered to the right people in a timely manner – for improved decision making. Forrester recently conducted a survey with 330 global business intelligence (BI) decision makers and found strong correlations between overall company success and adoption of innovative BI, analytics, and Big Data tools. Is your company harnessing the latest innovations in BI technology to achieve your business goals?
Exponential scaling of neural algorithms – a future beyond Moore’s Law? Although the brain has long been considered a potential inspiration for future computing, Moore’s Law – the scaling property that has seen revolutions in technologies ranging from supercomputers to smart phones – has largely been driven by advances in materials science. As the ability to miniaturize transistors is coming to an end, there is increasing attention on new approaches to computation, including renewed enthusiasm around the potential of neural computation. Recent advances in neurotechnologies, many of which have been aided by computing’s rapid progression over recent decades, are now reigniting this opportunity to bring neural computation insights into broader computing applications. As we understand more about the brain, our ability to motivate new computing paradigms with continue to progress. These new approaches to computing, which we are already seeing in techniques such as deep learning, will themselves improve our ability to learn about the brain and accordingly can be projected to give rise to even further insights. Such a positive feedback has the potential to change the complexion of how computing sciences and neurosciences interact, and suggests that the next form of exponential scaling in computing may emerge from our progressive understanding of the brain.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes.
Extraction of food consumption systems by non-negative matrix factorization (NMF) for the assessment of food choices. In Western countries where food supply is satisfactory, consumers organize their diets around a large combination of foods. It is the purpose of this paper to examine how recent nonnegative matrix fac- torization (NMF) techniques can be applied to food consumption data in order to understand these combinations. Such data are nonnegative by nature and of high dimension. The NMF model provides a representation of consumption data through latent vectors with nonnegative coe cients, we call consumption systems, in a small number. As the NMF approach may encourage sparsity of the data representation produced, the resulting consumption systems are easily interpretable. Beyond the illustration of its properties we provide through a simple simulation result, the NMF method is applied to data issued from a french consumption survey. The numerical results thus obtained are displayed and thoroughly discussed. A clustering based on the k-means method is also achieved in the resulting latent consumption space, in order to recover food consumption patterns easily usable for nutritionists.
Extremal Quantile Regression: An Overview This chapter provides an overview of extremal quantile regression. It is forthcoming in the Handbook of Quantile Regression.
Eyes Wide Open: Open Source Analytics 1. The total cost of owning and managing analytics technology consists of hardware (price per CPU, price per unit of storage), software (price per unit/license) and human capital (price per output) costs. Human capital costs are divided between line of business (LOB) users and IT support costs.
2. Transformational advances in data storage and compute power over the past 20 years have driven hardware costs so low that adoption is nearly universal. At the same time, managing these systems has become easier, resulting in lower human capital expense in the form training time (LOB users) and maintenance and management costs (IT). Resilient and reliable storage and compute power is now a commodity.
3. Open source storage (Apache Hadoop) and operating system (Linux) options have proliferated over the past 3+ years leading many firms to reliably experiment with low/no cost open source options to supplement or replace licensed commercial solutions.
4. In contrast, firms venturing down the open source analytics software path are not always seeing the expected cost reductions due to higher human capital expenses and increased risk that introduced into the enterprise through open source software.
5. IIA recommends firms take a blended approach to software selection, matching the correct tool to analytics user type/role, and that firms recalculate total costs, specifically incorporating potential risks associated with open source tools, particularly in mission critical applications.

F

Factorization tricks for LSTM networks We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is ‘matrix factorization by design’ of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the state-of the art perplexity. On the One Billion Word Benchmark we improve single model perplexity down to 24.29.
Failures of Deep Learning In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four families of problems for which some of the commonly used existing algorithms fail or suffer significant difficulty. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied.
Fairness-aware machine learning: a perspective Algorithms learned from data are increasingly used for deciding many aspects in our life: from movies we see, to prices we pay, or medicine we get. Yet there is growing evidence that decision making by inappropriately trained algorithms may unintentionally discriminate people. For example, in automated matching of candidate CVs with job descriptions, algorithms may capture and propagate ethnicity related biases. Several repairs for selected algorithms have already been proposed, but the underlying mechanisms how such discrimination happens from the computational perspective are not yet scientifically understood. We need to develop theoretical understanding how algorithms may become discriminatory, and establish fundamental machine learning principles for prevention. We need to analyze machine learning process as a whole to systematically explain the roots of discrimination occurrence, which will allow to devise global machine learning optimization criteria for guaranteed prevention, as opposed to pushing empirical constraints into existing algorithms case-by-case. As a result, the state-of-the-art will advance from heuristic repairing, to proactive and theoretically supported prevention. This is needed not only because law requires to protect vulnerable people. Penetration of big data initiatives will only increase, and computer science needs to provide solid explanations and accountability to the public, before public concerns lead to unnecessarily restrictive regulations against machine learning.
Fake News Detection on Social Media: A Data Mining Perspective Social media for news consumption is a double-edged sword. On the one hand, its low cost, easy access, and rapid dissemination of information lead people to seek out and consume news from social media. On the other hand, it enables the wide spread of ‘fake news’, i.e., low quality news with intentionally false information. The extensive spread of fake news has the potential for extremely negative impacts on individuals and society. Therefore, fake news detection on social media has recently become an emerging research that is attracting tremendous attention. Fake news detection on social media presents unique characteristics and challenges that make existing detection algorithms from traditional news media ineffective or not applicable. First, fake news is intentionally written to mislead readers to believe false information, which makes it difficult and nontrivial to detect based on news content; therefore, we need to include auxiliary information, such as user social engagements on social media, to help make a determination. Second, exploiting this auxiliary information is challenging in and of itself as users’ social engagements with fake news produce data that is big, incomplete, unstructured, and noisy. Because the issue of fake news detection on social media is both challenging and relevant, we conducted this survey to further facilitate research on the problem. In this survey, we present a comprehensive review of detecting fake news on social media, including fake news characterizations on psychology and social theories, existing algorithms from a data mining perspective, evaluation metrics and representative datasets. We also discuss related research areas, open problems, and future research directions for fake news detection on social media.
Fast unfolding of communities in large networks We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection method in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2.6 million customers and by analyzing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad-hoc modular networks.
Feature Engineering Tips for Data Scientists and Business Analysts Most data scientists and statisticians agree that predictive modeling is both art and science yet, relatively little to no air time is given to describing the art. This post describes one piece of the art of modeling called feature engineering which expands the number of variables you have to build a model. I offer six ways to implement feature engineering and provide examples of each. Using methods like these is important because additional relevant variables increase model accuracy, which makes feature engineering an essential part of the modeling process.
Financial Series Prediction: Comparison Between Precision of Time Series Models and Machine Learning Methods Precise financial series predicting has long been a difficult problem because of unstableness and many noises within the series. Although Traditional time series models like ARIMA and GARCH have been researched and proved to be effective in predicting, their performances are still far from satisfying. Machine Learning, as an emerging research field in recent years, has brought about many incredible improvements in tasks such as regressing and classifying, and it’s also promising to exploit the methodology in financial time series predicting. In this paper, the predicting precision of financial time series between traditional time series models and mainstream machine learning models including some state-of-the-art ones of deep learning are compared through experiment using real stock index data from history. The result shows that machine learning as a modern method far surpasses traditional models in precision.
Finite Mixture Models and Model-Based Clustering Finite mixture models have a long history in statistics, having been used to model pupulation heterogeneity, generalize distributional assumptions, and lately, for providing a convenient yet formal framework for clustering and classification. This paper provides a detailed review into mixture models and model-based clustering. Recent trends in the area, as well as open problems are also discussed.
Firefly Algorithm for optimization problems with non-continuous variables: A Review and Analysis Firefly algorithm is a swarm based metaheuristic algorithm inspired by the flashing behavior of fireflies. It is an effective and an easy to implement algorithm. It has been tested on different problems from different disciplines and found to be effective. Even though the algorithm is proposed for optimization problems with continuous variables, it has been modified and used for problems with non-continuous variables, including binary and integer valued problems. In this paper a detailed review of this modifications of firefly algorithm for problems with non-continuous variables will be discussed. The strength and weakness of the modifications along with possible future works will be presented.
Five big data challenges And how to overcome them with visual analytics Big data is set to offer companies tremendous insight. But with terabytes and petabytes of data pouring in to organizations today, traditional architectures and infrastructures are not up to the challenge. IT teams are burdened with ever-growing requests for data, ad hoc analyses and one-off reports. Decision makers become frustrated because it takes hours or days to get answers to questions, if at all. More users are expecting self-service access to information in a form they can easily understand and share with others. This begs the question: How do you present big data in a way that business leaders can quickly understand and use? This is not a minor consideration. Mining millions of rows of data creates a big headache for analysts tasked with sorting and presenting data. Organizations often approach the problem in one of two ways: Build “samples” so that it is easier to both analyze and present the data, or create template charts and graphs that can accept certain types of information. Both approaches miss the potential for big data. Instead, consider pairing big data with visual analytics so that you use all the data and receive automated help in selecting the best ways to present the data. This frees staff to deploy insights from data. Think of your data as a great, but messy, story. Visual analytics is the master filmmaker and the gifted editor who bring the story to life.
Five pillars of prescriptive analytics success As the Big Data Analytics space continues to evolve, one of the breakthrough technologies that businesses will be talking about in the coming years is prescriptive analytics. The promise of prescriptive analytics is certainly alluring: it enables decision-makers to not only look into the future of their mission critical processes and see the opportunities (and issues) that are potentially out there, but it also presents the best course of action to take advantage of that foresight in a timely manner. What should we look for in a prescriptive analytics solution to ensure it will deliver business value today and tomorrow?
Five Ways to Empower Business Analysts and Succeed in Your Self-Service BI Program The term self-service is ubiquitous in today’s business intelligence (BI) market. BI vendors and organizations alike constantly work to expand BI’s use and value proposition within the organization by making it more accessible to a wider variety of people. This push has created a series of BI offerings that are easy to interact and design with without the help of IT departments. However, there are many types of BI users that are still underserved. Primary among them are business analysts that could make self-service BI more successful if they are empowered with higher levels of interactivity and the capabilities to design their own BI applications. Traditionally, business analysts have a broader understanding of data relationships and know how to develop their own analytics. They interact with spreadsheets, develop their own SQL scripts, and create their own databases. In many cases, business analysts are power users because they are tasked with taking ownership over data due to their level of expertise. The business analyst is the person who understands the intricacies of data and how it interrelates within the organization’s ecosystem, represents the link between their departments and IT, and develops analytical insights based on their subject matter expertise. Because of this skill set, these users are tasked with developing their own complex set of analytics. They also create BI models and reports that will be consumed by employees across the organization. Luckily some self-service BI offerings support this two-tiered approach, providing business analysts with access to the components required to do so successfully. The importance of the power user/business analyst role cannot be overlooked as creating a successful self-service BI strategy requires consumption of analytics, design that supports this consumption, and business control to ensure that the right information is delivered to the right people and that the right business rules are applied. This represents the intersection of business and IT roles and represents the value of the power user. Now, more than ever, it is important to take advantage of these skill sets to drive self-service BI. The reality is that many BI projects fail when not controlled by the business. Lack of proper requirements gathering and the inability to meet the needs of users creates a lack of adoption. Also, information is becoming more varied and complex. Simple tools such as Excel and Access no longer handle the complexities and increasing volumes inherent in big data or maintain the validity of analytic models. Between this and the increasing demand for in-house programmers and software developers, organizations need to have business analysts that understand the needs of the business, can perform robust analytics, and provide consumable applications for a wide variety of business users to support more technical roles required. This paper looks at five key enablers of self-service BI for business analysts. These are:
1. Building a collaborative relationship between the business analyst and IT
2. Design flexibility
3. Cohesion between technology, people, and business processes
4. Data diversity and preparation 5. Data quality
Fog Computing: A Taxonomy, Survey and Future Directions In recent years, the number of Internet of Things (IoT) devices/sensors has increased to a great extent. To support the computational demand of real-time latency-sensitive applications of largely geo-distributed IoT devices/sensors, a new computing paradigm named ‘Fog computing’ has been introduced. Generally, Fog computing resides closer to the IoT devices/sensors and extends the Cloud-based computing, storage and networking facilities. In this chapter, we comprehensively analyse the challenges in Fogs acting as an intermediate layer between IoT devices/ sensors and Cloud datacentres and review the current developments in this field. We present a taxonomy of Fog computing according to the identified challenges and its key features.We also map the existing works to the taxonomy in order to identify current research gaps in the area of Fog computing. Moreover, based on the observations, we propose future directions for research.
Forecasting Time Series With Complex Seasonal Patterns Using Exponential Smoothing An innovations state space modeling framework is introduced for forecasting complex seasonal time series such as those with multiple seasonal periods, high-frequency seasonality, non-integer seasonality, and dual-calendar effects. The new framework incorporates Box–Cox transformations, Fourier representations with time varying coefficients, and ARMA error correction. Likelihood evaluation and analytical expressions for point forecasts and interval predictions under the assumption of Gaussian errors are derived, leading to a simple, comprehensive approach to forecasting complex seasonal time series. A key feature of the framework is that it relies on a new method that greatly reduces the computational burden in the maximum likelihood estimation. The modeling framework is useful for a broad range of applications, its versatility being illustrated in three empirical studies. In addition, the proposed trigonometric formulation is presented as a means of decomposing complex seasonal time series, and it is shown that this decomposition leads to the identification and extraction of seasonal components which are otherwise not apparent in the time series plot itself.
Fostering a data-driven culture Fostering a data-driven culture is an Economist Intelligence Unit report, sponsored by Tableau Software. It explores the challenges in nurturing a data-driven culture, and what companies can do to meet them. The Economist Intelligence Unit bears sole responsibility for the content of this report. The fi ndings do not necessarily refl ect the views of the sponsor. The paper draws on two main sources for its research and fi ndings:
* A survey, conducted in October 2012, of 530 senior executives from around the world. More than 40% of respondents are C-Level executives, including 23% from the CEO, president or managing director ranks and 9%, CIOs. Responses come from a wide range of regions: 50% North America, 15% Asia-Pacifi c, 26% Western Europe and 9% Latin America. The range of company sizes is also diverse, from those with revenue of less than US$500m (53%) through to those with revenue of US$10bn or more (20%). The survey covers nearly all industries, including IT and technology (18%), fi nancial services (17%), professional services (11%) and manufacturing (7%).
* A series of in-depth interviews with the following senior executives: Sidney Minassian, CEO, Contexti Jerry O’Dwyer, principal, Deloitte Consulting William Schmarzo, CTO, EMC Colin Hill, CEO, GNS Healthcare
We would like to thank all interviewees and survey respondents for their time and insight. The report was written by Jim Giles and edited by Gilda Stahl.
From Data Mining to Knowledge Discovery in Databases Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field.
From Data Scientist to Data Artist: Data Sculpting to Shape Business Insights Discover the new role of data artist and understand the need, value and availability of powerful, flexible and affordable analytics tools that do not require an advanced degree in mathematics nor a team of information technology experts to use. Learn about the professional requirements of a data artist and how to change corporate culture with the right analytics tools in the right hands. Read about the exploits of a news organization that used those tools to change its culture and become profitable.
From Data to City Indicators: A Knowledge Graph for Supporting Automatic Generation of Dashboards In the context of Smart Cities, indicator definitions have been used to calculate values that enable the comparison among different cities. The calculation of an indicator values has challenges as the calculation may need to combine some aspects of quality while addressing different levels of abstraction. Knowledge graphs (KGs) have been used successfully to support flexible representation, which can support improved understanding and data analysis in similar settings. This paper presents an operational description for a city KG, an indicator ontology that support indicator discovery and data visualization and an application capable of performing metadata analysis to automatically build and display dashboards according to discovered indicators. We describe our implementation in an urban mobility setting.
From Linear Models to Machine Learning Regression analysis is both one of the oldest branches of statistics, with least-squares analysis having been rst proposed way back in 1805, and also one of the newest areas, in the form of the machine learning techniques being vigorously researched today. Not surprisingly, then, there is a vast literature on the subject. Well, then, why write yet another regression book? Many books are out there already, with titles using words like regression, classi cation, predic- tive analytics, machine learning and so on. They are written by authors whom I greatly admire, and whose work I myself have found useful. Yet, I did not feel that any existing books covered the material in a manner that su ciently provided insight for the practicing data analyst.
From Yawn to YARN: Why You Should be Excited About Hadoop 2 By now almost everyone has heard the story of the yellow elephant who never forgets data, consumes whatever data you have from any source, and magically produces a big data treasure trove of business insights for you, including tweets, telemetry, customer sentiment, sensor readings, mobile app activity, and more! In fact, the story has been told and re-told so many times now that most people’s natural reaction is… yawn. Hadoop. Big Data. Yeah, yeah. I have heard this story too many times. I google Hadoop and get almost one billion results, but I can’t yell “yahoo!” about getting paid big bucks to code big data applications in MapReduce, which those cool kids in Silicon Valley used to “money-expand” into billionaires. So why should I be excited about Hadoop 2? After all, no sequel is as good as the original. Well, in this case, the sequel is better. The story has changed. The script has flipped. Even though the new protagonist’s name sounds like yawn, the yarn about YARN is much more than yet another chapter in the same old story. More reimagining than sequel, it will take you from yawn to YARN and get you excited about Hadoop 2. This paper explains why.
Functional data clustering: a survey The main contributions to functional data clustering are reviewed. Most approaches used for clustering functional data are based on the following three methodologies: dimension reduction before clustering, nonparametric methods using specific distances or dissimilarities between curves and model-based clustering methods. These latter assume a probabilistic distribution on either the principal components or coefficients of functional data expansion into a finite dimensional basis of functions. Numerical illustrations as well as a software review are presented.
Functional Regression: A New Model for Predicting Market Penetration of New Products The Bass model has been a standard for analyzing and predicting the market penetration of new products. We demonstrate the insights to be gained and predictive performance of functional data analysis (FDA), a new class of nonparametric techniques that has shown impressive results within the statistics community, on the market penetration of 760 categories drawn from 21 products and 70 countries. We propose a new model called Functional Regression and compare its performance to several models, including the Classic Bass model, Estimated Means, Last Observation Projection, a Meta-Bass model, and an Augmented Meta-Bass model for predicting eight aspects of market penetration. Results (a) validate the logic of FDA in integrating information across categories, (b) show that Augmented Functional Regression is superior to the above models, and (c) product-specific effects are more important than country-specific effects when predicting penetration of an evolving new product.
Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Many machine learning approaches are characterized by information constraints on how they interact with the training data. These include memory and sequential access constraints (e.g. fast first-order methods to solve stochastic optimization problems); communication constraints (e.g. distributed learning); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more. However, currently we have little understanding how such information constraints fundamentally affect our performance, independent of the learning problem semantics. For example, are there learning problems where any algorithm which has small memory footprint (or can use any bounded number of bits from each example, or has certain communication constraints) will perform worse than what is possible without such constraints? In this paper, we describe how a single set of results implies positive answers to the above, for several different settings.

G

Game theory models for communication between agents: a review In the real world, agents or entities are in a continuous state of interactions. These inter- actions lead to various types of complexity dynamics. One key difficulty in the study of complex agent interactions is the difficulty of modeling agent communication on the basis of rewards. Game theory offers a perspective of analysis and modeling these interactions. Previously, while a large amount of literature is available on game theory, most of it is from specific domains and does not cater for the concepts from an agent- based perspective. Here in this paper, we present a comprehensive multidisciplinary state-of-the-art review and taxonomy of game theory models of complex interactions between agents.
Gaussian Processes for Regression A Quick Introduction
Gaussian Processes in Machine Learning We give a basic introduction to Gaussian Process regression models. We focus on understanding the role of the stochastic process and how it is used to define a distribution over functions. We present the simple equations for incorporating training data and examine how to learn the hyperparameters using the marginal likelihood. We explain the practical advantages of Gaussian Process and end with conclusions and a look at the current trends in GP work.
Generalized Gradient Descent (Slide Deck)
Generalized Power Method for Sparse Principal Component Analysis In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed.
Generative Adversarial Active Learning We propose a new active learning approach using Generative Adversarial Networks (GAN). Different from regular active learning, we adaptively synthesize training instances for querying to increase learning speed. Our approach outperforms random generation using GAN alone in active learning experiments. We demonstrate the effectiveness of the proposed algorithm in various datasets when compared to other algorithms. To the best our knowledge, this is the first active learning work using GAN.
Generative and Discriminative Text Classification with Recurrent Neural Networks We empirically characterize the performance of discriminative and generative LSTM models for text classification. We find that although RNN-based generative models are more powerful than their bag-of-words ancestors (e.g., they account for conditional dependencies across words in a document), they have higher asymptotic error rates than discriminatively trained RNN models. However we also find that generative models approach their asymptotic error rate more rapidly than their discriminative counterparts—the same pattern that Ng & Jordan (2001) proved holds for linear classification models that make more naive conditional independence assumptions. Building on this finding, we hypothesize that RNN-based generative classification models will be more robust to shifts in the data distribution. This hypothesis is confirmed in a series of experiments in zero-shot and continual learning settings that show that generative models substantially outperform discriminative models.
Generative Deep Neural Networks for Dialogue: A Short Review Researchers have recently started investigating deep neural networks for dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq) models have shown promising results for unstructured tasks, such as word-level dialogue response generation. The hope is that such models will be able to leverage massive amounts of data to learn meaningful natural language representations and response generation strategies, while requiring a minimum amount of domain knowledge and hand-crafting. An important challenge is to develop models that can effectively incorporate dialogue context and generate meaningful and diverse responses. In support of this goal, we review recently proposed models based on generative encoder-decoder neural network architectures, and show that these models have better ability to incorporate long-term dialogue history, to model uncertainty and ambiguity in dialogue, and to generate responses with high-level compositional structure.
Getting Started with Apache Hadoop This Refcard presents Apache Hadoop, a software framework that enables distributed storage and processing of large datasets using simple high-level programming models. We cover the most important concepts of Hadoop, describe its architecture, guide how to start using it as well as write and execute various applications on Hadoop. In the nutshell, Hadoop is an open-source project of the Apache Software Foundation that can be installed on a set of standard machines, so that these machines can communicate and work together to store and process large datasets. …
Getting Started with Spark (Slide Deck)
GIS with R (Slide Deck)
Glossary (Cheat Sheet)
Grades of Evidence (Cheat Sheet)
Gradient boosting machines, a tutorial Gradient boosting machines are a family of powerful machine-learning techniques that have shown considerable success in a wide range of practical applications. They are highly customizable to the particular needs of the application, like being learned with respect to different loss functions. This article gives a tutorial introduction into the methodology of gradient boosting methods with a strong focus on machine-learning aspects of modeling. A theoretical information is complemented with descriptive examples and illustrations which cover all the stages of the gradient boosting model design. Considerations on handling the model complexity are discussed. Three practical examples of gradient boosting applications are presented and comprehensively analyzed.
Graphical Models Statistical applications in fields such as bioinformatics, information retrieval, speech processing, image processing and communications often involve large-scale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for approaching these problems, and indeed many of the models developed by researchers in these applied fields are instances of the general graphical model formalism.We review some of the basic ideas underlying graphical models, including the algorithmic ideas that allow graphical models to be deployed in large-scale data analysis problems.We also present examples of graphical models in bioinformatics, error-control coding and language processing.
Graphical Models in a Nutshell Probabilistic graphical models are an elegant framework which combines uncertainty (probabilities) and logical structure (independence constraints) to compactly represent complex, real-world phenomena. The framework is quite general in that many of the commonly proposed statistical models (Kalman filters, hidden Markov models, Ising models) can be described as graphical models. Graphical models have enjoyed a surge of interest in the last two decades, due both to the flexibility and power of the representation and to the increased ability to effectively learn and perform inference in large networks.
Graphical Models: An Extension to Random Graphs, Trees, and Other Objects In this work, we consider an extension of graphical models to random graphs, trees, and other objects. To do this, many fundamental concepts for multivariate random variables (e.g., marginal variables, Gibbs distribution, Markov properties) must be extended to other mathematical objects; it turns out that this extension is possible, as we will discuss, if we have a consistent, complete system of projections on a given object. Each projection defines a marginal random variable, allowing one to specify independence assumptions between them. Furthermore, these independencies can be specified in terms of a small subset of these marginal variables (which we call the atomic variables), allowing the compact representation of independencies by a directed graph. Projections also define factors, functions on the projected object space, and hence a projection family defines a set of possible factorizations for a distribution; these can be compactly represented by an undirected graph. The invariances used in graphical models are essential for learning distributions, not just on multivariate random variables, but also on other objects. When they are applied to random graphs and random trees, the result is a general class of models that is applicable to a broad range of problems, including those in which the graphs and trees have complicated edge structures. These models need not be conditioned on a fixed number of vertices, as is often the case in the literature for random graphs, and can be used for problems in which attributes are associated with vertices and edges. For graphs, applications include the modeling of molecules, neural networks, and relational real-world scenes; for trees, applications include the modeling of infectious diseases, cell fusion, the structure of language, and the structure of objects in visual scenes. Many classic models are particular instances of this framework.
Group theoretical methods in machine learning Ever since its discovery in 1807, the Fourier transform has been one of the mainstays of pure mathematics, theoretical physics, and engineering. The ease with which it connects the analytical and algebraic properties of function spaces; the particle and wave descriptions of matter; and the time and frequency domain descriptions of waves and vibrations make the Fourier transform one of the great unifying concepts of mathematics. Deeper examination reveals that the logic of the Fourier transform is dictated by the structure of the underlying space itself. Hence, the classical cases of functions on the real line, the unit circle, and the integers modulo n are only the beginning: harmonic analysis can be generalized to functions on any space on which a group of transformations acts. Here the emphasis is on the word group in the mathematical sense of an algebraic system obeying specific axioms. The group might even be non-commutative: the fundamental principles behind harmonic analysis are so general that they apply equally to commutative and non-commutative structures. Thus, the humble Fourier transform leads us into the depths of group theory and abstract algebra, arguably the most extensive formal system ever explored by humans. Should this be of any interest to the practitioner who has his eyes set on concrete applications of machine learning and statistical inference? Hopefully, the present thesis will convince the reader that the answer is an emphatic “yes”. One of the reasons why this is so is that groups are the mathematician’s way of capturing symmetries, and symmetries are all around us. Twentieth century physics has taught us just how powerful a tool symmetry principles are for prying open the secrets of nature. One could hardly ask for a better example of the power of mathematics than particle physics, which translates the abstract machinery of group theory into predictions about the behavior of the elementary building blocks of our universe. I believe that algebra will prove to be just as crucial to the science of data as it has proved to be to the sciences of the physical world. In probability theory and statistics it was Persi Diaconis who did much of the pioneering work in this realm, brilliantly expounded in his little book [Diaconis, 1988]. Since then, several other authors have also contributed to the field. In comparison, the algebraic side of machine learning has until now remained largely unexplored. The present thesis is a first step towards filling this gap. The two main themes of the thesis are (a) learning on domains which have non-trivial algebraic structure; and (b) learning in the presence of invariances. Learning rankings/matchings are the classic example of the first situation, whilst rotation/translation/scale invariance in machine vision is probably the most immediate example of the latter. The thesis presents examples addressing real world problems in these two domains. However, the beauty of the algebraic approach is that it allows us to discuss these matters on a more general, abstract, level, so most of our results apply equally well to a large range of learning scenarios. The generality of our approach also means that we do not have to commit to just one learning paradigm (frequentist/Bayesian) or one group of algorithms (SVMs/graphical models/boosting/etc.). We do find that some of our ideas regarding symmetrization and learning on groups meshes best with the Hilbert space learning framework, so in Chapters 4 and 5 we focus on this methodology, but even there we take a comparative stance, contrasting the SVM with Gaussian Processes and a modified version of the Perceptron. One of the reasons why up until now abstract algebra has not had a larger impact on the applied side of computer science is that it is often perceived as a very theoretical field, where computations are difficult if not impossible due to the sheer size of the objects at hand. For example, while permutations obviously enter many applied problems, calulations on the full symmetric group (permutation group) are seldom viable, since it has n! elements. However, recently abstract algebra has developed a strong computational side [B¨urgisser et al., 1997]. The core algorithms of this new computational algebra, such as the non-commutative FFTs discussed in detail in Chapter 3, are the backbone of the bridge between applied computations and abstract theory. In addition to our machine learning work, the present thesis offers some modest additions to this field by deriving some useful generalizations of Clausen’s FFT for the symmetric group, and presenting an efficient, expandable software library implementing the transform. To the best of our knowledge, this is the first time that such a library has been made publicly available. Clearly, a thesis like this one is only a first step towards building a bridge between the theory of groups/representations and machine learning. My hope is that it will offer ideas and inspiration to both sides, as well as a few practical algorithms that I believe are directly applicable to real world problems.
Guide to Big Data 2014 Big Data, NoSQL, and NewSQL – these are the high-level concepts relating to the new, unprecedented data management and analysis challenges that enterprises and startups are now facing. Some estimates expect the amount of digital data in the world to double every two years, while other estimates suggest that 90% of the world’s current data was created in the last two years. The predictions for data growth are staggering no matter where you look, but what does that mean practically for you, the developer, the sysadmin, the product manager, or C-level leader? DZone’s 2014 Guide to Big Data is the definitive resource for learning how industry experts are handling the massive growth and diversity of data. It contains resources that will help you navigate and excel in the world of Big Data management.
Guide to Big Data Business Solutions in the Cloud The recent release of a commercial version of the Lustre* parallel file system running on Amazon Web Services (AWS) was big news for business data centers facing ever expanding data analysis and storage demands. Now, Lustre, the predominant high-performing file system installed in most of the supercomputer installations around the world, could be deployed to business customers in a hardened, tested, easy to manage and fully supported distribution in the cloud. Proven to scale up to extreme levels of storage performance and capacity as measured in tens or even hundreds of petabytes, with shared and accessible to tens of thousands of clients, Lustre combines high throughput with high availability using vendor-neutral server, storage and interconnect hardware coupled with various distributions of Linux. In this Guide, we take a look at what Lustre on infrastructure AWS delivers for a broad community of business and commercial organizations struggling with the challenge of big data and demanding storage growth.
Guide to Machine Learning As the primary facilitator of data science and big data, machine learning has garnered much interest by a broad range of industries as a way to increase value of enterprise data assets. Through techniques of supervised and unsupervised statistical learning, organizations can make important predictions and discover previously unknown knowledge to provide actionable business intelligence. In this guide, we’ll examine the principles underlying machine learning based on the R statistical environment. We’ll explore machine learning with R from the open source R perspective as well as the more robust commercial perspective using Revolution Analytics’ Revolution R Enterprise (RRE) for big data deployments….

H

Hadoop Buyer’s Guide Everything you need to know about choosing the right Hadoop distribution for production
Hadoop’s Limitations for Big Data Analytics The era of ‘big data’ represents new challenges to businesses. Incoming data volumes are exploding in complexity, variety, speed and volume, while legacy tools have not kept pace. In recent years, a new tool – Apache Hadoop – has appeared on the scene. And while it solves some big data problems, it is not magic. In order to act effectively on big data, businesses must be able to assimilate data quickly, but also must be able to explore this data for value, allowing analysts to ask and iterate their business questions quickly. Hadoop – purpose built to facilitate certain forms of batch-oriented distributed data processing – lends itself readily to the assimilation process. But it was built on fundamentals which severely limit its ability to act as an analytic database. With the rise of big data has come the rise of the analytic database platform. Even five years ago, a company could leverage a DBMS such as Oracle for a data warehouse. However, Oracle was built in a time when databases rarely exceeded a few gigabytes in size. Along with other legacy DBMSs, it cannot perform at the scale now required. Enter the analytic platform. The analytic platform allows analysts to use their existing tools and skillsets to ask new questions of big data quickly, easily, and at scales unseen previously. The de facto best practice infrastructure for big data today often consists of a processing infrastructure of systems such as Hadoop to acquire and archive the data, and an analytic platform to enable the highly iterative analysis process. But because Hadoop is still relatively new, there is a great deal of confusion about its strengths and weaknesses. This paper will discuss those topics, and concludes with guidance on how to build the complete ecosystem for big data analytics.
Handling Missing Data in Within-Trial Cost-Effectiveness Analysis: a Review with Future Guidelines Cost-Effectiveness Analyses (CEAs) alongside randomised controlled trials (RCTs) are increasingly often designed to collect resource use and preference-based health status data for the purpose of healthcare technology assessment. However, because of the way these measures are collected, they are prone to missing data, which can ultimately affect the decision of whether an intervention is good value for money. We examine how missing cost and effect outcome data are handled in RCT-based CEAs, complementing a previous review (covering 2003-2009, 88 articles) with a new systematic review (2009-2015, 81 articles) focussing on two different perspectives. First, we review the description of the missing data, the statistical methods used to deal with them, and the quality of the judgement underpinning the choice of these methods. Second, we provide guidelines on how the information about missingness and related methods should be presented to improve the reporting and handling of missing data. Our review shows that missing data in within-RCT CEAs are still often inadequately handled and the overall level of information provided to support the chosen methods is rarely satisfactory.
Hands-On Data Science with R – Text Mining Text Mining or Text Analytics applies analytic tools to learn from collections of text documents like books, newspapers, emails, etc. The goal is similar to humans learning by reading books. Using automated algorithms we can learn from massive amounts of text, much more than a human can. The material could be consist of millions of newspaper articles to perhaps summarise the main themes and to identify those that are of most interest to particular people.
Harness the Power of Data Visualization to Transform Your Business Data underpins the operations and strategic decisions of every business. Yet these days, data is generated faster than it can be consumed and digested, making it challenging for small to mid-size organizations to extract maximum value from this vital asset. Many decision makers – whether data analysts or senior-level executives – struggle to draw meaningful conclusions in a timely manner from the array of data available to them. Reliance on spreadsheets and specialized reporting and analysis tools only limits their flexibility and output. After all, spreadsheets were not designed for data analysis. And specialized reporting and analysis tools often lack integration with other critical business applications and processes. Moreover, dependence on the IT group for ad hoc reports slows insights and decisions, putting the company at a disadvantage. Business decision makers feel a loss of control waiting for the already overburdened IT department to generate critical reports. Savvy companies are moving beyond static graphs, spreadsheets, and reports by harnessing the power of business visualization to transform how they see, discover, and share insights hidden in their data. Because business visualization spans a broad range of options, from static to dynamic and interactive, it serves a variety of needs within organizations. As a result, those companies adopting business visualization are able to extract maximum value from the information captured throughout their environments.
Hidden Technical Debt in Machine Learning Systems Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.
Hierarchical Bayesian Survival Analysis and Projective Covariate Selection in Cardiovascular Event Risk Prediction Identifying biomarkers with predictive value for disease risk stratification is an important task in epidemiology. This paper describes an application of Bayesian linear survival regression to model cardiovascular event risk in diabetic individuals with measurements available on 55 candidate biomarkers. We extend the survival model to include data from a larger set of non-diabetic individuals in an e↵ort to increase the predictive performance for the diabetic subpopulation. We compare the Gaussian, Laplace and horseshoe shrinkage priors, and find that the last has the best predictive performance and shrinks strong predictors less than the others. We implement the projection predictive covariate selection approach of Dupuis and Robert (2003) to further search for small sets of predictive biomarkers that could provide costefficient prediction without significant loss in performance. In passing, we present a derivation of the projective covariate selection in Bayesian decision theoretic framework.
Hierarchical Clustering: Objective Functions and Algorithms Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, Dasgupta framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a `good’ hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties. We take an axiomatic approach to defining `good’ objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of ‘admissible’ objective functions (that includes Dasgupta’s one) that have the property that when the input admits a `natural’ hierarchical clustering, it has an optimal value. Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better algorithms. For similarity-based hierarchical clustering, Dasgupta showed that the divisive sparsest-cut approach achieves an $O(\log^{3/2} n)$-approximation. We give a refined analysis of the algorithm and show that it in fact achieves an $O(\sqrt{\log n})$-approx. (Charikar and Chatziafratis independently proved that it is a $O(\sqrt{\log n})$-approx.). This improves upon the LP-based $O(\log n)$-approx. of Roy and Pokutta. For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approx., and provide a simple and better algorithm that gives a factor 3/2 approx.. Finally, we consider `beyond-worst-case’ scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta’s cost function has desirable properties for these inputs and we provide a simple 1 + o(1)-approximation in this setting.
Hierarchical Temporal Memory including HTM Cortical Learning Algorithms There are many things humans find easy to do that computers are currently unable to do. Tasks such as visual pattern recognition, understanding spoken language, recognizing and manipulating objects by touch, and navigating in a complex world are easy for humans. Yet despite decades of research, we have few viable algorithms for achieving human-like performance on a computer. In humans, these capabilities are largely performed by the neocortex. Hierarchical Temporal Memory (HTM) is a technology modeled on how the neocortex performs these functions. HTM offers the promise of building machines that approach or exceed human level performance for many cognitive tasks. This document describes HTM technology. Chapter 1 provides a broad overview of HTM, outlining the importance of hierarchical organization, sparse distributed representations, and learning time-based transitions. Chapter 2 describes the HTM cortical learning algorithms in detail. Chapters 3 and 4 provide pseudocode for the HTM learning algorithms divided in two parts called the spatial pooler and temporal pooler. After reading chapters 2 through 4, experienced software engineers should be able to reproduce and experiment with the algorithms. Hopefully, some readers will go further and extend our work.
Hierarchically Supervised Latent Dirichlet Allocation We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bag-of- word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not.
High Dimensional Data Clustering Clustering in high-dimensional spaces is a recurrent problem in many domains, for example in object recognition. High-dimensional data usually live in different lowdimensional subspaces hidden in the original space. This paper presents a clustering approach which estimates the specific subspace and the intrinsic dimension of each class. Our approach adapts the Gaussian mixture model framework to high-dimensional data and estimates the parameters which best fit the data. We obtain a robust clustering method called High- Dimensional Data Clustering (HDDC). We apply HDDC to locate objects in natural images in a probabilistic framework. Experiments on a recently proposed database demonstrate the effectiveness of our clustering method for category localization.
How do we choose our default methods? The field of statistics continues to be divided into competing schools of thought. In theory one might imagine choosing the uniquely best method for each problem as it arises, but in practice we choose for ourselves (and recommend to others) default principles, models, and methods to be used in a wide variety of settings. This article briefly considers the informal criteria we use to decide what methods to use and what principles to apply in statistics problems.
How Good Are Machine Learning Clouds for Binary Classification with Good Features? We conduct an empirical study of machine learning functionalities provided by major cloud service providers, which we call em machine learning clouds. Machine learning clouds hold the promise of hiding all the sophistication of running large-scale machine learning: Instead of specifying how to run a machine learning task, users only specify what machine learning task to run and the cloud figures out the rest. Raising the level of abstraction, however, rarely comes free — a performance penalty is possible. How good, then, are current machine learning clouds on real-world machine learning workloads? We study this question by presenting mlbench, a novel benchmark dataset constructed with the top winning code for all available competitions on Kaggle, as well as the results we obtained by running mlbench on machine learning clouds from both Azure and Amazon. We analyze the strength and weakness of existing machine learning clouds and discuss potential future directions.
How Important is Syntactic Parsing Accuracy? An Empirical Evaluation on Sentiment Analysis Syntactic parsing, the process of obtaining the internal structure of sentences in natural languages, is a crucial task for artificial intelligence applications that need to extract meaning from natural language text or speech. Sentiment analysis is one example of application for which parsing has recently proven useful. In recent years, there have been significant advances in the accuracy of parsing algorithms. In this article, we perform an empirical, task-oriented evaluation to determine how parsing accuracy influences the performance of a state-of-the-art sentiment analysis system that determines the polarity of sentences from their parse trees. In particular, we evaluate the system using four well-known dependency parsers, including both current models with state-of-the-art accuracy and more innacurate models which, however, require less computational resources. The experiments show that all of the parsers produce similarly good results in the sentiment analysis task, without their accuracy having any relevant influence on the results. Since parsing is currently a task with a relatively high computational cost that varies strongly between algorithms, this suggests that sentiment analysis researchers and users should prioritize speed over accuracy when choosing a parser; and parsing researchers should investigate models that improve speed further, even at some cost to accuracy.
How to Build Dashboards That Persuade, Inform and Engage Flow is powerful. Think about a great conversation you’ve had, with no awkwardness or selfconsciousness: just effortless communication. In data visualization, flow is crucial. Your audience should smoothly absorb and use the information in a dashboard without distractions or turbulence. Lack of flow means lack of communication, which means failure. Psychologist Mihaly Czikszentmihalyi has studied flow extensively. Czikszentmihalyi and other researchers have found that flow is correlated with happiness, creativity, and productivity. People experience flow when their skills are engaged and they’re being challenged just the right amount. The experience is not too challenging or too easy: flow is a just-right, Goldilocks state of being. So how do you create flow for an audience? By tailoring the presentation of data to that audience. If you focus on the skills, motivations, and needs of an audience, you’ll have a better chance of creating a positive experience of flow with your dashboards. And by creating that flow, you’ll be able to persuade, inform, and engage.
How to Grow a Mind: Statistics, Structure, and Abstraction In coming to understand the world—in learning concepts, acquiring language, and grasping causal relations—our minds make inferences that appear to go far beyond the data available. How do we do it? This review describes recent approaches to reverse-engineering human learning and cognitive development and, in parallel, engineering more humanlike machine learning systems. Computational models that perform probabilistic inference over hierarchies of flexibly structured representations can address some of the deepest questions about the nature and origins of human thought: How does abstract knowledge guide learning and reasoning from sparse data? What forms does our knowledge take, across different domains and tasks? And how is that abstract knowledge itself acquired?
How to Implement an Effective Decision Management System – Embedding Analytics into Real-Time Business Decisions, Operations and Processes Once used mostly in traditional batch-type environments, analytic techniques are now being embedded into real-time business decisions, operations and processes. In fact, decision support insight should be embedded very consistently in operational systems. For example:
• When a credit card organization is processing a card swipe, fraud detection analytics should be embedded in that process.
• When analysis of sensor data over time indicates an impending problem with a mechanical process, utility grid or manufacturing system, the system should trigger some proactive intervention.
• When call center agents have a customer on the phone, or tellers have a customer at the counter, analytics behind the scenes should be giving them the information they need to customize the interaction – right now.
This ideal has historically been a challenge to implement because the niche applications used for different business functions have not been on great speaking terms. Unlike niche tools, an enterprise decision management framework extracts information from multiple sources, runs it through analytical processes, and delivers the results directly into business applications or operational systems. “Enterprise decision management is one of the hottest topics in business analytics today,” said David Duling, Director of Enterprise Decision Management R&D at SAS. “You see it on the front page of many journals, and a lot of conferences are being organized around the topic. Enterprise data management marries the analytics that we’ve been doing at SAS with the product environments within your organization to automate routine business decisions.” This means better decisions, delivered right to the point of decision.
How to Speed up R Code: An Introduction Most calculations performed by the average R user are unremarkable in the sense that nowadays, any computer can crush the related code in a matter of seconds. But more and more often, heavy calculations are also performed using R, something especially true in some fields such as statistics. The user then faces total execution times of his codes that are hard to work with: hours, days, even weeks. In this paper, how to reduce the total execution time of various codes will be shown and typical bottlenecks will be discussed. As a last resort, how to run your code on a cluster of computers (most workplaces have one) in order to make use of a larger processing power than the one available on an average computer will also be discussed through two examples.
How to Use an Uncommon-Sense Approach to Big Data Quality Organizations are inundated in data – terabytes, petabytes and exabytes of it. Data pours in from every conceivable direction: from operational and transactional systems, from scanning and facilities management systems, from inbound and outbound customer contact points, from mobile media and the Web. The hopeful vision of big data is that organizations will be able to harvest every byte of relevant data and use it to make supremely informed decisions. We now have the technologies to collect and store big data, but more importantly, to understand and take advantage of its full value. “The financial services industry has led the way in using analytics and big data to manage risk and curb fraud, waste and abuse – especially important in that regulatory environment,” said Scott Chastain, Director of Information Management and Delivery at SAS. “We’re also seeing a transference of big data analytics into other areas, such as health care and government. The ability to find that needle in the haystack becomes very important when you’re examining things like costs, outcomes, utilization and fraud for large populations.
How to Use Hadoop as a Piece of the Big Data Puzzle Imagine you have a jar of multicolored candies, and you need to learn something from them, perhaps the count of blue candies relative to red and yellow ones. You could empty the jar onto a plate, sift through them and tally up your answer. If the jar held only a few hundred candies, this process would take only a few minutes. Now imagine you have four plates and four helpers. You pour out about one-fourth of the candies onto each plate. Everybody sifts through their set and arrives at an answer that they share with the others to arrive at a total. Much faster, no? That is what Hadoop does for data. Hadoop is an open-source software framework for running applications on large clusters of commodity hardware. Hadoop delivers enormous processing power – the ability to handle virtually limitless concurrent tasks and jobs – making it a remarkably low-cost complement to a traditional enterprise data infrastructure. Organizations are embracing Hadoop for several notable merits:
• Hadoop is distributed. Bringing a high-tech twist to the adage, “Many hands make light work,” data is stored on local disks of a distributed cluster of servers.
• Hadoop runs on commodity hardware. Based on the average cost per terabyte of compute capacity of a prepackaged system, Hadoop is easily 10 times cheaper for comparable computing capacity compared to higher-cost specialized hardware.
• Hadoop is fault-tolerant. Hardware failure is expected and is mitigated by data replication and speculative processing. If capacity is available, Hadoop runs multiple copies of the same task, accepting the results from the task that finishes first.
• Hadoop does not require a predefined data schema. A key benefit of Hadoop is the ability to just upload any unstructured files without having to “schematize” them first. You can dump any type of data into Hadoop and allow the consuming programs to determine and apply structure when necessary.
• Hadoop scales to handle big data. Hadoop clusters can scale to between 6,000 and 10,000 nodes and handle more than 100,000 concurrent tasks and 10,000 concurrent jobs. Yahoo! runs thousands of clusters and more than 42,000 Hadoop nodes storing more than 200 petabytes of data.
• Hadoop is fast. In a performance test, a 1,400-node cluster sorted a terabyte of data in 62 seconds; a 3,400-node cluster sorted 100 terabytes in 173 minutes. To put it in context, one terabyte contains 2,000 hours of CD-quality music; 10 terabytes could store the entire US Library of Congress print collection.
You get the idea. Hadoop handles big data. It does it fast. It redefines the possible when it comes to analyzing large volumes of data, particularly semi-structured and unstructured data (text).
How YARN Opens Doors to Easier Programming Tools for Hadoop 2.0 Users The emergence of YARN for the Hadoop 2.0 platform has opened the door to new tools and applications that promise to allow more companies to reap the benefits of big data in ways never before possible with outcomes possibly never imagined. By separating the problem of cluster resource management from the data processing function, YARN offers a world beyond MapReduce: lessencumbered by complex programming protocols, faster, and at a lower cost….
http://jmlr.org/papers/volume16/garcia15a/garcia15a.pdf Safe Reinforcement Learning can be de ned as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The rst is based on the modi cation of the optimality criterion, the classic discounted – nite/in nite horizon, with a safety factor. The second is based on the modi cation of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classi cation to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/sqrt(m). This improves on the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.
Hypervariate Data Visualization Both scientists and normal users face enormous amounts of data, which might be useless if no insight is gained from it. To achieve this, visualization techniques can be used. Many datasets have a dimensionality higher than three. Such data is called “hypervariate” and cannot be visualized directly in the three-dimensional space that we inhabit. Therefore, a wide variety of specialized techniques have been created for rendering hypervariate data. These techniques are based on very different principles and are designed for very different areas of application. This paper gives an overview of six representative techniques. For most techniques a rendering of a common dataset is provided to allow an easier comparison. Furthermore, an evaluation of the strengths and weaknesses of each technique is given. As an outlook, two papers dealing with quantitative analysis of visualization methods are presented.
Hypervariate Information Visualization In the last 20 years improvements in the computer sciences made it possible to store large data sets containing a plethora of different data attributes and data values, which could be applied in different application domains, for example, in the natural sciences, in law enforcements or in social studies. Due to this increasing data complexity in modern times, it is crucial to support the exploration of the hypervariate data with different visualization techniques. These facts are the fundament for this paper, which reveals how the information visualization can support the understanding of data with high dimensionality. Furthermore, it gives an overview and a comparison of the different categories of hypervariate information visualization, in order to analyse the advantages and the disadvantages of each category. We also addressed in the different interaction methods which help to create an understandable visualization and thus facilitate the user’s visual exploration. Interactive techniques are useful to create an understandable visualization of the relationships in a large data set. At the end, we also discussed the possibility of merging different interactions and visualization techniques.

I

ILNumerics: Numeric Computing for Industry Most enterprise software nowadays gets created by means of managed software frameworks. In the past they have often failed to deliver the speed required for professional data analysis and scientific computing. The ILNumerics Computing Engine offers a new approach for the integration of numerical algorithms into technical applications.
Image Segmentation Algorithms Overview The technology of image segmentation is widely used in medical image processing, face recognition pedestrian detection, etc. The current image segmentation techniques include region-based segmentation, edge detection segmentation, segmentation based on clustering, segmentation based on weakly-supervised learning in CNN, etc. This paper analyzes and summarizes these algorithms of image segmentation, and compares the advantages and disadvantages of different algorithms. Finally, we make a prediction of the development trend of image segmentation with the combination of these algorithms.
Implementation of a Practical Distributed Calculation System with Browsers Deep learning can achieve outstanding results in various fields. However, it requires so significant computational power that graphics processing units (GPUs) and/or numerous computers are often required for the practical application. We have developed a new distributed calculation framework called ”Sashimi” that allows any computer to be used as a distribution node only by accessing a website. We have also developed a new JavaScript neural network framework called ”Sukiyaki” that uses general purpose GPUs with web browsers. Sukiyaki performs 30 times faster than a conventional JavaScript library for deep convolutional neural networks (deep CNNs) learning. The combination of Sashimi and Sukiyaki, as well as new distribution algorithms, demonstrates the distributed deep learning of deep CNNs only with web browsers on various devices. The libraries that comprise the proposed methods are available under MIT license at http://…/.
IMSL C Math Library Version 8.5.0 The IMSL C Math Library, a component of the IMSL C Numerical Library, is a library of C functions useful in scientific programming. Each function is designed and documented for use in research activities as well as by technical specialists. A number of the example programs also show graphs of resulting output.
IMSL C Stat Library – Version 8.5.0 The IMSL C Stat Library, a component of the IMSL C Numerical Library, is a library of C functions useful in scientific programming. Each function is designed and documented to be used in research activities as well as by technical specialists. A number of the example programs also show graphs of resulting output.
Independent Component Analysis: Algorithms and Applications A fundamental problem in neural network research, as well as in many other disciplines, is finding a suitable representation of multivariate data, i.e. random vectors. For reasons of computational and conceptual simplicity, the representation is often sought as a linear transformation of the original data. In other words, each component of the representation is a linear combination of the original variables. Well-known linear transformation methods include principal component analysis, factor analysis, and projection pursuit. Independent component analysis (ICA) is a recently developed method in which the goal is to find a linear representation of nongaussian data so that the components are statistically independent, or as independent as possible. Such a representation seems to capture the essential structure of the data in many applications, including feature extraction and signal separation. In this paper, we present the basic theory and applications of ICA, and our recent work on the subject.
Inferential Methods to Assess the Difference The area under the curve (AUC) is the most common statistical approach to evaluate the discriminatory power of a set of factors in a binary regression model. A nested model framework is used to ascertain whether the AUC increases when new factors enter the model. Two statistical tests are proposed for the difference in the AUC parameters from these nested models. The asymptotic null distributions for the two test statistics are derived from the scenarios: (A) the difference in the AUC parameters is zero and the new factors are not associated with the binary outcome, (B) the difference in the AUC parameters is less than a strictly positive value. A confidence interval for the difference in AUC parameters is developed. Simulations are generated to determine the finite sample operating characteristics of the tests and a pancreatic cancer data example is used to illustrate this approach.
Information Limits of Aggregate Data This paper uses a small model in the Cowles Commission (CC) tradition to examine the limits of aggregate data. It argues that more can be learned about the macroeconomy following the CC approach than the reduced form and VAR approaches allow, but less than the DSGE approach tries to do.
Information Visualization with Self-Organizing Maps The Self-Organizing Map (SOM) is an unsupervised neural network algorithm that projects high- dimensional data onto a two-dimensional map. The projection preserves the topology of the data so that similar data items will be mapped to nearby locations on the map. Despite the popular use of the algorithm for clustering and information visualisation, a system has been lacking that combines the fast execution of the algorithm with powerful visualisation of the maps and effective tools for their interactive analysis. Powerful methods for interactive exploration and search from collections of free-form textual documents are needed to manage the ever-increasing flood of digital information. In this article we present a method, SOM, for automatic organization of full-text document collections using the self-organizing map (SOM) algorithm. The document collection is ordered onto a map in an unsupervised manner utilizing statistical information of short word contexts. The resulting ordered map where similar documents lie near each other thus presents a general view of the document space. With the aid of a suitable (SVG) interface, documents in interesting areas of the map can be browsed.
Infovis and Statistical Graphics: Different Goals, Different Looks The importance of graphical displays in statistical practice has been recognized sporadically in the statistical literature over the past century, with wider awareness following Tukey’s Exploratory Data Analysis (1977) and Tufte’s books in the succeeding decades. But statistical graphics still occupies an awkward in-between position: Within statistics, exploratory and graphical methods represent a minor subfield and are not wellintegrated with larger themes of modeling and inference. Outside of statistics, infographics (also called information visualization or Infovis) is huge, but their purveyors and enthusiasts appear largely to be uninterested in statistical principles. We present here a set of goals for graphical displays discussed primarily from the statistical point of view and discuss some inherent contradictions in these goals that may be impeding communication between the fields of statistics and Infovis. One of our constructive suggestions, to Infovis practitioners and statisticians alike, is to try not to cram into a single graph what can be better displayed in two or more. We recognize that we offer only one perspective and intend this article to be a starting point for a wide-ranging discussion among graphics designers, statisticians, and users of statistical methods. The purpose of this article is not to criticize but to explore the different goals that lead researchers in different fields to value different aspects of data visualization.
InsideBIGDATA Guide to In-Memory Computing In-memory computing (IMC) is an emerging field of importance in the big data industry. It is a quickly evolving technology, seen by many as an effective way to address the proverbial 3 V’s of big data – volume, velocity, and variety. Big data requires ever more powerful means to process and analyze growing stores of data, being collected at more rapid rates, and with increasing diversity in the types of data being sought – both structured and unstructured. In-memory computing’s rapid rise in the marketplace has the big data community on alert. In fact, Gartner picked in-memory computing as one of the Top Ten Strategic Initiatives.
InsideBIGDATA Guide to Predictive Analytics Predictive analytics, sometimes called advanced analytics, is a term used to describe a range of analytical and statistical techniques to predict future actions or behaviors. In business, predictive analytics are used to make proactive decisions and determine actions, by using statistical models to discover patterns in historical and transactional data to uncover likely risks and opportunities. Predictive analytics incorporates a range of activities which we will explore in this paper, including data access, exploratory data analysis and visualization, developing assumptions and data models, applying predictive models, then estimating and/or predicting future outcomes.
Installing R and Optional RStudio R is quickly becoming the statistical software of choice for researchers and analysts in a variety of disciplines. In recent years, it has surpassed many commonly used statistical programs in both number of users and availability of statistical methods. A fundamental difference between R and other statistical software packages is that R is open-source, meaning it is both free for download and the source code is available under the GNU General Project License. Anyone can contribute new techniques or analytical methods, which has been a primary factor enabling the growth of R. These contributions are called `packages’. Currently, almost 6000 packages are available for R.
Integrated Analytics – Platforms and Principles for Centralizing Your Data Companies are collecting more data than ever. But, given how difficult it is to unify the many internal and external data streams they’ve built, more data doesn’t necessarily translate into better analytics. The real challenge is to provide deep and broad access to “a single source of truth” in their data that the typically slow ETL process for data warehousing cannot achieve. More than just fast access, analysts need the ability to explore data at a granular level.
In this O’Reilly report, author Courtney Webster presents a roadmap to data centralization that will help your organization make data accessible, flexible, and actionable. Building a genuine data-driven culture depends on your company’s ability to quickly act upon new findings. This report explains how.
Intelligent Choice of the Number of Clusters in K -Means Clustering: An Experimental Study with Different Cluster Spreads The issue of determining ‘the right number of clusters’ in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the ‘intelligent’ K-Means method, ik-Means, that find the ‘right’ number of clusters by extracting ‘anomalous patterns’ from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.
Intelligent Data Analysis (Slide Deck)
Interactive Web Apps with shiny (Cheat Sheet)
Intercomparison of Machine Learning Methods for Statistical Downscaling: The Case of Daily and Extreme Precipitation Statistical downscaling of global climate models (GCMs) allows researchers to study local climate change effects decades into the future. A wide range of statistical models have been applied to downscaling GCMs but recent advances in machine learning have not been explored. In this paper, we compare four fundamental statistical methods, Bias Correction Spatial Disaggregation (BCSD), Ordinary Least Squares, Elastic-Net, and Support Vector Machine, with three more advanced machine learning methods, Multi-task Sparse Structure Learning (MSSL), BCSD coupled with MSSL, and Convolutional Neural Networks to downscale daily precipitation in the Northeast United States. Metrics to evaluate of each method’s ability to capture daily anomalies, large scale climate shifts, and extremes are analyzed. We find that linear methods, led by BCSD, consistently outperform non-linear approaches. The direct application of state-of-the-art machine learning methods to statistical downscaling does not provide improvements over simpler, longstanding approaches.
Interpreting Blackbox Models via Model Extraction Interpretability has become an important issue as machine learning is increasingly used to inform consequential decisions. We propose an approach for interpreting a blackbox model by extracting a decision tree that approximates the model. Our model extraction algorithm avoids overfitting by leveraging blackbox model access to actively sample new training points. We prove that as the number of samples goes to infinity, the decision tree learned using our algorithm converges to the exact greedy decision tree. In our evaluation, we use our algorithm to interpret random forests and neural nets trained on several datasets from the UCI Machine Learning Repository, as well as control policies learned for three classical reinforcement learning problems. We show that our algorithm improves over a baseline based on CART on every problem instance. Furthermore, we show how an interpretation generated by our approach can be used to understand and debug these models.
Introducing Connection Analytics Connection Analytics provides a new way of looking at people, products, physical phenomena, or events. It provides insights by dissecting the types of relationships between entities to determine causation and can be used for generating predictive intelligence based on the patterns of interactions. Connection Analytics can address queries such as identifying influencers, the groups that they influence, and where promotions or other forms of marketing are best directed. It can be utilized for product affinity analysis by taking a bottom up look at how the decisions to buy different items are linked. Likewise, this approach can help analyze networks by patterns of activity, and fraud and money laundering through the actions (rather than identities) of involved actors. It can help segment customers based on behavior patterns like past purchase behavior or reviews vs. traditional segmentation techniques like income & demographics. Graph analytics is one of the most promising approaches to performing Connection Analytics. Teradata is the first analytics data platform provider to make graph computing accessible to the existing base of data scientists, database developers and business analysts by introducing a SQL-friendly approach. Underneath the hood, Teradata Aster is using a compute approach that allows the data to leverage the power and performance of massively parallel analytic processing engines and pre-built algorithms.
Introducing R R is a powerful environment for statistical computing which runs on several platforms. These notes are written especially for users running the Windows version, but most of the material applies to the Mac and Linux versions as well.
Introduction to Boosted Trees (Slide Deck)
Introduction to Machine Learning and Soft Computing • Introduction
• Single-layer Neural Networks
• Linear Classification
• Linear Regression
• Kernel
• Multi-layer Neural Networks
• Nonlinear Classification
• Nonlinear Regression
• Model Selection
• GA-based Frameworks
• PSO-based Frameworks
• Conclusion
• Epilogue
Introduction to Markov Random Fields This book sets out to demonstrate the power of the Markov random field (MRF) in vision. It treats the MRF both as a tool for modeling image data and, coupled with a set of recently developed algorithms, as a means of making inferences about images. The inferences concern underlying image and scene structure to solve problems such as image reconstruction, image segmentation, 3D vision, and object labeling. This chapter is designed to present some of the main concepts used in MRFs, both as a taster and as a gateway to the more detailed chapters that follow, as well as a stand-alone introduction to MRFs.
Introduction to Network Theory (& Graph Theory) (Slide Deck)
Introduction to Neural Networks In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese samples to a taste response.
Introduction to Probabilistic Topic Models Probabilistic topic models are a suite of algorithms whose aim is to discover the hidden thematic structure in large archives of documents. In this article, we review the main ideas of this eld, survey the current state-of-the-art, and describe some promising future directions. We rst describe latent Dirichlet allocation (LDA) [8], which is the simplest kind of topic model. We discuss its connections to probabilistic modeling, and describe two kinds of algorithms for topic discovery. We then survey the growing body of research that extends and applies topic models in interesting ways. These extensions have been developed by relaxing some of the statistical assumptions of LDA, incorporating meta-data into the analysis of the documents, and using similar kinds of models on a diversity of data types such as social networks, images and genetics. Finally, we give our thoughts as to some of the important unexplored directions for topic modeling. These include rigorous methods for checking models built for data exploration, new approaches to visualizing text and other high dimensional data, and moving beyond traditional information engineering applications towards using topic models for more scienti c ends.
Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R In recent years, data streams have become an increasingly important area of research for the computer science, database and statistics communities. Data streams are ordered and potentially unbounded sequences of data points created by a typically non-stationary data generating process. Common data mining tasks associated with data streams include clustering, classification and frequent pattern mining. New algorithms for these types of data are proposed regularly and it is important to evaluate them thoroughly under standardized conditions. In this paper we introduce stream, a research tool that includes modeling and simulating data streams as well as an extensible framework for implementing, interfacing and experimenting with algorithms for various data stream mining tasks. The main advantage of stream is that it seamlessly integrates with the large existing infrastructure provided by R. In addition to data handling, plotting and easy scripting capabilities, R also provides many existing algorithms and enables users to interface code written in many programming languages popular among data mining researchers (e.g., C/C++, Java and Python). In this paper we describe the architecture of stream and focus on its use for data stream clustering research. stream was implemented with extensibility in mind and will be extended in the future to cover additional data stream mining tasks like classification and frequent pattern mining.
Introduction to YARN Apache Hadoop 2.0 includes YARN, which separates the resource management and processing components. The YARN-based architecture is not constrained to MapReduce. This article describes YARN and its advantages over the previous distributed processing layer in Hadoop. Learn how to enhance your clusters with YARN’s scalability, efficiency, and flexibility.
Introduction: Credibility, Models, and Parameters The goal of this chapter is to introduce the conceptual framework of Bayesian data analysis. Bayesian data analysis has two foundational ideas. The first idea is that Bayesian inference is reallocation of credibility across possibilities. The second foundational idea is that the possibilities, over which we allocate credibility, are parameter values in meaningful mathematical models. These two fundamental ideas form the conceptual foundation for every analysis in this book. Simple examples of these ideas are presented in this chapter. The rest of the book merely fills in the mathematical and computational details for specific applications of these two ideas. This chapter also explains the basic procedural steps shared by every Bayesian analysis.

J

JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling JAGS is a program for Bayesian Graphical modelling which aims for compatibility with classic BUGS. The program could eventually be developed as an R package. This article explains the motivations for this program, briefly describes the architecture and then discusses some ideas for a vectorized form of the BUGS language.
Julia for R Programmers (Slide Deck)

K

Kernel clustering: Breiman’s bias and solutions Clustering is widely used in data analysis where kernel methods are particularly popular due to their generality and discriminating power. However, kernel clustering has a practically significant bias to small dense clusters, e.g. empirically observed in (Shi & Malik, TPAMI’00). Its causes have never been analyzed and understood theoretically, even though many attempts were made to improve the results. We provide conditions and formally prove this bias in kernel clustering. Moreover, we show a general class of locally adaptive kernels directly addressing these conditions. Previously, (Breiman, ML’96) proved a bias to histogram mode isolation in discrete Gini criterion for decision tree learning. We found that kernel clustering reduces to a continuous generalization of Gini criterion for a common class of kernels where we prove a bias to density mode isolation and call it Breiman’s bias. These theoretical findings suggest that a principal solution for the bias should directly address data density inhomogeneity. In particular, our density law shows how density equalization can be done implicitly using certain locally adaptive geodesic kernels. Interestingly, a popular heuristic kernel in (Zelnik-Manor and Perona, NIPS’04) approximates a special case of our Riemannian kernel framework. Our general ideas are relevant to any algorithms for kernel clustering. We show many synthetic and real data experiments illustrating Breiman’s bias and its solution. We anticipate that theoretical understanding of kernel clustering limitations and their principled solutions will be important for a broad spectrum of data analysis applications in diverse disciplines.
Kernel Density Estimation with Ripley’s Circumferential Correction In this paper, we investigate (and extend) Ripley’s circumference method to correct bias of density estimation of edges (or frontiers) of regions. The idea of the method was theoretical and diffcult to implement. We provide a simple technique based of properties of Gaussian kernels to effciently compute weights to correct border bias on frontiers of the region of interest, with an automatic selection of an optimal radius for the method. We illustrate the use of that technique to visualize hot spots of car accidents and campsite locations, as well as location of bike thefts.
Kernel Mean Embedding of Distributions: A Review and Beyonds A Hilbert space embedding of distributions—in short, kernel mean embedding—has recently emerged as a powerful machinery for probabilistic modeling, statistical inference, machine learning, and causal discovery. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It gave rise to a great deal of research and novel applications of positive definite kernels. The goal of this survey is to give a comprehensive review of existing works and recent advances in this research area, and to discuss some of the most challenging issues and open problems that could potentially lead to new research directions. The survey begins with a brief introduction to the RKHS and positive definite kernels which forms the backbone of this survey, followed by a thorough discussion of the Hilbert space embedding of marginal distributions, theoretical guarantees, and review of its applications. The embedding of distributions enables us to apply RKHS methods to probability measures which prompts a wide range of applications such as kernel two-sample testing, independent testing, group anomaly detection, and learning on distributional data. Next, we discuss the Hilbert space embedding for conditional distributions, give theoretical insights, and review some applications. The conditional mean embedding enables us to perform sum, product, and Bayes’ rules—which are ubiquitous in graphical model, probabilistic inference, and reinforcement learning—in a non-parametric way using the new representation of distributions in RKHS. We then discuss relationships between this framework and other related areas. Lastly, we give some suggestions on future research directions.
Know-Evolve: Deep Reasoning in Temporal Knowledge Graphs Knowledge Graphs are important tools to model multi-relational data that serves as information pool for various applications. Traditionally, these graphs are considered to be static in nature. However, recent availability of large scale event-based interaction data has given rise to dynamically evolving knowledge graphs that contain temporal information for each edge. Reasoning over time in such graphs is not yet well understood. In this paper, we present a novel deep evolutionary knowledge network architecture to learn entity embeddings that can dynamically and non-linearly evolve over time. We further propose a multivariate point process framework to model the occurrence of a fact (edge) in continuous time. To facilitate temporal reasoning, the learned embeddings are used to compute relationship score that further parametrizes intensity function of the point process. We demonstrate improved performance over various existing relational learning models on two large scale real-world datasets. Further, our method effectively predicts occurrence or recurrence time of a fact which is novel compared to any prior reasoning approaches in multi-relational setting.
Knowledge Transfer for Out-of-Knowledge-Base Entities: A Graph Neural Network Approach Knowledge base completion (KBC) aims to predict missing information in a knowledge base.In this paper, we address the out-of-knowledge-base (OOKB) entity problem in KBC:how to answer queries concerning test entities not observed at training time. Existing embedding-based KBC models assume that all test entities are available at training time, making it unclear how to obtain embeddings for new entities without costly retraining. To solve the OOKB entity problem without retraining, we use graph neural networks (Graph-NNs) to compute the embeddings of OOKB entities, exploiting the limited auxiliary knowledge provided at test time.The experimental results show the effectiveness of our proposed model in the OOKB setting.Additionally, in the standard KBC setting in which OOKB entities are not involved, our model achieves state-of-the-art performance on the WordNet dataset. The code and dataset are available at https://…/GNN-for-OOKB. This paper has been accepted by IJCAI17.

L

L2 Regularization versus Batch and Weight Normalization Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.
Large Linear Classification When Data Cannot Fit In Memory Recent advances in linear classification have shown that for applications such as document classification, the training can be extremely efficient. However, most of the existing training methods are designed by assuming that data can be stored in the computer memory. These methods cannot be easily applied to data larger than the memory capacity due to the random access to the disk. We propose and analyze a block minimization framework for data larger than the memory size. At each step a block of data is loaded from the disk and handled by certain learning methods. We investigate two implementations of the proposed framework for primal and dual SVMs, respectively. As data cannot fit in memory, many design considerations are very different from those for traditional algorithms. Experiments using data sets 20 times larger than the memory demonstrate the effectiveness of the proposed method.
Large-Scale Graph Visualization and Analytics Novel approaches to network visualization and analytics use sophisticated metrics that enable rich interactive network views and node grouping and filtering. A survey of graph layout and simplification methods reveals considerable progress in these new directions.
Latent Dirichlet Allocation We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Latent Semantic Analysis and Topic Modeling: Roads to Text Meaning (Slide Deck)
Latent Variable Mixture Modeling The aim of this study was to provide an overview of mixture modeling techniques, specifically as applied to nursing research, and to present examples from two studies to illustrate how these techniques may be used crosssectionally and longitudinally.
Latent Variable Models A powerful approach to probabilistic modelling involves sup- plementing a set of observed variables with additional latent, or hidden, variables. By de ning a joint distribution over visible and latent variables, the corresponding distribution of the observed variables is then obtained by marginalization. This allows relatively complex distributions to be ex- pressed in terms of more tractable joint distributions over the expanded variable space. One well-known example of a hidden variable model is the mixture distribution in which the hidden variable is the discrete component label. In the case of continuous latent variables we obtain models such as factor analysis. The structure of such probabilistic models can be made particularly transparent by giving them a graphical representation, usually in terms of a directed acyclic graph, or Bayesian network. In this chapter we provide an overview of latent variable models for representing continuous variables. We show how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well-known tech- nique of principal components analysis (PCA). By extending this technique to mixtures, and hierarchical mixtures, of probabilistic PCA models we are led to a powerful interactive algorithm for data visualization. We also show how the probabilistic PCA approach can be generalized to non-linear latent variable models leading to the Generative Topographic Mapping algorithm (GTM). Finally, we show how GTM can itself be extended to model tem- poral data.
Layerwise Systematic Scan: Deep Boltzmann Machines and Beyond For Markov chain Monte Carlo methods, one of the greatest discrepancies between theory and system is the scan order – while most theoretical development on the mixing time analysis deals with random updates, real-world systems are implemented with systematic scans. We bridge this gap for models that exhibit a bipartite structure, including, most notably, the Restricted/Deep Boltzmann Machine. The de facto implementation for these models scans variables in a layerwise fashion. We show that the Gibbs sampler with a layerwise alternating scan order has its relaxation time (in terms of epochs) no larger than that of a random-update Gibbs sampler (in terms of variable updates). We also construct examples to show that this bound is asymptotically tight. Through standard inequalities, our result also implies a comparison on the mixing times.
LDAvis: A method for visualizing and interpreting topics We present LDAvis, a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation that is built using a combination of R and D3. Our visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic. First, we propose a novel method for choosing which terms to present to a user to aid in the task of topic interpretation, in which we define the relevance of a term to a topic. Second, we present results from a user study that suggest that ranking terms purely by their probability under a topic is suboptimal for topic interpretation. Last, we describe LDAvis, our visualization system that allows users to flexibly explore topic-term relationships using relevance to better understand a fitted LDA model.
Leakage in Data Mining: Formulation, Detection, and Avoidance Deemed “one of the top ten data mining mistakes”, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competi-tions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical i.i.d. assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by expli-citly defining modeling goals and analyzing the broader frame-work of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected.
Learn to use R – Your Hands-on Guide R is hot. Whether measured by more than 6,100 add-on packages, the 41,000+ members of LinkedIn’s R group or the 170+ R Meetup groups currently in existence, there can be little doubt that interest in the R statistics language, especially for data analysis, is soaring. Why R? It’s free, open source, powerful and highly extensible. “You have a lot of prepackaged stuff that’s already available, so you’re standing on the shoulders of giants,” Google’s chief economist told The New York Times back in 2009. Because it’s a programmable environment that uses command-line scripting, you can store a series of complex data-analysis steps in R. That lets you re-use your analysis work on similar data more easily than if you were using a point-and-click interface, notes Hadley Wickham, author of several popular R packages and chief scientist with RStudio. That also makes it easier for others to validate research results and check your work for errors — an issue that cropped up in the news recently after an Excel coding error was among several flaws found in an influential economics analysis report known as Reinhart/Rogoff. The error itself wasn’t a surprise, blogs Christopher Gandrud, who earned a doctorate in quantitative research methodology from the London School of Economics. “Despite our best efforts we always will” make errors, he notes. “The problem is that we often use tools and practices that make it difficult to find and correct our mistakes.” Sure, you can easily examine complex formulas on a spreadsheet. But it’s not nearly as easy to run multiple data sets through spreadsheet formulas to check results as it is to put several data sets through a script, he explains. Indeed, the mantra of “Make sure your work is reproducible!” is a common theme among R enthusiasts.
Learning Deep Architectures for AI Theoretical results strongly suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one needs deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult optimization task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas. This paper discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.
Learning from Dyadic Data Dyadic data refers to a domain with two nite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This type of data arises naturally in many application ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domain-independent framework of learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures. We propose an annealed version of the standard EM algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains
Learning Sparse Structural Changes in High-dimensional Markov Networks: A Review on Methodologies and Theories Recent years have seen an increasing popularity of learning the sparse \emph{changes} in Markov Networks. Changes in the structure of Markov Networks reflect alternations of interactions between random variables under different regimes and provide insights into the underlying system. While each individual network structure can be complicated and difficult to learn, the overall change from one network to another can be simple. This intuition gave birth to an approach that \emph{directly} learns the sparse changes without modelling and learning the individual (possibly dense) networks. In this paper, we review such a direct learning method with some latest developments along this line of research.
Learning the k in k-means When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level . We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity.
Learning the parts of objects by non-negative matrix factorization Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.
Learning the parts of objects by non-negative matrix factorization Is perception of the whole based on perception of its parts? There is psychological1 and physiological2,3 evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations4,5. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations.Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.
Learning to Extract International Relations from Political Context We describe a new probabilistic model for extracting events between major political actors from news corpora. Our unsupervised model brings together familiar components in natural language processing (like parsers and topic models) with contextual political information— temporal and dyad dependence—to infer latent event classes. We quantitatively evaluate the model’s performance on political science benchmarks: recovering expert-assigned event class valences, and detecting real-world conflict. We also conduct a small case study based on our model’s inferences.
Learning to Hash for Indexing Big Data – A Survey The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.
Learning Whenever Learning is Possible: Universal Learning under General Stochastic Processes This work initiates a general study of learning and generalization without the i.i.d. assumption, starting from first principles. While the standard approach to statistical learning theory is based on assumptions chosen largely for their convenience (e.g., i.i.d. or stationary ergodic), in this work we are interested in developing a theory of learning based only on the most fundamental and natural assumptions implicit in the requirements of the learning problem itself. We specifically study universally consistent function learning, where the objective is to obtain low long-run average loss for any target function, when the data follow a given stochastic process. We are then interested in the question of whether there exist learning rules guaranteed to be universally consistent given only the assumption that universally consistent learning is possible for the given data process. The reasoning that motivates this criterion emanates from a kind of optimist’s decision theory, and so we refer to such learning rules as being optimistically universal. We study this question in three natural learning settings: inductive, self-adaptive, and online. Remarkably, as our strongest positive result, we find that optimistically universal learning rules do indeed exist in the self-adaptive learning setting. Establishing this fact requires us to develop new approaches to the design of learning algorithms. Along the way, we also identify concise characterizations of the family of processes under which universally consistent learning is possible in the inductive and self-adaptive settings. We additionally pose a number of enticing open problems, particularly for the online learning setting.
Learning Word Representation Considering Proximity and Ambiguity Distributed representations of words (aka word embedding) have proven helpful in solving natural language processing (NLP) tasks. Training distributed representations of words with neural networks has lately been a major focus of researchers in the field. Recent work on word embedding, the Continuous Bag-of-Words (CBOW) model and the Continuous Skip-gram (Skip-gram) model, have produced particularly impressive results, significantly speeding up the training process to enable word representation learning from largescale data. However, both CBOW and Skip-gram do not pay enough attention to word proximity in terms of model or word ambiguity in terms of linguistics. In this paper, we propose Proximity-Ambiguity Sensitive (PAS) models (i.e. PAS CBOW and PAS Skip-gram) to produce high quality distributed representations of words considering both word proximity and ambiguity. From the model perspective, we introduce proximity weights as parameters to be learned in PAS CBOWand used in PAS Skip-gram. By better modeling word proximity, we reveal the strength of pooling-structured neural networks in word representation learning. The proximity sensitive pooling layer can also be applied to other neural network applications that employ pooling layers. From the linguistics perspective, we train multiple representation vectors per word. Each representation vector corresponds to a particular group of POS tags of the word. By using PAS models, we achieved a 16.9% increase in accuracy over state-of-theart models.
Least Square Projection: a fast high precision multidimensional projection technique and its application to document mapping The problem of projecting multidimensional data into lower dimensions has been pursued by many researchers due to its potential application to data analysis of various kinds. This paper presents a novel multidimensional projection technique based on least square approximations. The approximations compute the coordinates of a set of projected points based on the coordinates of a reduced number of control points with defined geometry. We name the technique Least Square Projections (LSP). From an initial projection of the control points, LSP defines the positioning of their neighboring points through a numerical solution that aims at preserving a similarity relationship between the points given by a metric in mD. In order to perform the projection, a small number of distance calculations is necessary and no repositioning of the points is required to obtain a final solution with satisfactory precision. The results show the capability of the technique to form groups of points by degree of similarity in 2D. We illustrate that capability through its application to mapping collections of textual documents from varied sources, a strategic yet difficult application. LSP is faster and more accurate than other existing high quality methods, particularly where it was mostly tested, that is, for mapping text sets.
Let’s Debunk the Myths about Data Mining Data mining is about knowledge and information, but only occasionally about predicting the future. For as long as the field has existed, data miners have worked to explain the difference between data mining and other forms of data analysis. The terms ‘predictive analysis’ and ‘predictive modelling’ have been adopted widely to distinguish data mining and its modelling from other kinds. Unfortunately, this has led to the erroneous belief among non-practitioners that data mining is all about prediction, which it is not. Rather, data mining is about information and knowledge. Take a look at the diagram: On the left, we have the myth which has grown up around data mining: the idea that starting from data we create models which make predictions to guide action. This places a false emphasis on models; a more accurate picture of what really happens is shown on the right. Knowledge is applied to data, producing new knowledge which can again be applied to the data: an iterative process. At any point in this cycle, knowledge and data can be used together to produce new information. This creation of new information is sometimes called ‘prediction’, but it is often not information about the future. It may have some implications for the future, as many pieces of information do, but it is not a prediction in the usual sense of the word. In summary, the left hand diagram is erroneous because it leaves out knowledge, which is both an essential prerequisite and a product of data mining, and is used at every step. Data mining often produces models but these are only one kind of knowledge that it can produce, the other being human knowledge (knowledge in the head).
Leveraging Flexible Data Management with Graph Databases Integrating up-to-date information into databases from different heterogeneous data sources is still a time-consuming and mostly manual job that can only be accomplished by skilled experts. For this reason, enterprises often lack information regarding the current market situation, preventing a holistic view that is needed to conduct sound data analysis and market predictions. Ironically, the Web consists of a huge and growing number of valuable information from diverse organizations and data providers, such as the Linked Open Data cloud, common knowledge sources like Freebase, and social networks. One desirable usage scenario for this kind of data is its integration into a single database in order to apply data analytics. However, in today’s business intelligence tools there is an evident lack of support for so-called situational or ad-hoc data integration. What we need is a system which 1) provides a exible storage of heterogeneous information of di erent degrees of structure in an ad-hoc manner, and 2) supports mass data operations suited for data analytics. In this paper, we will provide our vision of such a system and describe an extension of the well-studied property graph model that allows to \integrate and analyze as you go” external data exposed in the RDF format in a seamless manner. The proposed integration approach extends the internal graph model with external data from the Linked Open Data cloud, which stores over 31 billion RDF triples (September 2011) from a variety of domains.
lfe: Linear Group Fixed Effects Linear models with fixed effects and many dummy variables are common in some fields. Such models are straightforward to estimate unless the factors have too many levels. The R package lfe solves this problem by implementing a generalization of the within transformation to multiple factors, tailored for large problems.
Lifelong Metric Learning The state-of-the-art online learning approaches is only capable of learning the metric for predefined tasks. In this paper, we consider lifelong learning problem to mimic ‘human learning’, i.e., endow a new capability to the learned metric for a new task from new online samples and incorporating previous experiences and knowledge. Therefore, we propose a new framework: lifelong metric learning (LML), which only utilizes the data of the new task to train the metric model while preserving the original capabilities. More specifically, the proposed LML maintains a common subspace for all learned metrics, named lifelong dictionary, transfers knowledge from the common subspace to each new metric task with task-specific idiosyncrasy, and redefines the common subspace over time to maximize performance across all metric tasks. We apply online Passive Aggressive optimization to solve the proposed LML framework. Finally, we evaluate our approach by analyzing several multi-task metric learning datasets. Extensive experimental results demonstrate effectiveness and efficiency of the proposed framework.
Linear Dimensionality Reduction (Slide Deck)
Linear models and linear mixed effects models in R with linguistic applications Part 1: Linear modeling
Part 2: A very basic tutorial for performing linear mixed effects analyses
Linked Open Data: The Essentials A Quick Start Guide for Decision Makers
Listen First! Turning Social Media Conversations into Business Advantage Nearly anywhere you turn online, people are talking about your products and categories, what they like and dislike, what they want, what pleases them or ticks them off, and what they would like you to do, or stop doing. Twitter, Facebook, YouTube, millions of blogs, forums, Web sites, and review sites make most of these conversations public, accessible, and researchable to every company. You also hear individuals talk about the richness and texture of their lives and your role in them. You learn about their aspirations, families, relationships, and homes; music and movies; vacations, hobbies, and sports; finances, jobs, and careers; education and technology; what they had for lunch, what they crave; and much more. By listening in on those conversations, you position yourself to develop powerful insights into people that, coupled with strategy, drive your business forward and create an enduring advantage. Listen First! Turning Social Media Conversations into Business Advantage will show you how.
Listen, Interact and Talk: Learning to Speak via Interaction One of the long-term goals of artificial intelligence is to build an agent that can communicate intelligently with human in natural language. Most existing work on natural language learning relies heavily on training over a pre-collected dataset with annotated labels, leading to an agent that essentially captures the statistics of the fixed external training data. As the training data is essentially a static snapshot representation of the knowledge from the annotator, the agent trained this way is limited in adaptiveness and generalization of its behavior. Moreover, this is very different from the language learning process of humans, where language is acquired during communication by taking speaking action and learning from the consequences of speaking action in an interactive manner. This paper presents an interactive setting for grounded natural language learning, where an agent learns natural language by interacting with a teacher and learning from feedback, thus learning and improving language skills while taking part in the conversation. To achieve this goal, we propose a model which incorporates both imitation and reinforcement by leveraging jointly sentence and reward feedbacks from the teacher. Experiments are conducted to validate the effectiveness of the proposed approach.
Living Together: Mind and Machine Intelligence In this paper we consider the nature of the machine intelligences we have created in the context of our human intelligence. We suggest that the fundamental difference between human and machine intelligence comes down to \emph{embodiment factors}. We define embodiment factors as the ratio between an entity’s ability to communicate information vs compute information. We speculate on the role of embodiment factors in driving our own intelligence and consciousness. We briefly review dual process models of cognition and cast machine intelligence within that framework, characterising it as a dominant System Zero, which can drive behaviour through interfacing with us subconsciously. Driven by concerns about the consequence of such a system we suggest prophylactic courses of action that could be considered. Our main conclusion is that it is \emph{not} sentient intelligence we should fear but \emph{non-sentient} intelligence.
Luck is Hard to Beat: The Difficulty of Sports Prediction Predicting the outcome of sports events is a hard task. We quantify this difficulty with a coefficient that measures the distance between the observed final results of sports leagues and idealized perfectly balanced competitions in terms of skill. This indicates the relative presence of luck and skill. We collected and analyzed all games from 198 sports leagues comprising 1503 seasons from 84 countries of 4 different sports: basketball, soccer, volleyball and handball. We measured the competitiveness by countries and sports. We also identify in each season which teams, if removed from its league, result in a completely random tournament. Surprisingly, not many of them are needed. As another contribution of this paper, we propose a probabilistic graphical model to learn about the teams’ skills and to decompose the relative weights of luck and skill in each game. We break down the skill component into factors associated with the teams’ characteristics. The model also allows to estimate as 0.36 the probability that an underdog team wins in the NBA league, with a home advantage adding 0.09 to this probability. As shown in the first part of the paper, luck is substantially present even in the most competitive championships, which partially explains why sophisticated and complex feature-based models hardly beat simple models in the task of forecasting sports’ outcomes.

M

Machine Learned Learning Machines There are two common approaches for optimizing the performance of a machine: genetic algorithms and machine learning. A genetic algorithm is applied over many generations whereas machine learning works by applying feedback until the system meets a performance threshold. Though these are methods that typically operate separately, we combine evolutionary adaptation and machine learning into one approach. Our focus is on machines that can learn during their lifetime, but instead of equipping them with a machine learning algorithm we aim to let them evolve their ability to learn by themselves. We use evolvable networks of probabilistic and deterministic logic gates, known as Markov Brains, as our computational model organism. The ability of Markov Brains to learn is augmented by a novel adaptive component that can change its computational behavior based on feedback. We show that Markov Brains can indeed evolve to incorporate these feedback gates to improve their adaptability to variable environments. By combining these two methods, we now also implemented a computational model that can be used to study the evolution of learning.
Machine Learning This book is based partly on content from the 2013 session of the on-line Machine Learning course run by Andrew Ng (Stanford University). The on-line course is provided for free via the Coursera platform (www.coursera.org). The author is no way affiliated with Coursera, Stanford University or Andrew Ng.
Machine Learning – The Complete Guide This is a Wikipedia book, a collection of Wikipedia articles that can be easily saved, rendered electronically, and ordered as a printed book.
Machine Learning and Cloud Computing: Survey of Distributed and SaaS Solutions Applying popular machine learning algorithms to large amounts of data raised new challenges for the ML practitioners. Traditional ML libraries does not support well processing of huge datasets, so that new approaches were needed. Parallelization using modern parallel computing frameworks, such as MapReduce, CUDA, or Dryad gained in popularity and acceptance, resulting in new ML libraries developed on top of these frameworks. We will briefly introduce the most prominent industrial and academic outcomes, such as Apache Mahout, GraphLab or Jubatus. We will investigate how cloud computing paradigm impacted the field of ML. First direction is of popular statistics tools and libraries (R system, Python) deployed in the cloud. A second line of products is augmenting existing tools with plugins that allow users to create a Hadoop cluster in the cloud and run jobs on it. Next on the list are libraries of distributed implementations for ML algorithms, and on-premise deployments of complex systems for data analytics and data mining. Last approach on the radar of this survey is ML as Software-as-a-Service, several BigData start-ups (and large companies as well) already opening their solutions to the market.
Machine Learning and the Future of Realism The preceding three decades have seen the emergence, rise, and proliferation of machine learning (ML). From half-recognised beginnings in perceptrons, neural nets, and decision trees, algorithms that extract correlations (that is, patterns) from a set of data points have broken free from their origin in computational cognition to embrace all forms of problem solving, from voice recognition to medical diagnosis to automated scientific research and driverless cars, and it is now widely opined that the real industrial revolution lies less in mobile phone and similar than in the maturation and universal application of ML. Among the consequences just might be the triumph of anti-realism over realism.
Machine Learning for Business: Eight Best Practices to Get Started As organizations look to advance with analytics, predictive analytics is frequently on their road map. Businesses are interested in better understanding their customers, predicting behavior, and improving operational processes. They want more accurate insights and the ability to respond faster to change. Machine learning—building systems that can learn from data to identify patterns and predict future outcomes with minimal human intervention—is often on their radar. Data scientists who engage in analysis are an important piece of the equation. Data scientists can build new models, develop algorithms and applications, and help the organization innovate. However, these data scientists are not always easy to find. TDWI research indicates that organizations are often looking to supplement the data science team by growing the skills of business analysts to use tools such as machine learning. For example, in a recent TDWI survey, 51 percent of respondents said that enhancing business analysts’ skills was one of their top two strategies for growing their data science competencies in the organization.1 That means that organizations need productivity tools for data scientists as well as a way to equip power users and business analysts to perform advanced analytics. These business analysts can work together with data scientists and other team members to bring machine learning into the organization. How do businesses get started with machine learning? How do organizations equip business analysts to use machine learning techniques and work in conjunction with data scientists? What do these organizations need to know? This Checklist defines machine learning and discusses best practices for the business as it takes the next step on its analytics journey toward using machine learning.
Machine Learning for E-mail Spam Filtering: Review,Techniques and Trends We present a comprehensive review of the most effective content-based e-mail spam filtering techniques. We focus primarily on Machine Learning-based spam filters and their variants, and report on a broad review ranging from surveying the relevant ideas, efforts, effectiveness, and the current progress. The initial exposition of the background examines the basics of e-mail spam filtering, the evolving nature of spam, spammers playing cat-and-mouse with e-mail service providers (ESPs), and the Machine Learning front in fighting spam. We conclude by measuring the impact of Machine Learning-based filters and explore the promising offshoots of latest developments.
Machine Learning Methods for Computer Security The study of learning in adversarial environments is an emerging discipline at the juncture between machine learning and computer security that raises new questions within both fields. The interest in learning-based methods for security and system design applications comes from the high degree of complexity of phenomena underlying the security and reliability of computer systems. As it becomes increasingly difficult to reach the desired properties by design alone, learning methods are being used to obtain a better understanding of various data collected from these complex systems. However, learning approaches can be co-opted or evaded by adversaries, who change to counter them. To-date, there has been limited research into learning techniques that are resilient to attacks with provable robustness guarantees making the task of designing secure learning-based systems a lucrative open research area with many challenges. The Perspectives Workshop, “Machine Learning Methods for Computer Security” was convened to bring together interested researchers from both the computer security and machine learning communities to discuss techniques, challenges, and future research directions for secure learning and learning-based security applications. This workshop featured twenty-two invited talks from leading researchers within the secure learning community covering topics in adversarial learning, game-theoretic learning, collective classification, privacy-preserving learning, security evaluation metrics, digital forensics, authorship identification, adversarial advertisement detection, learning for offensive security, and data sanitization. The workshop also featured workgroup sessions organized into three topic: machine learning for computer security, secure learning, and future applications of secure learning.
Machine Learning with World Knowledge: The Position and Survey Machine learning has become pervasive in multiple domains, impacting a wide variety of applications, such as knowledge discovery and data mining, natural language processing, information retrieval, computer vision, social and health informatics, ubiquitous computing, etc. Two essential problems of machine learning are how to generate features and how to acquire labels for machines to learn. Particularly, labeling large amount of data for each domain-specific problem can be very time consuming and costly. It has become a key obstacle in making learning protocols realistic in applications. In this paper, we will discuss how to use the existing general-purpose world knowledge to enhance machine learning processes, by enriching the features or reducing the labeling work. We start from the comparison of world knowledge with domain-specific knowledge, and then introduce three key problems in using world knowledge in learning processes, i.e., explicit and implicit feature representation, inference for knowledge linking and disambiguation, and learning with direct or indirect supervision. Finally we discuss the future directions of this research topic.
Machine Learning: A Probabilistic Perspective This books adopts the view that the best way to solve such problems is to use the tools of probability theory. Probability theory can be applied to any problem involving uncertainty. In machine learning, uncertainty comes in many forms: what is the best prediction about the future given some past data? what is the best model to explain some data? what measurement should I perform next? etc. The probabilistic approach to machine learning is closely related to the field of statistics, but differs slightly in terms of its emphasis and terminology. We will describe a wide variety of probabilistic models, suitable for a wide variety of data and tasks. We will also describe a wide variety of algorithms for learning and using such models. The goal is not to develop a cook book of ad hoc techiques, but instead to present a unified view of the field through the lens of probabilistic modeling and inference. Although we will pay attention to computational effciency, details on how to scale these methods to truly massive datasets are better described in other books, such as (Rajaraman and Ullman 2011; Bekkerman et al. 2011).
Machine Learning: The High-Interest Credit Card of Technical Debt Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.
Machine Teaching: A New Paradigm for Building Machine Learning Systems The current processes for building machine learning systems require practitioners with deep knowledge of machine learning. This significantly limits the number of machine learning systems that can be created and has led to a mismatch between the demand for machine learning systems and the ability for organizations to build them. We believe that in order to meet this growing demand for machine learning systems we must significantly increase the number of individuals that can teach machines. We postulate that we can achieve this goal by making the process of teaching machines easy, fast and above all, universally accessible. While machine learning focuses on creating new algorithms and improving the accuracy of learners, the machine teaching discipline focuses on the efficacy of the teachers. Machine teaching as a discipline is a paradigm shift that follows and extends principles of software engineering and programming languages. We put a strong emphasis on the teacher and the teacher’s interaction with data, as well as crucial components such as techniques and design principles of interaction and visualization. In this paper, we present our position regarding the discipline of machine teaching and articulate fundamental machine teaching principles. We also describe how, by decoupling knowledge about machine learning algorithms from the process of teaching, we can accelerate innovation and empower millions of new uses for machine learning models.
Machine Translation Evaluation: A Survey This paper introduces the state-of-the-art MT evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness. We classify the automatic evaluation methods into two categories, including lexical similarity and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, and word order, etc. The linguistic features can be divided into syntactic features and semantic features. Subsequently, we also introduce the evaluation methods for MT evaluation and the recent quality estimation tasks for MT.
MAGIX: Model Agnostic Globally Interpretable Explanations Explaining the behavior of a black box machine learning model at the instance level is useful for building trust. However, what is also important is understanding how the model behaves globally. Such an understanding provides insight into both the data on which the model was trained and the generalization power of the rules it learned. We present here an approach that learns rules to explain globally the behavior of black box machine learning models. Collectively these rules represent the logic learned by the model and are hence useful for gaining insight into its behavior. We demonstrate the power of the approach on three publicly available data sets.
Making Predictive Analytics More Practical With Alteryx Businesses today are in a conundrum. While the current economic climate has made organizations hesitant to take a risk on a strategic investment that could be a ‘bad bet’, the pace of business and the competitive marketplace dictate that organizations move quickly to take advantage of opportunities that could create a huge revenue windfall—and create a significant competitive advantage. Unfortunately, although the vast majority of organizations have a deep visibility into the past, thanks to traditional Business Intelligence (BI) tools that analyze the historical performance of the business, many still depend on intuition or simple optimism to better understand the present and future, forcing them to ‘fly blind’ when mapping their company’s future strategy. Why? There are two primary reasons: First, traditional BI tools and platforms do not provide forward-looking insight, rendering them useless in anticipating future performance. While they are able to deliver a wide range of detailed reports, sophisticated dashboards, and complex visualizations, all are based on historical information, leaving organizations stuck using guesswork, ‘gut feel’, or simple spreadsheets to anticipate the future, thereby ignoring the tremendous potential of their information assets that could give them predictive insight for competitive advantage. Second, most of the predictive analytical tools on the market today are complex, time-consuming, and expensive to use. Requiring multiple different technology systems and highly trained, specialized personnel to get from business question to predictive answer, companies that use the predictive analytical tools available today often find that the business opportunity is already in their rear-view mirror—by several weeks or months—before they have an answer. Clearly, this is unacceptable in today’s competitive reality. What today’s cutting-edge businesses need is a powerful yet easy to use predictive analytics platform that enables them to gain significant business value from predicting future business performance—quickly and inexpensively—using forward-looking insight rather than historical data. This paper will examine today’s business reality of information overload, the hurdles organizations must overcome using today’s complex, expensive, and time-consuming predictive analytical tools, and the new approach to agile predictive analytics offered by Alteryx.
Making the business case for text analytics This report hopes to establish some of the key barriers that prevent successful commercial deployments while providing real-world assistance so obstacles can be overcome. It will focus on the di erent needs of an initial text analytics adoption, including what our contributors all cited as the top company need: strong high-level executive support to help ensure necessary long-term funding. Text analytics can be applied in almost every business case and multiple units within the same organization can benefit from a centralized analytics division. The market’s future is still a concern because of the shortage in text analytics professionals, and this reality is a guiding force for today’s successful pilots and programs.
Malicious URL Detection using Machine Learning: A Survey Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of malicious URL detectors, machine learning techniques have been explored with increasing attention in recent years. This article aims to provide a comprehensive survey and a structural understanding of Malicious URL Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as a machine learning task, and categorize and review the contributions of literature studies that addresses different dimensions of this problem (feature representation, algorithm design, etc.). Further, this article provides a timely and comprehensive survey for a range of different audiences, not only for machine learning researchers and engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them understand the state of the art and facilitate their own research and practical applications. We also discuss practical issues in system design, open research challenges, and point out some important directions for future research.
Managing Big Data: A TDWI Best Practices Report The emerging phenomenon called big data is forcing numerous changes in businesses and other organizations. Many struggle just to manage the massive data sets and non-traditional data structures that are typical of big data. Others are managing big data by extending their data management skills and their portfolios of data management software. This empowers them to automate more business processes, operate closer to real time, and through analytics, learn valuable new facts about business operations, customers, partners, and so on. The result is big data management (BDM), an amalgam of old and new best practices, skills, teams, data types, and home-grown or vendor-built functionality. All of these are expanding and realigning so that businesses can fully leverage big data, not merely manage it. At the same time, big data must eventually find a permanent place in enterprise data management. BDM is well worth doing because managing big data leads to a number of benefits. According to this report’s survey, the business and technology tasks that improve most are analytic insights, the completeness of analytic data sets, business value drawn from big data, and all sales and marketing activities. BDM also has challenges, and common barriers include low organizational maturity relative to big data, weak business support, and the need to learn new technology approaches. Despite the newness of big data, half of organizations surveyed are actively managing big data today. For a quarter of organizations, big data mostly takes the form of the relational and structured data that comes from traditional applications, whereas another quarter manages traditional data along with big data from new sources such as Web servers, machines, sensors, customer interactions, and social media. A quarter of surveyed organizations have managed to scale up preexisting applications and databases to handle burgeoning volumes of relational big data. Another quarter has gone out on the leading edge by acquiring new data management platforms that are purpose-built for managing and analyzing multi-structured big data. Many more are evaluating such big data platforms now, creating a brisk market of vendor products and services for managing big data. According to the survey, the Hadoop Distributed File System (HDFS), MapReduce, and various Hadoop tools will be the software products most aggressively adopted for BDM in the next three years. Others include complex event processing (for streaming big data), NoSQL databases (for schema-free big data), in-memory databases (for real-time analytic processing of big data), private clouds, in-database analytics, and grid computing. Organizations are adjusting their technical best practices to accommodate BDM. Most are schooled in extract, transform, and load (ETL) in support of data warehousing (DW) and reporting. Preparing big data for analytics is similar, but different. Organizations are retraining existing personnel, augmenting their teams with consultants, and hiring new personnel. The focus is on data analysts, data scientists, and data architects who can develop the applications for data exploration and discovery analytics that organizations need for getting value from big data. This report accelerates users’ understanding of the many options that are available for big data management (BDM), including old, new, and upcoming options. The report brings readers up to date so they can make intelligent decisions about which tools, techniques, and team structures to apply to their next-generation solutions for BDM.
Map-Reduce for Machine Learning on Multicore We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model can be written in a certain ‘summation form,’ which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.
MapReduce: Simplified Data Processing on Large Clusters MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers nd the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.
Markov Chains Most of our study of probability has dealt with independent trials processes. These processes are the basis of classical probability theory and much of statistics. We have discussed two of the principal theorems for these processes: the Law of Large Numbers and the Central Limit Theorem. We have seen that when a sequence of chance experiments forms an independent trials process, the possible outcomes for each experiment are the same and occur with the same probability. Further, knowledge of the outcomes of the previous experiments does not influence our predictions for the outcomes of the next experiment. The distribution for the outcomes of a single experiment is sufficient to construct a tree and a tree measure for a sequence of n experiments, and we can answer any probability question about these experiments by using this tree measure. Modern probability theory studies chance processes for which the knowledge of previous outcomes influences predictions for future experiments. In principle, when we observe a sequence of chance experiments, all of the past outcomes could influence our predictions for the next experiment. For example, this should be the case in predicting a student’s grades on a sequence of exams in a course. But to allow this much generality would make it very difficult to prove general results. In 1907, A. A. Markov began the study of an important new type of chance process. In this process, the outcome of a given experiment can affect the outcome of the next experiment. This type of process is called a Markov chain.
Matchbox: Large Scale Online Bayesian Recommendations We present a probabilistic model for generating personalised recommendations of items to users of a web service. The Matchbox system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional ‘trait space’ in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don’t like) and observation of a set of ordinal ratings on a userspecific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an item’s popularity, a user’s taste or a user’s personal rating scale to drift over time. By using Assumed-Density Filtering (ADF) for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data.
Math for Machine Learning The goal of this document is to provide a \refresher’ on continuous mathematics for computer science students. It is by no means a rigorous course on these topics. The presentation, motivation, etc., are all from a machine learning perspective. The hope, however, is that it’s useful in other contexts. The two major topics covered are linear algebra and calculus (probability is currently left o )).
Matrix decompositions for regression analysis
Matrix Differentiation Throughout this presentation I have chosen to use a symbolic matrix notation. This choice was not made lightly. I am a strong advocate of index notation, when appropriate. For example, index notation greatly simplifies the presentation and manipulation of differential geometry. As a rule-of-thumb, if your work is going to primarily involve differentiation with respect to the spatial coordinates, then index notation is almost surely the appropriate choice. In the present case, however, I will be manipulating large systems of equations in which the matrix calculus is relatively simply while the matrix algebra and matrix arithmetic is messy and more involved. Thus, I have chosen to use symbolic notation.
Matrix Factorization Techniques for Recommender Systems Modern consumers are inundated with choices. Electronic retailers and content providers offer a huge selection of products, with unprecedented opportunities to meet a variety of special needs and tastes. Matching consumers with the most appropriate products is key to enhancing user satisfaction and loyalty. Therefore, more retailers have become interested in recommender systems, which analyze patterns of user interest in products to provide personalized recommendations that suit a user’s taste. Because good personalized recommendations can add another dimension to the user experience, e-commerce leaders like Amazon.com and Netflix have made recommender systems a salient part of their websites. Such systems are particularly useful for entertainment products such as movies, music, and TV shows. Many customers will view the same movie, and each customer is likely to view numerous different movies. Customers have proven willing to indicate their level of satisfaction with particular movies, so a huge volume of data is available about which movies appeal to which customers. Companies can analyze this data to recommend movies to particular customers.
Maximize the Effectiveness of your Text Analytics Initiatives (Slide Deck)
Maximizing the value provided by a Big Data Platform
Max-Margin Markov Networks In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches.
Measurement Drives Behavior Measurement impacts our personal lives every single day. If we want to lose some weight, we start by standing on the scale. Based on the outcome, we decide how much weight we need to lose, and every other day we check our progress. If there is enough progress, we become encouraged to lose more, and if we are disappointed, we’re driven to add even more effort in order to achieve our goal. In short, measurement drives our behavior. It can be witnessed in countless ways in our private lives. In fact, it is an important principle in the social sciences, often called the Hawthorne effect. In the business world this is no different; measurement also drives our professional behavior. Once your business starts measuring the results of a certain process, your employees will start focusing on it. There are numerous examples: If the CFO starts tracking the days-sales-outstanding (DSO—i.e., the average number of days it takes customers to pay their bills) on a daily basis, instead of assuming that customers will pay within 14 days or so, the people in the accounts receivable departments are more likely to pay attention and exert greater effort to make collections. If hotel managers and their front desk staff are held accountable for the percentage of guests that fill out the customer satisfaction survey, they will be more likely to remind guests of the survey. Measurement helps us not only to focus on our goals and objectives, but also to balance our actions. If you measure production speed alone in a manufacturing process, it is likely that quality issues will arise. For balance, you also need to measure how many produced units need rework. If a procurement department is only measured on how much additional discount it can squeeze out of contract manufacturers, it becomes hard to avoid unethical practices, such as the use of child labor in low-wage countries and the use of cheaper and environmentally unfriendly materials and production processes. Procurement departments need to identify a balanced set of metrics that includes ethical issues as well as price. In each of the functional disciplines within an organization—finance, sales, marketing, logistics, manufacturing, procurement, human resources (HR) or information technology (IT)—measurement is a key element of management, and ultimately of bottom-line performance.
Measures of dissimilarity Patterns or objects analysed using the techniques described in this book are usually represented by a vector of measurements. Many of the techniques require some measure of dissimilarity or distance between two pattern vectors, although sometimes data can arise directly in the form of a dissimilarity matrix….
Measuring Distances Applied multivariate statistics
Measuring Predictability: Theory and Macroeconomic Applications We propose a measure of predictability based on the ratio of the expected loss of a short-run forecast to the expected loss of a long-run forecast. This predictability measure can be tailored to the forecast horizons of interest, and it allows for general loss functions, univariate or multivariate information sets, and covariance stationary or difference stationary processes. We propose a simple estimator, and we suggest resampling methods for inference. We then provide several macroeconomic applications. First, we illustrate the implementation of predictability measures based on fitted parametric models for several U.S. macroeconomic time series. Second, we analyze the internal propagation mechanism of a standard dynamic macroeconomic model by comparing the predictability of model inputs and model outputs. Third, we use predictability as a metric for assessing the similarity of data simulated from the model and actual data. Finally, we outline several nonparametric extensions of our approach.
mediation: R Package for Causal Mediation Analysis In this paper, we describe the R package mediation for conducting causal mediation analysis in applied empirical research. In many scientific disciplines, the goal of researchers is not only estimating causal effects of a treatment but also understanding the process in which the treatment causally affects the outcome. Causal mediation analysis is frequently used to assess potential causal mechanisms. The mediation package implements a comprehensive suite of statistical tools for conducting such an analysis. The package is organized into two distinct approaches. Using the model-based approach, researchers can estimate causal mediation effects and conduct sensitivity analysis under the standard research design. Furthermore, the design-based approach provides several analysis tools that are applicable under different experimental designs. This approach requires weaker assumptions than the model-based approach. We also implement a statistical method for dealing with multiple (causally dependent) mediators, which are often encountered in practice. Finally, the package also offers a methodology for assessing causal mediation in the presence of treatment noncompliance, a common problem in randomized trials.
Metalearning for Feature Selection A general formulation of optimization problems in which various candidate solutions may use different feature-sets is presented, encompassing supervised classification, automated program learning and other cases. A novel characterization of the concept of a ‘good quality feature’ for such an optimization problem is provided; and a proposal regarding the integration of quality based feature selection into metalearning is suggested, wherein the quality of a feature for a problem is estimated using knowledge about related features in the context of related problems. Results are presented regarding extensive testing of this ‘feature metalearning’ approach on supervised text classification problems; it is demonstrated that, in this context, feature metalearning can provide significant and sometimes dramatic speedup over standard feature selection heuristics.
Mining Approximate Top K Subspace Anomalies in MultiDimensional TimeSeries Data Market analysis is a representative data analysis process with many applications. In such an analysis, critical numerical measures, such as profit and sales, fluctuate over time and form timeseries data. Moreover, the time series data correspond to market segments, which are described by a set of attributes, such as age, gender, education, income level, and product-category, that form a multi-dimensional structure. To better understand market dynamics and predict future trends, it is crucial to study the dynamics of time-series in multi-dimensional market segments. This is a topic that has been largely ignored in time series and data cube research. In this study, we examine the issues of anomaly detection in multi-dimensional time-series data. We propose timeseries data cube to capture the multidimensional space formed by the attribute structure. This facilitates the detection of anomalies based on expected values derived from higher level, \more general” time-series. Anomaly detection in a time-series data cube poses computational challenges, especially for high-dimensional, large data sets. To this end, we also propose an efficient search algorithm to iteratively select subspaces in the original high-dimensional space and detect anomalies within each one. Our experiments with both synthetic and real-world data demonstrate the effectiveness and efficiency of the proposed solution.
Mining Software Quality from Software Reviews: Research Trends and Open Issues Software review text fragments have considerably valuable information about users experience. It includes a huge set of properties including the software quality. Opinion mining or sentiment analysis is concerned with analyzing textual user judgments. The application of sentiment analysis on software reviews can find a quantitative value that represents software quality. Although many software quality methods are proposed they are considered difficult to customize and many of them are limited. This article investigates the application of opinion mining as an approach to extract software quality properties. We found that the major issues of software reviews mining using sentiment analysis are due to software lifecycle and the diverse users and teams.
Mobile Edge Computing: Survey and Research Outlook Driven by the visions of Internet of Things and 5G communications, recent years have seen a paradigm shift in mobile computing, from the centralized Mobile Cloud Computing towards Mobile Edge Computing (MEC). The main feature of MEC is to push mobile computing, network control and storage to the network edges (e.g., base stations and access points) so as to enable computation-intensive and latency-critical applications at the resource-limited mobile devices. MEC promises dramatic reduction in latency and mobile energy consumption, tackling the key challenges for materializing 5G vision. The promised gains of MEC have motivated extensive efforts in both academia and industry on developing the technology. A main thrust of MEC research is to seamlessly merge the two disciplines of wireless communications and mobile computing, resulting in a wide-range of new designs ranging from techniques for computation offloading to network architectures. This paper provides a comprehensive survey of the state-of-the-art MEC research with a focus on joint radio-and-computational resource management. We also present a research outlook consisting of a set of promising directions for MEC research, including MEC system deployment, cache-enabled MEC, mobility management for MEC, green MEC, as well as privacy-aware MEC. Advancements in these directions will facilitate the transformation of MEC from theory to practice. Finally, we introduce recent standardization efforts on MEC as well as some typical MEC application scenarios.
Model-based Machine Learning Several decades of research in the field of machine learning have resulted in a multitude of different algorithms for solving a broad range of problems. To tackle a new application, a researcher typically tries to map their problem onto one of these existing methods, often influenced by their familiarity with specific algorithms and by the availability of corresponding software implementations. In this study, we describe an alternative methodology for applying machine learning, in which a bespoke solution is formulated for each new application. The solution is expressed through a compact modelling language, and the corresponding custom machine learning code is then generated automatically. This model-based approach offers several major advantages, including the opportunity to create highly tailored models for specific scenarios, as well as rapid prototyping and comparison of a range of alternative models. Furthermore, newcomers to the field of machine learning do not have to learn about the huge range of traditional methods, but instead can focus their attention on understanding a single modelling environment. In this study, we show how probabilistic graphical models, coupled with efficient inference algorithms, provide a very flexible foundation formodel-based machine learning, and we outline a large-scale commercial application of this framework involving tens of millions of users. We also describe the concept of probabilistic programming as a powerful software environment for modelbased machine learning, and we discuss a specific probabilistic programming language called Infer.NET, which has been widely used in practical applications.
Modeling and Optimization for Big Data Analytics With pervasive sensors continuously collecting and storing massive amounts of information, there is no doubt this is an era of data deluge. Learning from these large volumes of data is expected to bring significant science and engineering advances along with improvements in quality of life. However, with such a big blessing come big challenges. Running analytics on voluminous data sets by central processors and storage units seems infeasible, and with the advent of streaming data sources, learning must often be performed in real time, typically without a chance to revisit past entries. “Workhorse” signal processing (SP) and statistical learning tools have to be re-examined in today’s high-dimensional data regimes. This article contributes to the ongoing cross-disciplinary efforts in data science by putting forth encompassing models capturing a wide range of SP-relevant data analytic tasks, such as principal component analysis (PCA), dictionary learning (DL), compressive sampling (CS), and subspace clustering. It offers scalable architectures and optimization algorithms for decentralized and online learning problems, while revealing fundamental insights into the various analytic and implementation tradeoffs involved. Extensions of the encompassing models to timely data-sketching, tensor- and kernel-based learning tasks are also provided. Finally, the close connections of the presented framework with several big data tasks, such as network visualization, decentralized and dynamic estimation, prediction, and imputation of network link load traffic, as well as imputation in tensor-based medical imaging are highlighted.
mtk: A General-Purpose and Extensible R Environment for Uncertainty and Sensitivity Analyses of Numerical Experiments Along with increased complexity of the models used for scientific activities and engineering, come diverse and greater uncertainties. Today, effectively quantifying the uncertainties contained in a model appears to be more important than ever. Scientific fellows know how serious it is to calibrate their model in a robust way, and decision-makers describe how critical it is to keep the best effort to reduce the uncertainties about the model. Effectively accessing the uncertainties about the model requires mastering all the tasks involved in the numerical experiments, from optimizing the experimental design to managing the very time consuming aspect of model simulation and choosing the adequate indicators and analysis methods. In this paper, we present an open framework for organizing the complexity associated with numerical model simulation and analyses. Named mtk (Mexico Toolkit), the developed system aims at providing practitioners from different disciplines with a systematic and easy way to compare and to find the best method to effectively uncover and quantify the uncertainties contained in the model and further to evaluate their impact on the performance of the model. Such requirements imply that the system must be generic, universal, homogeneous, and extensible. This paper discusses such an implementation using the R scientific computing platform and demonstrates its functionalities with examples from agricultural modeling. The package mtk is of general purpose and easy to extend. Numerous methods are already available in the actual release version, including Fast, Sobol, Morris, Basic Monte-Carlo, Regression, LHS (Latin Hypercube Sampling), PLMM (Polynomial Linear metamodel). Most of them are compiled from available R packages with extension tools delivered by package mtk.
Multidimensional Scaling by Majorization: A Review A major breakthrough in the visualization of dissimilarities between pairs of objects was the formulation of the least-squares multidimensional scaling (MDS) model as defined by the Stress function. This function is quite flexible in that it allows possibly nonlinear transformations of the dissimilarities to be represented by distances between points in a low dimensional space. To obtain the visualization, the Stress function should be minimized over the coordinates of the points and the over the transformation. In a series of papers, Jan de Leeuw has made a significant contribution to majorization methods for the minimization of Stress in least-squares MDS. In this paper, we present a review of the majorization algorithm for MDS as implemented in the smacof package and related approaches. We present several illustrative examples and special cases.
Multi-Label Learning with Millions of Labels: Recommending Advertiser Bid Phrases for Web Pages Recommending phrases from web pages for advertisers to bid on against search engine queries is an important research problem with direct commercial impact. Most approaches have found it infeasible to determine the relevance of all possible queries to a given ad landing page and have focussed on making recommendations from a small set of phrases extracted (and expanded) from the page using NLP and ranking based techniques. In this paper, we eschew this paradigm, and demonstrate that it is possible to efficiently predict the relevant subset of queries from a large set of monetizable ones by posing the problem as a multi-label learning task with each query being represented by a separate label. We develop Multi-label Random Forests to tackle problems with millions of labels. Our proposed classifier has prediction costs that are logarithmic in the number of labels and can make predictions in a few milliseconds using 10 Gb of RAM. We demonstrate that it is possible to generate training data for our classifier automatically from click logs without any human annotation or intervention. We train our classifier on tens of millions of labels, features and training points in less than two days on a thousand node cluster. We develop a sparse semi-supervised multi-label learning formulation to deal with training set biases and noisy labels harvested automatically from the click logs. This formulation is used to infer a belief in the state of each label for each training ad and the random forest classifier is extended to train on these beliefs rather than the given labels. Experiments reveal significant gains over ranking and NLP based techniques on a large test set of 5 million ads using multiple metrics.
Multimodal Machine Learning: A Survey and Taxonomy Our experience of the world is multimodal – we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
Multiple Change-point Detection: a Selective Overview Very long and noisy sequence data arise from biological sciences to social science including high throughput data in genomics and stock prices in econometrics. Often such data are collected in order to identify and understand shifts in trend, e.g., from a bull market to a bear market in finance or from a normal number of chromosome copies to an excessive number of chromosome copies in genetics. Thus, identifying multiple change points in a long, possibly very long, sequence is an important problem. In this article, we review both classical and new multiple change-point detection strategies. Considering the long history and the extensive literature on the change-point detection, we provide an in-depth discussion on a normal mean change-point model from aspects of regression analysis, hypothesis testing, consistency and inference. In particular, we present a strategy to gather and aggregate local information for change-point detection that has become the cornerstone of several emerging methods because of its attractiveness in both computational and theoretical properties.
Multiple Factor Analysis Multiple factor analysis (MFA, see Escofier and Pagès, 1990, 1994) analyzes observations described by several “blocks” or sets of variables. MFA seeks the common structures present in all or some of these sets. MFA is performed in two steps. First a principal component analysis (PCA) is performed on each data set which is then “normalized” by dividing all its elements by the square root of the first eigenvalue obtained from of its PCA. Second, the normalized data sets are merged to form a unique matrix and a global PCA is performed on this matrix. The individual data sets are then projected onto the global analysis to analyze communalities and discrepancies. MFA is used in very different domains such as sensory evaluation, economy, ecology, and chemistry.
Multiple Instance Learning: A Survey of Problem Characteristics and Applications Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research.
Multisensor data fusion: A review of the state-of-the-art There has been an ever-increasing interest in multi-disciplinary research on multisensor data fusion technology, driven by its versatility and diverse areas of application. Therefore, there seems to be a real need for an analytical review of recent developments in the data fusion domain. This paper proposes a comprehensive review of the data fusion state of the art, exploring its conceptualizations, benefits, and challenging aspects, as well as existing methodologies. In addition, several future directions of research in the data fusion community are highlighted and described.
Multi-Stakeholder Recommendation: Applications and Challenges Recommender systems have been successfully applied to assist decision making by producing a list of item recommendations tailored to user preferences. Traditional recommender systems only focus on optimizing the utility of the end users who are the receiver of the recommendations. By contrast, multi-stakeholder recommendation attempts to generate recommendations that satisfy the needs of both the end users and other parties or stakeholders. This paper provides an overview and discussion about the multi-stakeholder recommendations from the perspective of practical applications, available data sets, corresponding research challenges and potential solutions.
Multivariate Archimax Copulas A multivariate extension of the bivariate class of Archimax copulas was recently proposed by Mesiar and J agr (2013), who asked under which conditions it holds. This paper answers their question and provides a stochastic representation of multivariate Archimax copulas. A few basic properties of these copulas are explored, including their minimum and maximum domains of attraction. Several non-trivial examples of multivariate Archimax copulas are also provided.
Multivariate Linear Models in R The multivariate linear model is Y (n m) = X (n k+1) B (k+1 m) + E (n m) where Y is a matrix of n observations on m response variables; X is a model matrix with columns for k + 1 regressors, typically including an initial column of 1s for the regression constant; B is a matrix of regression coe cients, one column for each response variable; and E is a matrix of errors. This model can be t with the lm function in R, where the left-hand side of the model comprises a matrix of response variables, and the right-hand side is speci ed exactly as for a univariate linear model (i.e., with a single response variable). This appendix to Fox and Weisberg (2011) explains how to use the Anova and linearHypothesis functions in the car package to test hypotheses for parameters in multivariate linear models, including models for repeated-measures data.
Multivariate Pricing Price strategy is the key marketing tool for companies to increase their competitive edge but too often, prices are based on costs, not on customers’ perceptions of value. Value-based pricing is a business strategy which sets selling prices based on the perceived value to the customer, rather than the actual cost of the product, the market price, competitors’ prices, or the historical price. Practically speaking, the goal is to align the money spent with the value perceived. For example, the number of users, lifetime spending, number of transactions, value of transaction, return-on-investment, cost saving, revenue; the list can continue. The most common techniques employ straightforward methods such as: ‘Would you pay for this item at this price?’. While the van Westendorf method and conjoint analysis are useful, this article focuses on multivariate pricing techniques that allow flexibility and agility into the pricing models, and therefore can be more widely employed by clients and product managers.

N

Narrative Science Systems: A Review Automatic narration of events and entities is the need of the hour, especially when live reporting is critical and volume of information to be narrated is huge. This paper discusses the challenges in this context, along with the algorithms used to build such systems. From a systematic study, we can infer that most of the work done in this area is related to statistical data. It was also found that subjective evaluation or contribution of experts is also limited for narration context.
Negative Results in Computer Vision: A Perspective A negative result is when the outcome of an experiment or a model is not what is expected or when a hypothesis does not hold. Despite being often overlooked in the scientific community, negative results are results and they carry value. While this topic has been extensively discussed in other fields such as social sciences and biosciences, less attention has been paid to it in the computer vision community. The unique characteristics of computer vision, in particular its experimental aspect, calls for a special treatment of this matter. In this paper, I will address questions such as what makes negative results important, how they should be disseminated, and how they should be incentivized. Further, I will discuss issues such as computer and human vision interaction, experimental design and statistical hypothesis testing, performance evaluation and model comparison, as well as computer vision research culture.
Network Structure Inference, A Survey: Motivations, Methods, and Applications Networks are used to represent relationships between entities in many complex systems, spanning from online social networks to biological cell development and brain activity. These networks model relationships which present various challenges. In many cases, relationships between entities are unambiguously known: are two users friends in a social network? Do two researchers collaborate on a published paper? Do two road segments in a transportation system intersect? These are unambiguous and directly observable in the system in question. In most cases, relationship between nodes are not directly observable and must be inferred: does one gene regulate the expression of another? Do two animals who physically co-locate have a social bond? Who infected whom in a disease outbreak? Existing approaches use specialized knowledge in different home domains to infer and measure the goodness of inferred network for a specific task. However, current research lacks a rigorous validation framework which employs standard statistical validation. In this survey, we examine how network representations are learned from non-network data, the variety of questions and tasks on these data over several domains, and validation strategies for measuring the inferred network’s capability of answering questions on the original system of interest.
Neural Graph Machines: Learning Neural Networks Using Graphs Label propagation is a powerful and flexible semi-supervised learning technique on graphs. Neural networks, on the other hand, have proven track records in many supervised learning tasks. In this work, we propose a training framework with a graph-regularised objective, namely ‘Neural Graph Machines’, that can combine the power of neural networks and label propagation. This work generalises previous literature on graph-augmented training of neural networks, enabling it to be applied to multiple neural architectures (Feed-forward NNs, CNNs and LSTM RNNs) and a wide range of graphs. The new objective allows the neural networks to harness both labeled and unlabeled data by: (a) allowing the network to train using labeled data as in the supervised setting, (b) biasing the network to learn similar hidden representations for neighboring nodes on a graph, in the same vein as label propagation. Such architectures with the proposed objective can be trained efficiently using stochastic gradient descent and scaled to large graphs, with a runtime that is linear in the number of edges. The proposed joint training approach convincingly outperforms many existing methods on a wide range of tasks (multi-label classification on social graphs, news categorization, document classification and semantic intent classification), with multiple forms of graph inputs (including graphs with and without node-level features) and using different types of neural networks.
Neural Machine Translation and Sequence-to-sequence Models: A Tutorial This tutorial introduces a new and powerful set of techniques variously called ‘neural machine translation’ or ‘neural sequence-to-sequence models’. These techniques have been used in a number of tasks regarding the handling of human language, and can be a powerful tool in the toolbox of anyone who wants to model sequential data of some sort. The tutorial assumes that the reader knows the basics of math and programming, but does not assume any particular experience with neural networks or natural language processing. It attempts to explain the intuition behind the various methods covered, then delves into them with enough mathematical detail to understand them concretely, and culiminates with a suggestion for an implementation exercise, where readers can test that they understood the content in practice.
Neural networks and rational functions Neural networks and rational functions efficiently approximate each other. In more detail, it is shown here that for any ReLU network, there exists a rational function of degree $O(\text{polylog}(1/\epsilon))$ which is $\epsilon$-close, and similarly for any rational function there exists a ReLU network of size $O(\text{polylog}(1/\epsilon))$ which is $\epsilon$-close. By contrast, polynomials need degree $\Omega(\text{poly}(1/\epsilon))$ to approximate even a single ReLU. When converting a ReLU network to a rational function as above, the hidden constants depend exponentially on the number of layers, which is shown to be tight; in other words, a compositional representation can be beneficial even for rational functions.
Next Generation Business Intelligence and Analytics: A Survey Business Intelligence and Analytics (BI&A) is the process of extracting and predicting business-critical insights from data. Traditional BI focused on data collection, extraction, and organization to enable efficient query processing for deriving insights from historical data. With the rise of big data and cloud computing, there are many challenges and opportunities for the BI. Especially with the growing number of data sources, traditional BI\&A are evolving to provide intelligence at different scales and perspectives – operational BI, situational BI, self-service BI. In this survey, we review the evolution of business intelligence systems in full scale from back-end architecture to and front-end applications. We focus on the changes in the back-end architecture that deals with the collection and organization of the data. We also review the changes in the front-end applications, where analytic services and visualization are the core components. Using a uses case from BI in Healthcare, which is one of the most complex enterprises, we show how BI\&A will play an important role beyond the traditional usage. The survey provides a holistic view of Business Intelligence and Analytics for anyone interested in getting a complete picture of the different pieces in the emerging next generation BI\&A solutions.
Nonlinear functional regression: a functional RKHS approach This paper deals with functional regression, in which the input attributes as well as the response are functions. To deal with this problem, we develop a functional reproducing kernel Hilbert space approach; here, a kernel is an operator acting on a function and yielding a function. We demonstrate basic properties of these functional RKHS, as well as a representer theorem for this setting; we investigate the construction of kernels; we provide some experimental insight.
Nonlinear probability. A theory with incompatible stochastic variables In 1991 J.F. Aarnes introduced the concept of quasi-measures in a compact topological space $\Omega$ and established the connection between quasi-states on $C (\Omega)$ and quasi-measures in $\Omega$. This work solved the linearity problem of quasi-states on $C^*$-algebras formulated by R.V. Kadison in 1965. The answer is that a quasi-state need not be linear, so a quasi-state need not be a state. We introduce nonlinear measures in a space $\Omega$ which is a generalization of a measurable space. In this more general setting we are still able to define integration and establish a representation theorem for the corresponding functionals. A probabilistic language is choosen since we feel that the subject should be of some interest to probabilists. In particular we point out that the theory allows for incompatible stochastic variables. The need for incompatible variables is well known in quantum mechanics, but the need seems natural also in other contexts as we try to explain by a questionary example. Keywords and phrases: Epistemic probability, Integration with respect to mea- sures and other set functions, Banach algebras of continuous functions, Set func- tions and measures on topological spaces, States, Logical foundations of quantum mechanics.
Notes: A Continuous Model of Neural Networks. Part I: Residual Networks In this series of notes, we try to model neural networks as as discretizations of continuous flows on the space of data, which can be called flow model. The idea comes from an observation of their similarity in mathematical structures. This conceptual analogy has not been proven useful yet, but it seems interesting to explore. In this part, we start with a linear transport equation (with nonlinear transport velocity field) and obtain a class of residual type neural networks. If the transport velocity field has a special form, the obtained network is found similar to the original ResNet. This neural network can be regarded as a discretization of the continuous flow defined by the transport flow. In the end, a summary of the correspondence between neural networks and transport equations is presented, followed by some general discussions.
Novelty Detection in Learning Systems Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen. It is a useful technique in cases where an important class of data is under-represented in the training set. This means that the performance of the network will be poor for those classes. In some circumstances, such as medical data and fault detection, it is often precisely the class that is under-represented in the data, the disease or potential fault, that the network should detect. In novelty detection systems the network is trained only on the negative examples where that class is not present, and then detects inputs that do not fits into the model that it has acquired, that it, members of the novel class. This paper reviews the literature on novelty detection in neural networks and other machine learning techniques, as well as providing brief overviews of the related topics of statistical outlier detection and novelty detection in biological organisms.

O

Object Oriented Analysis using Natural Language Processing concepts: A Review The Software Development Life Cycle (SDLC) starts with eliciting requirements of the customers in the form of Software Requirement Specification (SRS). SRS document needed for software development is mostly written in Natural Language(NL) convenient for the client. From the SRS document only, the class name, its attributes and the functions incorporated in the body of the class are traced based on pre-knowledge of analyst. The paper intends to present a review on Object Oriented (OO) analysis using Natural Language Processing (NLP) techniques. This analysis can be manual where domain expert helps to generate the required diagram or automated system, where the system generates the required diagram, from the input in the form of SRS.
Observational Learning by Reinforcement Learning Observational learning is a type of learning that occurs as a function of observing, retaining and possibly replicating or imitating the behaviour of another agent. It is a core mechanism appearing in various instances of social learning and has been found to be employed in several intelligent species, including humans. In this paper, we investigate to what extent the explicit modelling of other agents is necessary to achieve observational learning through machine learning. Especially, we argue that observational learning can emerge from pure Reinforcement Learning (RL), potentially coupled with memory. Through simple scenarios, we demonstrate that an RL agent can leverage the information provided by the observations of an other agent performing a task in a shared environment. The other agent is only observed through the effect of its actions on the environment and never explicitly modeled. Two key aspects are borrowed from observational learning: i) the observer behaviour needs to change as a result of viewing a ‘teacher’ (another agent) and ii) the observer needs to be motivated somehow to engage in making use of the other agent’s behaviour. The later is naturally modeled by RL, by correlating the learning agent’s reward with the teacher agent’s behaviour.
On Being a Data Skeptic I’d like to set something straight right out of the gate. I’m not a data cynic, nor am I urging other people to be. Data is here, it’s growing, and it’s powerful. I’m not hiding behind the word “skeptic” the way climate change “skeptics” do, when they should call themselves deniers. Instead, I urge the reader to cultivate their inner skeptic, which I define by the following characteristic behavior. A skeptic is someone who maintains a consistently inquisitive attitude toward facts, opinions, or (especially) beliefs stated as facts. A skeptic asks questions when confronted with a claim that has been taken for granted. That’s not to say a skeptic brow-beats someone for their beliefs, but rather that they set up reasonable experiments to test those beliefs. A really excellent skeptic puts the “science” into the term “data science.” In this paper, I’ll make the case that the community of data practitioners needs more skepticism, or at least would benefit greatly from it, for the following reason: there’s a two-fold problem in this community. On the one hand, many of the people in it are overly enamored with data or data science tools. On the other hand, other people are overly pessimistic about those same tools. I’m charging myself with making a case for data practitioners to engage in active, intelligent, and strategic data skepticism. I’m proposing a middle-of-the-road approach: don’t be blindly optimistic, don’t be blindly pessimistic. Most of all, don’t be awed. Realize there are nuanced considerations and plenty of context and that you don’t necessarily have to be a mathematician to understand the issues. …
On Calibration of Modern Neural Networks Confidence calibration — the problem of predicting probability estimates representative of the true correctness likelihood — is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling — a single-parameter variant of Platt Scaling — is surprisingly effective at calibrating predictions.
On Clustering Validation Techniques Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. This paper introduces the fundamental concepts of clustering while it surveys the widely known clustering algorithms in a comparative way. Moreover, it addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature is presented. Furthermore, the paper illustrates the issues that are under-addressed by the recent algorithms and gives the trends in clustering process.
On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widelyheld belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better. This stems from the observation – which is borne out in repeated experiments – that while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster.
On Ensuring that Intelligent Machines Are Well-Behaved Machine learning algorithms are everywhere, ranging from simple data analysis and pattern recognition tools used across the sciences to complex systems that achieve super-human performance on various tasks. Ensuring that they are well-behaved—that they do not, for example, cause harm to humans or act in a racist or sexist way—is therefore not a hypothetical problem to be dealt with in the future, but a pressing one that we address here. We propose a new framework for designing machine learning algorithms that simplifies the problem of specifying and regulating undesirable behaviors. To show the viability of this new framework, we use it to create new machine learning algorithms that preclude the sexist and harmful behaviors exhibited by standard machine learning algorithms in our experiments. Our framework for designing machine learning algorithms simplifies the safe and responsible application of machine learning.
On k-Anonymity and the Curse of Dimensionality In recent years, the wide availability of personal data has made the problem of privacy preserving data mining an important one. A number of methods have recently been proposed for privacy preserving data mining of multidimensional data records. One of the methods for privacy preserving data mining is that of anonymization, in which a record is released only if it is indistinguishable from k other entities in the data. We note that methods such as k-anonymity are highly dependent upon spatial locality in order to effectively implement the technique in a statistically robust way. In high dimensional space the data be- comes sparse, and the concept of spatial locality is no longer easy to define from an application point of view. In this paper, we view the k-anonymization problem from the perspec- tive of inference attacks over all possible combinations of attributes. We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. This is because an exponential number of combinations of dimensions can be used to make precise inference attacks, even when individual attributes are partially specified within a range. We provide an analysis of the effect of dimensionality on k-anonymity methods. We conclude that when a data set contains a large number of attributes which are open to inference attacks, we are faced with a choice of either completely suppressing most of the data or losing the desired level of anonymity. Thus, this paper shows that the curse of high dimensionality also applies to the problem of privacy preserving data mining.
On the Origin of Deep Learning This paper is a review of the evolutionary history of deep learning models. It covers from the genesis of neural networks when associationism modeling of the brain is studied, to the models that dominate the last decade of research in deep learning like convolutional neural networks, deep belief networks, and recurrent neural networks, and extends to popular recent models like variational autoencoder and generative adversarial nets. In addition to a review of these models, this paper primarily focuses on the precedents of the models above, examining how the initial ideas are assembled to construct the early models and how these preliminary models are developed into their current forms. Many of these evolutionary paths last more than half a century and have a diversity of directions. For example, CNN is built on prior knowledge of biological vision system; DBN is evolved from a trade-off of modeling power and computation complexity of graphical models and many nowadays models are neural counterparts of ancient linear models. This paper reviews these evolutionary paths and offers a concise thought flow of how these models are developed, and aims to provide a thorough background for deep learning. More importantly, along with the path, this paper summarizes the gist behind these milestones and proposes many directions to guide the future research of deep learning.
Online Algorithms This book chapter reviews fundamental concepts and results in the area of online algorithms. We first address classical online problems and then study various applications of current interest. Online algorithms represent a theoretical framework for studying problems in interactive computing. They model, in particular, that the input in an interactive system does not arrive as a batch but as a sequence of input portions and that the system must react in response to each incoming portion. Moreover, they take into account that at any point in time future input is unknown. As the name suggests, online algorithms consider the algorithmic aspects of interactive systems: We wish to design strategies that always compute good output and keep a given system in good state. No assumptions are made about the input stream. The input can even be generated by an adversary that creates new input portions based on the system’s reactions to previous ones. We seek algorithms that have a provably good performance.
Online Learning and Online Convex Optimization Online learning is a well established learning paradigm which has both theoretical and practical appeals. The goal of online learning is to make a sequence of accurate predictions given knowledge of the correct answer to previous prediction tasks and possibly additional available information. Online learning has been studied in several research fields including game theory, information theory, and machine learning. It also became of great interest to practitioners due the recent emergence of large scale applications such as online advertisement placement and online web ranking. In this survey we provide a modern overview of online learning. Our goal is to give the reader a sense of some of the interesting ideas and in particular to underscore the centrality of convexity in deriving efficient online learning algorithms. We do not mean to be comprehensive but rather to give a high-level, rigorous yet easy to follow, survey.
Online Portfolio Selection: A Survey Online portfolio selection is a fundamental problem in computational finance, which has been extensively studied across several research communities, including finance, statistics, artificial intelligence, machine learning, and data mining. This article aims to provide a comprehensive survey and a structural understanding of online portfolio selection techniques published in the literature. From an online machine learning perspective, we first formulate online portfolio selection as a sequential decision problem, and then we survey a variety of state-of-the-art approaches, which are grouped into several major categories, including benchmarks, Follow-the-Winner approaches, Follow-the-Loser approaches, Pattern-Matching–based approaches, and Meta-Learning Algorithms. In addition to the problem formulation and related algorithms, we also discuss the relationship of these algorithms with the capital growth theory so as to better understand the similarities and differences of their underlying trading ideas. This article aims to provide a timely and comprehensive survey for both machine learning and data mining researchers in academia and quantitative portfolio managers in the financial industry to help them understand the state of the art and facilitate their research and practical applications. We also discuss some open issues and evaluate some emerging new trends for future research.
Online Principal Component Analysis Principal Component Analysis (PCA) is one of the most well known and widely used procedures in scienti c computing. It is used for dimension reduction, signal denoising, regression, correlation analysis, visualization etc. It can be described in many ways but one is particularly appealing in the context of online algorithms. In the online setting, the algorithm receives the input vectors xt one ofter the other and must always output yt before receiving xt+1.
Ontology Learning from Text: A Survey of Methods After the vision of the Semantic Web was broadcasted at the turn of the millennium, ontology became a synonym for the solution to many problems concerning the fact that computers do not understand human language: if there were an ontology and every document were marked up with it and we had agents that would understand the markup, then computers would finally be able to process our queries in a really sophisticated way. Some years later, the success of Google shows us that the vision has not come true, being hampered by the incredible amount of extra work required for the intellectual encoding of semantic mark-up – as compared to simply uploading an HTML page. To alleviate this acquisition bottleneck, the field of ontology learning has since emerged as an important sub-field of ontology engineering. …
OPEB: Open Physical Environment Benchmark for Artificial Intelligence Artificial Intelligence methods to solve continuous- control tasks have made significant progress in recent years. However, these algorithms have important limitations and still need significant improvement to be used in industry and real- world applications. This means that this area is still in an active research phase. To involve a large number of research groups, standard benchmarks are needed to evaluate and compare proposed algorithms. In this paper, we propose a physical environment benchmark framework to facilitate collaborative research in this area by enabling different research groups to integrate their designed benchmarks in a unified cloud-based repository and also share their actual implemented benchmarks via the cloud. We demonstrate the proposed framework using an actual implementation of the classical mountain-car example and present the results obtained using a Reinforcement Learning algorithm.
Operational Analytics from A to Z
Optimization of Tree Ensembles Tree ensemble models such as random forests and boosted trees are among the most widely used and practically successful predictive models in applied machine learning and business analytics. Although such models have been used to make predictions based on exogenous, uncontrollable independent variables, they are increasingly being used to make predictions where the independent variables are controllable and are also decision variables. In this paper, we study the problem of tree ensemble optimization: given a tree ensemble that predicts some dependent variable using controllable independent variables, how should we set these variables so as to maximize the predicted value? We formulate the problem as a mixed-integer optimization problem. We theoretically examine the strength of our formulation, provide a hierarchy of approximate formulations with bounds on approximation quality and exploit the structure of the problem to develop two large-scale solution methods, one based on Benders decomposition and one based on iteratively generating tree split constraints. We test our methodology on real data sets, including two case studies in drug design and customized pricing, and show that our methodology can efficiently solve large-scale instances to near or full optimality, and outperforms solutions obtained by heuristic approaches. In our drug design case, we show how our approach can identify compounds that efficiently trade-off predicted performance and novelty with respect to existing, known compounds. In our customized pricing case, we show how our approach can efficiently determine optimal store-level prices under a random forest model that delivers excellent predictive accuracy.
Optimization theory in Statistics This paper addresses the issues of optimization theory and related numerical issues within the context of Statistics. Focusing on the problem of concave regression, several estimation techniques for nonparametric shape-constrained regression are classified, analyzed and compared qualitatively and quantitatively through numerical simulations. In particular, their main features, strengths and limitations for solving large instances of the problem are examined through this paper. Several improvements to enhance numerical stability and bound the computational cost are proposed. For each analyzed algorithm, the pseudo-code and its corresponding code in Scilab are provided. The results from this study demonstrate that the choice of the optimization approach strongly impact algorithmic performances. Interestingly, it is also shown that, currently, are not available methods able to solve efficiently large instance of the concave regression problems (more than many thousands of points). We suggest that further research to fill this gap in the literature should focus on finding a way to exploit and adapt classical multi-scale strategy to compute an approximate solution.
Overcoming the Barriers to Production-Ready Machine Learning Workflows (Slide Deck)
Overview of Annotation Creation: Processes & Tools Creating linguistic annotations requires more than just a reliable annotation scheme. Annotation can be a complex endeavour potentially involving many people, stages, and tools. This chapter outlines the process of creating end-to-end linguistic annotations, identifying specific tasks that researchers often perform. Because tool support is so central to achieving high quality, reusable annotations with low cost, the focus is on identifying capabilities that are necessary or useful for annotation tools, as well as common problems these tools present that reduce their utility. Although examples of specific tools are provided in many cases, this chapter concentrates more on abstract capabilities and problems because new tools appear continuously, while old tools disappear into disuse or disrepair. The two core capabilities tools must have are support for the chosen annotation scheme and the ability to work on the language under study. Additional capabilities are organized into three categories: those that are widely provided; those that often useful but found in only a few tools; and those that have as yet little or no available tool support.

P

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations Latent Dirichlet allocation (LDA) and other related topic models are increasingly popular tools for summarization and manifold discovery in discrete data. However, LDA does not capture correlations between topics. In this paper, we introduce the pachinko allocation model (PAM), which captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (DAG). The leaves of the DAG represent individual words in the vocabulary, while each interior node represents a correlation among its children, which may be words or other interior nodes (topics). PAM provides a flexible alternative to recent work by Blei and Lafferty (2006), which captures correlations only between pairs of topics. Using text data from newsgroups, historic NIPS proceedings and other research paper corpora, we show improved performance of PAM in document classification, likelihood of held-out data, the ability to support finer-grained topics, and topical keyword coherence.
Parameter estimation for text analysis Presents parameter estimation methods common with discrete probability distributions, which is of particular interest in text modeling. Starting with maximum likelihood, a posteriori and Bayesian estimation, central concepts like conjugate distributions and Bayesian networks are reviewed. As an application, the model of latent Dirichlet allocation (LDA) is explained in detail with a full derivation of an approximate inference algorithm based on Gibbs sampling, including a discussion of Dirichlet hyperparameter estimation.
PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA Graph analytics is a crucial element in extracting insights from Big Data because it helps discover hidden relationships by connecting the dots. A graph, meaning the network of nodes and relationships, treats the linkage between objects as equally important as the objects themselves. Social networks or supply chains are obvious examples, but graphs include any network of objects such as customers, products, purchase orders, customer support calls, product inventory, etc. HiperGraph, PARC’s breakthrough Big Data technology, is a high-performance graph analytics engine. Through a four-month research project with SAP, we added HiperGraph’s analytics to SAP HANA to demonstrate a live, real-time marketing insights use case. Graph reasoning technologies provide the ability to contextualize relational data with the tapestry of information and can go beyond simplistic reporting and dashboards. This creates opportunities to rapidly experiment, gain new insights, and identify root causes. The demonstrated technology match between HANA and HiperGraph has great disruptive potential, especially in the identification of key patterns within datasets (e.g., via clustering). With HANA and HiperGraph we can finally deliver on the promise of a closed feedback loop in the enterprise where transactions are analyzed and reacted to in real-time. The intelligence that is implicit in large volumes of structured and unstructured data from varieties of sources from inside or outside of the enterprise can be delivered to the users in the form of smart business applications. We concluded that the existing commercial or open source algorithms either did not provide the real-time response or were unable to scale to the large volumes of data. The requirements from our customer (an online retailer) required real-time response from their Big Data system. PARC’s graph reasoning, versatile goal-directed clustering, egocentric recommendations, and real-time recommendation algorithms combined with the power of HANA in-memory technologies far exceeded the expectations. Brand managers can use this solution to automatically find clusters of customers with similar purchases, clusters of products that are frequently bought together, clusters of products that tend to be purchased on sale vs. those that are purchased at full price, and so on, and act on these insights during the customer’s shopping experience. There is a great opportunity for businesses to gain value by combining the HANA in-memory technology with HiperGraph reasoning, recommendation, matrix factorization, egocentric collaborative filtering, and versatile goal-directed clustering. With SAP and PARC co-innovation in Big Data analytics we can now reduce and/or eliminate the need for complex extract, transform, and load (ETL) processes; increase speed in clustering; and introduce new accessibility for business users to directly explore data clusters. We are democratizing data science for all business users in the enterprise.
Patent Retrieval: A Literature Review With the ever increasing number of filed patent applications every year, the need for effective and efficient systems for managing such tremendous amounts of data becomes inevitably important. Patent Retrieval (PR) is considered is the pillar of almost all patent analysis tasks. PR is a subfield of Information Retrieval (IR) which is concerned with developing techniques and methods that effectively and efficiently retrieve relevant patent documents in response to a given search request. In this paper we present a comprehensive review on PR methods and approaches. It is clear that, recent successes and maturity in IR applications such as Web search can not be transferred directly to PR without deliberate domain adaptation and customization. Furthermore, state-of-the-art performance in automatic PR is still around average. These observations motivates the need for interactive search tools which provide cognitive assistance to patent professionals with minimal effort. These tools must also be developed in hand with patent professionals considering their practices and expectations. We additionally touch on related tasks to PR such as patent valuation, litigation, licensing, and highlight potential opportunities and open directions for computational scientists in these domains.
Perspectives of Predictive Modeling (Slide Deck)
Practical Approaches to Principal Component Analysis in the Presence of Missing Values Principal component analysis (PCA) is a classical data analysis technique that finds linear transformations of data that retain the maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overfitting and bad locally optimal solutions. A probabilistic formulation of PCA provides a good foundation for handling missing values, and we provide formulas for doing that. In case of high dimensional and very sparse data, overfitting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to variational Bayesian learning. Different versions of PCA are compared in artificial experiments, demonstrating the effects of regularization and modeling of posterior variance. The scalability of the proposed algorithm is demonstrated by applying it to the Netflix problem.
Practical Bayesian Optimization of Machine Learning Algorithms Machine learning algorithms frequently require careful tuning of model hyperparameters, regularization terms, and optimization parameters. Unfortunately, this tuning is often a ‘black art’ that requires expert experience, unwritten rules of thumb, or sometimes brute-force search. Much more appealing is the idea of developing automatic approaches which can optimize the performance of a given learning algorithm to the task at hand. In this work, we consider the automatic tuning problem within the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). The tractable posterior distribution induced by the GP leads to efficient use of the information gathered by previous experiments, enabling optimal choices about what parameters to try next. Here we show how the effects of the Gaussian process prior and the associated inference procedure can have a large impact on the success or failure of Bayesian optimization. We show that thoughtful choices can lead to results that exceed expert-level performance in tuning machine learning algorithms. We also describe new algorithms that take into account the variable cost (duration) of learning experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization on a diverse set of contemporary algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks.
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.
Practical Machine Learning: A New Look at Anomaly Detection Everyone loves a mystery, and at the heart of it, that’s what anomaly detection is—spotting the unusual, catching the fraud, discovering the strange activity. Anomaly detection has a wide range of useful applications, from banking security to natural sciences to medicine to marketing. Anomaly detection carried out by a machine-learning program is actually a form of artificial intelligence. With the ever-increasing volume of data and the new types of data, such as sensor data from an increasingly large variety of objects that needs to be considered, it’s no surprise that there also is a growing interest in being able to handle more decisions automatically via machine-learning applications. But in the case of anomaly detection, at least some of the appeal is the excitement of the chase itself. …
Practical Machine Learning: Innovations in Recommendation A key to one of most sophisticated and effective approaches in machine learning and recommendation is contained in the observation: “I want a pony.” As it turns out, building a simple but powerful recommender is much easier than most people think, and wanting a pony is part of the key. Machine learning, especially at the scale of huge datasets, can be a daunting task. There is a dizzying array of algorithms from which to choose, and just making the choice between them presupposes that you have sufficiently advanced mathematical background to understand the alternatives and make a rational choice. The options are also changing, evolving constantly as a result of the work of some very bright, very dedicated researchers who are continually refining existing algorithms and coming up with new ones.
Predicting Good Probabilities With Supervised Learning We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Models such as Naive Bayes, which make unrealistic independence assumptions, push probabilities toward 0 and 1. Other models such as neural nets and bagged trees do not have these biases and predict well calibrated probabilities. We experiment with two ways of correcting the biased probabilities predicted by some learning methods: Platt Scaling and Isotonic Regression. We qualitatively examine what kinds of distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. The empirical results show that after calibration boosted trees, random forests, and SVMs predict the best probabilities.
Predicting the future of predictive analytics The proliferation of data and the increasing awareness of the potential to gain valuable insight and a competitive advantage from that information are driving organizations to place data at the heart of their corporate strategy. Consumers regularly benefit from predictive analytics, in the form of anything from weather forecasts to insurance premiums. Organizations are now exploring the possibilities of using historical data to exploit growth opportunities and minimize business risks, a field known as predictive analytics. SAP commissioned Loudhouse to conduct primary research among business decision-makers in UK and US organizations to understand their attitudes to and experiences of predictive analytics, as well as a future view of usage, value and investment. The research reveals that businesses are struggling to take full advantage of the burgeoning and already overwhelming amount of data being collected. Challenges abound as firms seek to make effective use of data. While many businesses are investing in predictive analytics and already seeing benefits in a number of areas, even more see this as a future investment priority for their business. The research points to a data-driven future where advanced predictive analytics sits at the core of the business function rather than being siloed, is embraced by a greater proportion of the workforce and is used to drive decision-making across the whole business. To achieve this future vision, however, it is clear that businesses need to up-skill their workforce and invest in more intuitive technology. While firms in the UK and US recognize the potential of predictive analytics and the need for investment in skills, the US is further along the adoption curve than the UK. US organizations show greater promise for future investment in – and roll-out of – predictive analytics software across the workforce. Furthermore, US organizations perceive fewer challenges in using data to inform corporate strategy, and sense a greater need for training to embed the benefits of the technology into day-to-day business.
Predictive Analytics – The rise and value of predictive analytics in enterprise decision making In the past few years, predictive analytics has gone from an exotic technique practiced in just a few niches, to a competitive weapon with a rapidly expanding range of uses. The increasing adoption of predictive analytics is fueled by converging trends: the Big Data phenomenon, ever-improving tools for data analysis, and a steady stream of demonstrated successes in new applications. The modern analyst would say, “Give me enough data, and I can predict anything.”
Predictive Analytics enters the Mainstream
Predictive Analytics for Business Advantage To compete effectively in an era in which advantages are ephemeral, companies need to move beyond historical, rear-view understandings of business performance and customer behavior and become more proactive. Organizations today want to be predictive; they want to gain information and insight from data that enables them to detect patterns and trends, anticipate events, spot anomalies, forecast using what-if simulations, and learn of changes in customer behavior so that staff can take actions that lead to desired business outcomes. Success in being predictive and proactive can be a game changer for many business functions and operations, including marketing and sales, operations management, finance, and risk management. Although it has been around for decades, predictive analytics is a technology whose time has finally come. A variety of market forces have joined to make this possible, including an increase in computing power, a better understanding of the value of the technology, the rise of certain economic forces, and the advent of big data. Companies are looking to use the technology to predict trends and understand behavior for better business performance. Forward-looking companies are using predictive analytics across a range of disparate data types to achieve greater value. Companies are looking to also deploy predictive analytics against their big data. Predictive analytics is also being operationalized more frequently as part of a business process. Predictive analytics complements business intelligence and data discovery, and can enable organizations to go beyond the analytic complexity limits of many online analytical processing (OLAP) implementations. It is evolving from a specialized activity once utilized only among elite firms and users to one that could become mainstream across industries and market sectors. This TDWI Best Practices Report focuses on how organizations can and are using predictive analytics to derive business value. It provides in-depth survey analysis of current strategies and future trends for predictive analytics across both organizational and technical dimensions including organizational culture, infrastructure, data, and processes. It looks at the features and functionalities companies are using for predictive analytics and the infrastructure trends in this space. The report offers recommendations and best practices for successfully implementing predictive analytics in the organization. TDWI Research finds a shift occurring in the predictive analytics user base. No longer is predictive analytics the realm of statisticians and mathematicians. There is a definite trend toward business analysts and other business users making use of this technology. Marketing and sales are big current users of predictive analytics and market analysts are making use of the technology. Therefore, the report also looks at the skills necessary to perform predictive analytics and how the technology can be utilized and operationalized across the organization. It explores cultural and business issues involved with making predictive analytics possible. A unique feature of this report is its examination of the characteristics of companies that have actually measured either top-line or bottom-line impact with predictive analytics. In other words, it explores how those companies compare against those that haven’t measured value.
Predictive Analytics in Cloud CRM Cloud CRM solutions have long since become mainstream and expanded beyond their initial foothold in small and mid-sized enterprises. Today B2B and B2C companies in many industries are eyeing cloud CRM solutions for their call center, their sales force and more. These CRM solutions offer the classic benefits of a cloud offering—multi-tenancy, usage pricing, location transparency, network access and high availability. What these solutions often do not offer, however, is advanced analytics. Typically limited to reporting and dashboards, many cloud CRM solutions do not allow companies to maximize the value of their data. The analytics that are available in a typical cloud CRM solution assume that users have the necessary decision-making expertise as well as the time required to make these decisions. In a typical high-volume call center environment, neither of these assumptions is reasonable. What companies using these CRM solutions need is predictive analytics, specifically predictive analytic solutions designed to drive better decisions in real-time. Delivering predictive analytic solutions in a cloud CRM environment, however, has its own challenges. Those adopting cloud CRM solutions don’t want (nor have the budget) to hire analytics teams to build predictive analytic models using traditional techniques or have to move their cloud CRM data to an on-premise analytic environment. They also don’t want predictive analytic models “in the lab,” they want business-friendly decision-making solutions powered by sophisticated predictive analytics. To be successful with cloud CRM, these companies need predictive applications for the cloud, in the cloud.
Predictive Analytics in the Cloud Predictive analytics and cloud are hot topics in business today. Predictive analytics are increasingly the focus of many companies’ efforts to improve business performance with analytics while cloud is fast becoming the default option for purchasing and deploying software. Public, private and hybrid clouds are all evolving rapidly and are here to stay. But what’s happening at the intersection of these two technologies? How can predictive analytics in the cloud add value and what are the critical risks and issues involved? This paper explores the five key opportunities for organizations to use predictive analytics in the cloud:
• Using the cloud to deliver predictive analytics-enabled “Decisions as a Service” solutions
• Embedding predictive analytics in Software as a Service (SaaS) and other cloud-deployed applications
• Using the cloud to deliver predictive analytics to non-cloud applications across the extended enterprise
• Building predictive analytics against data in the cloud
• Using cloud computing to deliver elastic compute power for building predictive analytic models
Before discussing the various options for predictive analytics in the cloud it is worth clarifying exactly what we mean by the various terms.
Predictive Analytics Whitepaper
Principal Components: Mathematics, Example, Interpretation This paper will explain Principal Components Analysis, where “respecting structure” means “preserving variance”. Explain how to do PCA, show an example, and describe some of the issues that come up in interpreting the results. PCA has been rediscovered many times in many fields, so it is also known as the Karhunen-Loeve transformation, the Hotelling transformation, the method of empirical orthogonal functions, and singular value decomposition. We will call it PCA.
Probabilistic Forecasting A probabilistic forecast takes the form of a predictive probability distribution over future quantities or events of interest. Probabilistic forecasting aims to maximize the sharpness of the predictive distributions, subject to calibration, on the basis of the available information set. We formalize and study notions of calibration in a prediction space setting. In practice, probabilistic calibration can be checked by examining probability integral transform (PIT) histograms. Proper scoring rules such as the logarithmic score and the continuous ranked probability score serve to assess calibration and sharpness simultaneously. As a special case, consistent scoring functions provide decision-theoretically coherent tools for evaluating point forecasts.We emphasizemethodological links to parametric and nonparametric distributional regression techniques, which attempt to model and to estimate conditional distribution functions; we use the context of statistically postprocessed ensemble forecasts in numerical weather prediction as an example. Throughout, we illustrate concepts and methodologies in data examples.
Probabilistic Program Abstractions Abstraction is a fundamental tool for reasoning about complex systems. Program abstraction has been utilized to great effect for analyzing deterministic programs. At the heart of program abstraction is the relationship between a concrete program, which is difficult to analyze, and an abstraction, which is more tractable. We generalize non-deterministic program abstractions to probabilistic program abstractions by explicitly quantifying the non-deterministic choices made by traditional program abstractions. We upgrade key theoretical program abstraction insights to the probabilistic context. Probabilistic program abstractions provide avenues for utilizing abstraction techniques from the programming languages community to improve the analysis of probabilistic programs.
Probabilistic Programming Probabilistic programs are usual functional or imperative programs with two added constructs: (1) the ability to draw values at random from distributions, and (2) the ability to condition values of variables in a program via observations. Models from diverse application areas such as computer vision, coding theory, cryptographic protocols, biology and reliability analysis can be written as probabilistic programs. Probabilistic inference is the problem of computing an explicit representation of the probability distribution implicitly specified by a probabilistic program. Depending on the application, the desired output from inference may vary – we may want to estimate the expected value of some function f with respect to the distribution, or the mode of the distribution, or simply a set of samples drawn from the distribution. In this paper, we describe connections this research area called “Probabilistic Programming” has with programming languages and software engineering, and this includes language design, and the static and dynamic analysis of programs. We survey current state of the art and speculate on promising directions for future research.
Probabilistic Syntax 1. The Tradition of Categoricity and Prospects for Stochasticity
2. The joys and perils of corpus linguistics
3. Probabilistic syntactic models
4. Continuous categories
5. Explaining more: probabilistic models of syntactic usage
6. Conclusion:
There are many phenomena in syntax that cry out for non-categorical and probabilistic modeling and explanation. The opportunity to leave behind ill-fitting categorical assumptions, and to better model probabilities of use in syntax is exciting. The existence of ‘soft’ constraints within the variable output of an individual speaker, of exactly the same kind as the typological syntactic constraints found across languages, makes exploration of probabilistic grammar models compelling. We saw that one is not limited to simple surface representations: I have tried to outline how probabilistic models can be applied on top of one’s favorite sophisticated linguistic representations. The frequency evidence needed for parameter estimation in probabilistic models requires a lot more data collection, and a lot more careful evaluation and model building than traditional syntax, where one example can be the basis of a new theory, but the results can enrich linguistic theory by revealing the soft constraints at work in language use. This is an area ripe for exploration by the next generation of syntacticians.
Probabilistic Topic Models As our collective knowledge continues to be digitized and stored – in the form of news, blogs, Web pages, scientific articles, books, images, sound, video, and social networks – it becomes more difficult to find and discover what we are looking for. We need new computational tools to help organize, search, and understand these vast amounts of information. Right now, we work with online information using two main tools – search and links. We type keywords into a search engine and find a set of documents related to them. We look at the documents in that set, possibly navigating to other linked documents. This is a powerful way of interacting with our online archive, but something is missing. Imagine searching and exploring documents based on the themes that run through them. We might “zoom in” and “zoom out” to find specific or broader themes; we might look at how those themes changed through time or how they are connected to each other. Rather than finding documents through keyword search alone, we might first find the theme that we are interested in, and then examine the documents related to that theme.
Probabilistic Topic Models Many chapters in this book illustrate that applying a statistical method such as Latent Semantic Analysis (LSA; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998) to large databases can yield insight into human cognition. The LSA approach makes three claims: that semantic information can be derived from a word-document co-occurrence matrix; that dimensionality reduction is an essential part of this derivation; and that words and documents can be represented as points in Euclidean space. In this chapter, we pursue an approach that is consistent with the first two of these claims, but differs in the third, describing a class of statistical models in which the semantic properties of words and documents are expressed in terms of probabilistic topics.
Probability Cheatsheet (Cheat Sheet)
Probability Reversal and the Disjunction Effect in Reasoning Systems Data based judgments go into artificial intelligence applications but they undergo paradoxical reversal when seemingly unnecessary additional data is provided. Examples of this are Simpson’s reversal and the disjunction effect where the beliefs about the data change once it is presented or aggregated differently. Sometimes the significance of the difference can be evaluated using statistical tests such as Pearson’s chi-squared or Fisher’s exact test, but this may not be helpful in threshold-based decision systems that operate with incomplete information. To mitigate risks in the use of algorithms in decision-making, we consider the question of modeling of beliefs. We argue that evidence supports that beliefs are not classical statistical variables and they should, in the general case, be considered as superposition states of disjoint or polar outcomes. We analyze the disjunction effect from the perspective of the belief as a quantum vector.
Process Mining – seeing the real process (Poster)
ProM 6: The Process Mining Toolkit Process mining has been around for a decade, and it has proven to be a very fertile and successful researchfield. Part of this success can be contributed to the ProM tool, which combines most of the existing process mining techniques as plug-ins in a single tool. ProM 6 removes many limitations that existed in the previous versions, in par- ticular with respect to the tight integration between the tool and the GUI. ProM 6 has been developed from scratch and uses a completely redesigned architecture. The changes were driven by many real-life ap- plications and new insights into the design of process analysis software. Furthermore, the introduction of XESame in this toolkit allows for the conversion of logs to the ProM native format without programming.
Provable benefits of representation learning There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernel-learning, autoencoders, Boltzmann machines, etc. To study the relative merits of these techniques, it’s essential to formalize the definition and goals of representation learning, so that they are all become instances of the same definition. This paper introduces such a formal framework that also formalizes the utility of learning the representation. It is related to previous Bayesian notions, but with some new twists. We show the usefulness of our framework by exhibiting simple and natural settings — linear mixture models and loglinear models, where the power of representation learning can be formally shown. In these examples, representation learning can be performed provably and efficiently under plausible assumptions (despite being NP-hard), and furthermore: (i) it greatly reduces the need for labeled data (semi-supervised learning) and (ii) it allows solving classification tasks when simpler approaches like nearest neighbors require too much data (iii) it is more powerful than manifold learning methods.
Putting Hadoop To Work The Right Way Big data has rapidly progressed from an ambitious vision realized by a handful of innovators to a competitive advantage for businesses across dozens of industries. More data is available now – about customers, employees, competitors – than ever before. That data is intelligence that can have an impact on daily business decisions. Industry leaders rely on big data as a foundation to beat their rivals. This big data revolution is also behind the massive adoption of Hadoop. Hadoop has become the platform of choice for companies looking to harness big data’s power. Simply put, most traditional enterprise systems are too limited to keep up with the influx of big data; they are not designed to ingest large quantities of data first and analyze it later. The need to store and analyze big data cost-effectively is the main reason why Hadoop usage has grown exponentially in the last five years.
Putting Predictive Analytics to Work in Operations In a recent study, companies that tightly integrate predictive analytics into operational systems are more than twice as likely to report a transformative impact from predictive analytics as any others. Leaders are creating advantage across multiple core functions such as marketing, customer management, collections, customer service, distribution and more by applying predictive analytics to operational decisions. Small operational decisions, especially those about customers, are made over and over. The value of these decisions rapidly adds up. Making these decisions well is critical to business performance. Traditional approaches to analytics are hard to scale and hard to use in the real-time environment common for operational decisions. Predictive analytics work better, using data to make these decisions more precise, targeting and personalizing them to maximize customer value. Because these decisions about customers are made at the front line of an organization they must be made quickly and embedded in operational systems. This creates a challenge – the insight to action gap – that prevents many companies from taking advantage of predictive analytics in these decisions. To close the insight to action gap and put predictive analytics to work in operations, companies need to adopt Decision Management, a proven approach that leverages predictive analytics to make operational systems analytical.

R

R Essentials R is a highly extensible, open-source programming language used mainly for statistical analysis and graphics. It is a GNU project very similar to the S language. R’s strengths include its varying data structures, which can be more intuitive than data storage in other languages; its built-in statistical and graphical functions; and its large collection of useful plugins that can enhance the language’s abilities in many different ways. R can be run either as a series of console commands, or as full scripts, depending on the use case. It is heavily object-oriented, and allows you to create your own functions. It also has a common API for interacting with most file structures to access data stored outside of R.
R for Machine Learning It is common for today’s scientific and business industries to collect large amounts of data, and the ability to analyze the data and learn from it is critical to making informed decisions. Familiarity with software such as R allows users to visualize data, run statistical tests, and apply machine learning algorithms. Even if you already know other software, there are still good reasons to learn R:
1. R is free. If your future employer does not already have R installed, you can always download it for free, unlike other proprietary software packages that require expensive licenses. No matter where you travel, you can have access to R on your computer.
2. R gives you access to cutting-edge technology. Top researchers develop statistical learning methods in R, and new algorithms are constantly added to the list of packages you can download.
3. R is a useful skill. Employers that value analytics recognize R as useful and important. If for no other reason, learning R is worthwhile to help boost your resume.
R Is Still Hot–and Getting Hotter For the white paper titled “R Is Hot” about four years ago, the goal was to introduce the R programming language to a larger audience of statistical analysts and data scientists. As it turned out, the timing couldn’t have been better: R has now blossomed into the language of choice for data scientists worldwide. Today, R is widely used by scientists, researchers, and statisticians for modeling data and solving problems quickly and effectively. When people ask me which factors are driving the broader adoption of R among data analysts, I usually offer two key points:
1. R was designed specifically for statistical analysis, which means that analytics written in R typically require fewer lines of code (and hence less work) than analytics written in Java, Python, or C++.
2. R is an open source project, which means it is continually improved, upgraded, enhanced, and expanded by a global community of incredibly passionate developers and users.
R Markdown Cheat Sheet (Cheat Sheet)
R Markdown Cheat Sheet (Cheat Sheet)
R Quo Vadis? (Slide Deck)
Random Forests Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Freund and Schapire[1996]), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Random Forests, Decision Trees, and Categorical Predictors: The ‘Absent Levels’ Problem One of the advantages that decision trees have over many other models is their ability to natively handle categorical predictors without having to first transform them (e.g., by using one-hot encoding). However, in this paper, we show how this capability can also lead to an inherent ‘absent levels’ problem for decision tree based algorithms that, to the best of our knowledge, has never been thoroughly discussed, and whose consequences have never been carefully explored. This predicament occurs whenever there is indeterminacy in how to handle an observation that has reached a categorical split which was determined when the observation’s level was absent during training. Although these incidents may appear to be innocuous, by using Leo Breiman and Adele Cutler’s random forests FORTRAN code and the randomForest R package as motivating case studies, we show how overlooking the absent levels problem can systematically bias a model. Afterwards, we discuss some heuristics that can possibly be used to help mitigate the absent levels problem and, using three real data examples taken from public repositories, we demonstrate the superior performance and reliability of these heuristics over some of the existing approaches that are currently being employed in practice due to oversights in the software implementations of decision tree based algorithms. Given how extensively these algorithms have been used, it is conceivable that a sizable number of these models have been unknowingly and seriously affected by this issue—further emphasizing the need for the development of both theory and software that accounts for the absent levels problem.
Rankcluster: An R Package for Clustering Multivariate Partial Rankings The Rankcluster package is the first R package proposing both modeling and clustering tools for ranking data, potentially multivariate and partial. Ranking data are modeled by the Insertion Sorting Rank (ISR) model, which is a meaningful model parametrized by a central ranking and a dispersion parameter. A conditional independence assumption allows multivariate rankings to be taken into account, and clustering is performed by means of mixtures of multivariate ISR models. The parameters of the cluster (central rankings and dispersion parameters) help the practitioners to interpret the clustering. Moreover, the Rankcluster package provides an estimate of the missing ranking positions when rankings are partial. After an overview of the mixture of multivariate ISR models, the Rankcluster package is described and its use is illustrated through the analysis of two real datasets.
Rapidly Mixing Markov Chains: A Comparison of Techniques (A Survey) We survey existing techniques to bound the mixing time of Markov chains. The mixing time is related to a geometric parameter called conductance which is a measure of edge-expansion. Bounds on conductance are typically obtained by a technique called ‘canonical paths’ where the idea is to find a set of paths, one between every source-destination pair, such that no edge is heavily congested. However, the canonical paths approach cannot always show rapid mixing of a rapidly mixing chain. This drawback disappears if we allow the flow between a pair of states to be spread along multiple paths. We prove that for a large class of Markov chains canonical paths does capture rapid mixing. Allowing multiple paths to route the flow still does help a great deal in proofs, as illustrated by a result of Morris & Sinclair (FOCS’99) on the rapid mixing of a Markov chain for sampling 0/1 knapsack solutions. A different approach to prove rapid mixing is ‘Coupling’. Path Coupling is a variant discovered by Bubley & Dyer (FOCS’97) that often tremendously reduces the complexity of designing good Couplings. We present several applications of Path Coupling in proofs of rapid mixing. These invariably lead to much better bounds on mixing time than known using conductance, and moreover Coupling based proofs are typically simpler. This motivates the question of whether Coupling can be made to work whenever the chain is rapidly mixing. This question was answered in the negative by Kumar & Ramesh (FOCS’99), who showed that no Coupling strategy can prove the rapid mixing of the Jerrum-Sinclair chain for sampling perfect and near-perfect matchings.
Rationality, Optimism and Guarantees in General Reinforcement Learning In this article,1 we present a top-down theoretical study of general reinforcement learning agents. We begin with rational agents with unlimited resources and then move to a setting where an agent can only maintain a limited number of hypotheses and optimizes plans over a horizon much shorter than what the agent designer actually wants. We axiomatize what is rational in such a setting in a manner that enables optimism, which is important to achieve systematic explorative behavior. Then, within the class of agents deemed rational, we achieve convergence and nite-error bounds. Such results are desirable since they imply that the agent learns well from its experiences, but the bounds do not directly guarantee good performance and can be achieved by agents doing things one should obviously not. Good performance cannot in fact be guaranteed for any agent in fully general settings. Our approach is to design agents that learn well from experience and act rationally. We introduce a framework for general reinforcement learning agents based on rationality axioms for a decision function and an hypothesis-generating function designed so as to achieve guarantees on the number errors. We will consistently use an optimistic decision function but the hypothesis-generating function needs to change depending on what is known/assumed. We investigate a number of natural situations having either a frequentist or Bayesian avor, deterministic or stochastic environments and either nite or countable hypothesis class. Further, to achieve su ciently good bounds as to hold promise for practical success we introduce a notion of a class of environments being generated by a set of laws. None of the above has previously been done for fully general reinforcement learning environments.
Realization of Ontology Web Search Engine This paper describes the realization of the Ontology Web Search Engine. The Ontology Web Search Engine is realizable as independent project and as a part of other projects. The main purpose of this paper is to present the Ontology Web Search Engine realization details as the part of the Semantic Web Expert System and to present the results of the Ontology Web Search Engine functioning. It is expected that the Semantic Web Expert System will be able to process ontologies from the Web, generate rules from these ontologies and develop its knowledge base.
Reallocating and Resampling: A Comparison for Inference Simulation-based inference plays a major role in modern statistics, and often employs either reallocating (as in a randomization test) or resampling (as in bootstrapping). Reallocating mimics random allocation to treatment groups, while resampling mimics random sampling from a larger population; does it matter whether the simulation method matches the data collection method? Moreover, do the results differ for testing versus estimation? Here we answer these questions in a simple setting by exploring the distribution of a sample difference in means under a basic two group design and four different scenarios: true random allocation, true random sampling, reallocating, and resampling. For testing a sharp null hypothesis, reallocating is superior in small samples, but reallocating and resampling are asymptotically equivalent. For estimation, resampling is generally superior, unless the effect is truly additive. Moreover, these results hold regardless of whether the data were collected by random sampling or random allocation.
Real-Time Big Data Analytics: Emerging Architecture Imagine that it’s 2007. You’re a top executive at major search engine company, and Steve Jobs has just unveiled the iPhone. You immediately ask yourself, “Should we shift resources away from some of our current projects so we can create an experience expressly for iPhone users?” Then you begin wondering, “What if it’s all hype? Steve is a great showman … how can we predict if the iPhone is a fad or the next big thing?” The good news is that you’ve got plenty of data at your disposal. The bad news is that you have no way of querying that data and discovering the answer to a critical question: How many people are accessing my sites from their iPhones? Back in 2007, you couldn’t even ask the question without upgrading the schema in your data warehouse, an expensive process that might have taken two months. Your only choice was to wait and hope that a competitor didn’t eat your lunch in the meantime. Justin Erickson, a senior product manager at Cloudera, told me a version of that story and I wanted to share it with you because it neatly illustrates the difference between traditional analytics and real-time big data analytics. Back then, you had to know the kinds of questions you planned to ask before you stored your data. “Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it,” says Erickson. “Technologies such as MapReduce, Hive and Impala enable you to run queries without changing the data structures underneath.” Today, you are much less likely to face a scenario in which you cannot query data and get a response back in a brief period of time. Analytical processes that used to require month, days, or hours have been reduced to minutes, seconds, and fractions of seconds. But shorter processing times have led to higher expectations. Two years ago, many data analysts thought that generating a result from a query in less than 40 minutes was nothing short of miraculous. Today, they expect to see results in under a minute. That’s practically the speed of thought – you think of a query, you get a result, and you begin your experiment. “It’s about moving with greater speed toward previously unknown questions, defining new insights, and reducing the time between when an event happens somewhere in the world and someone responds or reacts to that event,” says Erickson. A rapidly emerging universe of newer technologies has dramatically reduced data processing cycle time, making it possible to explore and experiment with data in ways that would not have been practical or even possible a few years ago. Despite the availability of new tools and systems for handling massive amounts of data at incredible speeds, however, the real promise of advanced data analytics lies beyond the realm of pure technology. “Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse,” says Michael Minelli, co-author of Big Data, Big Analytics. “It’s about the ability to make better decisions and take meaningful actions at the right time. It’s about detecting fraud while someone is swiping a credit card, or triggering an offer while a shopper is standing on a checkout line, or placing an ad on a website while someone is reading a specific article. It’s about combining and analyzing data so you can take the right action, at the right time, and at the right place.” For some, real-time big data analytics (RTBDA) is a ticket to improved sales, higher profits and lower marketing costs. To others, it signals the dawn of a new era in which machines begin to think and respond more like humans.
Real-Time Enterprise Stories More than 20 detailed case studies from Bloomberg Businessweek Research Services and Forbes Insights featuring leading-edge enterprises across industries to explore the real value of the in-memory platform: SAP HANA
Real-Time Machine Learning: The Missing Pieces Machine learning applications are increasingly deployed not only to serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making. These applications pose a new set of requirements, none of which are difficult to achieve in isolation, but the combination of which creates a challenge for existing distributed execution frameworks: computation with millisecond latency at high throughput, adaptive construction of arbitrary task graphs, and execution of heterogeneous kernels over diverse sets of resources. We assert that a new distributed execution framework is needed for such ML applications and propose a candidate approach with a proof-of-concept architecture that achieves a 63x performance improvement over a state-of-the-art execution framework for a representative application.
Reducing the Dimensionality of Data with Neural Networks High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such ‘‘autoencoder’’ networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.
Reducing the Sampling Complexity of Topic Models Inference in topic models typically involves a sampling step to associate latent variables with observations. Unfortunately the generative model loses sparsity as the amount of data increases, requiring O(k) operations per word for k topics. In this paper we propose an algorithm which scales linearly with the number of actually instantiated topics kd in the document. For large document collections and in structured hierarchical models kd k. This yields an order of magnitude speedup. Our method applies to a wide variety of statistical models such as PDP and HDP. At its core is the idea that dense, slowly changing distributions can be approximated e ciently by the combination of a Metropolis-Hastings step, use of sparsity, and amortized constant time sampling via Walker’s alias method.
Regularized Discriminant Analysis Linear and quadratic discriminant analysis are considered in the small sample high-dimensional setting. Alternatives to the usual maximum likelihood (plug-in) estimates for the covariance matrices are proposed. These alternatives are characterized by two parameters, the values of which are customized to individual situations by jointly minimizing a sample based estimate of future misclassification risk. Computationally fast implementations are presented, and the efficacy of the approach is examined through simulation studies and application to data. These studies indicate that in many circumstances dramatic gains in classification accuracy can be achieved.
Reinforcement Learning: A Tutorial The purpose of this tutorial is to provide an introduction to reinforcement learning (RL) at a level easily understood by students and researchers in a wide range of disciplines. The intent is not to present a rigorous mathematical discussion that requires a great deal of effort on the part of the reader, but rather to present a conceptual framework that might serve as an introduction to a more rigorous study of RL. The fundamental principles and techniques used to solve RL problems are presented. The most popular RL algorithms are presented. Section 1 presents an overview of RL and provides a simple example to develop intuition of the underlying dynamic programming mechanism. In Section 2 the parts of a reinforcement learning problem are discussed. These include the environment, reinforcement function, and value function. Section 3 gives a description of the most widely used reinforcement learning algorithms. These include TD(l) and both the residual and direct forms of value iteration, Q-learning, and advantage learning. In Section 4 some of the ancillary issues in RL are briefly discussed, such as choosing an exploration strategy and an appropriate discount factor. The conclusion is given in Section 5. Finally, Section 6 is a glossary of commonly used terms followed by references in Section 7 and a bibliography of RL applications in Section 8. The tutorial structure is such that each section builds on the information provided in previous sections. It is assumed that the reader has some knowledge of learning algorithms that rely on gradient descent (such as the backpropagation of errors algorithm).
Representation Learning on Graphs: Methods and Applications Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Here we provide a conceptual review of key advancements in this area of representation learning on graphs, including matrix factorization-based methods, random-walk based algorithms, and graph convolutional networks. We review methods to embed individual nodes as well as approaches to embed entire (sub)graphs. In doing so, we develop a unified framework to describe these recent approaches, and we highlight a number of important applications and directions for future work.
Representation Learning: A Review and New Perspectives The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning. Index Terms – Deep learning, representation learning, feature learning, unsupervised learning, Boltzmann Machine, autoencoder, neural nets
Resource Elasticity for Distributed Data Stream Processing: A Survey and Future Directions Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructures, and Internet of Things, continuous data streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for processing unbounded data streams in a scalable and efficient manner. This paper surveys state of the art on stream processing engines and mechanisms for exploiting resource elasticity features of cloud computing in stream processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, stream processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. This work examines some of these challenges and discusses solutions proposed in the literature to address them.
Robust Principal Component Analysis? This paper is about a curious phenomenon. Suppose we have a data matrix, which is the superposition of a low-rank component and a sparse component. Can we recover each component individually? We prove that under some suitable assumptions, it is possible to recover both the low-rank and the sparse components exactly by solving a very convenient convex program called Principal Component Pursuit; among all feasible decompositions, simply minimize a weighted combination of the nuclear norm and of the l1 norm. This suggests the possibility of a principled approach to robust principal component analysis since our methodology and results assert that one can recover the principal components of a data matrix even though a positive fraction of its entries are arbitrarily corrupted. This extends to the situation where a fraction of the entries are missing as well. We discuss an algorithm for solving this optimization problem, and present applications in the area of video surveillance, where our methodology allows for the detection of objects in a cluttered background, and in the area of face recognition, where it offers a principled way of removing shadows and specularities in images of faces.
ROC Curve, Lift Chart and Calibration Plot This paper presents ROC curve, lift chart and calibration plot, three well known graphical techniques that are useful for evaluating the quality of classification models used in data mining and machine learning. Each technique, normally used and studied separately, defines its own measure of classification quality and its visualization. Here, we give a brief survey of the methods and establish a common mathematical framework which adds some new aspects, explanations and interrelations between these techniques. We conclude with an empirical evaluation and a few examples on how to use the presented techniques to boost classification accuracy
RStorm: Developing and Testing Streaming Algorithms in R Streaming data, consisting of indefinitely evolving sequences, are becoming ubiquitous in many branches of science and in various applications. Computer Scientists have developed streaming applications such as Storm and the S4 distributed stream computing platform1 to deal with data streams. However, in current production packages testing and evaluating streaming algorithms is cumbersome. This paper presents RStorm for the development and evaluation of streaming algorithms analogous to these production packages, but implemented fully in R. RStorm allows developers of streaming algorithms to quickly test, iterate, and evaluate various implementations of streaming algorithms. The paper provides both a canonical computer science example, the streaming word count, and examples of several statistical applications of RStorm.
Rules of Machine Learning: Best Practices for ML Engineering This document is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google. It presents a style for machine learning, similar to the Google C++ Style Guide and other popular guides to practical programming. If you have taken a class in machine learning, or built or worked on a machinelearned model, then you have the necessary background to read this document.

S

SAP HANA for Next-Generation Business: Applications and Real-Time Analytics Explore and Analyze Vast Quantities of Data from Virtually Any Source at the Speed of Thought SAP has introduced a new class of solutions that powers the next generation of business applications. The SAP HANA database is an in-memory database that combines transactional data processing, analytical data processing, and application logic processing functionality in memory. SAP HANA removes the limits of traditional database architecture that have severely constrained how business applications can be developed to support real-time business.
SAP Predictive Analysis – Real Life Use Case Predicting Who Will BuyAdditional Insurance This paper provides a step-by-step description, including screenshots, to evaluate how SAP Predictive Analysis and SAP InfiniteInsight, which as of this writing are now sold as a bundle, can be used to predict the potential customers who will buy additional products based on their behavior of interest. Using the combined strength of both SAP InfiniteInsight and SAP Predictive Analysis this article will also demonstrate how these two products can fulfill the challenge and also supplement each other to provide even better prediction models. This article is for educational purposes only and uses actual data from an insurance company. The data comes from the CoIL (Computational Intelligence and Learning) challenge from year 2000, which had the following goal: “Can you predict who would be interested in buying a caravan insurance policy and give an explanation why?” After reading this article you will be able to understand the differences between classification algorithms. You will learn how to simplify a dataset by determining which variables are important and vice versa, score a model and also how to use SAP Predictive Analysis and SAP InfiniteInsight to build models on existing data and run the custom models on new data. The authors of this article have closely observed the online Predictive Analysis community where members are constantly looking for two important things; which statistical algorithm to choose for which business case and real life business cases with actual data. The latter is certainly difficult to find but immensely crucial in understanding of this subject. Of course data is vital to each company and in today’s competitive market, gives a competitive advantage. Finding a real life case for educational purposes with actual data is a big challenge.
Scalable Strategies for Computing with Massive Data This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to de ne parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-speci c code. Second, the bigmemory package implements memory- and le-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support e cient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have e ectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.
Score Aggregation Techniques in Retrieval Experimentation Comparative evaluations of information retrieval systems are based on a number of key premises, including that representative topic sets can be created, that suitable relevance judgements can be generated, and that systems can be sensibly compared based on their aggregate performance over the selected topic set. This paper considers the role of the third of these assumptions – that the performance of a system on a set of topics can be represented by a single overall performance score such as the average, or some other central statistic. In particular, we experiment with score aggregation techniques including the arithmetic mean, the geometric mean, the harmonic mean, and the median. Using past TREC runs we show that an adjusted geometricmean providesmore consistent system rankings than the arithmetic mean when a significant fraction of the individual topic scores are close to zero, and that score standardization (Webber et al., SIGIR 2008) achieves the same outcome in a more consistent manner.
Security and Privacy Aspects in MapReduce on Clouds: A Survey MapReduce is a programming system for distributed processing large-scale data in an efficient and fault tolerant manner on a private, public, or hybrid cloud. MapReduce is extensively used daily around the world as an efficient distributed computation tool for a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and analysis of social networks. Security and privacy of data and MapReduce computations are essential concerns when a MapReduce computation is executed in public or hybrid clouds. In order to execute a MapReduce job in public and hybrid clouds, authentication of mappers-reducers, confidentiality of data-computations, integrity of data-computations, and correctness-freshness of the outputs are required. Satisfying these requirements shield the operation from several types of attacks on data and MapReduce computations. In this paper, we investigate and discuss security and privacy challenges and requirements, considering a variety of adversarial capabilities, and characteristics in the scope of MapReduce. We also provide a review of existing security and privacy protocols for MapReduce and discuss their overhead issues.
Security and Privacy of Sensitive Data in Cloud Computing: A Survey of Recent Developments Cloud computing is revolutionizing many ecosystems by providing organizations with computing resources featuring easy deployment, connectivity, configuration, automation and scalability. This paradigm shift raises a broad range of security and privacy issues that must be taken into consideration. Multi-tenancy, loss of control, and trust are key challenges in cloud computing environments. This paper reviews the existing technologies and a wide array of both earlier and state-of-the-art projects on cloud security and privacy. We categorize the existing research according to the cloud reference architecture orchestration, resource control, physical resource, and cloud service management layers, in addition to reviewing the existing developments in privacy-preserving sensitive data approaches in cloud computing such as privacy threat modeling and privacy enhancing protocols and solutions.
Security-related Research in Ubiquitous Computing — Results of a Systematic Literature Review In an endeavor to reach the vision of ubiquitous computing where users are able to use pervasive services without spatial and temporal constraints, we are witnessing a fast growing number of mobile and sensor-enhanced devices becoming available. However, in order to take full advantage of the numerous benefits offered by novel mobile devices and services, we must address the related security issues. In this paper, we present results of a systematic literature review (SLR) on security-related topics in ubiquitous computing environments. In our study, we found 5165 scientific contributions published between 2003 and 2015. We applied a systematic procedure to identify the threats, vulnerabilities, attacks, as well as corresponding defense mechanisms that are discussed in those publications. While this paper mainly discusses the results of our study, the corresponding SLR protocol which provides all details of the SLR is also publicly available for download.
See, Hear, and Read: Deep Aligned Representations We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities. Moreover, although our network is only trained with image+text and image+sound pairs, it can transfer between text and sound as well, a transfer the network never observed during training. Visualizations of our representation reveal many hidden units which automatically emerge to detect concepts, independent of the modality.
Selecting a Visual Analytics Application Not surprisingly, everywhere you look, software companies are adopting the terms “visual analytics” and “interactive data visualization.” Tools that do little more than produce charts and dashboards are now laying claim to the label. How can you tell the cleverly named from the genuine? What should you look for? It’s important to know the defining characteristics of visual analytics before you shop. This paper introduces you to the seven essential elements of true visual analytics applications.
Sentiment Analysis of Twitter Data :A Survey of Techniques With the advancement of web technology and its growth, there is a huge volume of data present in the web for internet users and a lot of data is generated too. Internet has become a platform for online learning, exchanging ideas and sharing opinions. Social networking sites like Twitter, Facebook, Google+ are rapidly gaining popularity as they allow people to share and express their views about topics,have discussion with different communities, or post messages across the world. There has been lot of work in the field of sentiment analysis of twitter data. This survey focuses mainly on sentiment analysis of twitter data which is helpful to analyze the information in the tweets where opinions are highly unstructured, heterogeneous and are either positive or negative, or neutral in some cases. In this paper, we provide a survey and a comparative analyses of existing techniques for opinion mining like machine learning and lexicon-based approaches, together with evaluation metrics. Using various machine learning algorithms like Naive Bayes, Max Entropy, and Support Vector Machine, we provide a research on twitter data streams. We have also discussed general challenges and applications of Sentiment Analysis on Twitter
Sentiment/Subjectivity Analysis Survey for Languages other than English Subjective and sentiment analysis has gained considerable attention recently. Most of the resources and systems built so far are done for English. The need for designing systems for other languages is increasing. This paper surveys different ways used for building systems for subjective and sentiment analysis for languages other than English. There are three different types of systems used for building these systems. The first (and the best) one is the language specific systems. The second type of systems involves reusing or transferring sentiment resources from English to the target language. The third type of methods is based on using language independent methods. The paper presents a separate section devoted to Arabic sentiment analysis.
Sequential Combining of Expert Information Using Mathematica In every real-world domain where reasoning under uncertainty is required, combining information from different sources (‘experts’) can be really a powerful tool to enhance accuracy and precision of the ‘final’ estimate of the unknown quantity. Bayesian paradigm offers a coherent perspective which can be used to address the problem, but an issue strictly related to information combining is how to perform an efficient process of sequential consulting: at each stage, the investigator can select the ‘best’ expert to be consulted and choose whether to stop or continue the consulting. The aim of this paper is to rephrase the Bayesian combining algorithm in a sequential context and use Mathematica to implement suitable selecting and stopping rules.
Sequential Pattern Mining – Approaches and Algorithms Sequences of events, items or tokens occurring in an ordered metric space appear often in data and the requirement to detect and analyse frequent subsequences is a common problem. Sequential Pattern Mining arose as a sub-field of data mining to focus on this field. This paper surveys the approaches and algorithms proposed to date.
Serverless Computing: Current Trends and Open Problems Serverless computing has emerged as a new compelling paradigm for the deployment of applications and services. It represents an evolution of cloud programming models, abstractions, and platforms, and is a testament to the maturity and wide adoption of cloud technologies. In this chapter, we survey existing serverless platforms from industry, academia, and open source projects, identify key characteristics and use cases, and describe technical challenges and open problems.
Set optimization – a rather short introduction Recent developments in set optimization are surveyed and extended including various set relations as well as fundamental constructions of a convex analysis for set- and vector-valued functions, and duality for set optimization problems. Extensive sections with bibliographical comments summarize the state of the art. Applica- tions to vector optimization and financial risk measures are discussed along with algorithmic approaches to set optimization problems.
Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey Intrusion detection has attracted a considerable interest from researchers and industries. The community, after many years of research, still faces the problem of building reliable and efficient IDS that are capable of handling large quantities of data, with changing patterns in real time situations. The work presented in this manuscript classifies intrusion detection systems (IDS). Moreover, a taxonomy and survey of shallow and deep networks intrusion detection systems is presented based on previous and current works. This taxonomy and survey reviews machine learning techniques and their performance in detecting anomalies. Feature selection which influences the effectiveness of machine learning (ML) IDS is discussed to explain the role of feature selection in the classification and training phase of ML IDS. Finally, a discussion of the false and true positive alarm rates is presented to help researchers model reliable and efficient machine learning based intrusion detection systems.
Shiny Cheat Sheet (Cheat Sheet)
Simple Ain’t Easy: Real-World Problems with Basic Summary Statistics In applied statistical work, the use of even the most basic summary statistics, like means, medians and modes, can be seriously problematic. When forced to choose a single summary statistic, many considerations come into practice. This repo attempts to describe some of the non-obvious properties possessed by standard statistical methods so that users can make informed choices about methods.
Smart Data – Innovationen aus Daten Das Bundesministerium für Wirtschaft und Technologie wird mit dem Technologiewettbewerb „Smart Data – Innovationen aus Daten“ Forschungs- und Entwicklungsaktivitäten (FuE-Aktivitäten) fördern, die den zukünftigen Markt um Big Data für die Wirtschaft am Standort Deutschland nachhaltig erschließen. Studien prognostizieren einen rasanten Anstieg des weltweiten Umsatzvolumens mit Big Data auf über 15 Mrd. € im Jahr 2016 (Deutschland: 1,6 Mrd. €). Deutschland hat gute Chancen, im Bereich der skalierbaren Datenmanagement und -analysesysteme international eine führende Rolle einzunehmen. Sowohl etablierte Unternehmen der deutschen IT-Wirtschaft, zahlreiche Forschungseinrichtungen wie auch diverse Start-Ups sind im Umfeld von Big Data bereits aktiv. Mit „Smart Data“ soll ein Schwerpunkt auf die Entwicklung von innovativen Diensten und Dienstleistungen gelegt werden, um eine frühzeitige breitenwirksame Nutzung voranzutreiben. Die Verwertung der Big Data-Technologien steht noch weitgehend am Beginn und konzentriert sich dabei auf einige spezifische Bereiche wie Online-Werbung und E-Commerce in größeren Unternehmen und Organisationen. Von den entstehenden Lösungen wird erwartet, dass sie aufgrund ihrer Handhabbarkeit vor allem in Bezug auf Datensicherheit und Datenqualität in der Wirtschaft leicht Anklang finden. Insbesondere sollen durch die FuE-Aktivitäten innovative Systemlösungen für kleine und mittelständische Unternehmen (KMU) entstehen. „Smart Data“ steht für eine über die Technologieentwicklung hinaus gehende anwendungsnahe Perspektive, die auch KMU eine attraktive und rechtssichere Nutzung und Verwertung von Massendaten ermöglich. Dazu gehört auch, die grundlegenden Rahmenbedingungen, z.B. den Rechtsrahmen für die Nutzung von Big Data, zu adressieren. Gesucht werden Projekte mit Leuchtturmcharakter zur Beseitigung technischer, struktureller, organisatorischer und rechtlicher Hemmnisse für den Einsatz von Big Data-Technologien. Die Projekte sollen in den Anwendungs-bereichen Industrie, Mobilität, Energie und Gesundheit angesiedelt sein. Das Technologieprogramm „Smart Data“ folgt den Zielstellungen der IKT-Strategie „Deutschland Digital 2015“ der Bundesregierung sowie denen des Zukunftsprojekts „Internetbasierte Dienste für die Wirtschaft“ im Rahmen der Hightech-Strategie 2020 und liegt somit im erheblichen Bundesinteresse. Das Programm knüpft an wichtige Basistechnologien und Standards als Grundlage von Big Data an, die zum Beispiel in anderen BMWi-Technologieprogrammen wie THESEUS, Trusted Cloud, Autonomik für Industrie 4.0, Elektromobilität und E-Energy entwickelt wurden oder noch werden. Synergieeffekte mit dem vom Bundesministerium für Bildung und Forschung (BMBF) geförderten Programm „Management und Analyse großer Datenmengen (Big Data)“ oder mit korrespondierenden Programmen der Europäischen Kommission sind erwünscht.
Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones The growth in the use of computationally intensive statistical procedures, especially with big data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPUs, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to “embarrassingly parallel” (EP) algorithms, especially on non-shared memory platforms. This paper develops a broadlyapplicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations.
Software Escalation Prediction with Data Mining One of the most severe manifestations of poor quality of software products occurs when a customer “escalates” a defect: an escalation is triggered when a defect significantly impacts a customer’s operations. Escalated defects are then quickly resolved, at a high cost, outside of the general product release engineering cycle. While the software vendor and its customers often detect and report defects before they are escalated it is not always possible to quickly and accurately prioritize reported defects for resolution. As a result, even previously known defects, in addition to newly discovered defects, are often escalated by customers. Labor cost of escalations from known defects to a software vendor can amount to millions of dollars per year. The total costs to the vendor are even greater, including loss of reputation, satisfaction, loyalty, and repeat revenue. The objective of Escalation Prediction (EP) is to avoid escalations from known product defects by predicting and proactively resolving those known defects that have the highest escalation risk. This short paper outlines the business case for EP, an analysis of the business problem, the solution architecture, and some preliminary validation results on the effectiveness of EP.
SoK: Applying Machine Learning in Security – A Survey The idea of applying machine learning(ML) to solve problems in security domains is almost 3 decades old. As information and communications grow more ubiquitous and more data become available, many security risks arise as well as appetite to manage and mitigate such risks. Consequently, research on applying and designing ML algorithms and systems for security has grown fast, ranging from intrusion detection systems(IDS) and malware classification to security policy management(SPM) and information leak checking. In this paper, we systematically study the methods, algorithms, and system designs in academic publications from 2008-2015 that applied ML in security domains. 98 percent of the surveyed papers appeared in the 6 highest-ranked academic security conferences and 1 conference known for pioneering ML applications in security. We examine the generalized system designs, underlying assumptions, measurements, and use cases in active research. Our examinations lead to 1) a taxonomy on ML paradigms and security domains for future exploration and exploitation, and 2) an agenda detailing open and upcoming challenges. Based on our survey, we also suggest a point of view that treats security as a game theory problem instead of a batch-trained ML problem.
Solutions Big Data IBM (Slide Deck)
Solving Differential Equations in R Although R is still predominantly applied for statistical analysis and graphical representation, it is rapidly becoming more suitable for mathematical computing. One of the fields where considerable progress has been made recently is the solution of differential equations. Here we give a brief overview of differential equations that can now be solved by R.
Some Class-Participation Demonstations for Decision Theory and Bayesian Statistics
Some models and methods for the analysis of observational data This article provides a short, concise and essentially self-contained exposition of some of the most important models and methods for the analysis of observational data, and a substantial number of illustrations of their application. Although for the most part our presentation follows P. Rosenbaum’s book, “Observational Studies”, and naturally draws on related literature, it contains original elements and simplifies and generalizes some basic results. The illustrations, based on simulated data, show the methods at work in some detail, highlighting pitfalls and emphasizing certain subjective aspects of the statistical analyses.
Sparse Principal Component Analysis Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results.We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings.We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results.
Spatial interpolation: Techniques for spatial data analysis (Slide Deck)
Spatio-Temporal Clustering: A Survey Spatio-temporal clustering is a process of grouping objects based on their spatial and temporal similarity. It is relatively new subfield of data mining which gained high popularity especially in geographic information sciences due to the pervasiveness of all kinds of location-based or environmental devices that record position, time or/and environmental properties of an object or set of objects in realtime. As a consequence, different types and large amounts of spatio-temporal data became available that introduce new challenges to data analysis and require novel approaches to knowledge discovery. In this chapter we concentrate on the spatiotemporal clustering in geographic space. First, we provide a classification of different types of spatio-temporal data. Then, we focus on one type of spatio-temporal clustering – trajectory clustering, pprovide an overview of the state-of-the-art approaches and methods of spatio-temporal clustering and finally present several scenarios in different application domains such as movement, cellular networks and environmental studies.
Spectral Theory of Unsigned and Signed Graphs. Applications to Graph Clustering: a Survey This is a survey of the method of graph cuts and its applications to graph clustering of weighted unsigned and signed graphs. I provide a fairly thorough treatment of the method of normalized graph cuts, a deeply original method due to Shi and Malik, including complete proofs. The main thrust of this paper is the method of normalized cuts. I give a detailed account for K = 2 clusters, and also for K > 2 clusters, based on the work of Yu and Shi. I also show how both graph drawing and normalized cut K-clustering can be easily generalized to handle signed graphs, which are weighted graphs in which the weight matrix W may have negative coefficients. Intuitively, negative coefficients indicate distance or dissimilarity. The solution is to replace the degree matrix by the matrix in which absolute values of the weights are used, and to replace the Laplacian by the Laplacian with the new degree matrix of absolute values. As far as I know, the generalization of K-way normalized clustering to signed graphs is new. Finally, I show how the method of ratio cuts, in which a cut is normalized by the size of the cluster rather than its volume, is just a special case of normalized cuts.
SQL-on-Hadoop Engines Explained Big Data And Hadoop – Hadoop is being regarded as one of the best platforms for storing and managing big data. It owes its success to its high data storage and processing scalability, low price/performance ratio, high performance, high availability, high schema flexibility, and its capability to handle all types of data. Unfortunately, Hadoop APIs, such as HDFS, MapReduce, and HBase, are quite complex. They require expertise in Java programming (or similar languages) and require in‐depth knowledge of how to parallelize query processing efficiently. The downsides of these interfaces are a small target audience, low productivity, and limited tool support. The Need For SQL-on-Hadoop Engines – What is needed is a programming interface that retains HDFS’s performance and scalability, offers high productivity and maintainability, is known to non‐technical users, and can be used by many reporting and analytical tools. The obvious choice is evidently SQL. SQL is a highlevel, declarative, and standardized database language, it’s familiar to countless BI specialists, it’s supported by almost all reporting and analytical tools, and has proven its worth over and over again. To offer SQL on Hadoop, SQL query engines are needed that can query and manipulate data stored in HDFS or HBase. Such products are called SQL‐on‐Hadoop engines. Lately, the popularity of SQL‐on‐Hadoop engine is growing rapidly. Here are just a few of the many SQLon‐ Hadoop engines available: Apache Drill, Apache Hive, CitusDB, Cloudera Impala, Concurrent Lingual, Hadapt, HP Vertica, InfiniDB, JethroData, MemSQL, Pivotal HAWQ, Progress DataDirect, ScleraDB, Shark, and SpliceMachine. On the outside most of the SQL‐on‐Hadoop engines look alike. They all support some SQL‐dialect that can be invoked through ODBC or JDBC. Internally, they can be very different. The differences stem from the purpose for which they have been designed. Here are some potential use cases for which they may have been designed:
• batch‐oriented query environment (data mining)
• interactive query environment (OLAP, self‐service BI, data visualization)
• point‐queries (retrieving and manipulating individual objects)
• investigative analytics (data science)
• operational intelligence (real‐time analytics)
• transactional (production systems) Undesired Big Data Silos – Most Hadoop‐based systems have been designed and developed by organizations for one or two use cases. The workload characteristics of these use cases are usually massive data load and execution of non‐interactive, complex forms of analytics. However, Hadoop implementations can support other use cases, including interactive reporting, data stream processing, transactional processing, and text search. The growing availability of SQL‐on‐Hadoop engines has just widen the range of use cases of Hadoop even more. Unfortunately, when deployed for a different use case, a specific Hadoop implementation may be unsuitable with regard to functionality or performance. Development of another use case may force an organization to develop a second solution in which data is stored again. In the long run, this results in many data management platforms: each one designed and optimized to support a limited number of use cases. Finally, this leads to undesirable big data silos. The disadvantages of having big data silos are: high costs because of data duplication, high data latency, complex data replication solutions, and data quality problems. Silos may work well temporarily, but history has shown that eventually the users of these silos will want to combine data from multiple data sources. When this happens, each application is extended to access multiple data sources. This leads to a dedicated integration solution for each one of them. The result is another undesired solution: an integration labyrinth. For an organization it’s almost impossible to guarantee that all these integration solutions are correct, efficient, and lead to consistent results. The Need For One Data Management Platform – The ROI on all big data stored in Hadoop is increased when it’s made available for as wide a range of use cases as possible, including all the new use cases offered by the SQL‐on‐Hadoop engines. What is needed is one Hadoop data management platform that has been designed to support all the current and future use cases, so that the need for duplication of all that big data is minimized and that the development of big data silos and an integration labyrinth is avoided. The Whitepaper – This whitepaper explains what SQL‐on‐Hadoop engines are, what the technological challenges are, and what potential use cases of SQL‐on‐Hadoop are. Besides a high‐level comparison of several of these engines, it also contains a detailed description of Apache Drill that brings to light some of the pertinent issues in providing SQL capabilities on big data. In addition, the MapR Technologies data management platform M7 is also described as an example of a big data platform that can support many different use cases.
SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA Today, not only Internet companies such as Google, Facebook or Twitter do have Big Data but also Enterprise Information Systems store an ever growing amount of data (called Big Enterprise Data in this paper). In a classical SAP system landscape a central data warehouse (SAP BW) is used to integrate and analyze all enterprise data. In SAP BW most of the business logic required for complex analytical tasks (e.g., a complex currency conversion) is implemented in the application layer on top of a standard relational database. While being independent from the underlying database when using such an architecture, this architecture has two major drawbacks when analyzing Big Enterprise Data: (1) algorithms in ABAP do not scale with the amount of data and (2) data shipping is required. To this end, we present a novel programming language called SQLScript to efficiently support complex and scalable analytical tasks inside SAP’s new main-memory database HANA. SQLScript provides two major extensions to the SQL dialect of SAP HANA: A functional and a procedural extension. While the functional extension allows the definition of scalable analytical tasks on Big Enterprise Data, the procedural extension provides imperative constructs to orchestrate the analytical tasks. The major contributions of this paper are two novel functional extensions: First, an extended version of the MapReduce programming model for supporting parallelizable user-defined functions (UDFs). Second, compared to recursion in the SQL standard, a generalized version of recursion to support graph analytics as well as machine learning tasks.
Stacked Graphs – Geometry & Aesthetics In February 2008, the New York Times published an unusual chart of box office revenues for 7500 movies over 21 years. The chart was based on a similar visualization, developed by the first author, that displayed trends in music listening. This paper describes the design decisions and algorithms behind these graphics, and discusses the reaction on the Web. We suggest that this type of complex layered graph is effective for displaying large data sets to a mass audience. We provide a mathematical analysis of how this layered graph relates to traditional stacked graphs and to techniques such as ThemeRiver, showing how each method is optimizing a different “energy function”. Finally, we discuss techniques for coloring and ordering the layers of such graphs. Throughout the paper, we emphasize the interplay between considerations of aesthetics and legibility.
Stan: A Probabilistic Programming Language Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively de nes a log probability function over parameters conditioned on speci ed data and constants. As of version 2.2.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectation propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can be called from the command line, through R using the RStan package, or through < Python using the PyStan package. All three interfaces support sampling or optimization-based inference and analysis, and RStan and PyStan also provide access to log probabilities, gradients, Hessians, and I/O transforms.
Standards in Predictive Analytics: The role of R, Hadoop and PMML in the mainstreaming of predictive analytics. Just a few years ago it was common to develop a predictive analytic model using a single proprietary tool against a sample of structured data. This would then be applied in batch, storing scores for future use in a database or data warehouse. Recently this model has been disrupted. There is a move to real-time scoring, calculating the value of predictive analytic models when they are needed rather than looking for them in a database. At the same time the variety of model execution platforms has expanded with in-database execution, columnar and inmemory databases as well as MapReduce-based execution becoming increasingly common. Modeling too has changed: the open source analytic modeling language R has become extremely popular, with up to 70% of analytic professionals using it at least occasionally. The range of data types being used in models has expanded along with the approaches used for storage. Modelers increasingly want to analyze all their data, not just a sample, to build a model. This increasingly complex and multi-vendor environment has increased the value of standards, both published standards and open source standards. In this paper we will explore the growing role of standards for predictive analytics in expanding the analytic ecosystem, handling Big Data and supporting the move to real-time scoring.
Statistical Inference: The Big Picture Statistics has moved beyond the frequentist-Bayesian controversies of the past. Where does this leave our ability to interpret results? I suggest that a philosophy compatible with statistical practice, labeled here statistical pragmatism, serves as a foundation for inference. Statistical pragmatism is inclusive and emphasizes the assumptions that connect statistical models with observed data. I argue that introductory courses often mischaracterize the process of statistical inference and I propose an alternative “big picture” depiction.
Statistical Learning and Kernel Methods (Slide Deck)
Statistical Learning Theory: Models, Concepts, and Results Statistical learning theory provides the theoretical basis for many of today’s machine learning algorithms and is arguably one of the most beautifully developed branches of arti cial intelligence in general. It originated in Russia in the 1960s and gained wide popularity in the 1990s following the development of the so-called Support Vector Machine (SVM), which has become a standard tool for pattern recognition in a variety of domains ranging from computer vision to computational biology. Providing the basis of new learning algorithms, however, was not the only motivation for developing statistical learning theory. It was just as much a philosophical one, attempting to answer the question of what it is that allows us to draw valid conclusions from empirical data. In this article we attempt to give a gentle, non-technical overview over the key ideas and insights of statistical learning theory. We do not assume that the reader has a deep background in mathematics, statistics, or computer science. Given the nature of the subject matter, however, some familiarity with mathematical concepts and notations and some intuitive understanding of basic probability is required. There exist many excellent references to more technical surveys of the mathematics of statistical learning theory: the monographs by one of the founders of statistical learning theory (Vapnik, 1995, Vapnik, 1998), a brief overview over statistical learning theory in Section 5 of Scholkopf and Smola (2002), more technical overview papers such as Bousquet et al. (2003), Mendelson (2003), Boucheron et al. (2005), Herbrich and Williamson (2002), and the monograph Devroye et al. (1996).
Statistical Model Selection with ‘Big Data’ Big Data offer potential benefits for statistical modelling, but confront problems like an excess of false positives, mistaking correlations for causes, ignoring sampling biases, and selecting by inappropriate methods. We consider the many important requirements when searching for a data-based relationship using Big Data, and the possible role of Autometrics in that context. Paramount considerations include embedding relationships in general initial models, possibly restricting the number of variables to be selected over by non-statistical criteria (the formulation problem), using good quality data on all variables, analyzed with tight significance levels by a powerful selection procedure, retaining available theory insights (the selection problem) while testing for relationships being well specified and invariant to shifts in explanatory variables (the evaluation problem), using a viable approach that resolves the computational problem of immense numbers of possible models.
Statistical Modeling: The Two Cultures There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
Statistical Software (R, SAS, SPSS, and Minitab) for Blind Students and Practitioners Access to information is crucial for the blind person’s success in education, but transferring knowledge about the existence of techniques into actually being able to complete those tasks is what will ultimately improve the blind person’s employment prospects. This paper is based on the experiences of the two authors; as blind academics in statistics, we are dependent on the usefulness of statistical software for blind users more than most blind people. The use of the \we” throughout this article is intentionally meant to be personal in terms of our own experiences but more importantly, also re ects the needs of the blind community as a whole. Blind students often bene t from one-to-one teaching resources which can aid in their uptake of statistical thinking and practice, but this additional service is only a temporary solution. Once the student has completed their rst course in statistics, they may embark on research at a university, or head out into industry to apply their knowledge. Irrespective of the direction they choose, they will need certainty in being able to independently create graphs for the sighted readers of their work. At the 2009 Workshop on E-Inclusion in Mathematics and Sciences, the rst author was able to meet other researchers who are concerned about the low rate of blind people entering the sciences in a broad sense and the mathematical sciences in particular. Godfrey (2009) presents what we believe is the rst formalized presentation (written by a blind person) of the current state of a airs for blind people taking statistics courses. Much of the material covered in that work still holds true today, although there have been some technological changes that have altered the landscape a little. The four main considerations of Godfrey (2009) were graphics, software, statistical tables, and mathematical formulae. Although software was just one element discussed, graphics and mathematical formulae are playing an increasing role in the usefulness of statistical software, especially with respect to the accessibility of support documentation. We have reviewed four statistical software packages that blind people might want to use in their university education. Our review is restricted to the Windows operating system because this is the predominant environment in which blind people are working. Before we review R, SAS, SPSS, and Minitab, we outline our expectations of statistical software, describe a simple task used to evaluate some practical experiences, and describe some issues with certain le formats and graphics. Following the software-speci c sections there is a general discussion of pertinent issues for software developers, including the relevant details of the legislative environment in the United States of America. The article closes with a simpli ed set of criteria and our overall assessment of the current state of the usefulness of statistical software for blind users.
Statistical Thinking: An Approach to Management (Slide Deck)
Stein´s Paradox in Statistics
STL: A seasonal Trend decomposition Procedure based on LOESS
Stochastic Gradient Descent Tricks Chapter 1 strongly advocates the stochastic back-propagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD). This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.
Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use A common problem in regression analysis is that of variable selection. Often, you have a large number of potential independent variables, and wish to select among them, perhaps to create a ‘best’ model. One common method of dealing with this problem is some form of automated procedure, such as forward, backward, or stepwise selection. We show that these methods are not to be recommended, and present better alternatives using PROC GLMSELECT and other methods.
Strengths and Weaknesses of Weak-Strong Cluster Problems: A Detailed Overview of State-of-the-art Classical Heuristics vs Quantum Approaches To date, a conclusive detection of quantum speedup remains elusive. Recently, a team by Google Inc. [arXiv:1512.02206] proposed a weak-strong cluster model tailored to have tall and narrow energy barriers separating local minima, with the aim to highlight the value of finite-range tunneling. More precisely, results from quantum Monte Carlo simulations, as well as the D-Wave 2X quantum annealer scale considerably better than state-of-the-art simulated annealing simulations. Moreover, the D-Wave 2X quantum annealer is $\sim 10^8$ times faster than simulated annealing on conventional computer hardware for problems with approximately $10^3$ variables. Here, an overview of different sequential, nontailored, as well as specialized tailored algorithms on the Google instances is given. We show that the quantum speedup is limited to sequential approaches and study the typical complexity of the benchmark problems using insights from the study of spin glasses.
Structural Equation Models Structural equation models (SEMs), also called simultaneous equation models, are multivariate (i.e., multiequation) regression models. Unlike the more traditional multivariate linear model, however, the response variable in one regression equation in an SEM may appear as a predictor in another equation; indeed, variables in an SEM may influence one-another reciprocally, either directly or through other variables as intermediaries. These structural equations are meant to represent causal relationships among the variables in the model. …
Structural Intervention Distance (SID) for Evaluating Causal Graphs Causal inference relies on the structure of a graph, often a directed acyclic graph (DAG). Di erent graphs may result in di erent causal inference statements and di erent intervention distributions. To quantify such differences, we propose a (pre-) distance between DAGs, the structural intervention distance (SID). The SID is based on a graphical criterion only and quanti es the closeness between two DAGs in terms of their corresponding causal inference statements. It is therefore well-suited for evaluating graphs that are used for computing interventions. Instead of DAGs it is also possible to compare CPDAGs, completed partially directed acyclic graphs that represent Markov equivalence classes. Since it di ers significantly from the popular Structural Hamming Distance (SHD), the SID constitutes a valuable additional measure.
Structure Learning of Probabilistic Graphical Models: A Comprehensive Survey Probabilistic graphical models combine the graph theory and probability theory to give a multivariate statistical modeling. They provide a unified description of uncertainty using probability and complexity using the graphical model. Especially, graphical models provide the following several useful properties:
• Graphical models provide a simple and intuitive interpretation of the structures of probabilistic models. On the other hand, they can be used to design and motivate new models.
• Graphical models provide additional insights into the properties of the model, including the conditional independence properties.
• Complex computations which are required to perform inference and learning in sophisticated models can be expressed in terms of graphical manipulations, in which the underlying mathematical expressions are carried along implicitly.
The graphical models have been applied to a large number of fields, including bioinformatics, social science, control theory, image processing, marketing analysis, among others. However, structure learning for graphical models remains an open challenge, since one must cope with a combinatorial search over the space of all possible structures. In this paper, we present a comprehensive survey of the existing structure learning algorithms.
Stupid Data Miner Tricks: Overfitting the S and P 500 It wasn’t too long ago that calling someone a data miner was a very bad thing. You could start a fistfight at a convention of statisticians with this kind of talk. It meant that you were finding the analytical equivalent of the bunnies in the clouds, poring over data until you found something. Everyone knew that if you did enough poring, you were bound to find that bunny sooner or later, but it was no more real than the one that blows over the horizon. Now, data mining is a small industry, with entire companies devoted to it. There are academic conferences devoted solely to data mining. The phrase no longer elicits as many invitations to step into the parking lot as it used to. What’s going on? These new data mining people are not fools. Sometimes data mining makes sense, and sometimes it doesn’t. …
Summarizing large text collection using topic modeling and clustering based on MapReduce framework. Document summarization provides an instrument for faster understanding the collection of text documents and has a number of real life applications. Semantic similarity and clustering can be utilized efficiently for generating effective summary of large text collections. Summarizing large volume of text is a challenging and time consuming problem particularly while considering the semantic similarity computation in summarization process. Summarization of text collection involves intensive text processing and computations to generate the summary. MapReduce is proven state of art technology for handling Big Data. In this paper, a novel framework based on MapReduce technology is proposed for summarizing large text collection. The proposed technique is designed using semantic similarity based clustering and topic modeling using Latent Dirichlet Allocation (LDA) for summarizing the large text collection over MapReduce framework. The summarization task is performed in four stages and provides a modular implementation of multiple documents summarization. The presented technique is evaluated in terms of scalability and various text summarization parameters namely, compression ratio, retention ratio, ROUGE and Pyramid score are also measured. The advantages of MapReduce framework are clearly visible from the experiments and it is also demonstrated that MapReduce provides a faster implementation of summarizing large text collections and is a powerful tool in Big Text Data analysis.
Support vs Confidence in Association Rule Algorithms The discovery of interesting association relationships among large amounts of business transactions is currently vital for making appropriate business decisions. There are currently a variety of algorithms to discover association rules. Some of these algorithms depend on the use of minimum support to weed out the uninteresting rules. Other algorithms look for highly correlated items, that is, rules with high confidence. In this paper we present a description of these types of association rule algorithms and a comparison of two algorithms representative of these approaches, with the aim of understanding the pros and cons of the support- and confidence-based approaches.
Survey of Clustering Data Mining Techniques Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. They are subject of the survey.
Survey of Consistent Network Updates Computer networks have become a critical infrastructure. Designing dependable computer networks however is challenging, as such networks should not only meet strict requirements in terms of correctness, availability, and performance, but they should also be flexible enough to support fast updates, e.g., due to a change in the security policy, an increasing traffic demand, or a failure. The advent of Software-Defined Networks (SDNs) promises to provide such flexiblities, allowing to update networks in a fine-grained manner, also enabling a more online traffic engineering. In this paper, we present a structured survey of mechanisms and protocols to update computer networks in a fast and consistent manner. In particular, we identify and discuss the different desirable update consistency properties a network should provide, the algorithmic techniques which are needed to meet these consistency properties, their implications on the speed and costs at which updates can be performed. We also discuss the relationship of consistent network update problems to classic algorithmic optimization problems. While our survey is mainly motivated by the advent of Software-Defined Networks (SDNs), the fundamental underlying problems are not new, and we also provide a historical perspective of the subject.
Survey of Distributed Decision We survey the recent distributed computing literature on checking whether a given distributed system configuration satisfies a given boolean predicate, i.e., whether the configuration is legal or illegal w.r.t. that predicate. We consider classical distributed computing environments, including mostly synchronous fault-free network computing (LOCAL and CONGEST models), but also asynchronous crash-prone shared-memory computing (WAIT-FREE model), and mobile computing (FSYNC model).
Survey of Expressivity in Deep Neural Networks We survey results on neural network expressivity described in ‘On the Expressive Power of Deep Neural Networks’. The paper motivates and develops three natural measures of expressiveness, which all display an exponential dependence on the depth of the network. In fact, all of these measures are related to a fourth quantity, trajectory length. This quantity grows exponentially in the depth of the network, and is responsible for the depth sensitivity observed. These results translate to consequences for networks during and after training. They suggest that parameters earlier in a network have greater influence on its expressive power — in particular, given a layer, its influence on expressivity is determined by the remaining depth of the network after that layer. This is verified with experiments on MNIST and CIFAR-10. We also explore the effect of training on the input-output map, and find that it trades off between the stability and expressivity.
Survey of Keyword Extraction Techniques Keywords are commonly used for search engines and document databases to locate information and determine if two pieces of test are related to each other. Reading and summarizing the contents of large entries of text into a small set of topics is difficult and time consuming for a human, so much so that it becomes nearly impossible to accomplish with limited manpower as the size of the information grows. As a result, automated systems are being more commonly used to do this task. This problem is challenging due to the intricate complexities of natural lan- guage, as well as the inherent difficulty in determining if a word or set of words accurately represent topics present within the text. With the advent of the internet, there is now both a massive amount of information available, as well as a demand to be able to search through all of this information. Keyword extraction from text data is a common tool used by search engines and indexes alike to quickly categorize and locate specific data based on explicitly or implicitly supplied keywords.
Survey of Visual Question Answering: Datasets and Techniques Visual question answering (or VQA) is a new and exciting problem that combines natural language processing and computer vision techniques. We present a survey of the various datasets and models that have been used to tackle this task. The first part of the survey details the various datasets for VQA and compares them along some common factors. The second part of this survey details the different approaches for VQA, classified into four types: non-deep learning models, deep learning models without attention, deep learning models with attention, and other models which do not fit into the first three. Finally, we compare the performances of these approaches and provide some directions for future work.
Survey on Feature Selection Feature selection plays an important role in the data mining process. It is needed to deal with the excessive number of features, which can become a computational burden on the learning algorithms. It is also necessary, even when computational resources are not scarce, since it improves the accuracy of the machine learning tasks, as we will see in the upcoming sections. In this review, we discuss the different feature selection approaches, and the relation between them and the various machine learning algorithms.
Survey on Models and Techniques for Root-Cause Analysis Automation and computer intelligence to support complex human decisions becomes essential to manage large and distributed systems in the Cloud and IoT era. Understanding the root cause of an observed symptom in a complex system has been a major problem for decades. As industry dives into the IoT world and the amount of data generated per year grows at an amazing speed, an important question is how to find appropriate mechanisms to determine root causes that can handle huge amounts of data or may provide valuable feedback in real-time. While many survey papers aim at summarizing the landscape of techniques for modelling system behavior and infering the root cause of a problem based in the resulting models, none of those focuses on analyzing how the different techniques in the literature fit growing requirements in terms of performance and scalability. In this survey, we provide a review of root-cause analysis, focusing on these particular aspects. We also provide guidance to choose the best root-cause analysis strategy depending on the requirements of a particular system and application.
Survival Analysis in R This document is intended to assist an individual who has familiarity with R and who is taking a survival analysis course. Specifically, this was constructed for a biostatistics course at UCLA. Many theoretical details have been intentionally omitted for brevity; it is assumed the reader is familiar with the theory of the topics presented. Likewise, it is assumed the reader has basic understanding of R including working with data frames, vectors, matrices, plotting, and linear model fitting and interpretation. Functions that are introduced will only have the key arguments mentioned and discussed. Most functions have several other (optional) arguments, however, many of these will not be useful for an introductory course. The functions, with the exception of those I wrote, have well-written descriptions that specify each of the potential arguments and their use. The functions I have written include documentation on the following web site: <” target=”top”>http://…/> Ideally, this survival analysis document would be printed front-to-back and bound like a book. No topics run over two pages and those that are two pages would then be on opposing pages, making a topic’s introduction available without the need to flip back and forth between pages (unless the reader forgets a previous topic, then some flipping may be necessary). A more thorough look at Cox PH models beyond what is discussed here is available in a guide constructed by John Fox, which is listed in the References.
Sustainability in the Age of Big Data Big data and climate change share one important characteristic: Both are changing the course of history. Carbon dioxide levels have not been this high in 800,000 years, and the amount of data being generated today is unprecedented. The question at the recent Wharton conference on “Sustainability in the Age of Big Data” was how rapidly advancing information technologies can be brought together to forestall the worst ravages of global climate change. As Gary Survis, CMO of Big Data company Syncsort, IGEL senior fellow and conference moderator, noted, “It is rare that there is a confluence of two seismic events as transformative as climate change and big data. It presents amazing opportunities, as well as responsibilities.” Coming to terms with the scope of big data is a challenge, but the promise is enormous. Big data has the potential to revolutionize the two industries that generate the most carbon dioxide – energy and agriculture. Machine-to-machine communication can help reduce energy demands and increase the viability of renewable power sources. On farms, data from the molecular level may help give rise to a new green revolution, and sensors in satellites, farmland, trucks and grocery stores promise to reduce waste industry-wide. Important questions remain. Can big data be used to influence people’s behavior without manipulating them? Can private enterprise capitalize on big data’s possibilities without riding roughshod over the rights of those who generate the data? And can the high-tech innovations already underway in the developed world help solve the problems of those most in need? How well we answer these questions will determine whether we can realize the historic potential of “Sustainability in the Age of Big Data.”
Swarm Intelligence in Semi-supervised Classification This Paper represents a literature review of Swarm intelligence algorithm in the area of semi-supervised classification. There are many research papers for applying swarm intelligence algorithms in the area of machine learning. Some algorithms of SI are applied in the area of ML either solely or hybrid with other ML algorithms. SI algorithms are also used for tuning parameters of ML algorithm, or as a backbone for ML algorithms. This paper introduces a brief literature review for applying swarm intelligence algorithms in the field of semi-supervised learning
Symbolic Calculus in Mathematical Statistics: A Review In the last ten years, the employment of symbolic methods has substantially extended both the theory and the applications of statistics and probability. This survey reviews the development of a symbolic technique arising from classical umbral calculus, as introduced by Rota and Taylor in $1994.$ The usefulness of this symbolic technique is twofold. The first is to show how new algebraic identities drive in discovering insights among topics apparently very far from each other and related to probability and statistics. One of the main tools is a formal generalization of the convolution of identical probability distributions, which allows us to employ compound Poisson random variables in various topics that are only somewhat interrelated. Having got a different and deeper viewpoint, the second goal is to show how to set up algorithmic processes performing efficiently algebraic calculations. In particular, the challenge of finding these symbolic procedures should lead to a new method, and it poses new problems involving both computational and conceptual issues. Evidence of efficiency in applying this symbolic method will be shown within statistical inference, parameter estimation, L\’evy processes, and, more generally, problems involving multivariate functions. The symbolic representation of Sheffer polynomial sequences allows us to carry out a unifying theory of classical, Boolean and free cumulants. Recent connections within random matrices have extended the applications of the symbolic method.
Symbolic Data Analysis: Definitions and Examples With the advent of computers, large, very large datasets have become routine. What is not so routine is how to analyse these data and/or how to glean useful information from within their massive confines. One approach is to summarize large data sets in such a way that the resulting summary dataset is of a manageable size. One consequence of this is that the data may no longer be formatted as single values such as is the case for classical data, but may be represented by lists, intervals, distributions and the like. These summarized data are examples of symbolic data. This paper looks at the concept of symbolic data in general, and then attempts to review the methods currently available to analyse such data. It quickly becomes clear that the range of methodologies available draws analogies with developments prior to 1900 which formed a foundation for the inferential statistics of the 1900’s, methods that are largely limited to small (by comparison) data sets and limited to classical data formats. The scarcity of available methodologies for symbolic data also becomes clear and so draws attention to an enormous need for the development of a vast catalogue (so to speak) of new symbolic methodologies along with rigorous mathematical foundational work for these methods.
Symbolic, Distributed and Distributional Representations for Natural Language Processing in the Era of Deep Learning: a Survey Natural language and symbols are intimately correlated. Recent advances in machine learning (ML) and in natural language processing (NLP) seem to contradict the above intuition: symbols are fading away, erased by vectors or tensors called distributed and distributional representations. However, there is a strict link between distributed/distributional representations and symbols, being the first an approximation of the second. A clearer understanding of the strict link between distributed/distributional representations and symbols will certainly lead to radically new deep learning networks. In this paper we make a survey that aims to draw the link between symbolic representations and distributed/distributional representations. This is the right time to revitalize the area of interpreting how symbols are represented inside neural networks.
SynopSys: Large Graph Analytics in the SAP HANA Database Through Summarization Graph-structured data is ubiquitous and with the advent of social networking platforms has recently seen a significant increase in popularity amongst researchers. However, also many business applications deal with this kind of data and can therefore benefit greatly from graph processing functionality offered directly by the underlying database. This paper summarizes the current state of graph data processing capabilities in the SAP HANA database and describes our efforts to enable large graph analytics in the context of our research project SynopSys. With powerful graph pattern matching support at the core, we envision OLAP-like evaluation functionality exposed to the user in the form of easy-to-apply graph summarization templates. By combining them, the user is able to produce concise summaries of large graph-structured datasets. We also point out open questions and challenges that we plan to tackle in the future developments on our way towards large graph analytics.

T

Teaching Machines to Read and Comprehend Teaching machines to read natural language documents remains an elusive challenge. Machine reading systems can be tested on their ability to answer questions posed on the contents of documents that they have seen, but until now large scale training and test datasets have been missing for this type of evaluation. In this work we define a new methodology that resolves this bottleneck and provides large scale supervised reading comprehension data. This allows us to develop a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure.
Temporal anomaly detection: calibrating the surprise We propose a hybrid approach to temporal anomaly detection in user-database access data — or more generally, any kind of subject-object co-occurrence data. Our methodology allows identifying anomalies based on a single stationary model, instead of requiring a full temporal one, which would be prohibitive in our setting. We learn our low-rank stationary model from the high-dimensional training data, and then fit a regression model for predicting the expected likelihood score of normal access patterns in the future. The disparity between the predicted and the observed likelihood scores is used to assess the ‘surprise’. This approach enables calibration of the anomaly score so that time-varying normal behavior patterns are not considered anomalous. We provide a detailed description of the algorithm, including a convergence analysis, and report encouraging empirical results. One of the datasets we tested is new for the public domain. It consists of two months’ worth of database access records from a live system. This dataset will be made publicly available, and is provided in the supplementary material.
Temporal Data Mining: An Overview To classify data mining problems and algorithms we used two dimensions: data type and type of mining operations. One of the main issue that arise during the data mining process is treating data that contains temporal information. The area of temporal data mining has very much attention in the last decade because from the time related feature of the data, one can extract much significant information which can not be extracted by the general methods of data mining. Many interesting techniques of temporal data mining were proposed and shown to be useful in many applications. Since temporal data mining brings together techniques from different fields such as databases, statistics and machine learning the literature is scattered among many different sources. In this paper, we present a survey on techniques of temporal data mining.
Ten Simple Rules for Reproducible Computational Research Replication is the cornerstone of a cumulative science. However, new tools and technologies, massive amounts of data, interdisciplinary approaches, and the complexity of the questions being asked are complicating replication efforts, as are increased pressures on scientists to advance their research. As full replication of studies on independently collected data is often not feasible, there has recently been a call for reproducible research as an attainable minimum standard for assessing the value of scientific claims. This requires that papers in experimental science describe the results and provide a sufficiently clear protocol to allow successful repetition and extension of analyses based on original data. The importance of replication and reproducibility has recently been exemplified through studies showing that scientific papers commonly leave out experimental details essential for reproduction, studies showing difficulties with replicating published experimental results, an increase in retracted papers, and through a high number of failing clinical trials. This has led to discussions on how individual researchers, institutions, funding bodies, and journals can establish routines that increase transparency and reproducibility. In order to foster such aspects, it has been suggested that the scientific community needs to develop a ‘‘culture of reproducibility’’ for computational science, and to require it for published claims. We want to emphasize that reproducibility is not only a moral responsibility with respect to the scientific field, but that a lack of reproducibility can also be a burden for you as an individual researcher. As an example, a good practice of reproducibility is necessary in order to allow previously developed methodology to be effectively applied on new data, or to allow reuse of code and results for new projects. In other words, good habits of reproducibility may actually turn out to be a time-saver in the longer run. We further note that reproducibility is just as much about the habits that ensure reproducible research as the technologies that can make these processes efficient and realistic. Each of the following ten rules captures a specific aspect of reproducibility, and discusses what is needed in terms of information handling and tracking of procedures. If you are taking a bare-bones approach to bioinformatics analysis, i.e., running various custom scripts from the command line, you will probably need to handle each rule explicitly. If you are instead performing your analyses through an integrated framework (such as Gene- Pattern, Galaxy, LONI pipeline, or Taverna), the system may already provide full or partial support for most of the rules. What is needed on your part is then merely the knowledge of how to exploit these existing possibilities. In a pragmatic setting, with publication pressure and deadlines, one may face the need to make a trade-off between the ideals of reproducibility and the need to get the research out while it is still relevant. This trade-off becomes more important when considering that a large part of the analyses being tried out never end up yielding any results. However, frequently one will, with the wisdom of hindsight, contemplate the missed opportunity to ensure reproducibility, as it may already be too late to take the necessary notes from memory (or at least much more difficult than to do it while underway). We believe that the rewards of reproducibility will compensate for the risk of having spent valuable time developing an annotated catalog of analyses that turned out as blind alleys. As a minimal requirement, you should at least be able to reproduce the results yourself. This would satisfy the most basic requirements of sound research, allowing any substantial future questioning of the research to be met with a precise explanation. Although it may sound like a very weak requirement, even this level of reproducibility will often require a certain level of care in order to be met. There will for a given analysis be an exponential number of possible combinations of software versions, parameter values, preprocessing steps, and so on, meaning that a failure to take notes may make exact reproduction essentially impossible. With this basic level of reproducibility in place, there is much more that can be wished for. An obvious extension is to go from a level where you can reproduce results in case of a critical situation to a level where you can practically and routinely reuse your previous work and increase your productivity. A second extension is to ensure that peers have a practical possibility of reproducing your results, which can lead to increased trust in, interest for, and citations of your work. We here present ten simple rules for reproducibility of computational research. These rules can be at your disposal for whenever you want to make your research more accessible – be it for peers or for your future self.
Tensor Networks in a Nutshell Tensor network methods are taking a central role in modern quantum physics and beyond. They can provide an efficient approximation to certain classes of quantum states, and the associated graphical language makes it easy to describe and pictorially reason about quantum circuits, channels, protocols, open systems and more. Our goal is to explain tensor networks and some associated methods as quickly and as painlessly as possible. Beginning with the key definitions, the graphical tensor network language is presented through examples. We then provide an introduction to matrix product states. We conclude the tutorial with tensor contractions evaluating combinatorial counting problems. The first one counts the number of solutions for Boolean formulae, whereas the second is Penrose’s tensor contraction algorithm, returning the number of $3$-edge-colorings of $3$-regular planar graphs.
Tests for Comparing Weighted Histograms. Review and Improvements Histograms with weighted entries are used to estimate probability density functions. Computer simulation is the main application of this type of histograms. A review on chi-square tests for comparing weighted histograms is presented in this paper. Improvements to these tests that have a size closer to its nominal value are proposed. Numerical examples are presented for evaluation and demonstration of various applications of the tests.
texreg: Conversion of Statistical Model Output in R to LATEX and HTML Tables A recurrent task in applied statistics is the (mostly manual) preparation of model output for inclusion in LATEX, Microsoft Word, or HTML documents – usually with more than one model presented in a single table along with several goodness-of- t statistics. However, statistical models in R have diverse object structures and summary methods, which makes this process cumbersome. This article first develops a set of guidelines for converting statistical model output to LATEX and HTML tables, then assesses to what extent existing packages meet these requirements, and finally presents the texreg package as a solution that meets all of the criteria set out in the beginning. After providing various usage examples, a blueprint for writing custom model extensions is proposed.
Text Understanding from Scratch This article demontrates that we can apply deep learning to text understanding from character level inputs all the way up to abstract text concepts, using temporal convolutional networks (LeCun et al., 1998) (ConvNets). We apply ConvNets to various large-scale datasets, including ontology classification, sentiment analysis, and text categorization. We show that temporal ConvNets can achieve astonishing performance without the knowledge of words, phrases, sentences and any other syntactic or semantic structures with regards to a human language. Evidence shows that our models can work for both English and Chinese.
The ALAMO approach to machine learning ALAMO is a computational methodology for leaning algebraic functions from data. Given a data set, the approach begins by building a low-complexity, linear model composed of explicit non-linear transformations of the independent variables. Linear combinations of these non-linear transformations allow a linear model to better approximate complex behavior observed in real processes. The model is refined, as additional data are obtained in an adaptive fashion through error maximization sampling using derivative-free optimization. Models built using ALAMO can enforce constraints on the response variables to incorporate first-principles knowledge. The ability of ALAMO to generate simple and accurate models for a number of reaction problems is demonstrated. The error maximization sampling is compared with Latin hypercube designs to demonstrate its sampling efficiency. ALAMO’s constrained regression methodology is used to further refine concentration models, resulting in models that perform better on validation data and satisfy upper and lower bounds placed on model outputs.
The Analytics Big Bang (infographic)
The Anatomy of Big Data Computing Advances in information technology and its widespread growth in several areas of business, engineering, medical and scientific studies are resulting in information/data explosion. Knowledge discovery and decision making from such rapidly growing voluminous data is a challenging task in terms of data organization and processing, which is an emerging trend known as Big Data Computing; a new paradigm which combines large scale compute, new data intensive techniques and mathematical models to build data analytics. Big Data computing demands a huge storage and computing for data curation and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of Big Data computing, differences between traditional data warehousing and Big Data, taxonomy of Big Data computing and underpinning technologies, integrated platform of Big Data and Clouds known as Big Data Clouds, layered architecture and components of Big Data Cloud and finally discusses open technical challenges and future directions.
The Art of Data Augmentation The term data augmentation refers to methods for constructing iterative optimization or sampling algorithms via the introduction of unobserved data or latent variables. For deterministic algorithms, the method was popularized in the general statistical community by the seminal article by Dempster, Laird, and Rubin on the EM algorithm for maximizing a likelihood function or, more generally, a posterior density. For stochastic algorithms, the method was popularized in the statistical literature by Tanner and Wong’s Data Augmentation algorithm for posterior sampling and in the physics literature by Swendsen and Wang’s algorithm for sampling from the Ising and Potts models and their generalizations; in the physics literature, the method of data augmentation is referred to as the method of auxiliary variables. Data augmentation schemes were used by Tanner and Wong to make simulation feasible and simple, while auxiliary variables were adopted by Swendsen and Wang to improve the speed of iterative simulation. In general, however, constructing data augmentation schemes that result in both simple and fast algorithms is a matter of art in that successful strategies vary greatly with the (observed-data) models being considered. After an overview of data augmentation/auxiliary variables and some recent developments in methods for constructing such efficient data augmentation schemes, we introduce an effective search strategy that combines the ideas of marginal augmentation and conditional augmentation, together with a deterministic approximation method for selecting good augmentation schemes. We then apply this strategy to three common classes of models (specifically, multivariate t, probit regression, and mixed-effects models) to obtain efŽficient Markov chain Monte Carlo algorithms for posterior sampling. We provide theoretical and empirical evidence that the resulting algorithms, while requiring similar programming effort, can show dramatic improvement over the Gibbs samplers commonly used for these models in practice. A key feature of all these new algorithms is that they are positive recurrent subchains of nonpositive recurrent Markov chains constructed in larger spaces.
The Art of Turning Data Into Product Having worked in academia, government and industry, I’ve had a unique opportunity to build products in each sector. Much of this product development has been around building data products. Just as methods for general product development have steadily improved, so have the ideas for developing data products. Thanks to large investments in the general area of data science, many major innovations (e.g., Hadoop, Voldemort, Cassandra, HBase, Pig, Hive, etc.) have made data products easier to build. Nonetheless, data products are unique in that they are often extremely difficult, and seemingly intractable for small teams with limited funds. Yet, they get solved every day. How? Are the people who solve them superhuman data scientists who can come up with better ideas in five minutes than most people can in a lifetime? Are they magicians of applied math who can cobble together millions of lines of code for high-performance machine learning in a few hours? No. Many of them are incredibly smart, but meeting big problems head-on usually isn’t the winning approach. There’s a method to solving data problems that avoids the big, heavyweight solution, and instead, concentrates building something quickly and iterating. Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small. We call this Data Jujitsu: the art of using multiple data elements in clever ways to solve iterative problems that, when combined, solve a data problem that might otherwise be intractable. It’s related to Wikipedia’s definition of the ancient martial art of jujitsu: “the art or technique of manipulating the opponent’s force against himself rather than confronting it with one’s own force.”
The Basic AI Drives One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted. We start by showing that goal-seeking systems will have drives to model their own operation and to improve themselves. We then show that self-improving systems will be driven to clarify their goals and represent them as economic utility functions. They will also strive for their actions to approximate rational economic behavior. This will lead almost all systems to protect their utility functions from modification and their utility measurement systems from corruption. We also discuss some exceptional systems which will want to modify their utility functions. We next discuss the drive toward self-protection which causes systems try to prevent themselves from being harmed. Finally we examine drives toward the acquisition of resources and toward their efficient utilization. We end with a discussion of how to incorporate these insights in designing intelligent technology which will lead to a positive future for humanity.
The Bayesian New Statistics: Two historical trends converge There have been two historical shifts in the practice of data analysis. One shift is from hypothesis testing to estimation with uncertainty and meta-analysis, which among frequentists in psychology has recently been dubbed ‘the New Statistics’ (Cumming, 2014). A second shift is from frequentist methods to Bayesian methods. We explain and applaud both of these shifts. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The two historical trends converge in Bayesian methods for estimation with uncertainty and meta-analysis.
The beginner’s guide to app analytics The web analytics industry has grown in size and sophistication over the past decade, enabling marketers to create targeted and revenue-driven online presences. This represents a huge shift in how brands operate and communicate with their consumers. And it’s awesome. But the web isn’t where the majority of your audience is going to be anymore, and it isn’t where the learning curve is. Today, there are over 1.2 billion mobile web users worldwide, and mobile traffic is slated to reach over 25% of total Internet traffic by the year-end. And now, US daily smartphone screen time has even exceeded television time. For brands designing, launching, promoting and investing in mobile apps, tracking the right engagement metrics is critical to long-term success in terms of ROI and growth. Using realtime app insights allows you to get to know and adapt to your users right from the start, so that they keep coming back, even if you’re just getting started. In this eBook, we outline how to define your brand’s mobile goals, switch from a web-only mindset, and get started identifying, measuring, and learning from your key app analytics.
The Big Data Economy: Why and how our future with data is cleaner, leaner, and smarter (Slide Deck)
The Big Potential of Big Data Big data works. Adopters have reaped benefits in ROI, customer interactions and insights into customer behavior. Of the organizations that used big data at least 50% of the time, three in five (60%) said that they had exceeded their goals. At the same time, of the companies that used big data less than 50% of the time, just 33% said that they had exceeded their goals. The more frequently that companies felt that they were making sufficient use of data, the more likely they exceeded their goals. More than nine in 10 companies (92%) who had always or frequently made sufficient use of data said that they had met or exceeded their goals, while just 5% who said that they were making sufficient use of data said that they were falling short of their goals. At the same time, marketers seem to be suffering from a personality split. The overwhelming majority of executives say they are satisfied with their marketing. When pressed for more detail, however, the participants’ rosy view contradicts other, more detailed findings. Executives believe that they are using big data enough when they aren’t. A majority of agencies and non-agencies said that they were frequently or always making sufficient use of data in marketing decisions. However, only about one in 10 non-agencies managed more than half their advertising/marketing with big data, and a third of agencies used big data in more than half their initiatives. Many executives may be struggling to define big data and its potential benefits. Just over half of senior executives (both at agencies and other companies) said that they agreed or strongly agreed that they had a good understanding of big data and its benefits. Systems that generate data quickly and can account for changing consumer behavior – those that utilize machine learning – will be increasingly important. Roughly a quarter of respondents called them critical to the success of their marketing, while another 43% of agency executives and 44% of senior executives at non-agency organizations said they would be increasingly important for most initiatives.
The Challenge of Non-Technical Loss Detection using Artificial Intelligence: A Survey Detection of non-technical losses (NTL) which include electricity theft, faulty meters or billing errors has attracted increasing attention from researchers in electrical engineering and computer science. NTLs cause significant harm to the economy, as in some countries they may range up to 40% of the total electricity distributed. The predominant research direction is employing artificial intelligence (AI) to solve this problem. Promising approaches have been reported falling into two categories: expert systems incorporating hand-crafted expert knowledge or machine learning, also called pattern recognition or data mining, which learns fraudulent consumption patterns from examples without being explicitly programmed. This paper first provides an overview about how NTLs are defined and their impact on economies. Next, it covers the fundamental pillars of AI relevant to this domain. It then surveys these research efforts in a comprehensive review of algorithms, features and data sets used. It finally identifies the key scientific and engineering challenges in NTL detection and suggests how they could be solved. We believe that those challenges have not sufficiently been addressed in past contributions and that covering those is necessary in order to advance NTL detection.
The Convergence of Markov chain Monte Carlo Methods: From the Metropolis method to Hamiltonian Monte Carlo From its inception in the 1950s to the modern frontiers of applied statistics, Markov chain Monte Carlo has been one of the most ubiquitous and successful methods in statistical computing. In that time its development has been fueled by increasingly difficult problems and novel techniques from physics. In this article I will review the history of Markov chain Monte Carlo from its inception with the Metropolis method to today’s state-of-the-art in Hamiltonian Monte Carlo. Along the way I will focus on the evolving interplay between the statistical and physical perspectives of the method.
The Dimensionality of Customer Satisfaction Survey Responses and Implications for Driver Analysis The canonical design of customer satisfaction surveys asks for global satisfaction with a product or service and for evaluations of its distinct attributes. Users of these surveys are often interested in the relationship between global satisfaction and the attributes, with regression analysis used to measure the conditional associations. Regression analysis is only appropriate when the global satisfaction measure results from the attribute evaluations, and is not appropriate when the covariance of the items lie in a low dimensional subspace, such as in a factor model. Potential reasons for low dimensional responses are responses that are haloed from overall satisfaction and an unintended lack of specificity of items. In this paper we develop a Bayesian mixture model that facilitates the empirical distinction between regression models and relatively much lower dimensional factor models. The model uses the dimensionality of the covariance among items in a survey as the primary classification criterion while accounting for heterogeneous usage of rating scales. We apply the model to four different customer satisfaction surveys evaluating hospitals, an academic program, smart-phones, and theme parks respectively. We show that correctly assessing the heterogeneous dimensionality of responses is critical for meaningful inferences by comparing our results to those from regression models.
The Enron Corpus: A New Dataset for Email classification Research Automated classification of email messages into user-speci c folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a stateof- the-art classi er (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classi er, and using all the sections in combination with regression weights.
The Evolution of Sentiment Analysis – A Review of Research Topics, Venues, and Top Cited Papers Research in sentiment analysis is increasing at a fast pace making it challenging to keep track of all the activities in the area. We present a computer-assisted literature review and analyze 5,163 papers from Scopus. We find that the roots of sentiment analysis are in studies on public opinion analysis at the start of 20th century, but the outbreak of computer-based sentiment analysis only occurred with the availability of subjective texts in the Web. Consequently, 99% of the papers have been published after 2005. Sentiment analysis papers are scattered to multiple publication venues and the combined number of papers in the top-15 venues only represent 29% of the papers in total. In recent years, sentiment analysis has shifted from analyzing online product reviews to social media texts from Twitter and Facebook. We created a taxonomy of research topics with text mining and qualitative coding. A meaningful future for sentiment analysis could be in ensuring the authenticity of public opinions, and detecting fake news.
The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R This paper describes an R package named flare, which implements a family of new high dimensional regression methods (LAD Lasso, SQRT Lasso, lq Lasso, and Dantzig selector) and their extensions to sparse precision matrix estimation (TIGER and CLIME). These methods exploit different nonsmooth loss functions to gain modeling exibility, estimation robustness, and tuning insensitiveness. The developed solver is based on the alternating direction method of multipliers (ADMM). The package flare is coded in double precision C, and called from R by a user-friendly interface. The memory usage is optimized by using the sparse matrix output. The experiments show that flare is efficient and can scale up to large problems.
The Forrester Wave: Big Data Streaming Analytics Platforms, Q3 2014 Streaming analytics is anything but a sleepy, rearview mirror analysis of data. No, it is about knowing and acting on what’s happening in your business at this very moment – now. Forrester calls these perishable insights because they occur at a moment’s notice and you must act on them fast within a narrow window of opportunity before they quickly lose their value. The high velocity, white-water flow of data from innumerable real-time data sources such as market data, Internet of Things, mobile, sensors, clickstream, and even transactions remain largely unnavigated by most firms. The opportunity to leverage streaming analytics has never been greater. In Forrester’s 50-criteria evaluation of big data streaming analytics platforms, we evaluated seven platforms from IBM, Informatica, SAP, Software AG, SQLstream, Tibco Software, and Vitria.
The Forward Search and Data Visualisation The forward search is a powerful robust statistical method for exploring the relationship between data and fitted models, which produces an appreciable number of graphs that illuminate the structure of the data. Atkinson and Riani (2000) describe its use in linear and nonlinear regression, response transformation and in generalized linear models, where the emphasis is on the detection of unidentified subsets of the data and of multiple masked outliers and of their effect on inferences. In this talk we extend the method to the analysis of multivariate data, where the emphasis is rather more on the data and less on the multivariate normal model. The forward search orders the observations by closeness to the assumed model, starting from a small subset of the data and increasing the number of observations m used for fitting the model. Outliers and small unidentified subsets of observations enter at the end of the search. Even if there are a number of groups, as in cluster analysis, we start by fitting one multivariate normal distribution to the data. An important graphical tool is a variety of plots of the Mahalanobis distances of the individual observations during the search. Each unit, originally a point in v-dimensional space, is then represented by a curve in two dimensions connecting the almost n values of the distance for each unit calculated during the search. Our task is now to classify these curves. Forward plots of Mahalanobis distances give a good initial indication of clusters, if any, which can be refined by, for example, plots of distances for unclassified units compared to those for each established group. We can also start the forward search at a different points, for example inside each cluster in turn, when we obtain very different curves for each unit. If our aim is cluster analysis, we finish by fitting as many multivariate normal distributions as there are clusters, visually monitoring the behaviour of our forward clustering algorithm. Because we use Mahalanobis distances, it is important that the data are approximately normal. We therefore combine cluster analysis with a multivariate form of the Box-Cox family of transformations. We again use graphical methods, particularly a series of “fan plots”, to establish appropriate transformations.
The forward search: Theory and data analysis The Forward Search is a powerful general method, incorporating flexible data-driven trimming, for the detection of outliers and unsuspected structure in data and so for building robust models. Starting from small subsets of data, observations that are close to the fitted model are added to the observations used in parameter estimation. As this subset grows we monitor parameter estimates, test statistics and measures of fit such as residuals. The paper surveys theoretical development in work on the Forward Search over the last decade. The main illustration is a regression example with 330 observations and 9 potential explanatory variables. Mention is also made of procedures for multivariate data, including clustering, time series analysis and fraud detection.
The Future of Data Analysis
The Future of Retail Analytics Retail has always been a data-intensive industry. As the tools available to store, manage and analyze this data evolved, so did the role the analysis of data played in retail decision-making. From visibility and control, to transparency, to efficiency, to customer engagement. It is cheaper, faster and easier today to store and process more data than ever before. Retailers have gotten better at data management. Question is, how well are they able to leverage insights from this analysis to drive strategic decisions? EKN conducted an industry survey to benchmark the state of the retail industry in terms of analytics maturity. Findings from the primary research covering 65+ respondents, interview based qualitative inputs from retail executives, and EKN’s secondary research from public and proprietary sources are presented in this report. In a retail environment where consumer spending is stunted and competition from newer, digital channels is eroding store sales, the route for brick and mortar retailers to earn a larger share of wallet of the customer is through deeper, Omni-channel customer engagement. Customer engagement is only as effective as how well you know the customer and how well you are equipped to act on that insight across your channels. Perhaps this is why customer insight emerges as retailers’ highest-priority goal from analytics initiatives in 2013. In addition, findings from EKN’s survey include:
• Retailers’ analytics maturity is low: 2 in 5 retailers state they lag behind their competitors in terms of their analytics maturity and a further 2 in 5 suggest they are at par. The “analytical retailer” is thus the exception rather than the rule.
• Data management and integration will be a key area of investment in an effort to increase analytical maturity. Retailers are looking to integrate a variety of data sources over the next 2 years, however public and open data remains a relatively under-explored opportunity.
• Retailers find their current analytics organizational setup sub-optimal. Only 18% currently have a shared services model for analytics in place whereas approximately 60% would like to move towards such a model.
• Retailers will invest in contextual, visual and mobile-friendly delivery of insights to combat the biggest challenge that prevents them from leveraging analytics strategically – delivery of insights to the right resource at the right time.
• Retailers’ eCommerce or Omni-channel function emerges as the business function with the highest potential opportunity for analytics impact, the highest rate of data growth and the highest planned technology investment. However, it is also currently the function with the lowest analytics maturity.
• Usability is the most important feature retailers will look for when choosing analytics solutions in 2013. Even with the delivery of insights being their biggest challenge, mobile or tablet access ranks relatively low.
The traditional view of data management and analysis in retail has been tool-driven – be it relational databases of decades past or Business Intelligence tools more recently. In EKN’s view, “business analytics” is a concept that focuses on decisions and outcomes, and is a far better indicator of the future of retail analytics.
The Global Impact of Open Data Open data has spurred economic innovation, social transformation, and fresh forms of political and government accountability in recent years, but few people understand how open data works. This comprehensive report, developed with support from Omidyar Network, presents detailed case studies of open data projects throughout the world, along with in-depth analysis of what works and what doesn’t. Authors Andrew Young and Stefaan Verhulst, both with The GovLab at New York University, explain how these projects have made governments more accountable and efficient, helped policymakers find solutions to previously intractable public problems, created new economic opportunities, and empowered citizens through new forms of social mobilization.
The Google File System We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
The Graph Story of the SAP HANA Database Many traditional and new business applications work with inherently graphstructured data and therefore benefit from graph abstractions and operations provided in the data management layer. The property graph data model not only offers schema flexibility but also permits managing and processing data and metadata jointly. By having typical graph operations implemented directly in the database engine and exposing them both in the form of an intuitive programming interface and a declarative language, complex business application logic can be expressed more easily and executed very efficiently. In this paper we describe our ongoing work to extend the SAP HANA database with built-in graph data support. We see this as a next step on the way to provide an efficient and intuitive data management platform for modern business applications with SAP HANA.
The hidden costs of open source Clusters based on open-source software and the Linux operating system have come to dominate high performance computing (HPC). This is due in part to their superior performance, cost-effectiveness and flexibility. The same factors that make open-source software the choice of HPC professionals have also made it less accessible to smaller centers. The complexity and associated cost of deploying and managing open-source clusters threatens to erode the very cost benefits that have made them compelling in the first place. As customers choose between open-source and commercial alternatives, there are many different costs related to administration and productivity that should be considered. These are explored in this paper in order to give a true cost perspective. We also examine how a commercial management product, such as IBM Platform HPC, enables HPC customers to side-step many overhead cost and support issues that often plague open-source environments and enable them to deploy powerful, easy to use clusters.
The impact of social segregation on human mobility in developing and industrialized regions This study leverages mobile phone data to analyze human mobility patterns in a developing nation, especially in comparison to those of a more industrialized nation. Developing regions, such as the Ivory Coast, are marked by a number of factors that may influence mobility, such as less infrastructural coverage and maturity, less economic resources and stability, and in some cases, more cultural and language-based diversity. By comparing mobile phone data collected from the Ivory Coast to similar data collected in Portugal, we are able to highlight both qualitative and quantitative differences in mobility patterns – such as differences in likelihood to travel, as well as in the time required to travel – that are relevant to consideration on policy, infrastructure, and economic development. Our study illustrates how cultural and linguistic diversity in developing regions (such as Ivory Coast) can present challenges to mobility models that perform well and were conceptualized in less culturally diverse regions. Finally, we address these challenges by proposing novel techniques to assess the strength of borders in a regional partitioning scheme and to quantify the impact of border strength on mobility model accuracy.