1 | 4 | 5 | 7 | 8 | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | Y | Z
|Documents| = 673


10 Tips to Create Useful and Beautiful Visualizations (Slide Deck)


4 Steps to Successfully Evaluating Business Analytics Software The goal of Business Analytics and Intelligence software is to help businesses access, analyze and visualize data, and then communicate those insights in meaningful dashboards and metrics. Unfortunately, the reality is that the majority of software options on the market today provide only a subset of that functionality. And those that provide a more comprehensive solution, tend to then lack the features that make it user-friendly. With a crowded marketplace, businesses need to go through a complex evaluation process and make some fundamental technology decisions before selecting a vendor. Finding a business intelligence (BI) software that will scale with your organization’s needs may seem like an impossible task. Here are the four questions you can ask when beginning the BI evaluation process that will save you a lot of time and help set you in the right direction.


5 Best Practices for Creating Effective Dashboards You’ve been there: no matter how many reports, formal meetings, casual conversations or emailed memos, someone important inevitably claims they didn’t know about some important fact or insight and says “we should have a dashboard to monitor the performance of X.” Or maybe you’ve been here: you’ve said “yes, let’s have a dashboard. It will help us improve return on investment (ROI) if everyone can see how X is performing and be able to quickly respond. I’ll update it weekly.” Unfortunately, by week 3, you realize you’re killing several hours a week integrating data from multiple sources to update a dashboard you’re not sure anyone is actually using. Yet, dashboards have been all the rage and with good reason. They can help you and your coworkers achieve a better grasp on the data – one of your most important, and often overlooked assets. You’ve read how they help organizations get on the same page, speed decision-making and improve ROI. They help create organizational alignment because everyone is looking at the same thing. So dashboards can be effective. They can work. The question becomes: How can you get one to work for you? Focus on these 5 best practices. Equally important, keep an eye on the 7 critical mistakes you don’t want to make.
50 years of Data Science More than 50 years ago, John Tukey called for a reformation of academic statistics. In `The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or `data analysis’. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name \Data Science’ for his envisioned eld. A recent and growing phenomenon is the emergence of \Data Science’ programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a $100M \Data Science Initiative’ that will hire 35 new faculty. Teaching in these new programs has signi cant overlap in curricular subject matter with tradi- tional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments. This paper reviews some ingredients of the current \Data Science moment’, including recent commentary about data science in the popular media, and about how/whether Data Science is really di erent from Statistics. The now-contemplated eld of Data Science amounts to a superset of the elds of statistics and machine learning which adds some technology for `scaling up’ to `big data’. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fty years. Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere `scaling up’, but instead the emergence of scienti c studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis work ows would impact the validity of data analysis across all of science, even predicting the impacts eld-by- eld. Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are `learning from data’, and I describe an academic eld dedicated to improving that activity in an evidence-based manner. This new eld is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.


7 Signs You Need Advanced Analytics for (or any CRM) and Why They Matter Sure, customer relationship management (CRM) applications provide reports and dashboards. But if you rely on the built-in analytic capabilities of CRM, you’re leaving money on the table. Because that’s what the information in your CRM system is; it’s money. But you can’t extract the true value of that information without an analytics application that does the heavy lifting without putting your sales team through hell. You also want your sales team to stay in your CRM application. That was the point. Remember, all CRM, all the time. Directing the team to another application for analytic insight just defeats the purpose. What you need are robust, easy-to-access analytics embedded right in your CRM solution. Following are seven signs that you are not operating efficiently and making reporting and analytics more difficult for your sales team and your business less productive. Don’t ignore these seven warning signs. They all carry one message: Yes, you need advanced analytics!
7 Tips to Succeed with Big Data in 2014 Just when you thought big data couldn’t get any bigger, it got bigger still. Regardless of its actual size, big data is showing its value. Organizations everywhere have big data of all shapes and sizes. They recognize the importance, the opportunity, and even the imperative to pay attention. It has become clear that big data will outlive those who ignore it. Organizations that have already tamed big data – the multi-structured mass they stored before they knew its worth – are improving their operational efficiency, growing their revenues, and empowering new business models. How do they do it? Their techniques for success can be summarized in seven tips.


8 Critical Metrics for Measuring App User Engagement In this guide, we outline for you the eight engagement metrics critical to app success, including suggestions for running marketing campaigns and boosting ROI.


A comparative study of fuzzy c-means algorithm and entropy-based fuzzy clustering algorithms Fuzzy clustering is useful to mine complex and multi-dimensional data sets, where the members have partial or fuzzy relations. Among the various developed techniques, fuzzy-C-means (FCM) algorithm is the most popular one, where a piece of data has partial membership with each of the pre-defined cluster centers. Moreover, in FCM, the cluster centers are virtual, that is, they are chosen at random and thus might be out of the data set. The cluster centers and membership values of the data points with them are updated through some iterations. On the other hand, entropy-based fuzzy clustering (EFC) algorithm works based on a similarity-threshold value. Contrary to FCM, in EFC, the cluster centers are real, that is, they are chosen from the data points. In the present paper, the performances of these algorithms have been compared on four data sets, such as IRIS, WINES, OLITOS and psychosis (collected with the help of forty doctors), in terms of the quality of the clusters (that is, discrepancy factor, compactness, distinctness) obtained and their computational time. Moreover, the best set of clusters has been mapped into 2-D for visualization using a self-organizing map (SOM).
A Comparative Study of Recommendation Algorithms in Ecommerce Applications We evaluate a wide range of recommendation algorithms on e-commerce-related datasets. These algorithms include the popular user-based and item-based correlation/similarity algorithms as well as methods designed to work with sparse transactional data. Data sparsity poses a significant challenge to recommendation approaches when applied in ecommerce applications. We experimented with approaches such as dimensionality reduction, generative models, and spreading activation, which are designed to meet this challenge. In addition, we report a new recommendation algorithm based on link analysis. Initial experimental results indicate that the link analysis-based algorithm achieves the best overall performance across several e-commerce datasets.
A comparison of algorithms for the multivariate L1-median The L1-median is a robust estimator of multivariate location with good statistical properties. Several algorithms for computing the L1-median are available. Problem specific algorithms can be used, but also general optimization routines. The aim is to compare different algorithms with respect to their precision and runtime. This is possible because all considered algorithms have been implemented in a standardized manner in the open source environment R. In most situations, the algorithm based on the optimization routine NLM (non-linear minimization) clearly outperforms other approaches. Its low computation time makes applications for large and high-dimensional data feasible.
A Composite Model for Computing Similarity Between Texts Computing text similarity is a foundational technique for a wide range of tasks in natural language processing such as duplicate detection, question answering, or automatic essay grading. Just recently, text similarity received wide-spread attention in the research community by the establishment of the Semantic Textual Similarity (STS) Task at the Semantic Evaluation (SemEval) workshop in 2012 – a fact that stresses the importance of text similarity research. The goal of the STS Task is to create automated measures which are able to compute the degree of similarity between two given texts in the same way that humans do. Measures are thereby expected to output continuous text similarity scores, which are then either compared with human judgments or used as a means for solving a particular problem. We start this thesis with the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. No attempt has been made yet to formalize in what way text similarity between two texts can be computed. Still, text similarity is regarded as a fixed, axiomatic notion in the community. To alleviate this shortcoming, we describe existing formal models of similarity and discuss how we can adapt them to texts. We propose to judge text similarity along multiple text dimensions, i.e. characteristics inherent to texts, and provide empirical evidence based on a set of annotation studies that the proposed dimensions are perceived by humans. We continue with a comprehensive survey of state-of-the-art text similarity measures previously proposed in the literature. To the best of our knowledge, no such survey has been done yet. We propose a classification into compositional and noncompositional text similarity measures according to their inherent properties. Compositional measures compute text similarity based on pairwise word similarity scores between all words which are then aggregated to an overall similarity score, while noncompositional measures project the complete texts onto particular models and then compare the texts based on these models. Based on our theoretical insights, we then present the implementation of a text similarity system which composes a multitude of text similarity measures along multiple text dimensions using a machine learning classifier. Depending on the concrete task at hand, we argue that such a system may need to address more than a single text dimension in order to best resemble human judgments. Our efforts culminate in the open source framework DKPro Similarity, which streamlines the development of text similarity measures and experimental setups. We apply our system in two evaluations, for which it consistently outperforms prior work and competing systems: an intrinsic and an extrinsic evaluation. In the intrinsic evaluation, the performance of text similarity measures is evaluated in an isolated setting by comparing the algorithmically produced scores with human judgments. We conducted the intrinsic evaluation in the context of the STS Task as part of the SemEval workshop. In the extrinsic evaluation, the performance of text similarity measures is evaluated with respect to a particular task at hand, where text similarity is a means for solving a particular problem. We conducted the extrinsic evaluation in the text classification task of text reuse detection. The results of both evaluations support our hypothesis that a composition of text similarity measures highly benefits the similarity computation process. Finally, we stress the importance of text similarity measures for real-world applications. We therefore introduce the application scenario Self-Organizing Wikis, where users of wikis, i.e. web-based collaborative content authoring systems, are supported in their everyday tasks by means of natural language processing techniques in general, and text similarity in particular. We elaborate on two use cases where text similarity computation is particularly beneficial: the detection of duplicates, and the semi-automatic insertion of hyperlinks. Moreover, we discuss two further applications where text similarity is a valuable tool: In both question answering and textual entailment recognition, text similarity has been used successfully in experiments and appears to be a promising means for further research in these fields. We conclude this thesis with an analysis of shortcomings of current text similarity research and formulate challenges which should be tackled by future work. In particular, we believe that computing text similarity along multiple text dimensions – which depend on the specific task at hand – will benefit any other task where text similarity is fundamental, as a composition of text similarity measures has shown superior performance in both the intrinsic as well as the extrinsic evaluation.
A Concise Guide to Compositional Data Analysis Why a course in compositional data analysis? Compositional data consist of vectors whose components are the proportion or percentages of some whole. Their peculiarity is that their sum is constrained to the be some constant, equal to 1 for proportions, 100 for percentages or possibly some other constant c for other situations such as parts per million (ppm) in trace element compositions. Unfortunately a cursory look at such vectors gives the appearance of vectors of real numbers with the consequence that over the last century all sorts of sophisticated statistical methods designed for unconstrained data have been applied to compositional data with inappropriate inferences. All this despite the fact that many workers have been, or should have been, aware that the sample space for compositional vectors is radically different from the real Euclidean space associated with unconstrained data. Several substantial warnings had been given, even as early as 1897 by Karl Pearson in his seminal paper on spurious correlations and then repeatedly in the 1960’s by geologist Felix Chayes. Unfortunately little heed was paid to such warnings and within the small circle who did pay attention the approach was essentially pathological, attempting to answer the question: what goes wrong when we apply multivariate statistical methodology designed for unconstrained data to our constrained data and how can the unconstrained methodology be adjusted to give meaningful inferences.
A Data Management System for Computational Experiments (3X) 3X, which stands for eXecuting eXploratory eXperiments, is a software tool to ease the burden of conducting computational experiments. 3X provides a standard yet con gurable structure to execute a wide variety of experiments in a systematic way. 3X organizes the code, inputs, and outputs for an experiment, records results, and lets users visualize result data in a variety of ways. Its interface allows further runs of the experiment to be driven interactively. Our demonstration will illustrate how 3X eases the process of conducting computational experiments, using two complementary examples designed to quickly show the many features of 3X.
A data scientist’s guide to start-ups In August 2013, we held a panel discussion at the KDD 2013 conference in Chicago on the subject of data science, data scientists, and start-ups. KDD is the premier conference on data science research and practice. The panel discussed the pros and cons for top-notch data scientists of the hot data science start-up scene. In this article, we first present background on our panelists. Our four panelists have unquestionable pedigrees in data science and substantial experience with start-ups from multiple perspectives (founders, employees, chief scientists, venture capitalists). For the casual reader, we next present a brief summary of the experts’ opinions on eight of the issues the panel discussed. The rest of the article presents a lightly edited transcription of the entire panel discussion.
A fast learning algorithm for deep belief nets We show how to use “complementary priors” to eliminate the explaining away effects that make inference difficult in densely-connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modelled by long ravines in the free-energy landscape of the top-level associative memory and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.
A Few Useful Things to Know about Machine Learning Machine learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious problems can be tackled. As a result, machine learning is widely used in computer science and other fields. However, developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.
A Framework for Considering Comprehensibility Comprehensibility in modeling is the ability of stakeholders to understand relevant aspects of the modeling process. In this article, we provide a framework to help guide exploration of the space of comprehensibility challenges. We consider facets organized around key questions: Who is comprehending? Why are they trying to comprehend? Where in the process are they trying to comprehend? How can we help them comprehend? How do we measure their comprehension?With each facet we consider the broad range of options.We discuss why taking a broad view of comprehensibility in modeling is useful in identifying challenges and opportunities for solutions.
A Gentle Introduction to Memetic Algorithms The generic denomination of `Memetic Algorithms’ (MAs) is used to encompass a broad class of metaheuristics (i.e. general purpose methods aimed to guide an underlying heuristic). The method is based on a population of agents and proved to be of practical success in a variety of problem domains and in particular for the approximate solution of NP Optimization problems. Unlike traditional Evolutionary Computation (EC) methods, MAs are intrinsically concerned with exploiting all available knowledge about the problem under study. The incorporation of prob- lem domain knowledge is not an optional mechanism, but a fundamental feature that characterizes MAs. This functioning philosophy is perfectly illustrated by the term \memetic’. Coined by R. Dawkins , the word `meme’ denotes an analogous to the gene in the context of cultural evolution .
A joint renewal process used to model event based data In many industrial situations, where systems must be monitored using data recorded throughout a historical period of observation, one cannot fully rely on sensor data, but often only has event data to work with. This, in particular, holds for legacy data, whose evaluation is of interest to systems analysts, reliability planners, maintenance engineers etc. Event data, herein defined as a collection of triples containing a time stamp, a failure code and eventually a descriptive text, can best be evaluated by using the paradigm of joint renewal processes. The present paper formulates a model of such a process, which proceeds by means of state dependent event rates. The system state is defined, at each point in time, as the vector of backward times, whereby the backward time of an event is the time passed since the last occurrence of this event. The present paper suggests a mathematical model relating event rates linearly to the backward times. The parameters can then be estimated by means of the method of moments. In a subsequent step, these event rates can be used in a Monte-Carlo simulation to forecast the numbers of occurrences of each failure in a future time interval, based on the current system state. The model is illustrated by means of an example. As forecasting system malfunctions receives increasingly more attention in light of modern condition-based maintenance policies, this approach enables decision makers to use existing event data to implement state dependent maintenance measures.
A Mathematical Theory for Clustering in Metric Spaces Clustering is one of the most fundamental problems in data analysis and it has been studied extensively in the literature. Though many clustering algorithms have been proposed, clustering theories that justify the use of these clustering algorithms are still unsatisfactory. In particular, one of the fundamental challenges is to address the following question: What is a cluster in a set of data points? In this paper, we make an attempt to address such a question by considering a set of data points associated with a distance measure (metric). We first propose a new cohesion measure in terms of the distance measure. Using the cohesion measure, we define a cluster as a set of points that are cohesive to themselves. For such a definition, we show there are various equivalent statements that have intuitive explanations. We then consider the second question: How do we find clusters and good partitions of clusters under such a definition? For such a question, we propose a hierarchical agglomerative algorithm and a partitional algorithm. Unlike standard hierarchical agglomerative algorithms, our hierarchical agglomerative algorithm has a specific stopping criterion and it stops with a partition of clusters. Our partitional algorithm, called the K-sets algorithm in the paper, appears to be a new iterative algorithm. Unlike the Lloyd iteration that needs two-step minimization, our K-sets algorithm only takes one-step minimization. One of the most interesting findings of our paper is the duality result between a distance measure and a cohesion measure. Such a duality result leads to a dual K-sets algorithm for clustering a set of data points with a cohesion measure. The dual K-sets algorithm converges in the same way as a sequential version of the classical kernel K-means algorithm. The key difference is that a cohesion measure does not need to be positive semi-definite.
A Measure of Similarity Between Graph Vertices: Applications to Synonym Extraction and Web Searching We introduce a concept of similarity between vertices of directed graphs. Let GA and GB be two directed graphs with respectively nA and nB vertices. We define a nB × nA similarity matrix S whose real entry sij expresses how similar vertex j (in GA) is to vertex i (in GB) : we say that sij is their similarity score. The similarity matrix can be obtained as the limit of the normalized even iterates of S(k+1) = BS(k)AT +BT S(k)A where A and B are adjacency matrices of the graphs and S(0) is a matrix whose entries are all equal to one. In the special case where GA = GB = G, the matrix S is square and the score sij is the similarity score between the vertices i and j of G. We point out that Kleinberg’s “hub and authority” method to identify web-pages relevant to a given query can be viewed as a special case of our definition in the case where one of the graphs has two vertices and a unique directed edge between them. In analogy to Kleinberg, we show that our similarity scores are given by the components of a dominant eigenvector of a non-negative matrix. Potential applications of our similarity concept are numerous. We illustrate an application for the automatic extraction of synonyms in a monolingual dictionary.
A Model Explanation System We propose a general model explanation system (MES) for “explaining” the output of black box classifiers. In this introduction we use the motivating example of a classifier trained to detect fraud in a credit card transaction history. The key aspect is that we provide explanations applicable to a single prediction, rather than provide an interpretable set of parameters. The labels in the provided examples are usually negative. Hence, we focus on explaining positive predictions (alerts). In many classification applications, but especially in fraud detection, there is an expectation of false positives. Alerts are given to a human analyst before any further action is taken. Analysts often insist on understanding “why” there was an alert, since an opaque alert makes it difficult for them to proceed. Analogous scenarios occur in computer vision , credit risk , spam detection , etc. Furthermore, the MES framework is useful for model criticism. In the world of generative models, practitioners often generate synthetic data from a trained model to get an idea of “what the model is doing”. Our MES framework augments such tools. As an added benefit, MES is applicable to completely non-probabilistic black boxes that only provide hard labels. In Section 3 we use MES to visualize the decisions of a face recognition system.
A Model of Modeling We propose a formal model of scientific modeling, geared to applications of decision theory and game theory. The model highlights the freedom that modelers have in conceptualizing social phenomena using general paradigms in these elds. It may shed some light on the distinctions between (i) refutation of a theory and a paradigm, (ii) notions of rationality, (iii) modes of application of decision models, and (iv) roles of economics as an academic discipline. Moreover, the model suggests that all four distinctions have some common features that are captured by the model.
A model of text for experimentation in the social sciences Statistical models of text have become increasingly popular in statistics and com- puter science as a method of exploring large document collections. Social scientists often want to move beyond exploration, to measurement and experimentation, and make inference about social and political processes that drive discourse and content. In this paper, we develop a model of text data that supports this type of substantive re- search. Our approach is to posit a hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates. In this model, topical prevalence and topical content are speci ed as a sim- ple generalized linear model on an arbitrary number of document-level covariates, such as news source and time of release, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a gen- erally applicable framework. We demonstrate the proposed methodology by analyzing a collection of news reports about China, where we allow the prevalence of topics to evolve over time and vary across newswire services. Our methods help quantify the e ect of news wire source on both the frequency and nature of topic coverage. All the methods we describe are available as part of the open source R package stm.
A Natural Language Query Interface to Structured Information Accessing structured data such as that encoded in ontologies and knowledge bases can be done using either syntactically complex formal query languages like SPARQL or complicated form interfaces that require expensive customisation to each particular application domain. This paper presents the QuestIO system – a natural language interface for accessing structured information, that is domain independent and easy to use without training. It aims to bring the simplicity of Google’s search interface to conceptual retrieval by automatically converting short conceptual queries into formal ones, which can then be executed against any semantic repository. QuestIO was developed specifically to be robust with regard to language ambiguities, incomplete or syntactically ill-formed queries, by harnessing the structure of ontologies, fuzzy string matching, and ontologymotivated similarity metrics.
A Neural Bayesian Estimator for Conditional Probability Densities This article describes a robust algorithm to estimate a conditional probability density f(t|x) as a non-parametric smooth regression function. It is based on a neural network and the Bayesian interpretation of the network output as a posteriori probabability. The network is trained using example events from history or simulation, which define the underlying probability density f(t, x). Once trained, the network is applied on new, unknown examples x, for which it can predict the probability distribution of the target variable t. Event-by-event knowledge of the smooth function f(t|x) can be very useful, e.g. in maximum likelihood fits or for forecasting tasks. No assumptions are necessary about the distribution, and non-Gaussian tails are accounted for automatically. Important quantities like median, mean value, left and right standard deviations, moments and expectation values of any function of t are readily derived from it. The algorithm can be considered as an event-by-event unfolding and leads to statistically optimal reconstruction. The largest benefit of the method lies in complicated problems, when the measurements x are only relatively weakly correlated to the output t. As to assure optimal generalisation features and to avoid overfitting, the networks are regularised by extended versions of weight decay. The regularisation parameters are determined during the online-learning of the network by relations obtained from Bayesian statistics. Some toy Monte Carlo tests and first real application examples from high-energy physics and econometry are discussed.
A New View of Predictive State Methods for Dynamical System Learning Recently there has been substantial interest in predictive state methods for learning dynamical systems: these algorithms are popular since they often offer a good tradeoff between computational speed and statistical efficiency. Despite their desirable properties, though, predictive state methods can sometimes be difficult to use in practice. E.g., in contrast to the rich literature on supervised learning methods, which allows us to choose from an extensive menu of models and algorithms to suit the prior beliefs we have about properties of the function to be learned, predictive state dynamical system learning methods are comparatively inflexible: it is as if we were restricted to use only linear regression instead of being allowed to choose decision trees, nonparametric regression, or the lasso. To address this problem, we propose a new view of predictive state methods in terms of instrumentalvariable regression. This view allows us to construct a wide variety of dynamical system learners simply by swapping in different supervised learning methods. We demonstrate the effectiveness of our proposed methods by experimenting with non-linear regression to learn a hidden Markov model, showing that the resulting algorithm outperforms its linear counterpart; the correctness of this algorithm follows directly from our general analysis.
A Non-Geek’s Big Data Playbook This Big Data Playbook demonstrates in six common “plays” how Apache Hadoop supports and extends the EDW ecosystem.
A novel algorithm for fast and scalable subspace clustering of high-dimensional data Rapid growth of high dimensional datasets in recent years has created an emergent need to extract the knowledge underlying them. Clustering is the process of automatically finding groups of similar data points in the space of the dimensions or attributes of a dataset. Finding clusters in the high dimensional datasets is an important and challenging data mining problem. Data group together differently under different subsets of dimensions, called subspaces. Quite often a dataset can be better understood by clustering it in its subspaces, a process called subspace clustering. But the exponential growth in the number of these subspaces with the dimensionality of data makes the whole process of subspace clustering computationally very expensive. There is a growing demand for efficient and scalable subspace clustering solutions in many Big data application domains like biology, computer vision, astronomy and social networking. Apriori based hierarchical clustering is a promising approach to find all possible higher dimensional subspace clusters from the lower dimensional clusters using a bottom-up process. However, the performance of the existing algorithms based on this approach deteriorates drastically with the increase in the number of dimensions. Most of these algorithms require multiple database scans and generate a large number of redundant subspace clusters, either implicitly or explicitly, during the clustering process. In this paper, we present SUBSCALE, a novel clustering algorithm to find non-trivial subspace clusters with minimal cost and it requires only k database scans for a k-dimensional data set. Our algorithm scales very well with the dimensionality of the dataset and is highly parallelizable. We present the details of the SUBSCALE algorithm and its evaluation in this paper.
A novel framework to analyze road accident time series data Road accident data analysis plays an important role in identifying key factors associated with road accidents. These associated factors help in taking preventive measures to overcome the road accidents. Various studies have been done on road accident data analysis using traditional statistical techniques and data mining techniques. All these studies focused on identifying key factors associated with road accidents in different countries. Road accident is uncertain and unpredictable events which can occur in any circumstances. Also, road accidents do not have similar impacts in every region of the districts. There are chances that road accident rate is increasing in a certain district but it has some lower impact in other districts. Hence, the more focus on road safety should be on those regions or districts where road accident trend is increasing. Time series analysis is an important area of study which can be helpful in identifying the increasing or decreasing trends in different districts. In this paper, we have proposed a framework to analyze road accident time series data that takes 39 time series data of 39 districts of Gujrat and Uttarakhand state of India. This framework segments the time series data into different clusters. A time series merging algorithm is proposed to find the representative time series (RTS) for each cluster. This RTS is further used for trend analysis of different clusters. The results reveals that road accident trend is going to increase in certain clusters and those districts should be the prime concern to take preventive measure to overcome the road accidents.
A Practical Guide to Support Vector Classification The support vector machine (SVM) is a popular classi cation technique. However, beginners who are not familiar with SVM often get unsatisfactory results since they miss some easy but signi cant steps. In this guide, we propose a simple procedure which usually gives reasonable results.
A Primer on Neural Network Models for Natural Language Processing Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in elds such as image recognition and speech processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques. The tutorial covers input encoding for natural language tasks, feed-forward networks, convolutional networks, recurrent networks and recursive networks, as well as the computation graph abstraction for automatic gradient computation.
A Probabilistic Theory of Deep Learning A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks such as visual object and speech recognition. The key factor complicating such tasks is the presence of numerous nuisance variables, for instance, the unknown object position, orientation, and scale in object recognition or the unknown voice pronunciation, pitch, and speed in speech recognition. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks; they are constructed from many layers of alternating linear and nonlinear processing units and are trained using large-scale algorithms and massive amounts of training data. The recent success of deep learning systems is impressive – they now routinely yield pattern recognition systems with nearor super-human capabilities – but a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on a Bayesian generative probabilistic model that explicitly captures variation due to nuisance variables. The graphical structure of the model enables it to be learned from data using classical expectation-maximization techniques. Furthermore, by relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks (DCNs) and random decision forests (RDFs), providing insights into their successes and shortcomings as well as a principled route to their improvement.
A Revealing Introduction to Hidden Markov Models Suppose we want to determine the average annual temperature at a particular location on earth over a series of years. To make it interesting, suppose the years we are concerned with lie in the distant past, before thermometers were invented. Since we can’t go back in time, we instead look for indirect evidence of the temperature…
A Review of 40 Years of Cognitive Architecture Research: Focus on Perception, Attention, Learning and Applications In this paper we present a broad overview of the last 40 years of research on cognitive architectures. Although the number of existing architectures is nearing several hundred, most of the existing surveys do not reflect this growth and focus on a handful of well-established architectures. While their contributions are undeniable, they represent only a part of the research in the field. Thus, in this survey we wanted to shift the focus towards a more inclusive and high-level overview of the research in cognitive architectures. Our final set of 86 architectures includes 55 that are still actively developed, and borrow from a diverse set of disciplines, spanning areas from psychoanalysis to neuroscience. To keep the length of this paper within reasonable limits we discuss only the core cognitive abilities, such as perception, attention mechanisms, learning and memory structure. To assess the breadth of practical applications of cognitive architectures we gathered information on over 700 practical projects implemented using the cognitive architectures in our list. We use various visualization techniques to highlight overall trends in the development of the field. Our analysis of practical applications shows that most architectures are very narrowly focused on a particular application domain. Furthermore, there is an apparent gap between general research in robotics and computer vision and research in these areas within the cognitive architectures field. It is very clear that biologically inspired models do not have the same range and efficiency compared to the systems based on engineering principles and heuristics. Another observation is related to a general lack of collaboration. Several factors hinder communication, such as the closed nature of the individual projects (only one-third of the reviewed here architectures are open-source) and terminological differences.
A Review of Data Fusion Techniques In general, all tasks that demand any type of parameter estimation from multiple sources can benefit from the use of data/information fusion methods. The terms information fusion and data fusion are typically employed as synonyms; but in some scenarios, the term data fusion is used for raw data (obtained directly from the sensors) and the term information fusion is employed to define already processed data. In this sense, the term information fusion implies a higher semantic level than data fusion. Other terms associated with data fusion that typically appear in the literature include decision fusion, data combination, data aggregation, multisensor data fusion, and sensor fusion. Researchers in this field agree that the most accepted definition of data fusion was provided by the Joint Directors of Laboratories (JDL) workshop : ‘A multi-level process dealing with the association, correlation, combination of data and information from single and multiple sources to achieve refined position, identify estimates and complete and timely assessments of situations, threats and their significance.’ Hall and Llinas provided the following well-known definition of data fusion: ‘data fusion techniques combine data from multiple sensors and related information from associated databases to achieve improved accuracy and more specific inferences than could be achieved by the use of a single sensor alone.’ Briefly, we can define data fusion as a combination of multiple sources to obtain improved information; in this context, improved information means less expensive, higher quality, or more relevant information. Data fusion techniques have been extensively employed on multisensor environments with the aim of fusing and aggregating data from different sensors; however, these techniques can also be applied to other domains, such as text processing.The goal of using data fusion inmultisensor environments is to obtain a lower detection error probability and a higher reliability by using data from multiple distributed sources. The available data fusion techniques can be classified into three nonexclusive categories: (i) data association, (ii) state estimation, and (iii) decision fusion. Because of the large number of published papers on data fusion, this paper does not aim to provide an exhaustive review of all of the studies; instead, the objective is to highlight the main steps that are involved in the data fusion framework and to review the most common techniques for each step. The remainder of this paper continues as follows. The next section provides various classification categories for data fusion techniques. Then, Section 3 describes the most common methods for data association tasks. Section 4 provides a review of techniques under the state estimation category. Next, the most common techniques for decision fusion are enumerated in Section 5. Finally, the conclusions obtained from reviewing the different methods are highlighted in Section 6.
A review of data mining using big data in health informatics The amount of data produced within Health Informatics has grown to be quite vast, and analysis of this Big Data grants potentially limitless possibilities for knowledge to be gained. In addition, this information can improve the quality of healthcare offered to patients. However, there are a number of issues that arise when dealing with these vast quantities of data, especially how to analyze this data in a reliable manner. The basic goal of Health Informatics is to take in real world medical data from all levels of human existence to help advance our understanding of medicine and medical practice. This paper will present recent research using Big Data tools and approaches for the analysis of Health Informatics data gathered at multiple levels, including the molecular, tissue, patient, and population levels. In addition to gathering data at multiple levels, multiple levels of questions are addressed: human-scale biology, clinical-scale, and epidemic-scale. We will also analyze and examine possible future work for each of these areas, as well as how combining data from each level may provide the most promising approach to gain the most knowledge in Health Informatics.
A Review of Features for the Discrimination of Twitter Users: Application to the Prediction of Offline Influence Many works related to Twitter aim at characterizing its users in some way: role on the service (spammers, bots, organizations, etc.), nature of the user (socio-professional category, age, etc.), topics of interest, and others. However, for a given user classification problem, it is very difficult to select a set of appropriate features, because the many features described in the literature are very heterogeneous, with name overlaps and collisions, and numerous very close variants. In this article, we review a wide range of such features. In order to present a clear state-of-the-art description, we unify their names, definitions and relationships, and we propose a new, neutral, typology. We then illustrate the interest of our review by applying a selection of these features to the offline influence detection problem. This task consists in identifying users which are influential in real-life, based on their Twitter account and related data. We show that most features deemed efficient to predict online influence, such as the numbers of retweets and followers, are not relevant to this problem. However, We propose several content-based approaches to label Twitter users as Influencers or not. We also rank them according to a predicted influence level. Our proposals are evaluated over the CLEF RepLab 2014 dataset, and outmatch state-of-the-art methods.
A review of instance selection methods In supervised learning, a training set providing previously known information is used to classify new instances. Commonly, several instances are stored in the training set but some of them are not useful for classifying therefore it is possible to get acceptable classification rates ignoring non useful cases; this process is known as instance selection. Through instance selection the training set is reduced which allows reducing runtimes in the classification and/or training stages of classifiers. This work is focused on presenting a survey of the main instance selection methods reported in the literature.
A Review of Relational Machine Learning for Knowledge Graphs Relational machine learning studies methods for the statistical analysis of relational, or graph-structured, data. In this paper, we provide a review of how such statistical models can be ‘trained’ on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph). In particular, we discuss two different kinds of statistical relational models, both of which can scale to massive datasets. The first is based on tensor factorization methods and related latent variable models. The second is based on mining observable patterns in the graph. We also show how to combine these latent and observable models to get improved modeling power at decreased computational cost. Finally, we discuss how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web. In particular, we discuss Google’s Knowledge Vault project.
A Short Course on Network Analysis These are lecture notes prepared for a short (6 hours) course given at the conference Method- ological Advances in Statistics Related to Big Data, held in Castro Urdiales, Spain, June 8-12, 2015. The course focuses on the analysis of networks without labels at the nodes. It covers de- scriptives statistics for graphs, random graphs models, and graph partitioning, including recent advances in spectral and semide nite methods.
A Short Introduction to Boosting Boosting is a general method for improving the accuracy of any given learning algorithm. This short overview paper introduces the boosting algorithm AdaBoost, and explains the underlying theory of boosting, including an explanation of why boosting often does not suffer from overfitting as well as boosting’s relationship to support-vector machines. Some examples of recent applications of boosting are also described.
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University
A Statistical Learning Model of Text Classification for Support Vector Machines
A summary on Maximum likelihood Estimator A general method of building a predictive model requires least square estimation at first. Then we need work on the residuals, find the confidence interval of parameters and test how well the model fits the data which are based on the normally distributed assumption of the residuals (or noises). But unfortunately the assumption is not guaranteed. Most of the time, you will have a graph of residuals that looks like another distribution rather than the normal. At this moment you could add one more factor term to your model so as to filter out the non-normal distributed noise, and then calculate the LSE again. But you may still have the same problem again. Or if you can recognize the distribution of the graph (or somehow you know the pdf of the noise), you can just calculate the MLE of the parameters of your model. This time, your work is really finished.
A Survey of Algorithms for Keyword Search on Graph Data In this chapter, we survey methods that perform keyword search on graph data. Keyword search provides a simple but user-friendly interface to retrieve information from complicated data structures. Since many real life datasets are represented by trees and graphs, keyword search has become an attractive mechanism for data of a variety of types. In this survey, we discuss methods of keyword search on schema graphs, which are abstract representation for XML data and relational data, and methods of keyword search on schema-free graphs. In our discussion, we focus on three major challenges of keyword search on graphs. First, what is the semantics of keyword search on graphs, or, what qualifies as an answer to a keyword search; second, what constitutes a good answer, or, how to rank the answers; third, how to perform keyword search efficiently. We also discuss some unresolved challenges and propose some new research directions on this topic.
A survey of Bayesian predictive methods for model assessment, selection and comparison To date, several methods exist in the statistical literature for model assessment, which purport themselves specifically as Bayesian predic- tive methods. The decision theoretic assumptions on which these methods are based are not always clearly stated in the original articles, however. The aim of this survey is to provide a unified review of Bayesian predictive model assessment and selection methods, and of methods closely related to them. We review the various assumptions that are made in this context and discuss the connections between different approaches, with an emphasis on how each method approximates the expected utility of using a Bayesian model for the purpose of predicting future data.
A Survey of Binary Similarity and Distance Measures The binary feature vector is one of the most common representations of patterns and measuring similarity and distance measures play a critical role in many problems such as clustering, classification, etc. Ever since Jaccard proposed a similarity measure to classify ecological species in 1901, numerous binary similarity and distance measures have been proposed in various fields. Applying appropriate measures results in more accurate data analysis. Notwithstanding, few comprehensive surveys on binary measures have been conducted. Hence we collected 76 binary similarity and distance measures used over the last century and reveal their correlations through the hierarchical clustering technique.
A Survey of Methods for Collective Communication Optimization and Tuning New developments in HPC technology in terms of increasing computing power on multi/many core processors, high-bandwidth memory/IO subsystems and communication interconnects, pose a direct impact on software and runtime system development. These advancements have become useful in producing high-performance collective communication interfaces that integrate efficiently on a wide variety of platforms and environments. However, number of optimization options that shows up with each new technology or software framework has resulted in a \emph{combinatorial explosion} in feature space for tuning collective parameters such that finding the optimal set has become a nearly impossible task. Applicability of algorithmic choices available for optimizing collective communication depends largely on the scalability requirement for a particular usecase. This problem can be further exasperated by any requirement to run collective problems at very large scales such as in the case of exascale computing, at which impractical tuning by brute force may require many months of resources. Therefore application of statistical, data mining and artificial Intelligence or more general hybrid learning models seems essential in many collectives parameter optimization problems. We hope to explore current and the cutting edge of collective communication optimization and tuning methods and culminate with possible future directions towards this problem.
A Survey of Monte Carlo Tree Search Methods Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
A Survey of Online Failure Prediction Methods With ever-growing complexity and dynamicity of computer systems, proactive fault management is an effective approach to enhancing availability. Online failure prediction is the key to such techniques. In contrast to classical reliability methods, online failure prediction is based on runtime monitoring and a variety of models and methods that use the current state of a system and, frequently, the past experience as well. This survey describes these methods. To capture the wide spectrum of approaches concerning this area, a taxonomy has been developed, whose different approaches are explained and major concepts are described in detail.
A Survey of Point-of-interest Recommendation in Location-based Social Networks Point-of-interest (POI) recommendation that suggests new places for users to visit arises with the popularity of location-based social networks (LBSNs). Due to the importance of POI recommendation in LBSNs, it has attracted much academic and industrial interest. In this paper, we offer a systematic review of this field, summarizing the contributions of individual efforts and exploring their relations. We discuss the new properties and challenges in POI recommendation, compared with traditional recommendation problems, e.g., movie recommendation. Then, we present a comprehensive review in three aspects: influential factors for POI recommendation, methodologies employed for POI recommendation, and different tasks in POI recommendation. Specifically, we propose three taxonomies to classify POI recommendation systems. First, we categorize the systems by the influential factors check-in characteristics, including the geographical information, social relationship, temporal influence, and content indications. Second, we categorize the systems by the methodology, including systems modeled by fused methods and joint methods. Third, we categorize the systems as general POI recommendation and successive POI recommendation by subtle differences in the recommendation task whether to be bias to the recent check-in. For each category, we summarize the contributions and system features, and highlight the representative work. Moreover, we discuss the available data sets and the popular metrics. Finally, we point out the possible future directions in this area and conclude this survey.
A Survey of Tensor Methods Matrix decompositions have always been at the heart of signal, circuit and system theory. In particular, the Singular Value Decomposition (SVD) has been an important tool. There is currently a shift of paradigm in the algebraic foundations of these fields. Quite recently, Nonnegative Matrix Factorization (NMF) has been shown to outperform SVD at a number of tasks. Increasing research efforts are spent on the study and application of decompositions of higher-order tensors or multi-way arrays. This paper is a partial survey on tensor generalizations of the SVD and their applications. We also touch on Nonnegative Tensor Factorizations.
A survey of transfer learning Machine learning and data mining techniques have been used in numerous real-world applications. An assumption of traditional machine learning methodologies is the training data and testing data are taken from the same domain, such that the input feature space and data distribution characteristics are the same. However, in some real-world machine learning scenarios, this assumption does not hold. There are cases where training data is expensive or difficult to collect. Therefore, there is a need to create high-performance learners trained with more easily obtained data from different domains. This methodology is referred to as transfer learning. This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied to transfer learning. Lastly, there is information listed on software downloads for various transfer learning solutions and a discussion of possible future research work. The transfer learning solutions surveyed are independent of data size and can be applied to big data environments.
A Survey on Load Balancing Algorithms for VM Placement in Cloud Computing The emergence of cloud computing based on virtualization technologies brings huge opportunities to host virtual resource at low cost without the need of owning any infrastructure. Virtualization technologies enable users to acquire, configure and be charged on pay-per-use basis. However, Cloud data centers mostly comprise heterogeneous commodity servers hosting multiple virtual machines (VMs) with potential various specifications and fluctuating resource usages, which may cause imbalanced resource utilization within servers that may lead to performance degradation and service level agreements (SLAs) violations. To achieve efficient scheduling, these challenges should be addressed and solved by using load balancing strategies, which have been proved to be NP-hard problem. From multiple perspectives, this work identifies the challenges and analyzes existing algorithms for allocating VMs to PMs in infrastructure Clouds, especially focuses on load balancing. A detailed classification targeting load balancing algorithms for VM placement in cloud data centers is investigated and the surveyed algorithms are classified according to the classification. The goal of this paper is to provide a comprehensive and comparative understanding of existing literature and aid researchers by providing an insight for potential future enhancements.
A Tour through the Visualization Zoo A survey of powerful visualization techniques, from the obvious to the obscure
A Tutorial for Reinforcement Learning The tutorial is written for those who would like an introduction to reinforcement learning (RL). The aim is to provide an intuitive presentation of the ideas rather than concentrate on the deeper mathematics underlying the topic. RL is generally used to solve the so-called Markov decision problem (MDP). In other words, the problem that you are attempting to solve with RL should be an MDP or its variant. The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI). We will begin with a quick description of MDPs. We will discuss what we mean by “complex” and “large-scale” MDPs. Then we will explain why RL is needed to solve complex and large-scale MDPs. The semi-Markov decision problem (SMDP) will also be covered.
A tutorial on active learning (Slide Deck)
A Tutorial on Bayesian Belief Networks This tutorial provides an overview of Bayesian belief networks. The subject is introduced through a discussion on probabilistic models that covers probability language, dependency models, graphical representations of models, and belief networks as a particular representation of probabilistic models. The general class of causal belief networks is presented, and the concept of d-separation and its relationship with independence in probabilistic models is introduced. This leads to a description of Bayesian belief networks as a specific class of causal belief networks, with detailed discussion on belief propagation and practical network design. The target recognition problem is presented as an example of the application of Bayesian belief networks to a real problem, and the tutorial concludes with a brief summary of Bayesian belief networks.
A Tutorial on Deep Learning Part 1: Nonlinear Classi ers and The Backpropagation Algorithm In the past few years, Deep Learning has generated much excitement in Machine Learning and industry thanks to many breakthrough results in speech recognition, computer vision and text processing. So, what is Deep Learning? For many researchers, Deep Learning is another name for a set of algorithms that use a neural network as an architecture. Even though neural networks have a long history, they became more successful in recent years due to the availability of inexpensive, parallel hardware (GPUs, computer clusters) and massive amounts of data. In this tutorial, we will start with the concept of a linear classi er and use that to develop the concept of neural networks. I will present two key algorithms in learning with neural networks: the stochastic gradient descent algorithm and the backpropagation algorithm. Towards the end of the tutorial, I will explain some simple tricks and recent advances that improve neural networks and their training. For that, let’s start with a simple example.
A Tutorial on Deep Learning Part 2: Autoencoders, Convolutional Neural Networks and Recurrent Neural Networks In the previous tutorial, I discussed the use of deep networks to classify nonlinear data. In addition to their ability to handle nonlinear data, deep networks also have a special strength in their exibility which sets them apart from other tranditional machine learning models: we can modify them in many ways to suit our tasks. In the following, I will discuss three most common modi cations:
• Unsupervised learning and data compression via autoencoders which require modi cations in the loss function,
• Translational invariance via convolutional neural networks which require modi cations in the network architecture,
• Variable-sized sequence prediction via recurrent neural networks which require modi cations in the network architecture.
The exibility of neural networks is a very powerful property. In many cases, these changes lead to great improvements in accuracy compared to basic models that we discussed in the previous tutorial. In the last part of the tutorial, I will also explain how to parallelize the training of neural networks. This is also an important topic because parallelizing neural networks has played an important role in the current deep learning movement.
A tutorial on geometric programming A geometric program (GP) is a type of mathematical optimization problem characterized by objective and constraint functions that have a special form. Recently developed solution methods can solve even large-scale GPs extremely efficiently and reliably; at the same time a number of practical problems, particularly in circuit design, have been found to be equivalent to (or well approximated by) GPs. Putting these two together, we get effective solutions for the practical problems. The basic approach in GP modeling is to attempt to express a practical problem, such as an engineering analysis or design problem, in GP format. In the best case, this formulation is exact; when this is not possible, we settle for an approximate formulation. This tutorial paper collects together in one place the basic background material needed to do GP modeling. We start with the basic definitions and facts, and some methods used to transform problems into GP format.We show how to recognize functions and problems compatible with GP, and how to approximate functions or data in a form compatible with GP (when this is possible). We give some simple and representative examples, and also describe some common extensions of GP, along with methods for solving (or approximately solving) them.
A Tutorial on Spectral Clustering In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ‘echo state network’ approach (Slide Deck)
A weakly informative default prior distribution for logistic and other regression models We propose a new prior distribution for classical (nonhierarchical) logistic regression models, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Student-t prior distributions on the coefficients. As a default choice, we recommend the Cauchy distribution with center 0 and scale 2.5, which in the simplest setting is a longer-tailed version of the distribution attained by assuming one-half additional success and one-half additional failure in a logistic regression. Cross-validation on a corpus of datasets shows the Cauchy class of prior distributions to outperform existing implementations of Gaussian and Laplace priors. We recommend this prior distribution as a default choice for routine applied use. It has the advantage of always giving answers, even when there is complete separation in logistic regression (a common problem, even when the sample size is large and the number of predictors is small), and also automatically applying more shrinkage to higher-order interactions. This can be useful in routine data analysis as well as in automated procedures such as chained equations for missing-data imputation. We implement a procedure to fit generalized linear models in R with the Student-t prior distribution by incorporating an approximate EM algorithm into the usual iteratively weighted least squares. We illustrate with several applications, including a series of logistic regressions predicting voting preferences, a small bioassay experiment, and an imputation model for a public health data set.
Ad Click Prediction: a View from the Trenches Predicting ad click-through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm (which has excellent sparsity and convergence properties) and the use of per-coordinate learning rates. We also explore some of the challenges that arise in a real-world system that may appear at first to be outside the domain of traditional machine learning research. These include useful tricks for memory savings, methods for as- sessing and visualizing performance, practical methods for providing con dence estimates for predicted probabilities, calibration methods, and methods for automated management of features. Finally, we also detail several directions that did not turn out to be bene cial for us, despite promis- ing results elsewhere in the literature. The goal of this paper is to highlight the close relationship between theoretical ad- vances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system.
ADADELTA: An Adaptive Learning Rate Method We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.
Addressing the “Big Data” Issue: What You Need to Know These days, you’re probably hearing a lot of hype about “big data.” Vendors are currently hawking a wealth of new tools, all of which promise to help your organization unlock previously inaccessible insights from your proprietary information. According to the authors, there is no doubt that big data, i.e., organization-wide data that’s being managed in a centralized repository, can yield valuable discoveries that will result in improved products and performance – if properly analyzed. Nonetheless, you must look before you leap. First, is your company culture ready for such a move? How will data managers be affected when scores of discrete data silos are gathered and reviewed as a whole? How will you involve leadership and others in ongoing decision-making processes? How will you choose your architecture and tools from the dizzying array of options that are currently available? How will you stay up-to-date in this rapidly evolving field? Finally, how will you train your company’s users so that they can actually leverage the new capabilities? This ExecBlueprint explores these and other key concerns.
Advanced Analytics with the SAP HANA Database MapReduce as a programming paradigm provides a simple-to-use yet very powerful abstraction encapsulated in two second-order functions: Map and Reduce. As such, they allow defining single sequentially processed tasks while at the same time hiding many of the framework details about how those tasks are parallelized and scaled out. In this paper we discuss four processing patterns in the context of the distributed SAP HANA database that go beyond the classic MapReduce paradigm. We illustrate them using some typical Machine Learning algorithms and present experimental results that demonstrate how the data flows scale out with the number of parallel tasks.
Agent-Based Modeling and Simulation Agent-based modeling and simulation (ABMS) is a new approach to modeling systems comprised of autonomous, interacting agents. Computational advances have made possible a growing number of agent-based models across a variety of application domains. Applications range from modeling agent behavior in the stock market, supply chains, and consumer markets, to predicting the spread of epidemics, mitigating the threat of bio-warfare, and understanding the factors that may be responsible for the fall of ancient civilizations. Such progress suggests the potential of ABMS to have far-reaching effects on the way that businesses use computers to support decision-making and researchers use agent-based models as electronic laboratories. Some contend that ABMS “is a third way of doing science” and could augment traditional deductive and inductive reasoning as discovery methods. This brief tutorial introduces agent-based modeling by describing the foundations of ABMS, discuss-ing some illustrative applications, and addressing toolkits and methods for developing agent-based models.
Agile business intelligence: reshaping the landscape The last few years have brought a wave of changes for business intelligence (BI) solutions. A set of redefining technological trends is reshaping the landscape from a slow and cumbersome process practiced mainly by large enterprises to a much more flexible, agile process that mid-market companies as well as individuals can utilize. This report explores the key features that influence the evolution of agile BI and takes a look at the BI landscape under this light. At first glance, polarization seems to exist between traditional BI vendors, who are focused on extract, transform, and load (ETL) and reporting, and the newcomers, who are focused on data exploration and visualization, but a closer look reveals that, in fact, they converge as adoption of useful features is taking place across the spectrum.
Algorithm quasi-optimal (AQ) learning The algorithm quasi-optimal (AQ) is a powerful machine learning methodology aimed at learning symbolic decision rules from a set of examples and counterexamples. It was first proposed in the late 1960s to solve the Boolean function satisfiability problem and further refined over the following decade to solve the general covering problem. In its newest implementations, it is a powerful but yet little explored methodology for symbolic machine learning classification. It has been applied to solve several problems from different domains, including the generation of individuals within an evolutionary computation framework. The current article introduces the main concepts of the AQ methodology and describes AQ for source detection(AQ4SD), a tailored implementation of the AQ methodology to solve the problem of finding the sources of atmospheric releases using distributed sensor measurements. The AQ4SD program is tested to find the sources of all the releases of the prairie grass field experiment.
Algorithms and Methods in Recommender Systems Today, there is a big variety of different approaches and algorithms of data filtering and recommendations giving. In this paper we describe traditional approaches and explain what kind of modern approaches have been developed lately. All the paper long we will try to explain approaches and their problems based on movie recommendations. In the end we will show the main challenges recommender systems come across.
Algorithms for Active Learning This dissertation explores both the algorithmic and statistical aspects of active learning for binary classification. What are effective procedures for determining which data to label? How can these procedures take advantage of the interactive learning process, and in what circumstances do they yield improved learning performance compared to standard passive learners? To answer these questions, we develop and rigorously analyze a broad class of general active learning methods that address the essential algorithmic and statistical difficulties of the problem.
Algorithms for Reinforcement Learning Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner’s predictions. Further, the predictions may have long term e ects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop e cient learning algorithms, as well as to understand the algorithms’ merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in arti cial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas together with a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.
Algorithms in Data Mining using Matrix and Tensor Methods In many elds of science, engineering, and economics large amounts of data are stored and there is a need to analyze these data in order to extract information for various purposes. Data mining is a general concept involving di erent tools for performing this kind of analysis. The development of mathematical models and e cient algorithms is of key importance. In this thesis we discuss algorithms for the reduced rank regression problem and algorithms for the computation of the best multilinear rank approximation of tensors. Recommendations: Item-to-Item Collaborative Filtering Recommendation algorithms are best known for their use on e-commerce Web sites, where they use input about a customer’s interests to generate a list of recommended items. Many applications use only the items that customers purchase and explicitly rate to represent their interests, but they can also use other attributes, including items viewed, demographic data, subject interests, and favorite artists. At, we use recommendation algorithms to personalize the online store for each customer. The store radically changes based on customer interests, showing programming titles to a software engineer and baby toys to a new mother. The click-through and conversion rates – two important measures of Web-based and email advertising effectiveness – vastly exceed those of untargeted content such as banner advertisements and top-seller lists….
An Analysis of Machine Learning Intelligence Deep neural networks (DNNs) have set state of the art results in many machine learning and NLP tasks. However, we do not have a strong understanding of what DNN models learn. In this paper, we examine learning in DNNs through analysis of their outputs. We compare DNN performance directly to a human population, and use characteristics of individual data points such as difficulty to see how well models perform on easy and hard examples. We investigate how training size and the incorporation of noise affect a DNN’s ability to generalize and learn. Our experiments show that unlike traditional machine learning models (e.g., Naive Bayes, Decision Trees), DNNs exhibit human-like learning properties. As they are trained with more data, they are more able to distinguish between easy and difficult items, and performance on easy items improves at a higher rate than difficult items. We find that different DNN models exhibit different strengths in learning and are robust to noise in training data.
An Economist’s Guide to Visualizing Data Once upon a time, a picture was worth a thousand words. But with online news, blogs, and social media, a good picture can now be worth so much more. Economists who want to disseminate their research, both inside and outside the seminar room, should invest some time in thinking about how to construct compelling and effective graphics. An effective graph should tap into the brain’s “pre-attentive visual processing” (Few 2004; Healey and Enns 2012). Because our eyes detect a limited set of visual characteristics, such as shape or contrast, we easily combine those characteristics and unconsciously perceive them as an image. In contrast to “attentive processing” – the conscious part of perception that allows us to perceive things serially – pre-attentive processing is done in parallel and is much faster. Pre-attentive processing allows the reader to perceive multiple basic visual elements simultaneously….
An Example Inference Task: Clustering Human brains are good at nding regularities in data. One way of expressing regularity is to put a set of objects into groups that are similar to each other. For example, biologists have found that most objects in the natural world fall into one of two categories: things that are brown and run away, and things that are green and don’t run away. The rst group they call animals, and the second, plants. We’ll call this operation of grouping things together clustering. If the biologist further sub-divides the cluster of plants into sub- clusters, we would call this `hierarchical clustering’; but we won’t be talking about hierarchical clustering yet. In this chapter we’ll just discuss ways to take a set of N objects and group them into K clusters.
An Impossibility Theorem for Clustering Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and pro- foundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median.
An Introduction to Advanced Analytics Advanced Analytics is “the analysis of all kinds of data using sophisticated quantitative methods (for example, statistics, descriptive and predictive data mining, simulation and optimization) to produce insights that traditional approaches to business intelligence (BI) – such as query and reporting – are unlikely to discover.”
An Introduction to Bayesian Networks: Concepts and Learning from Data (Slide Deck)
An Introduction to Cluster Analysis for Data Mining Cluster analysis divides data into meaningful or useful groups (clusters). If meaningful clusters are the goal, then the resulting clusters should capture the “natural” structure of the data. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, and to provide a grouping of spatial locations prone to earthquakes. However, in other cases, cluster analysis is only a useful starting point for other purposes, e.g., data compression or efficiently finding the nearest neighbors of points. Whether for understanding or utility, cluster analysis has long been used in a wide variety of fields: psychology and other social sciences, biology, statistics, pattern recognition, information retrieval, machine learning, and data mining. The scope of this paper is modest: to provide an introduction to cluster analysis in the field of data mining, where we define data mining to be the discovery of useful, but non-obvious, information or patterns in large collections of data. Much of this paper is necessarily consumed with providing a general background for cluster analysis, but we also discuss a number of clustering techniques that have recently been developed specifically for data mining. While the paper strives to be self-contained from a conceptual point of view, many details have been omitted. Consequently, many references to relevant books and papers are provided.
An Introduction To Compressive Sampling This article surveys the theory of compressive sampling, also known as compressed sensing or CS, a novel sensing/sampling paradigm that goes against the common wisdom in data acquisition. CS theory asserts that one can recover certain signals and images from far fewer samples or measurements than traditional methods use. To make this possible, CS relies on two principles: sparsity, which pertains to the signals of interest, and incoherence, which pertains to the sensing modality.
An Introduction to Factor Graphs A large variety of algorithms in coding, signal processing, and artificial intelligence may be viewed as instances of the summary-product algorithm (or belief/probability propagation algorithm), which operates by message passing in a graphical model. Specific instances of such algorithms include Kalman filtering and smoothing; the forward–backward algorithm for hidden Markov models; probability propagation in Bayesian networks; and decoding algorithms for error-correcting codes such as the Viterbi algorithm, the BCJR algorithm, and the iterative decoding of turbo codes, low-density parity-check (LDPC) codes, and similar codes. New algorithms for complex detection and estimation problems can also be derived as instances of the summary-product algorithm. In this article, we give an introduction to this unified perspective in terms of (Forney-style) factor graphs.
An introduction to graphical models The following quotation, from the Preface provides a very concise introduction to graphical models: Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering { uncertainty and complexity { and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity { a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of e cient general-purpose algorithms. Many of the classical multivariate probabalistic systems studied in elds such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism { examples include mixture models, factor analysis, hidden Markov models, Kalman lters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages { in particular, specialized techniques that have been developed in one eld can be transferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems.
An introduction to graphical models (Slide Deck)
An introduction to high-dimensional statistics In this note, we aim to give a very brief introduction to high-dimensional statistics. Rather than attempting to give an overview of this vast area, we will explain what is meant by highdimensional data and then focus on two methods which have been introduced to deal with this sort of data. Many of the state of the art techniques used in high-dimensional statistics today are based on these two core methods. We begin with a quick recap of least squares regression.
An Introduction to Latent Semantic Analysis The question of knowledge induction, i.e. how children are able to learn so much about, say, what words mean without any explicit instruction, is one that has vexed philosophers, linguistics, and psychologists alike. Indeed, inferring the vast amount of knowledge that children learn almost effortlessly from an apparently ‘impoverished stimulus’ seems paradoxical. The Latent Semantic Analysis model (Landauer & Dumais, 1997) is a theory for how meaning representations might be learned from encountering large samples of language without explicit directions as to how it is structured. To do this, LSA makes two assumptions about how the meaning of linguistic expressions is present in the distributional patterns of simple expressions (e.g words) within more complex expressions (e.g. sentences and paragraphs) viewed across many samples of language….
An Introduction to Latent Variable Mixture Modeling (Part 1): Overview and Cross-Sectional Latent Class and Latent Profile Analyses Objective: Pediatric psychologists are often interested in finding patterns in heterogeneous cross-sectional data. Latent variable mixture modeling is an emerging person-centered statistical approach that models heterogeneity by classifying individuals into unobserved groupings (latent classes) with similar (more homogenous) patterns. The purpose of this article is to offer a nontechnical introduction to cross-sectional mixture modeling.
Method: An overview of latent variable mixture modeling is provided and 2 cross-sectional examples are reviewed and distinguished.
Results: Step-by-step pediatric psychology examples of latent class and latent profile analyses are provided using the Early Childhood Longitudinal Study–Kindergarten Class of 1998–1999 data file.
Conclusions: Latent variable mixture modeling is a technique that is useful to pediatric psychologists who wish to find groupings of individuals who share similar data patterns to determine the extent to which these patterns may relate to variables of interest.
An Introduction to Latent Variable Mixture Modeling (Part 2): Longitudinal Latent Class Growth Analysis and Growth Mixture Models Objective: Pediatric psychologists are often interested in finding patterns in heterogeneous longitudinal data. Latent Variable Mixture Modeling is an emerging statistical approach that models such heterogeneity by classifying individuals into unobserved groupings (latent classes) with similar (more homogenous) patterns. The purpose of the second of a two article set is to offer a nontechnical introduction to longitudinal latent variable mixture modeling.
Methods: 3 latent variable approaches to modeling longitudinal data are reviewed and distinguished.
Results: Step-by-step pediatric psychology examples of latent growth curve modeling, latent class growth analysis, and growth mixture modeling are provided using the Early Childhood Longitudinal Study-Kindergarten Class of 1998–99 data file.
Conclusions: Latent variable mixture modeling is a technique that is useful to pediatric psychologists who wish to find groupings of individuals who share similar longitudinal data patterns to determine the extent to which these patterns may relate to variables of interest.
An introduction to modern missing data analyses A great deal of recent methodological research has focused on two modern missing data analysis methods: maximum likelihood and multiple imputation. These approaches are advantageous to traditional techniques (e.g. deletion and mean imputation techniques) because they require less stringent assumptions and mitigate the pitfalls of traditional techniques. This article explains the theoretical underpinnings of missing data analyses, gives an overview of traditional missing data techniques, and provides accessible descriptions of maximum likelihood and multiple imputation. In particular, this article focuses on maximum likelihood estimation and presents two analysis examples from the Longitudinal Study of American Youth data. One of these examples includes a description of the use of auxiliary variables. Finally, the paper illustrates ways that researchers can use intentional, or planned, missing data to enhance their research designs.
An Introduction to Multivariate Statistics The term “multivariate statistics” is appropriately used to include all statistics where there are more than two variables simultaneously analyzed. You are already familiar with bivariate statistics such as the Pearson product moment correlation coefficient and the independent groups t-test. A one-way ANOVA with 3 or more treatment groups might also be considered a bivariate design, since there are two variables: one independent variable and one dependent variable. Statistically, one could consider the one-way ANOVA as either a bivariate curvilinear regression or as a multiple regression with the K level categorical independent variable dummy coded into K-1 dichotomous variables.
An Introduction to Neural Networks An accurate forecast into the future can offer tremendous value in areas as diverse as financial market price movements, financial expense budget forecasts, website clickthrough likelihoods, insurance risk, and drug compound efficacy, to name just a few. Many algorithm techniques, ranging from regression analysis to ARIMA for time series, among others, are regularly used to generate forecasts. A neural network approach provides a forecasting technique that can operate in circumstances where classical techniques cannot perform or do not generate the desired accuracy in a forecast.
An Introduction to Ontology Learning Ever since the early days of Artificial Intelligence and the development of the first knowledge-based systems in the 70s people have dreamt of self-learning machines. When knowledge-based systems grew larger and the commercial interest in these technologies increased, people became aware of the knowledge acquisition bottleneck and the necessity to (partly) automatize the creation and maintenance of knowledge bases. Today, many applications which exhibit ’intelligent’ behavior thanks to symbolic knowledge representation and logical inference rely on ontologies and the standards provided by the World Wide Web Committee (W3C). Supporting the construction of ontologies and populating them with instantiations of both concepts and relations, commonly referred to as ontology learning. Early research in ontology learning has concentrated on the extraction of facts or schema-level knowledge from textual resources building upon earlier work in the field of computational linguistics and lexical acquisition. However, as we will show in this book, ontology learning is a very diverse and interdisciplinary field of research. Ontology learning approaches are as heterogeneous as the sources of data on the web, and as different from one another as the types of knowledge representations called “ontologies”. In the remainder of this introduction, we briefly summarize the state-of-the-art in ontology learning and elaborate on what we consider as the key challenges for current and future ontology learning research.
An introduction to ROC analysis Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
An Introduction to Variable and Feature Selection Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
An Introduction to Visualizing Data The purpose of this document is to provide an introduction to the theory behind visualizing data. After studying the works of many talented people I decided to summarize the key points of information into this single paper. If you found this document interesting please take some time to look at the list of resources that I used (see Chapter 8) because I could never have created this without the excellent work done by others.
An Overview of Multi-Processor Approximate Message Passing Approximate message passing (AMP) is an algorithmic framework for solving linear inverse problems from noisy measurements, with exciting applications such as reconstructing images, audio, hyper spectral images, and various other signals, including those acquired in compressive signal acquisiton systems. The growing prevalence of big data systems has increased interest in large-scale problems, which may involve huge measurement matrices that are unsuitable for conventional computing systems. To address the challenge of large-scale processing, multiprocessor (MP) versions of AMP have been developed. We provide an overview of two such MP-AMP variants. In row-MP-AMP, each computing node stores a subset of the rows of the matrix and processes corresponding measurements. In column- MP-AMP, each node stores a subset of columns, and is solely responsible for reconstructing a portion of the signal. We will discuss pros and cons of both approaches, summarize recent research results for each, and explain when each one may be a viable approach. Aspects that are highlighted include some recent results on state evolution for both MP-AMP algorithms, and the use of data compression to reduce communication in the MP network.
An Overview of Statistical Learning Theory Statistical learning theory was introduced in the late 1960’s. Until the 1990’s it was a purely theoretical analysis of the problem of function estimation from a given collection of data. In the middle of the 1990’s new types of learning algorithms (called support vector machines) based on the developed theory were proposed. This made statistical learning theory not only a tool for the theoretical analysis but also a tool for creating practical algorithms for estimating multidimensional functions. This article presents a very general overview of statistical learning theory including both theoretical and algorithmic aspects of the theory. The goal of this overview is to demonstrate how the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems. A more detailed overview of the theory (without proofs) can be found in Vapnik (1995). In Vapnik (1998) one can find detailed description of the theory (including proofs).
Analysing spatial point patterns in R This is a detailed set of notes for a workshop on Analysing spatial point patterns in R, presented by the author in Australia and New Zealand since 2006. The goal of the workshop is to equip researchers with a range of practical techniques for the statistical analysis of spatial point patterns. Some of the techniques are well established in the applications literature, while some are very recent developments. The workshop is based on spatstat, a contributed library for the statistical package R, which is free open source software. Topics covered include: statistical formulation and methodological issues; data input and handling; R concepts such as classes and methods; exploratory data analysis; nonparametric intensity and risk estimates; goodness-of-fit testing for Complete Spatial Randomness; maximum likelihood inference for Poisson processes; spatial logistic regression; model validation for Poisson processes; exploratory analysis of dependence; distance methods and summary functions such as Ripley’s K function; simulation techniques; non-Poisson point process models; fitting models using summary statistics; LISA and local analysis; inhomogeneous K-functions; Gibbs point process models; fitting Gibbs models; simulating Gibbs models; validating Gibbs models; multitype and marked point patterns; exploratory analysis of multitype and marked point patterns; multitype Poisson process models and maximum likelihood inference; multitype Gibbs process models and maximum pseudolikelihood; line segment patterns, 3-dimensional point patterns, multidimensional space-time point patterns, replicated point patterns, and stochastic geometry methods.
Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining The proliferation of textual data in business is overwhelming. Unstructured textual data is being constantly generated via call center logs, emails, documents on the web, blogs, tweets, customer comments, customer reviews, and so on. While the amount of textual data is increasing rapidly, businesses’ ability to summarize, understand, and make sense of such data for making better business decisions remain challenging. This paper takes a quick look at how to organize and analyze textual data for extracting insightful customer intelligence from a large collection of documents and for using such information to improve business operations and performance. Multiple business applications of case studies using real data that demonstrate applications of text analytics and sentiment mining using SAS Text Miner and SAS Sentiment Analysis Studio are presented. While SAS products are used as tools for demonstration only, the topics and theories covered are generic (not tool specific).
Analytical Skills, Tools & Attitudes 2013: Analytics capabilities needed now and in the future Organizations continue to invest more in analytics, but increasingly there is recognition that a shortage of analytic talent is holding back even greater investment. Lavastorm Analytics polled more than 425 people in the analytics community about whether their organization needs more analytic resources or skills and which skills are valued most and are most urgently needed. Survey respondents included business analysts, technologists, data analytics professionals, managers, and C-level executives across a broad variety of industries. The top findings were:
– According to the survey respondents, a lack of skills/training/education is the biggest factor holding back organizations from using analytics more.
– Skills most urgently needed in their organizations are Statistics, math or other quantitative skills; Analytic tool training; and Critical thinking.
– Lack of funding or resources, however, also has a significant impact on adoption of analytics to drive day-to-day decisions. Lesser factors also include inadequate support from executives and data that is not integrated.
Analytics 3.0 In the new era, big data will power consumer products and services
Analytics: The real-world use of big data “Big data” – which admittedly means many things to many people – is no longer confined to the realm of technology. Today it is a business priority, given its ability to profoundly affect commerce in the globally integrated economy. In addition to providing solutions to long-standing business challenges, big data inspires new ways to transform processes, organizations, entire industries and even society itself. Yet extensive media coverage makes it hard to distinguish hype from reality – what is really happening? Our newest research finds that organizations are using big data to target customer-centric outcomes, tap into internal data and build a better information ecosystem.
Analyzing the Analyzers Binita, Chao, Dmitri, and Rebecca are data scientists. What does that statement tell you about them? Probably not as much as you’d like. You know they probably know something about statistics, programming, and data visualization. You’d hope that they had some experience finding insights from data, maybe even “big data.” But if you’re trying to find the best person for a job, you need to be more specific than just “doctor,” or “athlete,” or “data scientist.” And that’s a problem. Finding the right people for a task is all about efficient communication and, without the appropriate shared vocabulary, data science talent and data science problems are too often kept apart….
Anomaly Detection: A Survey Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
APACHE DRILL: Interactive Ad-Hoc Analysis at Scale Apache Drill is a distributed system for interactive ad-hoc analysis of large-scale datasets. Designed to handle up to petabytes of data spread across thousands of servers, the goal of Drill is to respond to ad-hoc queries in a lowlatency manner. In this article, we introduce Drill’s architecture, discuss its extensibility points, and put it into the context of the emerging offerings in the interactive analytics realm.
Applied Data Science in Europe Google Trends and other IT fever charts rate Data Science among the most rapidly emerging and promising fields that expand around computer science. Although Data Science draws on content from established fields like artificial intelligence, statistics, databases, visualization and many more, industry is demanding for trained data scientists that no one seems able to deliver. This is due to the pace at which the field has expanded and the corresponding lack of curricula; the unique skill set, which is inherently multi-disciplinary; and the translation work (from the US web economy to other ecosystems) necessary to realize the recognized world-wide potential of applying analytics to all sorts of data. In this contribution we draw from our experiences in establishing an inter-disciplinary Data Science lab in order to highlight the challenges and potential remedies for Data Science in Europe. We discuss our role as academia in the light of the potential societal/economic impact as well as the challenges in organizational leadership tied to such inter-disciplinary work.
Architecting a High Performance Storage System Designing a large-scale, high-performance data storage system presents significant challenges. This paper describes a step-by-step approach to designing such a system and presents an iterative methodology that applies at both the component level and the system level. A detailed case study using the methodology described to design a Lustre storage system is presented.
Are You a Bayesian or a Frequentist? (Slide Deck)
Assessing Your Business Analytics Initiatives – Eight Metrics That Matter It’s no secret that using analytics to uncover meaningful insights from data is crucial for making fact-based decisions. Now considered mainstream, the business analytics market worldwide is expected to exceed $50 billion by the year 2016.1 Yet when it comes to making analytics work, not all organizations are equal. In fact, despite the transformative power of big data and analytics, many organizations still struggle to wring value from their information. The complexities of dealing with big data, integrating technologies, finding analytical talent and challenging corporate culture are the main pitfalls to the successful use of analytics within organizations. The management of information – including the analytics used to transform it – is an evolutionary process, and organizations are at various levels of this evolution. Those wanting to advance analytics to a new level need to understand their analytics activities across the organization, from both an IT and business perspective. Toward that end, an assessment focusing on eight key analytics metrics can be used to identify strengths and areas for improvement in the analytics life cycle.
At what sample size do correlations stabilize? Sample correlations converge to the population value with increasing sample size, but the estimates are often inaccurate in small samples. In this report we use Monte-Carlo simulations to determine the critical sample size from which on the magnitude of a correlation can be expected to be stable. The necessary sample size to achieve stable estimates for correlations depends on the effect size, the width of the corridor of stability (i.e., a corridor around the true value where deviations are tolerated), and the requested confidence that the trajectory does not leave this corridor any more. Results indicate that in typical scenarios the sample size should approach 250 for stable estimates.
Automatic Conversion of Tables to LongForm Dataframes TableToLongForm automatically converts hierarchical Tables intended for a human reader into a simple LongForm Dataframe that is machine readable, hence enabling much greater utilisation of the data. It does this by recognising positional cues present in the hierarchical Table (which would normally be interpreted visually by the human brain) to decompose, then reconstruct the data into a LongForm Dataframe. The article motivates the benefit of such a conversion with an example Table, followed by a short user manual, which includes a comparison between the simple one argument call to TableToLongForm, with code for an equivalent manual conversion. The article then explores the types of Tables the package can convert by providing a gallery of all recognised patterns. It finishes with a discussion of available diagnostic methods and future work.
Automatic Keyphrase Extraction: A Survey of the State of the Art While automatic keyphrase extraction has been examined extensively, state-of-theart performance on this task is still much lower than that on many core natural language processing tasks. We present a survey of the state of the art in automatic keyphrase extraction, examining the major sources of errors made by existing systems and discussing the challenges ahead.
Automatic Sarcasm Detection: A Survey Automatic detection of sarcasm has witnessed interest from the sentiment analysis research community. With diverse approaches, datasets and analyses that have been reported, there is an essential need to have a collective understanding of the research in this area. In this survey of automatic sarcasm detection, we describe datasets, approaches (both supervised and rule-based), and trends in sarcasm detection research. We also present a research matrix that summarizes past work, and list pointers to future work.
Automatic Tag Recommendation Algorithms for Social Recommender Systems The emergence of Web 2.0 and the consequent success of social network websites such as and Flickr introduce us to a new concept called social bookmarking, or tagging in short. Tagging can be seen as the action of connecting a relevant user-defined keyword to a document, image or video, which helps user to better organize and share their collections of interesting stuff. With the rapid growth of Web 2.0, tagged data is becoming more and more abundant on the social network websites. An interesting problem is how to automate the process of making tag recommendations to users when a new resource becomes available. In this paper, we address the issue of tag recommendation from a machine learning perspective of view. From our empirical observation of two large-scale data sets, we first argue that the user-centered approach for tag recommendation is not very effective in practice. Consequently, we propose two novel document-centered approaches that are capable of making effective and efficient tag recommendations in real scenarios. The first graph-based method represents the tagged data into two bipartite graphs of (document, tag) and (document, word), then finds document topics by leveraging graph partitioning algorithms. The second prototype-based method aims at finding the most representative documents within the data collections and advocates a sparse multi-class Gaussian process classifier for efficient document classification. For both methods, tags are ranked within each topic cluster/class by a novel ranking method. Recommendations are performed by first classifying a new document into one or more topic clusters/classes, and then selecting the most relevant tags from those clusters/classes as machine-recommended tags. Experiments on real-world data from, CiteULike and BibSonomy examine the quality of tag recommendation as well as the efficiency of our recommendation algorithms. The results suggest that our document-centered models can substantially improve the performance of tag recommendations when compared to the user-centered methods, as well as topic models LDA and SVM classifiers.
Average Predictive Comparisons for models with nonlinearity, interactions, and variance components In a predictive model, what is the expected difference in the outcome associated with a unit difference in one of the inputs? In a linear regression model without interactions, this average predictive comparison is simply a regression coefficient (with associated uncertainty). In a model with nonlinearity or interactions, however, the average predictive comparison in general depends on the values of the predictors. We consider various definitions based on averages over a population distribution of the predictors, and we compute standard errors based on uncertainty in model parameters. We illustrate with a study of criminal justice data for urban counties in the United States. The outcome of interest measures whether a convicted felon received a prison sentence rather than a jail or non-custodial sentence, with predictors available at both individual and county levels.We fit three models: (1)a hierarchical logistic regression with varying coefficients for the within-county intercepts as well as for each individual predictor; (2)a hierarchical model with varying intercepts only; and (3)a nonhierarchical model that ignores themultilevel nature of the data. The regression coefficients have different interpretations for the different models; in contrast, the models can be compared directly using predictive comparisons. Furthermore, predictive comparisons clarify the interplay between the individual and county predictors for the hierarchical models and also illustrate the relative size of varying county effects.
Avoiding the Barriers of In-Memory Business Intelligence: Making Data Discovery Scalable When looking at the growth rates of the business intelligence platform space, it is apparent that acquisitions of new business intelligence tools have shifted dramatically from traditional data visualization and aggregation use cases to newer data discovery implementations. This shift toward data discovery use cases has been driven by two key factors: faster implementation times and the ability to visualize and manipulate data as quickly as an analyst can click a mouse. The improvements in implementation speeds stem from the use of architectures that access source data directly without having to first aggregate all the data in a central location such as an enterprise data warehouse or departmental data mart. The promise of fast manipulation of data has largely been accomplished by employing in-memory data management models to exploit the speed advantage of accessing data from server memory over traditional disk-based approaches. The “physics” of data access favors in-memory data management models. However, in-memory techniques are not without drawbacks. As companies attempt to evolve from small departmental projects to broader division-wide or enterprise-wide initiatives, increasing data volumes and the impact of increasing data consumer counts challenge the limits of early in-memory implementations. These challenges raise serious questions that should be considered by any organization considering in-memory techniques for business intelligence platforms.


Bayesian Computation Via Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC) algorithms are an indispensable tool for performing Bayesian inference. This review discusses widely used sampling algorithms and illustrates their implementation on a probit regression model for lupus data. The examples considered highlight the importance of tuning the simulation parameters and underscore the important contributions of modern developments such as adaptive MCMC. We then use the theory underlying MCMC to explain the validity of the algorithms considered and to assess the variance of the resulting Monte Carlo estimators.
Bayesian Computational Tools This article surveys advances in the field of Bayesian computation over the past 20 years from a purely personal viewpoint, hence containing some ommissions given the spectrum of the field. Monte Carlo, MCMC, and ABC themes are covered here, whereas the rapidly expanding area of particle methods is only briefly mentioned and different approximative techniques such as variational Bayes and linear Bayes methods do not appear at all. This article also contains some novel computational entries on the doubleexponential model that may be of interest.
Bayesian estimation supersedes the t test Bayesian estimation for two groups provides complete distributions of credible values for the effect size, group means and their difference, standard deviations and their difference, and the normality of the data. The method handles outliers. The decision rule can accept the null value (unlike traditional t tests) when certainty in the estimate is high (unlike Bayesian model com- parison using Bayes factors). The method also yields precise estimates of statistical power for various research goals. The software and programs are free, and run on Macintosh, Windows, and Linux platforms.
Bayesian Methods of Parameter Estimation In order to motivate the idea of parameter estimation we need to first understand the notion of mathematical modeling. What is the idea behind modeling real world phenomena? Mathematically modeling an aspect of the real world enables us to better understand it and better explain it, and perhaps enables us to reproduce it, either on a large scale, or on a simplified scale that characterizes only the critical parts of that phenomenon. How do we model these real life phenomena? These real life phenomena are captured by means of distribution models, which are extracted or learned directly from data gathered about them. So, what do we mean by parameter estimation? Every distribution model has a set of parameters that need to be estimated. These parameters specify any constants appearing in the model and provide a mechanism for efficient and accurate use of data. …
Bayesian Model Averaging: A Tutorial Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA) provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples. In these examples, BMA provides improved out-ofsample predictive performance. We also provide a catalogue of currently available BMA software.
Bayesian Statistics Students are to choose one paper in the following list, or possibly outside of the list upon my agreement. The papers are available online. Most of them are collected in this zip file . The presentation can focus on a particular section / result / example of the paper. Evaluation of the students is based on the understanding and presentation of the chosen paper.
Bayesian Statistics Papers (Paper Collection)
Best Practices for Scientific Computing Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists’ productivity and the reliability of their software.
Better Decisions through Science Math-based aids for making decisions in medicine and industry could improve many diagnoses – often saving lives in the process.
Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm Figures in scientific publications are critically important because they often show the data supporting key findings. Our systematic review of research articles published in top physiology journals (n = 703) suggests that, as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies. Papers rarely included scatterplots, box plots, and histograms that allow readers to critically evaluate continuous data. Most papers presented continuous data in bar and line graphs. This is problematic, as many different data distributions can lead to the same bar or line graph. The full data may suggest different conclusions from the summary statistics. We recommend training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies. Investigators can quickly make univariate scatterplots for small sample size studies using our Excel templates.
BI forward: A full view of your business Imagine that your organization is effectively using a business intelligence (BI) solution that provides everything you need to make better decisions and improve operational efficiency. Imagine users with their fingers on the pulse of markets, customers, channels and operations at all times. And imagine that your programs, plans, services and products are being designed with full and timely insight into all the factors – past, present and future – critical to success. What would it take to make that happen? What businesses need from BI is a full picture. And that is why it is important to understand that, for now and in the future, BI should help you not only describe and diagnose your past and current performance, but also predict future performance. When your business can do all three, you have a better idea of what your business needs to do to stay competitive. You have reports that show you where you have been, scorecards and real-time monitoring that show what is happening now and predictive analytics to show where your business is headed. This paper explains the advantages of a BI solution that includes predictive analytics.
BI, Analytics and Big Data A Modern-Day Perspective (Slide Deck)
Big Data Analytics in Action How Your Organization Can Improve its Bottom Line through Better Measurement, Better Decisions and Faster Response to Dynamic Market Conditions.
Big Data Analytics: A Survey The age of big data is now coming. But the traditional data analytics may not be able to handle such large quantities of data. The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data. To deeply discuss this issue, this paper begins with a brief introduction to data analytics, followed by the discussions of big data analytics. Some important open issues and further research directions will also be presented for the next step of big data analytics.
Big Data and Machine Learning with an Actuarial Perspective (Slide Deck)
Big Data and the Creative Destruction of Today’s Business Models
Big data and the democratisation of decisions In August 2012 the Economist Intelligence Unit conducted a survey sponsored by Alteryx of 241 global executives to gauge their perceptions of big data adoption. Fifty-three percent of respondents are board members or C-suite executives, including 66 CEOs, presidents or managing directors. Those polled are based in North America (34%), the Asia-Pacific region (27%), Western Europe (25%), the Middle East and Africa (6%), Latin America (5%) and Eastern Europe (4%). Half of executives work for companies with revenue that exceeds US$500m. Executives hail from 18 sectors and represent 14 functional roles, including general management (30%), strategy and business development (18%), finance (17%) and marketing and sales (10%).
Big Data and the Internet of Things Advances in sensing and computing capabilities are making it possible to embed increasing computing power in small devices. This has enabled the sensing devices not just to passively capture data at very high resolution but also to take sophisticated actions in response. Combined with advances in communication, this is resulting in an ecosystem of highly interconnected devices referred to as the Internet of Things – IoT. In conjunction, the advances in machine learning have allowed building models on this ever increasing amounts of data. Consequently, devices all the way from heavy assets such as aircraft engines to wearables such as health monitors can all now not only generate massive amounts of data but can draw back on aggregate analytics to ‘improve’ their performance over time. Big data analytics has been identified as a key enabler for the IoT. In this chapter, we discuss various avenues of the IoT where big data analytics either is already making a significant impact or is on the cusp of doing so. We also discuss social implications and areas of concern.
Big Data for Big Business? A Taxonomy of Data-driven Business Models used by Start-up Firms This paper reports a study which provides a series of implications that may be particularly helpful to companies already leveraging ‘big data’ for their businesses or planning to do so. The Data Driven Business Model (DDBM) framework represents a basis for the analysis and clustering of business models. For practitioners the dimensions and various features may provide guidance on possibilities to form a business model for their specific venture. The framework allows identification and assessment of available potential data sources that can be used in a new DDBM. It also provides comprehensive sets of potential key activities as well as revenue models. The identified business model types can serve as both inspiration and blueprint for companies considering creating new data-driven business models. Although the focus of this paper was on business models in the start-up world, the key findings presumably also apply to established organisations to a large extent. The DDBM can potentially be used and tested by established organisations across different sectors in future research.
Big Data for Finance According to the 2014 IDG Enterprise Big Data research report, companies are intensifying their efforts to derive value through big data initiatives with nearly half (49%) of respondents already implementing big data projects or in the process of doing so in the future. Further, organizations are seeing exponential growth in the amount of data managed with an expected increase of 76% within the next 12-18 months. With growth there are opportunities as well as challenges. Among those facing the big data challenge are finance executives, as this extraordinary growth presents a unique opportunity to leverage data assets like never before. As the 3 V’s of big data: volume, velocity and variety continue to grow, so too does the opportunity for finance sector firms to capitalize on this data for strategic advantage. Finance professionals are accomplished in collecting, analyzing and benchmarking data, so they are in a unique position to provide a new and critical service – making big data more manageable while condensing vast amounts of information into actionable business insights.
Big Data Gets Personal Big data and personal data are converging to shape the Internet’s most surprising consumer products. They’ll predict your needs and store your memories – if you let them.
Big Data in Big Companies Big data burst upon the scene in the first decade of the 21st century, and the first organizations to embrace it were online and startup firms. Arguably, firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning. They didn’t have to reconcile or integrate big data with more traditional sources of data and the analytics performed upon them, because they didn’t have those traditional forms. They didn’t have to merge big data technologies with their traditional IT infrastructures because those infrastructures didn’t exist. Big data could stand alone, big data analytics could be the only focus of analytics, and big data technology architectures could be the only architecture. Consider, however, the position of large, well-established businesses. Big data in those environments shouldn’t be separate, but must be integrated with everything else that’s going on in the company. Analytics on big data have to coexist with analytics on other types of data. Hadoop clusters have to do their work alongside IBM mainframes. Data scientists must somehow get along and work jointly with mere quantitative analysts. In order to understand this coexistence, we interviewed 20 large organizations in the early months of 2013 about how big data fit in to their overall data and analytics environments. Overall, we found the expected co-existence; in not a single one of these large organizations was big data being managed separately from other types of data and analytics. The integration was in fact leading to a new management perspective on analytics, which we’ll call “Analytics 3.0.” In this paper we’ll describe the overall context for how organizations think about big data, the organizational structure and skills required for it…etc. We’ll conclude by describing the Analytics 3.0 era.
Big Data Machine Learning: Patterns for Predictive Analytics (RefCard)
Big data maturity: An action plan for policymakers and executives Big data have the potential to improve or transform existing business operations and reshape entire economic sectors. Big data can pave the way for disruptive, entrepreneurial companies and allow new industries to emerge. The technological aspect is important, but insufficient to allow big data to show their full potential and to stop companies from feeling swamped by this information. What matters is to reshape internal decision-making culture so that executives base their judgments on data rather than hunches. Research already indicates that companies that have managed this are more likely to be productive and profitable than the competition. Organizations need to understand where they are in terms of big data maturity, an approach that allows them to assess progress and identify necessary initiatives. Judging maturity requires looking at environment readiness, how far governments have provided the necessary legal and regulatory frameworks, and information and communications technology (ICT) infrastructure; an organization’s internal capabilities and how ready it is to implement big data initiatives; and the many and more complicated methods for using big data, which can mean simple efficiency gains or revamping a business model. The ultimate maturity level involves transforming the business model to be data-driven, which requires significant investment over many years. Policymakers should pay particular attention to environment readiness. They should present citizens with a compelling case for the benefits of big data. This means addressing privacy concerns and seeking to harmonize regulations around data privacy globally. Policymakers should establish an environment that facilitates the business viability of the big data sector (such as data, service, or IT system providers), and they should take educational measures to address the shortage of big data specialists. As big data become ubiquitous in public and private organizations, their use will become a source of national and corporate competitive advantage.
Big Data Visualization: Turning Big Data Into Big Insights This white paper provides valuable information about visualization-based data discovery tools and how they can help IT decision-makers derive more value from big data. Topics include:
• An overview of the IT landscape and the challenges that are leading more businesses to look for alternatives to traditional business intelligence tools
• A description of the features and benefits of visualization-based data discovery tools
• Guidance and suggestions on data governance, and ways to protect the quality of big data while facilitating self-service business intelligence
• Several usage examples of visualization-based data discovery tools from TIBCO* Software, the world’s second-largest data discovery vendor
Big Data: Harnessing the Power of Big Data through Education and data-driven Decision Making Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data-science programs, and publications are touting data science as a hot – even ‘‘sexy’’ – career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this article, we argue that there are good reasons why it has been hard to pin down exactly what is data science. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner’s field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of data science precisely is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii), we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this article, we present a perspective that addresses all these concepts.We close by offering, as examples, a partial list of fundamental principles underlying data science.
Big Data: New Tricks for Econometrics Computers are now involved in many economic transactions and can capture data associated with these transactions, which can then be manipulated and analyzed. Conventional statistical and econometric techniques such as regression often work well, but there are issues unique to big datasets that may require different tools. First, the sheer size of the data involved may require more powerful data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large datasets may allow for more flexible relationships than simple linear models. Machine learning techniques such as decision trees, support vector machines, neural nets, deep learning, and so on may allow for more effective ways to model complex relationships. In this essay, I will describe a few of these tools for manipulating and analyzing big data. I believe that these methods have a lot to offer and should be more widely known and used by economists. In fact, my standard advice to graduate students these days is go to the computer science department and take a class in machine learning. There have been very fruitful collaborations between computer scientists and statisticians in the last decade or so, and I expect collaborations between computer scientists and econometricians will also be productive in the future.
Big data: The next frontier for innovation, competition, and productivity This report contributes to MGI’s mission to help global leaders understand the forces transforming the global economy, improve company performance, and work for better national and international policies. As with all MGI research, we would like to emphasize that this work is independent and has not been commissioned or sponsored in any way by any business, government, or other institution.
Big data: The next frontier for innovation, competition, and productivity This report contributes to MGI’s mission to help global leaders understand the forces transforming the global economy, improve company performance, and work for better national and international policies. As with all MGI research, we would like to emphasize that this work is independent and has not been commissioned or sponsored in any way by any business, government, or other institution.
Big Workflow: More than Just Intelligent Workload Management for Big Data Big data applications represent a fast-growing category of high-value applications that are increasingly employed by business and technical computing users. However, they have exposed an inconvenient dichotomy in the way resources are utilized in data centers. Conventional enterprise and web-based applications can be executed efficiently in virtualized server environments, where resource management and scheduling is generally confined to a single server. By contrast, data-intensive analytics and technical simulations demand large aggregated resources, necessitating intelligent scheduling and resource management that spans a computer cluster, cloud, or entire data center. Although these tools exist in isolation, they are not available in a general-purpose framework that allows them to interoperate easily and automatically within existing IT infrastructure. A new approach, known as “Big Workflow,” is being created by Adaptive Computing to address the needs of these applications. It is designed to unify public clouds, private clouds, Map Reduce-type clusters, and technical computing clusters. Specifically Big Workflow will:
• Schedule, optimize and enforce policies across the data center
• Enable data-aware workflow coordination across storage and compute silos
• Integrate with external workflow automation tools Such a solution will provide a much-needed toolset for managing big data applications, shortening timelines, simplifying operations, and maximizing resource utilization, and preserving existing investments.
Blending Transactions and Analytics in a Single In-Memory Platform: Key to the Real-Time Enterprise This white paper discusses the issues involved in the traditional practice of deploying transactional and analytic applications on separate platforms using separate databases. It analyzes the results from a user survey, conducted on SAP’s behalf by IDC, that explores these issues. The paper then considers how SAP HANA, with its combination of in-memory data management and its ability to handle both transactions and analytics in real time, can resolve these issues. It explores how businesses may find opportunities for innovation (such as the ability to engage in a richer dialog with a customer based on analysis of the latest transactional information), for speed (with the ability to provide faster access to information to make timely decisions), and for simplification of the IT landscape with a single in-memory platform.
Bridging the gap between hierarchical network representation and functional analysis RedeR is an R-based package combined with a Java application for dynamic network visualization and manipulation. It implements a callback engine by using a low-level R-to-Java interface to build and run common plugins. In this sense, RedeR takes advantage of R to run robust statistics, while the R-to-Java interface bridge the gap between network analysis and visualization. RedeR is designed to deal with three key challenges in network analysis. Firstly, biological networks are modular and hierarchical, so network visualization needs to take advantage of such structural features. Secondly, network analysis relies on statistical methods, many of which are already available in resources like CRAN or Bioconductor. However, the missing link between ad- vanced visualization and statistical computing makes it hard to take full advantage of R packages for network analysis. Thirdly, in larger networks user input is needed to focus the view of the network on the biologically relevant parts, rather than relying on an automatic layout function. RedeR is designed to address these challenges (additional information is available at Castro et al.).
Brief: Real-Time Speech Analytics — Still More Sizzle Than Steak Most customer service organizations record phone interactions with their customers. If they get around to analyzing those recordings, whatever they find can’t change the outcome of those calls — they are long since over. Vendors of real-time speech analytics tools promise to allow companies to intervene at the moment of truth, while the customer and the contact center agent are still talking. This brief discusses the hurdles application development and delivery (AD&D) pros will need to overcome to justify the expenditure on this technology and the steps they will need to take to prepare for a world of alerts generated in real-time based on customer conversations.
Build a Powerful Business Case for Data Quality with Metrics Money and resources wasted; sales missed; extra costs incurred. Recent research by industry analyst firm Gartner shows that the shocking price that companies are paying because of poor quality data adds up to a staggering $8.2 million annually.
Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models We survey latent variable models for solving data-analysis problems. A latent variable model is a probabilistic model that encodes hidden patterns in the data.We uncover these patterns from their conditional distribution and use them to summarize data and form predictions. Latent variable models are important in many fields, including computational biology, natural language processing, and social network analysis. Our perspective is that models are developed iteratively: We build a model, use it to analyze data, assess how it succeeds and fails, revise it, and repeat. We describe how new research has transformed these essential activities. First, we describe probabilistic graphical models, a language for formulating latent variable models. Second, we describe mean field variational inference, a generic algorithm for approximating conditional distributions. Third, we describe how to use our analyses to solve problems: exploring the data, forming predictions, and pointing us in the direction of improved models.
Building Data Science Teams Starting in 2008, Jeff Hammerbacher (@hackingdata) and I sat down to share our experiences building the data and analytics groups at Facebook and LinkedIn. In many ways, that meeting was the start of data science as a distinct professional specialization (see “What Makes a Data Scientist?” on page 11 for the story on how we came up with the title “Data Scientist”). Since then, data science has taken on a life of its own. The hugely positive response to “What Is Data Science?,” a great introduction to the meaning of data science in today’s world, showed that we were at the start of a movement. There are now regular meetups, well-established startups, and even college curricula focusing on data science. As McKinsey’s big data research report and LinkedIn’s data indicates indicates (see Figure 1), data science talent is in high demand. This increase in the demand for data scientists has been driven by the success of the major Internet companies. Google, Facebook, LinkedIn, and Amazon have all made their marks by using data creatively: not just warehousing data, but turning it into something of value. Whether that value is a search result, a targeted advertisement, or a list of possible acquaintances, data science is producing products that people want and value. And it’s not just Internet companies: Walmart doesn’t produce “data products” as such, but they’re well known for using data to optimize every aspect of their retail operations. Given how important data science has grown, it’s important to think about what data scientists add to an organization, how they fit in, and how to hire and build effective data science teams.
Building High-level Features Using Large Scale Unsupervised Learning We consider the problem of building highlevel, class-speci c feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200×200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also nd that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 22,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art.
Building Real-Time Data Pipelines Imagine you had a time machine that could go back one minute, or an hour. Think about what you could do with it. From the perspective of other people, it would seem like there was nothing you couldn’t do, no contest you couldn’t win. In the real world, there are three basic ways to win. One way is to have something, or to know something, that your competition does not. Nice work if you can get it. The second way to win is to simply be more intelligent. However, the number of people who think they are smarter is much larger than the number of people who actually are smarter. The third way is to process information faster so you can make and act on decisions faster. Being able to make more decisions in less time gives you an advantage in both information and intelligence. It allows you to try many ideas, correct the bad ones, and react to changes before your competition. If your opponent cannot react as fast as you can, it does not matter what they have, what they know, or how smart they are. Taken to extremes, it’s almost like having a time machine. An example of the third way can be found in high-frequency stock trading. Every trading desk has access to a large pool of highly intelligent people, and pays them well. All of the players have access to the same information at the same time, at least in theory. Being more or less equally smart and informed, the most active area of competition is the end-to-end speed of their decision loops. In recent years, traders have gone to the trouble of building their own wireless long-haul networks, to exploit the fact that microwaves move through the air 50% faster than light can pulse through fiber optics. This allows them to execute trades a crucial millisecond faster. Finding ways to shorten end-to-end information latency is also a constant theme at leading tech companies. They are forever working to reduce the delay between something happening out there in the world or in their huge clusters of computers, and when it shows up on a graph. At Facebook in the early 2010s, it was normal to wait hours after pushing new code to discover whether everything was working efficiently. The full report came in the next day. After building their own distributed in-memory database and event pipeline, their information loop is now on the order of 30 seconds, and they push at least two full builds per day. Instead of slowing down as they got bigger, Facebook doubled down on making more decisions faster. What is your system’s end-to-end latency? How long is your decision loop, compared to the competition? Imagine you had a system that was twice as fast. What could you do with it? This might be the most important question for your business. In this book we’ll explore new models of quickly processing information end to end that are enabled by long-term hardware trends, learnings from some of the largest and most successful tech companies, and surprisingly powerful ideas that have survived the test of time.
Business Analytics for Manufacturing: Four Ways to Increase Efficiency and Performance Whether the economy is strong or weak, the fundamental strategies for surviving and thriving still hold true. Manufacturers have to be highly efficient to meet demand and supply requirements. Costs and resources also have to be managed carefully and intelligently. At the same time, companies are considering new tactics: inventory optimization, maintenance operations, intelligent supply chains and leveraging technology as a focal point of business strategy. In order to be successful your company needs access to critical information and visibility into how well your business, your market and your competitors are responding to today’s challenging and changing times. …
Business Models for the Data Economy Whether you call it Big Data, data science, or simply analytics, modern businesses see data as a gold mine. Sometimes they already have this data in hand and understand that it is central to their activities. Other times, they uncover new data that fills a perceived gap, or seemingly “useless” data generated by other processes. Whatever the case, there is certainly value in using data to advance your business.
Business Process Deviance Mining: Review and Evaluation Business process deviance refers to the phenomenon whereby a subset of the executions of a business process deviate, in a negative or positive way, with respect to its expected or desirable outcomes. Deviant executions of a business process include those that violate compliance rules, or executions that undershoot or exceed performance targets. Deviance mining is concerned with uncovering the reasons for deviant executions by analyzing business process event logs. This article provides a systematic review and comparative evaluation of deviance mining approaches based on a family of data mining techniques known as sequence classification. Using real-life logs from multiple domains, we evaluate a range of feature types and classification methods in terms of their ability to accurately discriminate between normal and deviant executions of a process. We also analyze the interestingness of the rule sets extracted using different methods. We observe that feature sets extracted using pattern mining techniques only slightly outperform simpler feature sets based on counts of individual activity occurrences in a trace.
Business-Driven BI: Using New Technologies to Foster Self-Service Access to Insights Self-Service Business Intelligence (BI) has been the holy grail for BI professionals for a long time. Yet almost two-thirds of BI professionals (64%) rate the success of their self-service initiatives “average” or lower. Newcomers to BI struggle even more, with more than half (52%) rating their attempts at selfservice BI “fair” or “poor.” One reason for these less-than-stellar numbers is this: Implementing selfservice BI is more complex than it looks. It’s not a one-size-fits-all program. BI users come in many different shapes and sizes, each with unique information requirements. This report lays out several frameworks that explain how users interact with information and then maps elements of each to BI functionality and categories of BI tools. This mapping is critical to success with self-service BI….


Caching and Distributing Statistical Analyses in R We present the cacher package for R, which provides tools for caching statistical analyses and for distributing these analyses to others in an e cient manner. The cacher package takes objects created by evaluating R expressions and stores them in key-value databases. These databases of cached objects can subsequently be assembled into packages for distribution over the web. The cacher package also provides tools to help readers examine the data and code in a statistical analysis and reproduce, modify, or improve upon the results. In addition, readers can easily conduct alternate analyses of the data. We describe the design and implementation of the cacher package and provide two examples of how the package can be used for reproducible research. This vignette was originally published as Peng (2008).
Calling R from .NET: a case-study using Rapid NCA, the non-compartmental analysis workflow tool (Slide Deck)
Canonical example of Bayes’ theorem in detail The most common elementary illustration of Bayes’ theorem is medical testing for a rare disease. The example is almost a clich´e in probability and statistics books. And yet in my opinion, it’s usually presented too quickly and too abstractly. Here I’m going to risk erring on the side of going too slowly and being too concrete. I’ll work out an example with numbers and no equations before presenting Bayes theorem. Then I’ll include a few graphs.
Capitalizing on the power of big data for retail The retail industry is changing dramatically as consumers shop in new ways. With the growing popularity of online shopping and mobile commerce, consumers are using more retail channels than ever before to research products, compare prices, search for promotions, make purchases and provide feedback. Social media has become one of the key channels. Consumers are using social media – and the leading e-commerce platforms that integrate with social media – to find product recommendations, lavish praise, voice complaints, capitalize on product offers and engage in ongoing dialogs with their favorite brands. The multiplication of retail channels and the increasing use of social media are empowering consumers. With a wealth of information readily available online, consumers are now better able to compare products, services and prices – even as they shop in physical stores. When consumers interact with companies publically through social media, they have greater power to influence other customers or damage a brand. These and other changes in the retail industry are creating important opportunities for retailers. But to capitalize on those opportunities, retailers need ways to collect, manage and analyze a tremendous volume, variety and velocity of data. When point-of-sale (POS) systems were first commercialized, retailers were able to collect large amounts of potentially valuable information, but most of that information remained untapped. The emergence of social media and other consumer-oriented technologies is now introducing even more data to the retail ecosystem. Retailers must handle not only the growing volume of information but also an increasing variety – including both structured and unstructured data. They must also find ways to accommodate the changing nature of this data and the velocity at which is being produced and collected. If retailers succeed in addressing the challenges of “big data,” they can use this data to generate valuable insights for personalizing marketing and improving the effectiveness of marketing campaigns, optimizing assortment and merchandising decisions, and removing inefficiencies in distribution and operations. Adopting solutions designed to capitalize on this big data allows companies to navigate the shifting retail landscape and drive a positive transformation for the business….
Causal inference in statistics: An overview This review presents empirical researchers with recent advances in causal inference, and stresses the paradigmatic shifts that must be undertaken in moving from traditional statistical analysis to causal analysis of multivariate data. Special emphasis is placed on the assumptions that underly all causal inferences, the languages used in formulating those assumptions, the conditional nature of all causal and counterfactual claims, and the methods that have been developed for the assessment of such claims. These advances are illustrated using a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a), which subsumes and unifies other approaches to causation, and provides a coherent mathematical foundation for the analysis of causes and counterfactuals. In particular, the paper surveys the development of mathematical tools for inferring (from a combination of data and assumptions) answers to three types of causal queries:
(1) queries about the effects of potential interventions, (also called ‘causal effects’ or ‘policy evaluation’)
(2) queries about probabilities of counterfactuals, (including assessment of ‘regret,’ ‘attribution’ or ’causes of effects’) and
(3) queries about direct and indirect effects (also known as ‘mediation’).
Finally, the paper defines the formal and conceptual relationships between the structural and potential-outcome frameworks and presents tools for a symbiotic analysis that uses the strong features of both.
Causality and Statistical Learning In social science we are sometimes in the position of studying descriptive questions (In what places do working-class whites vote for Republicans? In what eras has social mobility been higher in the United States than in Europe? In what social settings are different sorts of people more likely to act strategically?). Answering descriptive questions is not easy and involves issues of data collection, data analysis, and measurement (how one should define concepts such as “working-class whites,” “social mobility,” and “strategic”) but is uncontroversial from a statistical standpoint. All becomes more difficult when we shift our focus from what to what if and why. Consider two broad classes of inferential questions:
1. Forward causal inference. What might happen if we do X? What are the effects of smoking on health, the effects of schooling on knowledge, the effect of campaigns on election outcomes, and so forth?
2. Reverse causal inference. What causes Y? Why do more attractive people earn more money? Why do many poor people vote for Republicans and rich people vote for Democrats? Why did the economy collapse?
Challenges and Opportunities with Big Data The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of “Big Data.’’ While the promise of Big Data is real — for example, it is estimated that Google alone contributed 54 billion dollars to the US economy in 2009 — there is currently a wide gap between its potential and its realization. Heterogeneity, scale, timeliness, complexity, and privacy problems with Big Data impede progress at all phases of the pipeline that can create value from data. The problems start right away during data acquisition, when the data tsunami requires us to make decisions, currently in an ad hoc manner, about what data to keep and what to discard, and how to store what we keep reliably with the right metadata. Much data today is not natively in structured format; for example, tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not for semantic content and search: transforming such content into a structured format for later analysis is a major challenge. The value of data explodes when it can be linked with other data, thus data integration is a major creator of value. Since most data is directly generated in digital format today, we have the opportunity and the challenge both to influence the creation to facilitate later linkage and to automatically link previously created data. Data analysis, organization, retrieval, and modeling are other foundational challenges. Data analysis is a clear bottleneck in many applications, both due to lack of scalability of the underlying algorithms and due to the complexity of the data that needs to be analyzed. Finally, presentation of the results and its interpretation by non-technical domain experts is crucial to extracting actionable knowledge. During the last 35 years, data management principles such as physical and logical independence, declarative querying and cost-based optimization have led, during the last 35 years, to a multi-billion dollar industry. More importantly, these technical advances have enabled the first round of business intelligence applications and laid the foundation for managing and analyzing Big Data today. The many novel challenges and opportunities associated with Big Data necessitate rethinking many aspects of these data management platforms, while retaining other desirable aspects. We believe that appropriate investment in Big Data will lead to a new wave of fundamental technological advances that will be embodied in the next generations of Big Data management and analysis platforms, products, and systems. We believe that these research problems are not only timely, but also have the potential to create huge economic value in the US economy for years to come. However, they are also hard, requiring us to rethink data analysis systems in fundamental ways. A major investment in Big Data, properly directed, can result not only in major scientific advances, but also lay the foundation for the next generation of advances in science, medicine, and business.
Challenges of Big Data Analysis Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
Challenges of Big Data Analysis Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity.They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
Challenges of Big Data analysis Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity.They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
Character-level Convolutional Networks for Text Classification This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several largescale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
Chart Suggestions – A Thought-Starter (Cheat Sheet)
Choosing the right NoSQL database for the job: a quality attribute evaluation For over forty years, relational databases have been the leading model for data storage, retrieval and management. However, due to increasing needs for scalability and performance, alternative systems have emerged, namely NoSQL technology. The rising interest in NoSQL technology, as well as the growth in the number of use case scenarios, over the last few years resulted in an increasing number of evaluations and comparisons among competing NoSQL technologies. While most research work mostly focuses on performance evaluation using standard benchmarks, it is important to notice that the architecture of real world systems is not only driven by performance requirements, but has to comprehensively include many other quality attribute requirements. Software quality attributes form the basis from which software engineers and architects develop software and make design decisions. Yet, there has been no quality attribute focused survey or classification of NoSQL databases where databases are compared with regards to their suitability for quality attributes common on the design of enterprise systems. To fill this gap, and aid software engineers and architects, in this article, we survey and create a concise and up-to-date comparison of NoSQL engines, identifying their most beneficial use case scenarios from the software engineer point of view and the quality attributes that each of them is most suited to.
Classification and Regression Tree Methods A classification or regression tree is a prediction model that can be represented as a decision tree. This article discusses the C4.5, CART, CRUISE, GUIDE, and QUEST methods in terms of their algorithms, features, properties, and performance.
Classification and Regression Trees Classification and regression trees are machine-learningmethods for constructing predictionmodels from data. Themodels are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.
Classification And Regression Trees : A Practical Guide for Describing a Dataset (Slide Deck)
Classification via Minimum Incremental Coding Length We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We demonstrate the asymptotic optimality of this criterion for Gaussian distributions and analyze its relationships to classical classifiers. The theoretical results clarify the connections between our approach and popular classifiers such as MAP, RDA, k-NN, and SVM, as well as unsupervised methods based on lossy coding. Our formulation induces several good effects on the resulting classifier. First, minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small sample setting. Second, compression provides a uniform means of handling classes of varying dimension. The new criterion and its kernel and local versions perform competitively on synthetic examples, as well as on real imagery data such as handwritten digits and face images. On these problems, the performance of our simple classifier approaches the best reported results, without using domain-specific information. All MATLAB code and classification results are publicly available for peer evaluation at http://…/home.htm.
Cloud based
Predictive Analytics poised for rapid growth
Rather than report survey results question by question the results, and their implications, have been grouped into a number of sections. Each section highlights significant results from the survey and discusses its implication.
– Business solutions are what organizations need
– Predictive analytics are showing real strength
– Customers are the focus for predictive analytics and cloud
– Cloud-based predictive analytic scenarios are gaining momentum
– Early adopters are gaining a competitive advantage
– Decision Management matters to predictive analytic success
– There are still some barriers and concerns with cloud-based predictive analytics
– Industries vary in their adoption and concerns
– A mix of clouds is appropriate
– Traditional data sources dominate predictive analytic models
After the survey results and implications are discussed we will make some recommendations and identify pros and cons of the various options. Demographics and vendor profiles complete the paper.
Cluster Analysis: Tutorial with R In this tutorial we inspect classification. classification and ordination are al- ternative strategies of simplifying data. Ordination tries to simplify data into a map showing similarities among points. classification simpli es data by putting similar points into same class. The task of describing a high number of points is simpli ed to an easier task of describing a low number of classes.
Cluster Validation (Slide Deck)
Clustering large Data Sets with mixed numeric and Categorical Values Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The k-means based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we present a k-prototypes algorithm which is based on the k-means paradigm but removes the numeric data limitation whilst preserving its efficiency. In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the kmeans. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with other statistics about clusters, can assist data miners to understand and identify interesting clusters.
Collaborative Filtering Recommender Systems Recommender systems are an important part of the information and e-commerce ecosystem. They represent a powerful method for enabling users to filter through large information and product spaces. Nearly two decades of research on collaborative filtering have led to a varied set of algorithms and a rich collection of tools for evaluating their performance. Research in the field is moving in the direction of a richer understanding of how recommender technology may be embedded in specific domains. The differing personalities exhibited by different recommender algorithms show that recommendation is not a one-sizefits- all problem. Specific tasks, information needs, and item domains represent unique problems for recommenders, and design and evaluation of recommenders needs to be done based on the user tasks to be supported. Effective deployments must begin with careful analysis of prospective users and their goals. Based on this analysis, system designers have a host of options for the choice of algorithm and for its embedding in the surrounding user experience. This paper discusses a wide variety of the choices available and their implications, aiming to provide both practicioners and researchers with an introduction to the important issues underlying recommenders and current best practices for addressing these issues.
Combining Predictions for Accurate Recommender Systems We analyze the application of ensemble learning to recommender systems on the Net ix Prize dataset. For our analysis we use a set of diverse state-of-the-art collaborative ltering (CF) algorithms, which include: SVD, Neighborhood Based Approaches, Restricted Boltzmann Machine, Asymmetric Factor Model and Global E ects. We show that linearly combining (blending) a set of CF algorithms increases the accuracy and outperforms any single CF algorithm. Furthermore, we show how to use ensemble methods for blending predictors in order to outperform a single blending algorithm. The dataset and the source code for the ensemble blending are available online.
Community Detection in Networks: The Leader-Follower Algorithm Natural networks such as those between humans observed through their interactions or biological networks predicted based on various experimental measurements contain a wealth of information about the unobserved structure of the social or biological system. However, these networks are inherently noisy in the sense that they contain spurious connections making them seemingly dense. Therefore, identifying important, refined structures such as communities or clusters becomes quite challenging. Specifically, we find that the popular, traditional method of spectral clustering does not manage to learn refined community structure. The primary reason for this is that it is based upon external community connectivity properties such as graph-cuts. Motivated to overcome this limitation, we propose a community detection algorithm, called the leader-follower algorithm, based upon identifying the natural internal structure of the expected communities. The algorithm uses the notion of network centrality in a novel manner to differentiate leaders (nodes which connect different communities) from loyal followers (nodes which only have neighbors within a single community). Using this approach, it is able to learn the communities from the network structure. A salient feature of our algorithm is that, unlike the spectral clustering, it does not require knowledge of number of communities in the network; it learns it naturally. We show that our algorithm is quite effective. We prove that it detects all of the communities exactly for any network possessing communities with the natural internal structure expected in social networks. More importantly, we demonstrate its effectiveness in the context of various real networks ranging from social networks such as Facebook to biological networks such as an fMRI based human brain network.
Comparative Analysis of K-Means and Fuzzy C-Means Algorithms In the arena of software, data mining technology has been considered as useful means for identifying patterns and trends of large volume of data. This approach is basically used to extract the unknown pattern from the large set of data for business as well as real time applications. It is a computational intelligence discipline which has emerged as a valuable tool for data analysis, new knowledge discovery and autonomous decision making. The raw, unlabeled data from the large volume of dataset can be classified initially in an unsupervised fashion by using cluster analysis i.e. clustering the assignment of a set of observations into clusters so that observations in the same cluster may be in some sense be treated as similar. The outcome of the clustering process and efficiency of its domain application are generally determined through algorithms. There are various algorithms which are used to solve this problem. In this research work two important clustering algorithms namely centroid based K-Means and representative object based FCM (Fuzzy C-Means) clustering algorithms are compared. These algorithms are applied and performance is evaluated on the basis of the efficiency of clustering output. The numbers of data points as well as the number of clusters are the factors upon which the behaviour patterns of both the algorithms are analyzed. FCM produces close results to K-Means clustering but it still requires more computation time than K-Means clustering.
Comparison of Bayesian predictive methods for model selection The goal of this paper is to compare several widely used Bayesian model selection methods in practical model selection problems, highlight their differences and give recommendations about the preferred approaches. We focus on the variable subset selection for regression and classification and perform several numerical experiments using both simulated and real world data. The results show that the optimization of a utility estimate such as the cross-validation score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. Better and much less varying results are obtained by incorporating all the uncertainties into a full encompassing model and projecting this information onto the submodels. The reference model projection appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly bene t from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.
Comprehensive View on Cran Packages (Cheat Sheet)
Computation of the multivariate Oja median The multivariate Oja (1983) median is an affine equivariant multivariate location estimate with high efficiency. This estimate has a bounded influence function but zero breakdown. The computation of the estimate appears to be highly intensive. We consider different, exact and stochastic, algorithms for the calculation of the value of the estimate. In the stochastic algorithms, the gradient of the objective function, the rank function, is estimated by sampling observation hyperplanes. The estimated rank function with its estimated accuracy then yields a confidence region for the true Oja samplemedian, and the confidence region shrinks to the sample median with the increasing number of the sampled hyperplanes. Regular grids and and the grid given by the data points are used in the construction. Computation times of different algorithms are discussed and compared.
Condition-Based Maintenance Using Sensor Arrays and Telematics Emergence of uniquely addressable embeddable devices has raised the bar on Telematics capabilities. Though the technology itself is not new, its application has been quite limited until now. Sensor based telematics technologies generate volumes of data that are orders of magnitude larger than what operators have dealt with previously. Real-time big data computation capabilities have opened the flood gates for creating new predictive analytics capabilities into an otherwise simple data log systems, enabling real-time control and monitoring to take preventive action in case of any anomalies. Condition-based-maintenance, usage-based-insurance, smart metering and demand-based load generation etc. are some of the predictive analytics use cases for Telematics. This paper presents the approach of condition-based maintenance using real-time sensor monitoring, Telematics and predictive data analytics.
Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2013) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B = Theta(n^1.5) bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B = Theta(n) replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.
Considerations for maximising analytic performance When it comes to running business analytics, there are three key nonfunctional requirements that must be met: fast performance, usability and affordability. Bloor Research was asked by IBM to compare the performance capabilities of the leading business analytic platforms. Specifically, we were asked to evaluate how the combined capabilities of business analytic tools and the underlying database management system can affect the overall performance of your analytic applications, reports and dashboards.
Context-Aware Recommender Systems The importance of contextual information has been recognized by researchers and practitioners in many disciplines, including e-commerce personalization, information retrieval, ubiquitous and mobile computing, data mining, marketing, and management. While a substantial amount of research has already been performed in the area of recommender systems, most existing approaches focus on recommending the most relevant items to users without taking into account any additional contextual information, such as time, location, or the company of other people (e.g., for watching movies or dining out). In this chapter we argue that relevant contextual information does matter in recommender systems and that it is important to take this information into account when providing recommendations. We discuss the general notion of context and how it can be modeled in recommender systems. Furthermore, we introduce three different algorithmic paradigms – contextual prefiltering, post-filtering, and modeling – for incorporating contextual information into the recommendation process, discuss the possibilities of combining several context-aware recommendation techniques into a single unifying approach, and provide a case study of one such combined approach. Finally, we present additional capabil- ities for context-aware recommenders and discuss important and promising directions for future research.
Control And Protect Sensitive Information In The Era Of Big Data This report outlines the future look of Forrester’s solution for security and risk (S&R) executives seeking to develop a holistic strategy to protect and manage sensitive data. In the never-ending race to stay ahead of the competition, companies are developing advanced capabilities to store, process, and analyze vast amounts of data from social networks, sensors, IT systems, and other sources to improve business intelligence and decisioning capabilities. “Big data processing” refers to the tools and techniques that handle the extreme data volumes and velocities and wide variety of data formats resulting from implementing these capabilities. As organizations aggregate more and more data, they need to be aware that much of it could be financial, personal, and other types of sensitive data that are subject to global laws and regulations. S&R professionals need to be aware of the security issues surrounding big data so they can take an active role early in these initiatives. This report will help S&R pros understand how to control and properly protect sensitive information in the era of big data.
Copulas: A Personal View Copula modeling has taken the world of finance and insurance, and well beyond, by storm. Why is this? In this paper I review the early start of this development, discuss some important current research, mainly from an applications point of view, and comment on potential future developments. An alternative title of the paper would be ‘Demystifying the copula craze’. The paper also contains what I would like to call the copula must-reads.
Correlated Topic Models Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets.
Correspondence Analysis This working paper gives a comprehensive explanation of the multivariate technique called correspondence analysis, applied in the context of a large survey of a nation’s state of health, in this case the Spanish National Health Survey. It is first shown how correspondence analysis can be used to interpret a simple cross-tabulation by visualizing the table in the form of a map of points representing the rows and columns of the table. Combinations of variables can also be interpreted by coding the data in the appropriate way. The technique can also be used to deduce optimal scale values for the levels of a categorical variable, thus giving quantitative meaning to the categories. Multiple correspondence analysis can analyze several categorical variables simultaneously, and is analogous to factor analysis of continuous variables. Other uses of correspondence analysis are illustrated using different variables of the same Spanish database: for example, exploring patterns of missing data and visualizing trends across surveys from consecutive years.
Cross-validation This text is a survey on cross-validation. We define all classical cross-validation procedures, and we study their properties for two different goals: estimating the risk of a given estimator, and selecting the best estimator among a given family. For the risk estimation problem, we compute the bias (which can also be corrected) and the variance of cross-validation methods. For estimator selection, we first provide a first-order analysis (based on expectations). Then, we explain how to take into account second-order terms (from variance computations, and by taking into account the usefulness of overpenalization). This allows, in the end, to provide some guidelines for choosing the best cross-validation method for a given learning problem.
Cumulative Gains Model Quality Metric This paper proposes a more comprehensive look at the ideas of KS and Area Under the Curve AUC of a cumulative gains chart to develop a model quality statistic which can be used agnostically to evaluate the quality of a wide range of models in a standardized fashion. It can be either used holistically on the entire range of the model or at a given decision threshold of the model. Further it can be extended into the model learning process.
Customer Analytics in the age of Social Media Becoming “customer centric” is a top priority today, and for good reason: as if it weren’t important enough that customers buy products and contract for services, they now do much more than simply buy. Customers participate in social media networks and chat rooms; they write blogs and contribute to comment sites; and they share information through sites such as YouTube and Flickr. Their activities and expressions not only reveal personal buying behavior and interests, but they also bring into focus their influence on purchasing by others in their social networks.


Data Acceleration: Architecture for the Modern Data Supply Chain Data technologies are evolving rapidly, but organizations have adopted most of these in piecemeal fashion. As a result, enterprise data—whether related to customer interactions, business performance, computer notifications, or external events in the business environment —is vastly underutilized. Moreover, companies’ data ecosystems have become complex and littered with data silos. This makes the data more difficult to access, which in turn limits the value that organizations can get out of it. Indeed, according to a recent Gartner, Inc. report, 85 percent of Fortune 500 organizations will be unable to exploit Big Data for competitive advantage through 2015.1 Furthermore, a recent Accenture study found that half of all companies have concerns about the accuracy of their data, and the majority of executives are unclear about the business outcomes they are getting from their data analytics programs. To unlock the value hidden in their data, companies must start treating data as a supply chain, enabling it to flow easily and usefully through the entire organization—and eventually throughout each company’s ecosystem of partners, including suppliers and customers. The time is right for this approach. For one thing, new external data sources are becoming available, providing fresh opportunities for data insights. In addition, the tools and technology required to build a better data platform are available and in use. These provide a foundation on which companies can construct an integrated, end-to-end data supply chain.
Data Analysis the Data.Table way (Cheat Sheet)
Data Center Infrastructure Management (DCIM) For Dummies Data Center Infrastructure Management (DCIM) is the discipline of managing the physical infrastructure of a data center and optimizing its ongoing operation. DCIM is a software suite that bridges the traditional gap between IT and the facilities groups and coordinates between the two. DCIM reduces computing costs while making it easier to quickly support new applications and other business requirements. About This Book This book explains the importance of DCIM, describes the key components of a modern DCIM system, guides you in the selection of the right DCIM solution for your particular needs, and gives you a step-by-step formula for a successful DCIM implementation. Because this is a For Dummies book, you can be sure that it’s easy to read and has touches of humor.
Data Clustering With Leaders and Subleaders Algorithm In this paper, an efficient hierarchical clustering algorithm, suitable for large data sets is proposed for effective clustering and prototype selection for pattern classification. It is another simple and efficient technique which uses incremental clustering principles to generate a hierarchical structure for finding the subgroups/subclusters within each cluster. As an example, a two level clustering algorithm – Leaders–Subleaders, an extension of the leader algorithm is presented. Classification accuracy (CA) obtained using the representatives generated by the Leaders–Subleaders method is found to be better than that of using leaders as representatives. Even if more number of prototypes are generated, classification time is less as only a part of the hierarchical structure is searched.
Data Clustering: A Review Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
Data Driven: Creating a Data Culture The data movement is in full swing. There are conferences (Strata +Hadoop World), bestselling books (Big Data, The Signal and the Noise, Lean Analytics), business articles (“Data Scientist: The Sexiest Job of the 21st Century”), and training courses (An Introduction to Machine Learning with Web Data, the Insight Data Science Fellows Program) on the value of data and how to be a data scientist. Unfortunately, there is little that discusses how companies that successfully use data actually do that work. Using data effectively is not just about which database you use or how many data scientists you have on staff, but rather it’s a complex interplay between the data you have, where it is stored and how people work with it, and what problems are considered worth solving. While most people focus on the technology, the best organizations recognize that people are at the center of this complexity. In any organization, the answers to questions such as who controls the data, who they report to, and how they choose what to work on are always more important than whether to use a database like PostgreSQL or Amazon Redshift or HDFS. We want to see more organizations succeed with data. We believe data will change the way that businesses interact with the world, and we want more people to have access. To succeed with data, businesses must develop a data culture.
Data Management: A Unified Approach Unified data management is becoming a strategic advantage in today’s business world. With the advent of big data, the volume and type of information that companies must use in near-real time to gain a competitive edge is growing at an unprecedented rate. Meanwhile, industry consolidation is leading to mergers and acquisitions that require disparate IT systems to be harmonized in order to move forward. These forces, combined with ongoing pressure to use all available data to improve employee productivity, customer satisfaction and innovation, are spurring enterprises to make data management planning a top priority. To support these plans and help achieve important business goals, enterprises are turning to data management solutions with significant urgency. According to a recent IDG Research Services study of 118 IT professionals, 87 percent of respondents said data integration tools have been deployed or are on their company’s road maps; 84 percent answered the same for data quality tools; 82 percent for master data management solutions; and 81 percent for data governance/data stewardship initiatives. Nearly three-fifths of respondents at organizations that have data management solutions in place are planning to continue making near-term investments in these types of tools.
Data Mining and Statistics: What is the Connection? Data Mining is used to discover patterns and relationships in data, with an emphasis on large observational data bases. It sits at the common frontiers of several fields including Data Base Management, Arti cial Intelligence, Machine Learning, Pattern Recognition, and Data Visualization. From a statistical perspective it can be viewed as computer automated exploratory data analysis of (usually) large complex data sets. In spite of (or perhaps because of) the somewhat exaggerated hype, this eld is having a major impact in business, industry, and science. It also a ords enormous research opportunities for new methodological developments. Despite the obvious connections between data mining and statistical data analysis, most of the methodologies used in Data Mining have so far originated in fields other than Statistics. This paper explores some of the reasons for this, and why statisticians should have an interest in Data Mining. It is argued that Statistics can potentially have a major in uence on Data Mining, but in order to do so some of our basic paradigms and operating principles may have to be modified.
Data Mining Cluster Analysis: Basic Concepts and Algorithms (Slide Deck)
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Manufacturing enterprises have been collecting and storing more and more current, detailed and accurate production relevant data. The data stores offer enormous potential as source of new knowledge, but the huge amount of data and its complexity far exceeds the ability to reduce and analyze data without the use of automated analysis techniques. This paper provides a brief introduction into knowledge discovery from databases and presents the methology for data mining in time series. The relevancy of data mining for manufacturing shall be depicted.
Data Mining Standards In this survey paper we have consolidated all the current data mining standards. We have categorized them in to process standards, XML standards, standard APIs, web standards and grid standards and discussed them in considerable detail. We have also designed an application using these standards. We later also analyze the standards their influence on data mining application development and later point out areas in the data mining application development that need to be standardized. We also talk about the trend in the focus areas addressed by these standards.
Data Mining: A Conceptual Overview This tutorial provides an overview of the data mining process. The tutorial also provides a basic understanding of how to plan, evaluate and successfully refine a data mining project, particularly in terms of model building and model evaluation. Methodological considerations are discussed and illustrated. After explaining the nature of data mining and its importance in business, the tutorial describes the underlying machine learning and statistical techniques involved. It describes the CRISP-DM standard now being used in industry as the standard for a technology-neutral data mining process model. The paper concludes with a major illustration of the data mining process methodology and the unsolved problems that offer opportunities for research. The approach is both practical and conceptually sound in order to be useful to both academics and practitioners.
Data Mining: Discovering and Visualizing Patterns with Python (RefCard)
Data profit vs. Data waste: Boosting business performance every day in the real world with information optimization Companies do many things to grow profits. They discover new market opportunities. They sell more effectively. They innovate. They delight their customers. They improve productivity. They find ways to cut costs and mitigate risks. It can be difficult to do these things in today’s economic environment, because revenue opportunities are not always abundant and executives are largely disinclined to make substantial investments in new business capabilities. Despite current conditions, businesses are still finding ways to significantly improve their performance on a daily basis. One of these ways is the aggressive pursuit of data profit. Data profit is what results when companies make economically optimized use of all the structured and unstructured data already residing in existing systems across the enterprise to get better at everything the business needs to do: discovering opportunities, selling, innovating, delighting customers, improving productivity, cutting costs, and mitigating risk. Data profit has become an especially compelling business strategy today, because companies now suffer as never before from a specific problem that is the very opposite of data profit. That problem is data waste. Data waste occurs when companies do not fully utilize the wealth of data that they already have. This problem has become highly prevalent because companies have implemented so many systems over the past decade or more – from high-end databases and applications to email and basic desktop productivity tools – but have not developed effective strategies for fully leveraging their collective information output….
Data Science (Poster)
Data Science and its Relationship to Big Data and data-driven Decision Making Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data-science programs, and publications are touting data science as a hot – even ‘‘sexy’’ – career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this article, we argue that there are good reasons why it has been hard to pin down exactly what is data science. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner’s field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of data science precisely is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii), we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this article, we present a perspective that addresses all these concepts. We close by offering, as examples, a partial list of fundamental principles underlying data science.
Data Science Code of Professional Conduct We look at the proposed Data Science Code of Professional Conduct and nominate a “Golden Rule” which summarizes the data scientist ethic.
Data Science in the Cloud with Microsoft Azure Machine Learning and R Recently, Microsoft launched the Azure Machine Learning cloud platform – Azure ML. Azure ML provides an easy-to-use and powerful set of cloud-based data transformation and machine learning tools. This report covers the basics of manipulating data, as well as constructing and evaluating models in Azure ML, illustrated with a data science example. Before we get started, here are a few of the benefits Azure ML provides for machine learning solutions:
• Solutions can be quickly deployed as web services.
• Models run in a highly scalable cloud environment.
• Code and data are maintained in a secure cloud environment.
• Available algorithms and data transformations are extendable using the R language for solution-specific functionality.
Throughout this report, we’ll perform the required data manipulation then construct and evaluate a regression model for a bicycle sharing demand dataset. You can follow along by downloading the code and data provided below. Afterwards, we’ll review how to publish your trained models as web services in the Azure cloud.
Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update This report covers the basics of manipulating data, constructing models, and evaluating models in the Microsoft Azure Machine Learning platform (Azure ML). The Azure ML platform has greatly simplified the development and deployment of machine learning models, with easy-to-use and powerful cloud-based data transformation and machine learning tools. In this report, we’ll explore extending Azure ML with the R language. (A companion report explores extending Azure ML using the Python language.) All of the concepts we will cover are illustrated with a data science example, using a bicycle rental demand dataset. We’ll perform the required data manipulation, or data munging. Then, we will construct and evaluate regression models for the dataset. You can follow along by downloading the code and data provided in the next section. Later in the report, we’ll discuss publishing your trained models as web services in the Azure cloud.
Data Science Revealed: A Data-Driven Glimpse into the Burgeoning new Field As the cost of computing power, data storage, and high-bandwidth Internet access and have plunged exponentially over the past two decades, companies around the globe recognized the power of harnessing data as a source of competitive advantage. But it was only recently, as social web applications and massive, parallel processing have become more widely available that the nescient field of data science revealed what many are becoming to understand: that data is the new oil,i the source for corporate energy and differentiation in the 21st century. Companies like Facebook, LinkedIn, Yahoo, and Google are generating data not only as their primary product, but are analyzing it to continuously improve their products. Pharmaceutical and biomedical companies are using big data to find new cures and analyze genetic information, while marketers leverage the same technology to generate new customer insights. In order to tap this newfound wealth, organizations of all sizes are turning to practitioners in the new field of data science who are capable of translating massive data into predictive insights that lead to results. Data science is an emerging field, with rapid changes, great uncertainty, and exciting opportunities. Our study attempts the first ever benchmark of the data science community, looking at how they interact with their data, the tools they use, their education, and how their organizations approach data-driven problem solving. We also looked at a smaller group of business intelligence professionals to identify areas of contrast between the emerging role of data scientists and the more mature field of BI. Our findings, summarized here, show an emerging talent gap between organizational needs and current industry capabilities exemplified by the unique contributions data scientists can make to an organization and the broad expectations of data science professionals generally.
Data Science Salary Survey 2013 O’Reilly Media conducted an anonymous salary and tools survey in 2012 and 2013 with attendees of the Strata Conference: Making Data Work in Santa Clara, California and Strata + Hadoop World in New York. Respondents from 37 US states and 33 countries, representing a variety of industries in the public and private sector, completed the survey. We ran the survey to better understand which tools data analysts and data scientists use and how those tools correlate with salary. Not all respondents describe their primary role as data scientist/data analyst, but almost all respondents are exposed to data analytics. Similarly, while just over half the respondents described themselves as technical leads, almost all reported that some part of their role included technical duties (i.e., 10–20% of their responsibilities included data analysis or software development). We looked at which tools correlate with others (if respondents use one, are they more likely to use another?) and created a network graph of the positive correlations. Tools could then be compared with salary, either individually or collectively, based on where they clustered on the graph.
Data Science, an Overview of Classification Techniques (Slide Deck)
Data Science, Banking, and Fintech The financial industry today is under siege, but not from economic pressures in Europe and China. Rather, this once-impenetrable fortress is currently riding a giant entrepreneurial wave of disruption, disintermediation, and digital innovation. Behind the siege is fintech, a spunky and growing group of financial technology companies. These venture-backed new arrivals are challenging the old champions in lending, payments, money transfer, trading, wealth management, and cryptocurrencies. In this O’Reilly report, author Cornelia Lévy-Bencheton examines the disruptive megatrends taking hold at every level and juncture of the financial ecosystem. You’ll find out how fintech is reshaping the financial industry, reimagining the ways consumers manage, save, and spend money through a data-driven culture of big data analytics, mobile payment services, and robo-advising. Can traditional financial institutions evolve in time to catch up and avoid being replaced? Pick up this report to learn about the current banking and financial services industry, key participants in fintech, and some adaptive strategies being used by traditional financial organizations.
Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics An action plan to enlarge the technical areas of statistics focuses on the data analyst. The plan sets out six technical areas of work for a university department and advocates a specific allocation of resources devoted to research in each area and to courses in each area. The value of technical work is judged by the extent to which it benefits the data analyst, either directly or indirectly. The plan is also applicable to government research labs and corporate research organizations.
Data Scientist Enablement Roadmap (Slide Deck)
Data Scientist: The Sexiest Job of the 21st Century When Jonathan Goldman arrived for work in June 2006 at LinkedIn, the business networking site, the place still felt like a startup. The company had just under 8 million accounts, and the number was growing quickly as existing members invited their friends and colleagues to join. But users weren’t seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently missing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”
Data Storytelling: Using visualization to share the human impact of numbers Storytelling is a cornerstone of the human experience. The universe may be full of atoms, but it’s through stories that we truly construct our world. From Greek mythology to the Bible to television series like Cosmos, stories have been shaping our experience on Earth for as long as we’ve lived on it. A key purpose of storytelling is not just understanding the world but changing it. After all, why would we study the world if we didn’t want to know how we can—and should— influence it? Though many elements of stories have remained the same throughout history, we have developed better tools and mediums for telling them, such as printed books, movies, and comics. This has changed storytelling styles—and perhaps most importantly, the impact of those stories—over the millennia. But can stories be told with data, as well as with images and words? That’s what this paper’s about.
Data Stream Mining – A Practical Approach Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naïve Bayes classifiers at the leaves. MOA is related to WEKA, the Waikato Environment for Knowledge Analysis, which is an award-winning open-source workbench containing implementations of a wide range of batch machine learning methods. WEKA is also written in Java. The main benefits of Java are portability, where applications can be run on any platform with an appropriate Java virtual machine, and the strong and welldeveloped support libraries. Use of the language is widespread, and features such as the automatic garbage collection help to reduce programmer burden and error. This text explains the theoretical and practical foundations of the methods and streams available in MOA.
Data Visualization Techniques – From Basics to Big Data with SAS Visual Analytics A picture is worth a thousand words – especially when you are trying to understand and gain insights from data. It is particularly relevant when you are trying to find relationships among thousands or even millions of variables and determine their relative importance. Organizations of all types and sizes generate data each minute, hour and day. Everyone – including executives, departmental decision makers, call center workers and employees on production lines – hopes to learn things from collected data that can help them make better decisions, take smarter actions and operate more efficiently. Regardless of how much data you have, one of the best ways to discern important relationships is through advanced analysis and high-performance data visualization. If sophisticated analyses can be performed quickly, even immediately, and results presented in ways that showcase patterns and allow querying and exploration, people across all levels in your organization can make faster, more effective decisions. To create meaningful visuals of your data, there are some basics you should consider. Data size and column composition play an important role when selecting graphs to represent your data. This paper discusses some of the basic issues concerning data visualization and provides suggestions for addressing those issues. In addition, big data brings a unique set of challenges for creating visualizations. This paper covers some of those challenges and potential solutions as well. If you are working with massive amounts of data, one challenge is how to display results of data exploration and analysis in a way that is not overwhelming. You may need a new way to look at the data – one that collapses and condenses the results in an intuitive fashion but still displays graphs and charts that decision makers are accustomed to seeing. And, in today’s on-the-go society, you may also need to make the results available quickly via mobile devices, and provide users with the ability to easily explore data on their own in real time. SAS Visual Analytics is a new business intelligence solution that uses intelligent autocharting to help business analysts and nontechnical users visualize data. It creates the best possible visual based on the data that is selected. The visualizations make it easy to see patterns and trends and identify opportunities for further analysis. The heart and soul of SAS Visual Analytics is the SAS LASR Analytic Server, which can execute and accelerate analytic computations in-memory with unprecedented performance. The combination of high-performance analytics and an easy-to-use data exploration interface enables different types of users to create and interact with graphs so they can understand and derive value from their data faster than ever. This creates an unprecedented ability to solve difficult problems, improve business performance and mitigate risk – rapidly and confidently.
Data Visualization with ggplot2 (Cheat Sheet)
Data Visualization: A New Language for Storytelling An Emerging Universal Medium: When was the last time you saw a business presentation that did not include at least one slide with a bar graph or a pie chart? Data visualizations have become so ubiquitous that we no longer find them remarkable.
Data Visualization: Making Big Data Approachable and Valuable Enterprises today are beginning to realize the important role Big Data plays in achieving business goals. Concepts that used to be difficult for companies to comprehend— factors that influence a customer to make a purchase, behavior patterns that point to fraud or misuse, inefficiencies slowing down business processes—now can be understood and addressed by collecting and analyzing Big Data. The insight gained from such analysis helps organizations improve operations and identify new product and service opportunities that they may have otherwise missed. In essence, Big Data promises to deliver the advantages that companies need to drive revenue growth and gain a competitive edge. However, getting to that Big Data payoff is proving a difficult challenge for many organizations. Big Data is often voluminous and tends to rapidly change and morph, making it challenging to get a handle on and difficult to access. The majority of tools available to work with Big Data are complex and hard to use, and most enterprises don’t have the in-house expertise to perform the required data analysis and manipulation to draw out the answers that the business is seeking. In fact, in a recent survey conducted by IDG Research, when asked about analyzing Big Data, respondents cite lack of skills and difficulty in making Big Data available to users as two significant challenges. “A lot of existing Big Data techniques require you to really get your hands dirty; I don’t think that most Big Data software is as mature as it needs to be in order to be accessible to business users at most enterprises,” says Paul Kent, vice president of Big Data with SAS. “So if you’re not Google or LinkedIn or Facebook, and you don’t have thousands of engineers to work with Big Data, it can be difficult to find business answers in the information.” What enterprises need are tools to help them easily and effectively understand and analyze Big Data. Employees who aren’t data scientists or analysts should be able to ask questions of the data based on their own business expertise and quickly and easily find patterns, spot inconsistencies, even get answers to questions they haven’t yet thought to ask. Otherwise, the effort and expense that companies invest in collecting and mining Big Data may be challenged to yield significant actionable results. And companies run the risk of missing important business opportunities if they can’t find the answers that are likely stored in their own data.
Data Visualization: When Data Speaks Business This TEC Product Analysis Report aims to provide an extensive review of the set of data visualization features that form part of the essential core of IBM Cognos Business Intelligence (BI) capabilities. The report contains the following elements:
1. An introduction to IBM Cognos Business Intelligence and data visualization for providing extensive analytics and data discovery services
2. An analyst perspective covering data visualization, its role, importance, and value in the BI lifecycle chain and examining its relationship to other elements in a reliable and best practice scenario for performing BI within an organization
3. A review of IBM Cognos data visualization capabilities
4. A general conclusion and final analyst summary
Data Warehousing: Best Practices for Collecting, Storing, and Delivering Decision-Support Data Data Warehousing is a process for collecting, storing, and delivering decision-support data for some or all of an enterprise. Data warehousing is a broad subject that is described point by point in this Refcard. A data warehouse is one of the artifacts created in the data warehousing process.
Data Wrangling with dplyr and tidyr Cheat Sheet (Cheat Sheet)
Data: Emerging Trends and Technologies What are the emerging trends and technologies that will transform the data landscape in coming months? In this report from Strata + Hadoop World co-chair Alistair Croll, you’ll learn how the ubiquity of cheap sensors, fast networks, and distributed computing have given rise to several developments that will soon have a profound effect on individuals and society as a whole. Machine learning, for example, has quickly moved from lab tool to hosted, pay-as-you-go services in the cloud. Those services, in turn, are leading to predictive apps that will provide individuals with the right functionality and content at the right time by continuously learning about them and predicting what they’ll need. Computational power can produce cognitive augmentation.
Data-intensive applications, challenges, techniques and technologies: A survey on Big Data It is already true that Big Data has drawn huge attention from researchers in information sciences, policy and decision makers in governments and enterprises. As the speed of information growth exceeds Moore’s Law at the beginning of this new century, excessive data is making great troubles to human beings. However, there are so much potential and highly useful values hidden in the huge volume of data. A new scientific paradigm is born as dataintensive scientific discovery (DISD), also known as Big Data problems. A large number of fields and sectors, ranging from economic and business activities to public administration, from national security to scientific researches in many areas, involve with Big Data problems. On the one hand, Big Data is extremely valuable to produce productivity in businesses and evolutionary breakthroughs in scientific disciplines, which give us a lot of opportunities to make great progresses in many fields. There is no doubt that the future competitions in business productivity and technologies will surely converge into the Big Data explorations. On the other hand, Big Data also arises with many challenges, such as difficulties in data capture, data storage, data analysis and data visualization. This paper is aimed to demonstrate a close-up view about Big Data, including Big Data applications, Big Data opportunities and challenges, as well as the state-of-the-art techniques and technologies we currently adopt to deal with the Big Data problems. We also discuss several underlying methodologies to handle the data deluge, for example, granular computing, cloud computing, bio-inspired computing, and quantum computing.
Deciphering Big Data Stacks: An Overview of Big Data Tools With its ability to ingest, process, and decipher an abundance of incoming data, the Big Data is considered by many a cornerstone of future research and development. However, the large number of available tools and the overlap between those are impeding their technological potential. In this paper, we present a systematic grouping of the available tools and present a network of dependencies among those with the aim of composing individual tools into functional software stacks required to perform Big Data analyses.
Decision Management and Cloud as a Platform for Predictive Analytics (Slide Deck)
Decision Modeling with DMN: How to Build a Decision Requirements Model using the new Decision Model and Notation (DMN) standard The goal of this paper is to describe the four iterative steps to complete a Decision Requirements Model using the forthcoming DMN standard. Before beginning, it is important to understand the value of defining decision requirements as part of your overall requirements process. Experience shows that there are three main reasons for doing so:
1. Current requirements approaches don’t tackle the decision-making that is increasingly important in information systems.
2. While important for all software development projects, decision requirements are especially important for projects adopting business rules and advanced analytic technologies.
3. Decisions are a common language across business, IT and analytic organizations improving collaboration, increasing reuse, and easing implementation.
Decision Requirements Modeling for Analytic Projects Established analytic approaches like CRISP-DM stress the importance of understanding the project objectives and requirements from a business perspective, but to date there are no formal approaches to capturing this understanding in a repeatable, understandable format. Decision Requirements Modeling closes this gap. Decision Requirements Modeling is a successful technique that develops a richer, more complete business understanding earlier. Decision Requirements Modeling results in a clear business target, an understanding of how the results will be used and deployed, and by whom. Using Decision Requirements Modeling to guide and shape analytics projects reduces reliance on constrained specialist resources by improving requirements gathering, helps teams ask the key questions and enables teams to collaborate effectively across the organization, bringing analytics, IT and business professionals together. Using Decision Requirements Modeling to document analytic project requirements enables organizations to:
– Compare multiple projects for prioritization, including allowing new analytic development to be compared with updating or refining existing analytics.
– Act on a specific plan to guide analytic development that is accessible to business, IT and analytic teams alike.
– Reuse knowledge from project to project by creating an increasingly detailed and accurate view of decision-making and the role of analytics.
– Value information sources and analytics in terms of business impact.
There is an emerging consensus that Decision Requirements Modeling is the best way to specify decision-making. It is also central to a forthcoming standard, the Object Management Group’s Decision Model and Notation, which will give adopters access to a broad community and a vehicle for sharing expertise more widely.
Decision Theory – A Brief Introduction Decision theory is theory about decisions. The subject is not a very unified one. To the contrary, there are many different ways to theorize about decisions, and therefore also many different research traditions. This text attempts to reflect some of the diversity of the subject. Its emphasis lies on the less (mathematically) technical aspects of decision theory.
Deep Belief Nets (Slide Deck)
Deep Learning (Slide Deck)
Deep learning applications and challenges in big data analytics Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. Complex abstractions are learnt at a given level based on relatively simpler abstractions formulated in the preceding level in the hierarchy. A key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and un-categorized. In the present study, we explore how Deep Learning can be utilized for addressing some important problems in Big Data Analytics, including extracting complex patterns from massive volumes of data, semantic indexing, data tagging, fast information retrieval, and simplifying discriminative tasks.We also investigate some aspects of Deep Learning research that need further exploration to incorporate specific challenges introduced by Big Data Analytics, including streaming data, high-dimensional data, scalability of models, and distributed computing. We conclude by presenting insights into relevant future works by posing some questions, including defining data sampling criteria, domain adaptation modeling, defining criteria for obtaining useful data abstractions, improving semantic indexing, semi-supervised learning, and active learning.
Deep Learning applied to NLP Convolutional Neural Network (CNNs) are typically associated with Computer Vision. CNNs are responsible for major breakthroughs in Image Classification and are the core of most Computer Vision systems today. More recently CNNs have been applied to problems in Natural Language Processing and gotten some interesting results. In this paper, we will try to explain the basics of CNNs, its different variations and how they have been applied to NLP.
Deep Learning: Past, Present and Future (Slide Deck)
Deep Neural Decision Forests We present Deep Neural Decision Forests – a novel approach that unifies classification trees with the representation learning functionality known from deep convolutional networks, by training them in an end-to-end manner. To combine these two worlds, we introduce a stochastic and differentiable decision tree model, which steers the representation learning usually conducted in the initial layers of a (deep) convolutional network. Our model differs from conventional deep networks because a decision forest provides the final predictions and it differs from conventional decision forests since we propose a principled, joint and global optimization of split and leaf node parameters. We show experimental results on benchmark machine learning datasets like MNIST and ImageNet and find onpar or superior results when compared to state-of-the-art deep models. Most remarkably, we obtain Top5-Errors of only 7:84%=6:38% on ImageNet validation data when integrating our forests in a single-crop, single/seven model GoogLeNet architecture, respectively. Thus, even without any form of training data set augmentation we are improving on the 6.67% error obtained by the best GoogLeNet architecture (7 models, 144 crops).
Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify objects in images with near-human-level performance, questions naturally arise as to what differences remain between computer and human vision. A recent study revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-theart DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects. Our results shed light on interesting differences between human vision and current DNNs, and raise questions about the generality of DNN computer vision.
Deep Reinforcement Learning: An Overview We give an overview of recent exciting achievements of deep reinforcement learning (RL). We start with background of deep learning and reinforcement learning, as well as introduction of testbeds. Next we discuss Deep Q-Network (DQN) and its extensions, asynchronous methods, policy optimization, reward, and planning. After that, we talk about attention and memory, unsupervised learning, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, spoken dialogue systems (a.k.a. chatbot), machine translation, text sequence prediction, neural architecture design, personalized web services, healthcare, finance, and music generation. We mention topics/papers not reviewed yet. After listing a collection of RL resources, we close with discussions.
Deep-learning in Mobile Robotics – from Perception to Control Systems: A Survey on Why and Why not Deep-learning has dramatically changed the world overnight. It greatly boosted the development of visual perception, object detection, and speech recognition, etc. That was attributed to the multiple convolutional processing layers for abstraction of learning representations from massive data. The advantages of deep convolutional structures in data processing motivated the applications of artificial intelligence methods in robotic problems, especially perception and control system, the two typical and challenging problems in robotics. This paper presents a survey of the deep-learning research landscape in mobile robotics. We start with introducing the definition and development of deep-learning in related fields, especially the essential distinctions between image processing and robotic tasks. We described and discussed several typical applications and related works in this domain, followed by the benefits from deep-learning, and related existing frameworks. Besides, operation in the complex dynamic environment is regarded as a critical bottleneck for mobile robots, such as that for autonomous driving. We thus further emphasize the recent achievement on how deep-learning contributes to navigation and control systems for mobile robots. At the end, we discuss the open challenges and research frontiers.
DeepWalk: Online Learning of Social Representations We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk’s latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk’s representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk’s representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection.
Delivering Information Faster: In-Memory Technology Reboots the Big Data Analytics World In-memory technology – in which entire datasets are pre-loaded into a computer’s random access memory, alleviating the need for shuttling data between memory and disk storage every time a query is initiated – has actually been around for a number of years. However, with the onset of big data, as well as an insatiable thirst for analytics, the industry is taking a second look at this promising approach to speeding up data processing.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%,) on this visual recognition challenge.
Density-based Clustering The clustering methods like K-means or Expectation-Maximization are suitable for finding ellipsoid-shaped clusters, or at best convex clusters. However, for non-convex clusters, such as those shown in Figure 15.1, these methods have trouble finding the true clusters, since two points from different clusters may be closer than two points in the same cluster. The density-based methods we consider in this chapter are able to mine such non-convex or shape-based clusters. Figure
Design Principles of Massive, Robust Prediction Systems Most data mining research is concerned with building high-quality classification models in isolation. In massive production systems, however, the ability to monitor and maintain performance over time while growing in size and scope is equally important. Many external factors may degrade classification performance including changes in data distribution, noise or bias in the source data, and the evolution of the system itself. A well-functioning system must gracefully handle all of these. This paper lays out a set of design principles for large-scale autonomous data mining systems and then demonstrates our application of these principles within the m6d automated ad targeting system. We demonstrate a comprehensive set of quality control processes that allow us monitor and maintain thousands of distinct classification models automatically, and to add new models, take on new data, and correct poorly-performing models without manual intervention or system disruption.
Designing Great Visualizations This paper traces the history of visual representation, from early cave drawings through the computer revolution and the launch of Tableau. We will discuss some of the pioneers in data research and show how their work helped to revolutionize visualization techniques. We will also examine the different styles of data visuals, discuss some of the barriers to making effective visuals and the methods we use to overcome those barriers. In the end, we will show the power (and limits) of human perception, and how we can use data to tell stories – much like those of the earliest cave drawings.
Deterministic Distributed Matching: Simpler, Faster, Better We present improved deterministic distributed algorithms for a number of well-studied matching problems, which are simpler, faster, more accurate, and/or more general than their known counterparts. The common denominator of these results is a deterministic distributed rounding method for certain linear programs, which is the first such rounding method, to our knowledge. A sampling of our end results is as follows: — An $O(\log^2 \Delta \log n)$-round deterministic distributed algorithm for computing a maximal matching, in $n$-node graphs with maximum degree $\Delta$. This is the first improvement in about 20 years over the celebrated $O(\log^4 n)$-round algorithm of Hanckowiak, Karonski, and Panconesi [SODA’98, PODC’99]. — An $O(\log^2 \Delta \log \frac{1}{\varepsilon} + \log^ * n)$-round deterministic distributed algorithm for a $(2+\varepsilon)$-approximation of maximum matching. This is exponentially faster than the classic $O(\Delta +\log^* n)$-round $2$-approximation of Panconesi and Rizzi [DIST’01]. With some modifications, the algorithm can also find an almost maximal matching which leaves only an $\varepsilon$-fraction of the edges on unmatched nodes. — An $O(\log^2 \Delta \log \frac{1}{\varepsilon} \log_{1+\varepsilon} W + \log^ * n)$-round deterministic distributed algorithm for a $(2+\varepsilon)$-approximation of a maximum weighted matching, and also for the more general problem of maximum weighted $b$-matching. Here, $W$ denotes the maximum normalized weight. These improve over the $O(\log^4 n \log_{1+\varepsilon} W)$-round $(6+\varepsilon)$-approximation algorithm of Panconesi and Sozio [DIST’10].
Different Approach to the Problem of Missing Data There is a long history of devleopment of methodology dealing with missing data in statistical analysis. Today, the most popular methods fall into two classes, Complete Cases (CC) and Multiple Imputation (MI). Another approach, Available Cases (AC), has occasionally been mentioned in the research literature, in the context of linear regression analysis, but has generally been ignored. In this paper, we revisit the AC method, showing that it can perform better than CC and MI, and we extend its breadth of application.
Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society The Big Data Research and Development Initiative is now in its third year and making great strides to address the challenges of Big Data. To further advance this initiative, we describe how statistical thinking can help tackle the many Big Data challenges, emphasizing that often the most productive approach will involve multidisciplinary teams with statistical, computational, mathematical, and scientific domain expertise.
discrete examples: genetics and spell checking
Distance Metric Learning – A Comprehensive Survey Many machine learning algorithms, such as K Nearest Neighbor (KNN), heavily rely on the distance metric for the input data patterns. Distance Metric learning is to learn a distance metric for the input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data. In recent years, many studies have demonstrated, both empirically and theoretically, that a learned metric can significantly improve the performance in classification, clustering and retrieval tasks. This paper surveys the field of distance metric learning from a principle perspective, and includes a broad selection of recent work. In particular, distance metric learning is reviewed under different learning conditions: supervised learning versus unsupervised learning, learning in a global sense versus in a local sense; and the distance matrix based on linear kernel versus nonlinear kernel. In addition, this paper discusses a number of techniques that is central to distance metric learning, including convex programming, positive semi-definite programming, kernel learning, dimension reduction, K Nearest Neighbor, large margin classification, and graph-based approaches.
Distinguishing cause from effect using observational data: methods and benchmarks The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X; Y . This was often considered to be impossible. Nevertheless, several approaches for addressing this bivariate causal discovery problem were proposed recently. In this paper, we present the benchmark data set CauseEffectPairs that consists of 88 di erent \causee ect pairs” selected from 31 datasets from various domains. We evaluated the performance of several bivariate causal discovery methods on these real-world benchmark data and on arti cially simulated data. Our empirical results provide evidence that additive-noise methods are indeed able to distinguish cause from e ect using only purely observational data. In addition, we prove consistency of the additive-noise method proposed by Hoyer et al. (2009).
Distributed Constraint Optimization Problems and Applications: A Survey The field of Multi-Agent System (MAS) is an active area of research within Artificial Intelligence, with an increasingly important impact in industrial and other real-world applications. Within a MAS, autonomous agents interact to pursue personal interests and/or to achieve common objectives. Distributed Constraint Optimization Problems (DCOPs) have emerged as one of the prominent agent architectures to govern the agents’ autonomous behavior, where both algorithms and communication models are driven by the structure of the specific problem. During the last decade, several extensions to the DCOP model have enabled them to support MAS in complex, real-time, and uncertain environments. This survey aims at providing an overview of the DCOP model, giving a classification of its multiple extensions and addressing both resolution methods and applications that find a natural mapping within each class of DCOPs. The proposed classification suggests several future perspectives for DCOP extensions, and identifies challenges in the design of efficient resolution algorithms, possibly through the adaptation of strategies from different areas.
Distributed Decision Tree Learning for Mining Big Data Streams Web companies need to e ectively analyse big data in order to enhance the experiences of their users. They need to have systems that are capable of handling big data in term of three dimensions: volume as data keeps growing, variety as the type of data is diverse, and velocity as the is continuously arriving very fast into the systems. However, most of the existing systems have addressed at most only two out of the three dimensions such as Mahout, a distributed machine learning framework that addresses the volume and variety dimensions, and Massive Online Analysis (MOA), a streaming machine learning framework that handles the variety and velocity dimensions. In this thesis, we propose and develop Scalable Advanced Massive Online Analysis (SAMOA), a distributed streaming machine learning framework to address the aforementioned challenge. SAMOA provides exible application programming interfaces (APIs) to allow rapid development of new ML algorithms for dealing with variety. Moreover, we integrate SAMOA with Storm, a state-of-the-art stream processing engine (SPE), which allows SAMOA to inherit Storm’s scalability to address velocity and volume. The main benefits of SAMOA are: it provides exibility in developing new ML algorithms and extensibility in integrating new SPEs. We develop a distributed online classification algorithm on top of SAMOA to verify the aforementioned features of SAMOA. The evaluation results show that the distributed algorithm is suitable for high number of attributes settings.
Distributed Latent Dirichlet Allocation via Tensor Factorization Latent Dirichlet Allocation (LDA) has proven extremely popular and versatile since its introduction over a decade ago. LDA is successful in part because it assigns a mixture of latent states (‘topics’) to each set of exchangeable observations (‘document’), in contrast to a hard clustering. This property complicates the estimation of latent parameters, and has led to extensive research in disparate learning techniques. Broadly speaking there are 3 basic strategies: variational inference ; Markov chain Monte Carlo ; and the method of moments , the latter having been recently discovered. Due to high dimensional data with large vocabulary size; numerous documents; and number of topics, computational constraints are the limiting factor to developing large scale topic models. This has motivated research into scalable computational strategies for LDA. In the single node context, stochastic variational inference is fast and accurate, but has high communication costs in the distributed setting. Batch variational inference has a more favorable ratio of communication to computation as the E-step (but not the M-step) is embarrisingly parallel. Markov chain Monte Carlo (MCMC) techniques have also been implemented in the distributed setting, both synchronous and asynchronous variants. Due to their recent introduction, there are no distributed implementations of method of moments based approaches to LDA. We leverage that the method of moments for LDA reduces to canonical polyadic (CP) decomposition of a tensor, a problem which has received extensive study in the literature , including distributed variants. We combine ALS with whitening preprocessing (data orthogonalization and dimensionality reduction) motivated by better convergence rate and perturbation guarantees compared to previous methods. Additionally, the preprocessing has the benefit that the subsequent tensor decomposition is independent of the vocabulary size and the number of documents. Although ALS requires many iterations to converge (more than would be tolerable using map-reduce without custom support for low-overhead iteration), we utilize REEF , a distributed processing framework which runs on YARN managed clusters, e.g., a Hadoop 2 installation.
Distributed Machine Learning with Apache Mahout (RefCard) Apache Mahout is a library for scalable machine learning. Originally a subproject of Apache Lucene (a high-performance text search engine library), Mahout has progressed to be a top-level Apache project. While Mahout has only been around for a few years, it has established itself as a frontrunner in the field of machine learning technologies. This Refcard will present the basics of Mahout by studying two possible applications:
• Training and testing a Random Forest for handwriting recognition using Amazon Web Services EMR.
• Running a recommendation engine on a standalone Spark cluster.
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
Do you know Big Data? (Cheat Sheet)
Domain Adaptation for Visual Applications: A Comprehensive Survey The aim of this paper is to give an overview of domain adaptation and transfer learning with a specific view on visual applications. After a general motivation, we first position domain adaptation in the larger transfer learning problem. Second, we try to address and analyze briefly the state-of-the-art methods for different types of scenarios, first describing the historical shallow methods, addressing both the homogeneous and the heterogeneous domain adaptation methods. Third, we discuss the effect of the success of deep convolutional architectures which led to new type of domain adaptation methods that integrate the adaptation within the deep architecture. Fourth, we overview the methods that go beyond image categorization, such as object detection or image segmentation, video analyses or learning visual attributes. Finally, we conclude the paper with a section where we relate domain adaptation to other machine learning solutions.
Don’t Be Overwhelmed by Big Data Big. Data. The CPG industry is abuzz with those two words. And for good reason, as both brick-and-mortar retailers and online retailers each attempt to create the ideal omni-channel consumer experience that will drive increased sales, they look to Big Data for actionable insights that can be measured against key KPIs. And while it’s understandable that the CPG industry is excited by the prospect of more data they can use to better understand the who, what, why and when of consumer purchasing behavior, it’s critical CPG organizations pause and ask themselves, “Are we providing retail team and executive team members with “quality” data? POS, Big Data, order data or shipment summary data. It doesn’t matter. Is the right (i.e., “quality”) data getting to the right people at the right time? In essence, it’s the same question retail and executive teams face when considering how to best merchandise their SKUs. If CPG organizations understand the importance of getting the right product to the right people at the right time, then surely they understand the importance of applying the same forethought to their demand data? …
Dynamic Bayesian Networks: A State of the Art
Dynamic Decision Networks for Decision-Making in Self-Adaptive Systems: A Case Study Bayesian decision theory is increasingly applied to support decision-making processes under environmental variability and uncertainty. Researchers from application areas like psychology and biomedicine have applied these techniques successfully. However, in the area of software engineering and specifically in the area of self-adaptive systems (SASs), little progress has been made in the application of Bayesian decision theory. We believe that techniques based on Bayesian Networks (BNs) are useful for systems that dynamically adapt themselves at runtime to a changing environment, which is usually uncertain. In this paper, we discuss the case for the use of BNs, specifically Dynamic Decision Networks (DDNs), to support the decision-making of self-adaptive systems. We present how such a probabilistic model can be used to support the decisionmaking in SASs and justify its applicability. We have applied our DDN-based approach to the case of an adaptive remote data mirroring system. We discuss results, implications and potential benefits of the DDN to enhance the development and operation of self-adaptive systems, by providing mechanisms to cope with uncertainty and automatically make the best decision.
Dynamo: Amazon’s Highly Available Key-value Store Reliability at massive scale is one of the biggest challenges we face at, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.


Econometrics in R A more advanced tutorial on econometrics with R
Economic Forecasting Forecasts guide decisions in all areas of economics and finance and their value can only be understood in relation to, and in the context of, such decisions. We discuss the central role of the loss function in helping determine the forecaster’s objectives. Decision theory provides a framework for both the construction and evaluation of forecasts. This framework allows an understanding of the challenges that arise from the explosion in the sheer volume of predictor variables under consideration and the forecaster’s ability to entertain an endless array of forecasting models and time-varying specifications, none of which may coincide with the ‘true’ model. We show this along with reviewing methods for comparing the forecasting performance of pairs of models or evaluating the ability of the best of many models to beat a benchmark specification.
Efficient Dimensionality Reduction for High-Dimensional Network Estimation We propose module graphical lasso (MGL), an aggressive dimensionality reduction and network estimation technique for a highdimensional Gaussian graphical model (GGM). MGL achieves scalability, interpretability and robustness by exploiting the modularity property of many real-world networks. Variables are organized into tightly coupled modules and a graph structure is estimated to determine the conditional independencies among modules. MGL iteratively learns the module assignment of variables, the latent variables, each corresponding to a module, and the parameters of the GGM of the latent variables. In synthetic data experiments, MGL outperforms the standard graphical lasso and three other methods that incorporate latent variables into GGMs. When applied to gene expression data from ovarian cancer, MGL outperforms standard clustering algorithms in identifying functionally coherent gene sets and predicting survival time of patients. The learned modules and their dependencies provide novel insights into cancer biology as well as identifying possible novel drug targets.
Efficient Estimation of Word Representations in Vector Space We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Efficient Forecasting for Hierarchical Time Series Forecasting is used as the basis for business planning in many application areas such as energy, sales and traffic management. Time series data used in these areas is often hierarchically organized and thus, aggregated along the hierarchy levels based on their dimensional features. Calculating forecasts in these environments is very time consuming, due to ensuring forecasting consistency between hierarchy levels. To increase the forecasting efficiency for hierarchically organized time series, we introduce a novel forecasting approach that takes advantage of the hierarchical organization. There, we reuse the forecast models maintained on the lowest level of the hierarchy to almost instantly create already estimated forecast models on higher hierarchical levels. In addition, we define a hierarchical communication framework, increasing the communication flexibility and efficiency. Our experiments show significant runtime improvements for creating a forecast model at higher hierarchical levels, while still providing a very high accuracy.
Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics: An Issue Brief In data mining and data analytics, tools and techniques once confined to research laboratories are being adopted by forward-looking industries to generate business intelligence for improving decision making. Higher education institutions are beginning to use analytics for improving the services they provide and for increasing student grades and retention. The U.S. Department of Education’s National Education Technology Plan, as one part of its model for 21st-century learning powered by technology, envisions ways of using data from online learning systems to improve instruction.
Entering the Era of Data Science: Targeted Learning and the Integration of Statistics and Computational Data Analysis This outlook paper reviews the research of van der Laan’s group on Targeted Learning, a subfield of statistics that is concerned with the construction of data adaptive estimators of user-supplied target parameters of the probability distribution of the data and corresponding confidence intervals, aiming at only relying on realistic statistical assumptions. Targeted Learning fully utilizes the state of the art in machine learning tools, while still preserving the important identity of statistics as a field that is concerned with both accurate estimation of the true target parameter value and assessment of uncertainty in order to make sound statistical conclusions.We also provide a philosophical historical perspective on Targeted Learning, also relating it to the new developments in Big Data. We conclude with some remarks explaining the immediate relevance of Targeted Learning to the current Big Data movement.
Error Statistics Error statistics, as we are using that term, has a dual dimension involving philosophy and methodology. It refers to a standpoint regarding both:
1. a cluster of statistical tools, their interpretation and justification,
2. a general philosophy of science, and the roles probability plays in inductive inference.
Evaluating a classification model – What does precision and recall tell me? Once you have created a predictive model, you always need to find out is how good it is. RDS helps you with this by reporting the performance of the model. All measures reported by RDS are estimated from data not used for creating the model.
Evaluating the Design of the R Language R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we assess the success of different language features.
Evaluation-as-a-Service: Overview and Outlook Evaluation in empirical computer science is essential to show progress and assess technologies developed. Several research domains such as information retrieval have long relied on systematic evaluation to measure progress: here, the Cranfield paradigm of creating shared test collections, defining search tasks, and collecting ground truth for these tasks has persisted up until now. In recent years, however, several new challenges have emerged that do not fit this paradigm very well: extremely large data sets, confidential data sets as found in the medical domain, and rapidly changing data sets as often encountered in industry. Also, crowdsourcing has changed the way that industry approaches problem-solving with companies now organizing challenges and handing out monetary awards to incentivize people to work on their challenges, particularly in the field of machine learning. This white paper is based on discussions at a workshop on Evaluation-as-a-Service (EaaS). EaaS is the paradigm of not providing data sets to participants and have them work on the data locally, but keeping the data central and allowing access via Application Programming Interfaces (API), Virtual Machines (VM) or other possibilities to ship executables. The objective of this white paper are to summarize and compare the current approaches and consolidate the experiences of these approaches to outline the next steps of EaaS, particularly towards sustainable research infrastructures. This white paper summarizes several existing approaches to EaaS and analyzes their usage scenarios and also the advantages and disadvantages. The many factors influencing EaaS are overviewed, and the environment in terms of motivations for the various stakeholders, from funding agencies to challenge organizers, researchers and participants, to industry interested in supplying real-world problems for which they require solutions.
Event History Analysis Event history analysis deals with data obtained by observing individuals over time, focusing on events occurring for the individuals under observation. Important applications are to life events of humans in demography, life insurance mathematics, epidemiology, and sociology. The basic data are the times of occurrence of the events and the types of events that occur. The standard approach to the analysis of such data is to use multistate models; a basic example is finite-state Markov processes in continuous time. Censoring and truncation are defining features of the area. This review comments specifically on three areas that are current subjects of active development, all motivated by demands from applications: sampling patterns, the possibility of causal interpretation of the analyses, and the levels and interpretation of variability.
Evolutionary Multimodal Optimization: A Short Survey Real world problems always have different multiple solutions. For instance, optical engineers need to tune the recording parameters to get as many optimal solutions as possible for multiple trials in the varied-line-spacing holographic grating design problem. Unfortunately, most traditional optimization techniques focus on solving for a single optimal solution. They need to be applied several times; yet all solutions are not guaranteed to be found. Thus the multimodal optimization problem was proposed. In that problem, we are interested in not only a single optimal point, but also the others. With strong parallel search capability, evolutionary algorithms are shown to be particularly effective in solving this type of problem. In particular, the evolutionary algorithms for multimodal optimization usually not only locate multiple optima in a single run, but also preserve their population diversity throughout a run, resulting in their global optimization ability on multimodal functions. In addition, the techniques for multimodal optimization are borrowed as diversity maintenance techniques to other problems. In this chapter, we describe and review the state-of-the-arts evolutionary algorithms for multimodal optimization in terms of methodology, benchmarking, and application.
Experiencing SAX: a Novel Symbolic Representation of Time Series Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.
Exploiting Innovative Technologies in BI and Big Data Analytics Data is worth nothing without the right technologies to facilitate its transformation into meaningful information – delivered to the right people in a timely manner – for improved decision making. Forrester recently conducted a survey with 330 global business intelligence (BI) decision makers and found strong correlations between overall company success and adoption of innovative BI, analytics, and Big Data tools. Is your company harnessing the latest innovations in BI technology to achieve your business goals?
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes.
Extraction of food consumption systems by non-negative matrix factorization (NMF) for the assessment of food choices. In Western countries where food supply is satisfactory, consumers organize their diets around a large combination of foods. It is the purpose of this paper to examine how recent nonnegative matrix fac- torization (NMF) techniques can be applied to food consumption data in order to understand these combinations. Such data are nonnegative by nature and of high dimension. The NMF model provides a representation of consumption data through latent vectors with nonnegative coe cients, we call consumption systems, in a small number. As the NMF approach may encourage sparsity of the data representation produced, the resulting consumption systems are easily interpretable. Beyond the illustration of its properties we provide through a simple simulation result, the NMF method is applied to data issued from a french consumption survey. The numerical results thus obtained are displayed and thoroughly discussed. A clustering based on the k-means method is also achieved in the resulting latent consumption space, in order to recover food consumption patterns easily usable for nutritionists.
Eyes Wide Open: Open Source Analytics 1. The total cost of owning and managing analytics technology consists of hardware (price per CPU, price per unit of storage), software (price per unit/license) and human capital (price per output) costs. Human capital costs are divided between line of business (LOB) users and IT support costs.
2. Transformational advances in data storage and compute power over the past 20 years have driven hardware costs so low that adoption is nearly universal. At the same time, managing these systems has become easier, resulting in lower human capital expense in the form training time (LOB users) and maintenance and management costs (IT). Resilient and reliable storage and compute power is now a commodity.
3. Open source storage (Apache Hadoop) and operating system (Linux) options have proliferated over the past 3+ years leading many firms to reliably experiment with low/no cost open source options to supplement or replace licensed commercial solutions.
4. In contrast, firms venturing down the open source analytics software path are not always seeing the expected cost reductions due to higher human capital expenses and increased risk that introduced into the enterprise through open source software.
5. IIA recommends firms take a blended approach to software selection, matching the correct tool to analytics user type/role, and that firms recalculate total costs, specifically incorporating potential risks associated with open source tools, particularly in mission critical applications.


Fast unfolding of communities in large networks We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection method in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2.6 million customers and by analyzing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad-hoc modular networks.
Feature Engineering Tips for Data Scientists and Business Analysts Most data scientists and statisticians agree that predictive modeling is both art and science yet, relatively little to no air time is given to describing the art. This post describes one piece of the art of modeling called feature engineering which expands the number of variables you have to build a model. I offer six ways to implement feature engineering and provide examples of each. Using methods like these is important because additional relevant variables increase model accuracy, which makes feature engineering an essential part of the modeling process.
Finite Mixture Models and Model-Based Clustering Finite mixture models have a long history in statistics, having been used to model pupulation heterogeneity, generalize distributional assumptions, and lately, for providing a convenient yet formal framework for clustering and classification. This paper provides a detailed review into mixture models and model-based clustering. Recent trends in the area, as well as open problems are also discussed.
Firefly Algorithm for optimization problems with non-continuous variables: A Review and Analysis Firefly algorithm is a swarm based metaheuristic algorithm inspired by the flashing behavior of fireflies. It is an effective and an easy to implement algorithm. It has been tested on different problems from different disciplines and found to be effective. Even though the algorithm is proposed for optimization problems with continuous variables, it has been modified and used for problems with non-continuous variables, including binary and integer valued problems. In this paper a detailed review of this modifications of firefly algorithm for problems with non-continuous variables will be discussed. The strength and weakness of the modifications along with possible future works will be presented.
Five big data challenges And how to overcome them with visual analytics Big data is set to offer companies tremendous insight. But with terabytes and petabytes of data pouring in to organizations today, traditional architectures and infrastructures are not up to the challenge. IT teams are burdened with ever-growing requests for data, ad hoc analyses and one-off reports. Decision makers become frustrated because it takes hours or days to get answers to questions, if at all. More users are expecting self-service access to information in a form they can easily understand and share with others. This begs the question: How do you present big data in a way that business leaders can quickly understand and use? This is not a minor consideration. Mining millions of rows of data creates a big headache for analysts tasked with sorting and presenting data. Organizations often approach the problem in one of two ways: Build “samples” so that it is easier to both analyze and present the data, or create template charts and graphs that can accept certain types of information. Both approaches miss the potential for big data. Instead, consider pairing big data with visual analytics so that you use all the data and receive automated help in selecting the best ways to present the data. This frees staff to deploy insights from data. Think of your data as a great, but messy, story. Visual analytics is the master filmmaker and the gifted editor who bring the story to life.
Five pillars of prescriptive analytics success As the Big Data Analytics space continues to evolve, one of the breakthrough technologies that businesses will be talking about in the coming years is prescriptive analytics. The promise of prescriptive analytics is certainly alluring: it enables decision-makers to not only look into the future of their mission critical processes and see the opportunities (and issues) that are potentially out there, but it also presents the best course of action to take advantage of that foresight in a timely manner. What should we look for in a prescriptive analytics solution to ensure it will deliver business value today and tomorrow?
Five Ways to Empower Business Analysts and Succeed in Your Self-Service BI Program The term self-service is ubiquitous in today’s business intelligence (BI) market. BI vendors and organizations alike constantly work to expand BI’s use and value proposition within the organization by making it more accessible to a wider variety of people. This push has created a series of BI offerings that are easy to interact and design with without the help of IT departments. However, there are many types of BI users that are still underserved. Primary among them are business analysts that could make self-service BI more successful if they are empowered with higher levels of interactivity and the capabilities to design their own BI applications. Traditionally, business analysts have a broader understanding of data relationships and know how to develop their own analytics. They interact with spreadsheets, develop their own SQL scripts, and create their own databases. In many cases, business analysts are power users because they are tasked with taking ownership over data due to their level of expertise. The business analyst is the person who understands the intricacies of data and how it interrelates within the organization’s ecosystem, represents the link between their departments and IT, and develops analytical insights based on their subject matter expertise. Because of this skill set, these users are tasked with developing their own complex set of analytics. They also create BI models and reports that will be consumed by employees across the organization. Luckily some self-service BI offerings support this two-tiered approach, providing business analysts with access to the components required to do so successfully. The importance of the power user/business analyst role cannot be overlooked as creating a successful self-service BI strategy requires consumption of analytics, design that supports this consumption, and business control to ensure that the right information is delivered to the right people and that the right business rules are applied. This represents the intersection of business and IT roles and represents the value of the power user. Now, more than ever, it is important to take advantage of these skill sets to drive self-service BI. The reality is that many BI projects fail when not controlled by the business. Lack of proper requirements gathering and the inability to meet the needs of users creates a lack of adoption. Also, information is becoming more varied and complex. Simple tools such as Excel and Access no longer handle the complexities and increasing volumes inherent in big data or maintain the validity of analytic models. Between this and the increasing demand for in-house programmers and software developers, organizations need to have business analysts that understand the needs of the business, can perform robust analytics, and provide consumable applications for a wide variety of business users to support more technical roles required. This paper looks at five key enablers of self-service BI for business analysts. These are:
1. Building a collaborative relationship between the business analyst and IT
2. Design flexibility
3. Cohesion between technology, people, and business processes
4. Data diversity and preparation 5. Data quality
Fog Computing: A Taxonomy, Survey and Future Directions In recent years, the number of Internet of Things (IoT) devices/sensors has increased to a great extent. To support the computational demand of real-time latency-sensitive applications of largely geo-distributed IoT devices/sensors, a new computing paradigm named ‘Fog computing’ has been introduced. Generally, Fog computing resides closer to the IoT devices/sensors and extends the Cloud-based computing, storage and networking facilities. In this chapter, we comprehensively analyse the challenges in Fogs acting as an intermediate layer between IoT devices/ sensors and Cloud datacentres and review the current developments in this field. We present a taxonomy of Fog computing according to the identified challenges and its key features.We also map the existing works to the taxonomy in order to identify current research gaps in the area of Fog computing. Moreover, based on the observations, we propose future directions for research.
Forecasting Time Series With Complex Seasonal Patterns Using Exponential Smoothing An innovations state space modeling framework is introduced for forecasting complex seasonal time series such as those with multiple seasonal periods, high-frequency seasonality, non-integer seasonality, and dual-calendar effects. The new framework incorporates Box–Cox transformations, Fourier representations with time varying coefficients, and ARMA error correction. Likelihood evaluation and analytical expressions for point forecasts and interval predictions under the assumption of Gaussian errors are derived, leading to a simple, comprehensive approach to forecasting complex seasonal time series. A key feature of the framework is that it relies on a new method that greatly reduces the computational burden in the maximum likelihood estimation. The modeling framework is useful for a broad range of applications, its versatility being illustrated in three empirical studies. In addition, the proposed trigonometric formulation is presented as a means of decomposing complex seasonal time series, and it is shown that this decomposition leads to the identification and extraction of seasonal components which are otherwise not apparent in the time series plot itself.
Fostering a data-driven culture Fostering a data-driven culture is an Economist Intelligence Unit report, sponsored by Tableau Software. It explores the challenges in nurturing a data-driven culture, and what companies can do to meet them. The Economist Intelligence Unit bears sole responsibility for the content of this report. The fi ndings do not necessarily refl ect the views of the sponsor. The paper draws on two main sources for its research and fi ndings:
* A survey, conducted in October 2012, of 530 senior executives from around the world. More than 40% of respondents are C-Level executives, including 23% from the CEO, president or managing director ranks and 9%, CIOs. Responses come from a wide range of regions: 50% North America, 15% Asia-Pacifi c, 26% Western Europe and 9% Latin America. The range of company sizes is also diverse, from those with revenue of less than US$500m (53%) through to those with revenue of US$10bn or more (20%). The survey covers nearly all industries, including IT and technology (18%), fi nancial services (17%), professional services (11%) and manufacturing (7%).
* A series of in-depth interviews with the following senior executives: Sidney Minassian, CEO, Contexti Jerry O’Dwyer, principal, Deloitte Consulting William Schmarzo, CTO, EMC Colin Hill, CEO, GNS Healthcare
We would like to thank all interviewees and survey respondents for their time and insight. The report was written by Jim Giles and edited by Gilda Stahl.
From Data Mining to Knowledge Discovery in Databases Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field.
From Data Scientist to Data Artist: Data Sculpting to Shape Business Insights Discover the new role of data artist and understand the need, value and availability of powerful, flexible and affordable analytics tools that do not require an advanced degree in mathematics nor a team of information technology experts to use. Learn about the professional requirements of a data artist and how to change corporate culture with the right analytics tools in the right hands. Read about the exploits of a news organization that used those tools to change its culture and become profitable.
From Linear Models to Machine Learning Regression analysis is both one of the oldest branches of statistics, with least-squares analysis having been rst proposed way back in 1805, and also one of the newest areas, in the form of the machine learning techniques being vigorously researched today. Not surprisingly, then, there is a vast literature on the subject. Well, then, why write yet another regression book? Many books are out there already, with titles using words like regression, classi cation, predic- tive analytics, machine learning and so on. They are written by authors whom I greatly admire, and whose work I myself have found useful. Yet, I did not feel that any existing books covered the material in a manner that su ciently provided insight for the practicing data analyst.
From Yawn to YARN: Why You Should be Excited About Hadoop 2 By now almost everyone has heard the story of the yellow elephant who never forgets data, consumes whatever data you have from any source, and magically produces a big data treasure trove of business insights for you, including tweets, telemetry, customer sentiment, sensor readings, mobile app activity, and more! In fact, the story has been told and re-told so many times now that most people’s natural reaction is… yawn. Hadoop. Big Data. Yeah, yeah. I have heard this story too many times. I google Hadoop and get almost one billion results, but I can’t yell “yahoo!” about getting paid big bucks to code big data applications in MapReduce, which those cool kids in Silicon Valley used to “money-expand” into billionaires. So why should I be excited about Hadoop 2? After all, no sequel is as good as the original. Well, in this case, the sequel is better. The story has changed. The script has flipped. Even though the new protagonist’s name sounds like yawn, the yarn about YARN is much more than yet another chapter in the same old story. More reimagining than sequel, it will take you from yawn to YARN and get you excited about Hadoop 2. This paper explains why.
Functional data clustering: a survey The main contributions to functional data clustering are reviewed. Most approaches used for clustering functional data are based on the following three methodologies: dimension reduction before clustering, nonparametric methods using specific distances or dissimilarities between curves and model-based clustering methods. These latter assume a probabilistic distribution on either the principal components or coefficients of functional data expansion into a finite dimensional basis of functions. Numerical illustrations as well as a software review are presented.
Functional Regression: A New Model for Predicting Market Penetration of New Products The Bass model has been a standard for analyzing and predicting the market penetration of new products. We demonstrate the insights to be gained and predictive performance of functional data analysis (FDA), a new class of nonparametric techniques that has shown impressive results within the statistics community, on the market penetration of 760 categories drawn from 21 products and 70 countries. We propose a new model called Functional Regression and compare its performance to several models, including the Classic Bass model, Estimated Means, Last Observation Projection, a Meta-Bass model, and an Augmented Meta-Bass model for predicting eight aspects of market penetration. Results (a) validate the logic of FDA in integrating information across categories, (b) show that Augmented Functional Regression is superior to the above models, and (c) product-specific effects are more important than country-specific effects when predicting penetration of an evolving new product.
Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Many machine learning approaches are characterized by information constraints on how they interact with the training data. These include memory and sequential access constraints (e.g. fast first-order methods to solve stochastic optimization problems); communication constraints (e.g. distributed learning); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more. However, currently we have little understanding how such information constraints fundamentally affect our performance, independent of the learning problem semantics. For example, are there learning problems where any algorithm which has small memory footprint (or can use any bounded number of bits from each example, or has certain communication constraints) will perform worse than what is possible without such constraints? In this paper, we describe how a single set of results implies positive answers to the above, for several different settings.


Gaussian Processes for Regression A Quick Introduction
Gaussian Processes in Machine Learning We give a basic introduction to Gaussian Process regression models. We focus on understanding the role of the stochastic process and how it is used to define a distribution over functions. We present the simple equations for incorporating training data and examine how to learn the hyperparameters using the marginal likelihood. We explain the practical advantages of Gaussian Process and end with conclusions and a look at the current trends in GP work.
Generalized Gradient Descent (Slide Deck)
Generalized Power Method for Sparse Principal Component Analysis In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed.
Getting Started with Apache Hadoop This Refcard presents Apache Hadoop, a software framework that enables distributed storage and processing of large datasets using simple high-level programming models. We cover the most important concepts of Hadoop, describe its architecture, guide how to start using it as well as write and execute various applications on Hadoop. In the nutshell, Hadoop is an open-source project of the Apache Software Foundation that can be installed on a set of standard machines, so that these machines can communicate and work together to store and process large datasets. …
Getting Started with Spark (Slide Deck)
GIS with R (Slide Deck)
Glossary (Cheat Sheet)
Grades of Evidence (Cheat Sheet)
Gradient boosting machines, a tutorial Gradient boosting machines are a family of powerful machine-learning techniques that have shown considerable success in a wide range of practical applications. They are highly customizable to the particular needs of the application, like being learned with respect to different loss functions. This article gives a tutorial introduction into the methodology of gradient boosting methods with a strong focus on machine-learning aspects of modeling. A theoretical information is complemented with descriptive examples and illustrations which cover all the stages of the gradient boosting model design. Considerations on handling the model complexity are discussed. Three practical examples of gradient boosting applications are presented and comprehensively analyzed.
Graphical Models Statistical applications in fields such as bioinformatics, information retrieval, speech processing, image processing and communications often involve large-scale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for approaching these problems, and indeed many of the models developed by researchers in these applied fields are instances of the general graphical model formalism.We review some of the basic ideas underlying graphical models, including the algorithmic ideas that allow graphical models to be deployed in large-scale data analysis problems.We also present examples of graphical models in bioinformatics, error-control coding and language processing.
Graphical Models in a Nutshell Probabilistic graphical models are an elegant framework which combines uncertainty (probabilities) and logical structure (independence constraints) to compactly represent complex, real-world phenomena. The framework is quite general in that many of the commonly proposed statistical models (Kalman filters, hidden Markov models, Ising models) can be described as graphical models. Graphical models have enjoyed a surge of interest in the last two decades, due both to the flexibility and power of the representation and to the increased ability to effectively learn and perform inference in large networks.
Group theoretical methods in machine learning Ever since its discovery in 1807, the Fourier transform has been one of the mainstays of pure mathematics, theoretical physics, and engineering. The ease with which it connects the analytical and algebraic properties of function spaces; the particle and wave descriptions of matter; and the time and frequency domain descriptions of waves and vibrations make the Fourier transform one of the great unifying concepts of mathematics. Deeper examination reveals that the logic of the Fourier transform is dictated by the structure of the underlying space itself. Hence, the classical cases of functions on the real line, the unit circle, and the integers modulo n are only the beginning: harmonic analysis can be generalized to functions on any space on which a group of transformations acts. Here the emphasis is on the word group in the mathematical sense of an algebraic system obeying specific axioms. The group might even be non-commutative: the fundamental principles behind harmonic analysis are so general that they apply equally to commutative and non-commutative structures. Thus, the humble Fourier transform leads us into the depths of group theory and abstract algebra, arguably the most extensive formal system ever explored by humans. Should this be of any interest to the practitioner who has his eyes set on concrete applications of machine learning and statistical inference? Hopefully, the present thesis will convince the reader that the answer is an emphatic “yes”. One of the reasons why this is so is that groups are the mathematician’s way of capturing symmetries, and symmetries are all around us. Twentieth century physics has taught us just how powerful a tool symmetry principles are for prying open the secrets of nature. One could hardly ask for a better example of the power of mathematics than particle physics, which translates the abstract machinery of group theory into predictions about the behavior of the elementary building blocks of our universe. I believe that algebra will prove to be just as crucial to the science of data as it has proved to be to the sciences of the physical world. In probability theory and statistics it was Persi Diaconis who did much of the pioneering work in this realm, brilliantly expounded in his little book [Diaconis, 1988]. Since then, several other authors have also contributed to the field. In comparison, the algebraic side of machine learning has until now remained largely unexplored. The present thesis is a first step towards filling this gap. The two main themes of the thesis are (a) learning on domains which have non-trivial algebraic structure; and (b) learning in the presence of invariances. Learning rankings/matchings are the classic example of the first situation, whilst rotation/translation/scale invariance in machine vision is probably the most immediate example of the latter. The thesis presents examples addressing real world problems in these two domains. However, the beauty of the algebraic approach is that it allows us to discuss these matters on a more general, abstract, level, so most of our results apply equally well to a large range of learning scenarios. The generality of our approach also means that we do not have to commit to just one learning paradigm (frequentist/Bayesian) or one group of algorithms (SVMs/graphical models/boosting/etc.). We do find that some of our ideas regarding symmetrization and learning on groups meshes best with the Hilbert space learning framework, so in Chapters 4 and 5 we focus on this methodology, but even there we take a comparative stance, contrasting the SVM with Gaussian Processes and a modified version of the Perceptron. One of the reasons why up until now abstract algebra has not had a larger impact on the applied side of computer science is that it is often perceived as a very theoretical field, where computations are difficult if not impossible due to the sheer size of the objects at hand. For example, while permutations obviously enter many applied problems, calulations on the full symmetric group (permutation group) are seldom viable, since it has n! elements. However, recently abstract algebra has developed a strong computational side [B¨urgisser et al., 1997]. The core algorithms of this new computational algebra, such as the non-commutative FFTs discussed in detail in Chapter 3, are the backbone of the bridge between applied computations and abstract theory. In addition to our machine learning work, the present thesis offers some modest additions to this field by deriving some useful generalizations of Clausen’s FFT for the symmetric group, and presenting an efficient, expandable software library implementing the transform. To the best of our knowledge, this is the first time that such a library has been made publicly available. Clearly, a thesis like this one is only a first step towards building a bridge between the theory of groups/representations and machine learning. My hope is that it will offer ideas and inspiration to both sides, as well as a few practical algorithms that I believe are directly applicable to real world problems.
Guide to Big Data 2014 Big Data, NoSQL, and NewSQL – these are the high-level concepts relating to the new, unprecedented data management and analysis challenges that enterprises and startups are now facing. Some estimates expect the amount of digital data in the world to double every two years, while other estimates suggest that 90% of the world’s current data was created in the last two years. The predictions for data growth are staggering no matter where you look, but what does that mean practically for you, the developer, the sysadmin, the product manager, or C-level leader? DZone’s 2014 Guide to Big Data is the definitive resource for learning how industry experts are handling the massive growth and diversity of data. It contains resources that will help you navigate and excel in the world of Big Data management.
Guide to Big Data Business Solutions in the Cloud The recent release of a commercial version of the Lustre* parallel file system running on Amazon Web Services (AWS) was big news for business data centers facing ever expanding data analysis and storage demands. Now, Lustre, the predominant high-performing file system installed in most of the supercomputer installations around the world, could be deployed to business customers in a hardened, tested, easy to manage and fully supported distribution in the cloud. Proven to scale up to extreme levels of storage performance and capacity as measured in tens or even hundreds of petabytes, with shared and accessible to tens of thousands of clients, Lustre combines high throughput with high availability using vendor-neutral server, storage and interconnect hardware coupled with various distributions of Linux. In this Guide, we take a look at what Lustre on infrastructure AWS delivers for a broad community of business and commercial organizations struggling with the challenge of big data and demanding storage growth.
Guide to Machine Learning As the primary facilitator of data science and big data, machine learning has garnered much interest by a broad range of industries as a way to increase value of enterprise data assets. Through techniques of supervised and unsupervised statistical learning, organizations can make important predictions and discover previously unknown knowledge to provide actionable business intelligence. In this guide, we’ll examine the principles underlying machine learning based on the R statistical environment. We’ll explore machine learning with R from the open source R perspective as well as the more robust commercial perspective using Revolution Analytics’ Revolution R Enterprise (RRE) for big data deployments….


Hadoop Buyer’s Guide Everything you need to know about choosing the right Hadoop distribution for production
Hadoop’s Limitations for Big Data Analytics The era of ‘big data’ represents new challenges to businesses. Incoming data volumes are exploding in complexity, variety, speed and volume, while legacy tools have not kept pace. In recent years, a new tool – Apache Hadoop – has appeared on the scene. And while it solves some big data problems, it is not magic. In order to act effectively on big data, businesses must be able to assimilate data quickly, but also must be able to explore this data for value, allowing analysts to ask and iterate their business questions quickly. Hadoop – purpose built to facilitate certain forms of batch-oriented distributed data processing – lends itself readily to the assimilation process. But it was built on fundamentals which severely limit its ability to act as an analytic database. With the rise of big data has come the rise of the analytic database platform. Even five years ago, a company could leverage a DBMS such as Oracle for a data warehouse. However, Oracle was built in a time when databases rarely exceeded a few gigabytes in size. Along with other legacy DBMSs, it cannot perform at the scale now required. Enter the analytic platform. The analytic platform allows analysts to use their existing tools and skillsets to ask new questions of big data quickly, easily, and at scales unseen previously. The de facto best practice infrastructure for big data today often consists of a processing infrastructure of systems such as Hadoop to acquire and archive the data, and an analytic platform to enable the highly iterative analysis process. But because Hadoop is still relatively new, there is a great deal of confusion about its strengths and weaknesses. This paper will discuss those topics, and concludes with guidance on how to build the complete ecosystem for big data analytics.
Hands-On Data Science with R – Text Mining Text Mining or Text Analytics applies analytic tools to learn from collections of text documents like books, newspapers, emails, etc. The goal is similar to humans learning by reading books. Using automated algorithms we can learn from massive amounts of text, much more than a human can. The material could be consist of millions of newspaper articles to perhaps summarise the main themes and to identify those that are of most interest to particular people.
Harness the Power of Data Visualization to Transform Your Business Data underpins the operations and strategic decisions of every business. Yet these days, data is generated faster than it can be consumed and digested, making it challenging for small to mid-size organizations to extract maximum value from this vital asset. Many decision makers – whether data analysts or senior-level executives – struggle to draw meaningful conclusions in a timely manner from the array of data available to them. Reliance on spreadsheets and specialized reporting and analysis tools only limits their flexibility and output. After all, spreadsheets were not designed for data analysis. And specialized reporting and analysis tools often lack integration with other critical business applications and processes. Moreover, dependence on the IT group for ad hoc reports slows insights and decisions, putting the company at a disadvantage. Business decision makers feel a loss of control waiting for the already overburdened IT department to generate critical reports. Savvy companies are moving beyond static graphs, spreadsheets, and reports by harnessing the power of business visualization to transform how they see, discover, and share insights hidden in their data. Because business visualization spans a broad range of options, from static to dynamic and interactive, it serves a variety of needs within organizations. As a result, those companies adopting business visualization are able to extract maximum value from the information captured throughout their environments.
Hidden Technical Debt in Machine Learning Systems Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.
Hierarchical Bayesian Survival Analysis and Projective Covariate Selection in Cardiovascular Event Risk Prediction Identifying biomarkers with predictive value for disease risk stratification is an important task in epidemiology. This paper describes an application of Bayesian linear survival regression to model cardiovascular event risk in diabetic individuals with measurements available on 55 candidate biomarkers. We extend the survival model to include data from a larger set of non-diabetic individuals in an e↵ort to increase the predictive performance for the diabetic subpopulation. We compare the Gaussian, Laplace and horseshoe shrinkage priors, and find that the last has the best predictive performance and shrinks strong predictors less than the others. We implement the projection predictive covariate selection approach of Dupuis and Robert (2003) to further search for small sets of predictive biomarkers that could provide costefficient prediction without significant loss in performance. In passing, we present a derivation of the projective covariate selection in Bayesian decision theoretic framework.
Hierarchical Temporal Memory including HTM Cortical Learning Algorithms There are many things humans find easy to do that computers are currently unable to do. Tasks such as visual pattern recognition, understanding spoken language, recognizing and manipulating objects by touch, and navigating in a complex world are easy for humans. Yet despite decades of research, we have few viable algorithms for achieving human-like performance on a computer. In humans, these capabilities are largely performed by the neocortex. Hierarchical Temporal Memory (HTM) is a technology modeled on how the neocortex performs these functions. HTM offers the promise of building machines that approach or exceed human level performance for many cognitive tasks. This document describes HTM technology. Chapter 1 provides a broad overview of HTM, outlining the importance of hierarchical organization, sparse distributed representations, and learning time-based transitions. Chapter 2 describes the HTM cortical learning algorithms in detail. Chapters 3 and 4 provide pseudocode for the HTM learning algorithms divided in two parts called the spatial pooler and temporal pooler. After reading chapters 2 through 4, experienced software engineers should be able to reproduce and experiment with the algorithms. Hopefully, some readers will go further and extend our work.
Hierarchically Supervised Latent Dirichlet Allocation We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bag-of- word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not.
High Dimensional Data Clustering Clustering in high-dimensional spaces is a recurrent problem in many domains, for example in object recognition. High-dimensional data usually live in different lowdimensional subspaces hidden in the original space. This paper presents a clustering approach which estimates the specific subspace and the intrinsic dimension of each class. Our approach adapts the Gaussian mixture model framework to high-dimensional data and estimates the parameters which best fit the data. We obtain a robust clustering method called High- Dimensional Data Clustering (HDDC). We apply HDDC to locate objects in natural images in a probabilistic framework. Experiments on a recently proposed database demonstrate the effectiveness of our clustering method for category localization.
How do we choose our default methods? The field of statistics continues to be divided into competing schools of thought. In theory one might imagine choosing the uniquely best method for each problem as it arises, but in practice we choose for ourselves (and recommend to others) default principles, models, and methods to be used in a wide variety of settings. This article briefly considers the informal criteria we use to decide what methods to use and what principles to apply in statistics problems.
How to Build Dashboards That Persuade, Inform and Engage Flow is powerful. Think about a great conversation you’ve had, with no awkwardness or selfconsciousness: just effortless communication. In data visualization, flow is crucial. Your audience should smoothly absorb and use the information in a dashboard without distractions or turbulence. Lack of flow means lack of communication, which means failure. Psychologist Mihaly Czikszentmihalyi has studied flow extensively. Czikszentmihalyi and other researchers have found that flow is correlated with happiness, creativity, and productivity. People experience flow when their skills are engaged and they’re being challenged just the right amount. The experience is not too challenging or too easy: flow is a just-right, Goldilocks state of being. So how do you create flow for an audience? By tailoring the presentation of data to that audience. If you focus on the skills, motivations, and needs of an audience, you’ll have a better chance of creating a positive experience of flow with your dashboards. And by creating that flow, you’ll be able to persuade, inform, and engage.
How to Grow a Mind: Statistics, Structure, and Abstraction In coming to understand the world—in learning concepts, acquiring language, and grasping causal relations—our minds make inferences that appear to go far beyond the data available. How do we do it? This review describes recent approaches to reverse-engineering human learning and cognitive development and, in parallel, engineering more humanlike machine learning systems. Computational models that perform probabilistic inference over hierarchies of flexibly structured representations can address some of the deepest questions about the nature and origins of human thought: How does abstract knowledge guide learning and reasoning from sparse data? What forms does our knowledge take, across different domains and tasks? And how is that abstract knowledge itself acquired?
How to Implement an Effective Decision Management System – Embedding Analytics into Real-Time Business Decisions, Operations and Processes Once used mostly in traditional batch-type environments, analytic techniques are now being embedded into real-time business decisions, operations and processes. In fact, decision support insight should be embedded very consistently in operational systems. For example:
• When a credit card organization is processing a card swipe, fraud detection analytics should be embedded in that process.
• When analysis of sensor data over time indicates an impending problem with a mechanical process, utility grid or manufacturing system, the system should trigger some proactive intervention.
• When call center agents have a customer on the phone, or tellers have a customer at the counter, analytics behind the scenes should be giving them the information they need to customize the interaction – right now.
This ideal has historically been a challenge to implement because the niche applications used for different business functions have not been on great speaking terms. Unlike niche tools, an enterprise decision management framework extracts information from multiple sources, runs it through analytical processes, and delivers the results directly into business applications or operational systems. “Enterprise decision management is one of the hottest topics in business analytics today,” said David Duling, Director of Enterprise Decision Management R&D at SAS. “You see it on the front page of many journals, and a lot of conferences are being organized around the topic. Enterprise data management marries the analytics that we’ve been doing at SAS with the product environments within your organization to automate routine business decisions.” This means better decisions, delivered right to the point of decision.
How to Speed up R Code: An Introduction Most calculations performed by the average R user are unremarkable in the sense that nowadays, any computer can crush the related code in a matter of seconds. But more and more often, heavy calculations are also performed using R, something especially true in some fields such as statistics. The user then faces total execution times of his codes that are hard to work with: hours, days, even weeks. In this paper, how to reduce the total execution time of various codes will be shown and typical bottlenecks will be discussed. As a last resort, how to run your code on a cluster of computers (most workplaces have one) in order to make use of a larger processing power than the one available on an average computer will also be discussed through two examples.
How to Use an Uncommon-Sense Approach to Big Data Quality Organizations are inundated in data – terabytes, petabytes and exabytes of it. Data pours in from every conceivable direction: from operational and transactional systems, from scanning and facilities management systems, from inbound and outbound customer contact points, from mobile media and the Web. The hopeful vision of big data is that organizations will be able to harvest every byte of relevant data and use it to make supremely informed decisions. We now have the technologies to collect and store big data, but more importantly, to understand and take advantage of its full value. “The financial services industry has led the way in using analytics and big data to manage risk and curb fraud, waste and abuse – especially important in that regulatory environment,” said Scott Chastain, Director of Information Management and Delivery at SAS. “We’re also seeing a transference of big data analytics into other areas, such as health care and government. The ability to find that needle in the haystack becomes very important when you’re examining things like costs, outcomes, utilization and fraud for large populations.
How to Use Hadoop as a Piece of the Big Data Puzzle Imagine you have a jar of multicolored candies, and you need to learn something from them, perhaps the count of blue candies relative to red and yellow ones. You could empty the jar onto a plate, sift through them and tally up your answer. If the jar held only a few hundred candies, this process would take only a few minutes. Now imagine you have four plates and four helpers. You pour out about one-fourth of the candies onto each plate. Everybody sifts through their set and arrives at an answer that they share with the others to arrive at a total. Much faster, no? That is what Hadoop does for data. Hadoop is an open-source software framework for running applications on large clusters of commodity hardware. Hadoop delivers enormous processing power – the ability to handle virtually limitless concurrent tasks and jobs – making it a remarkably low-cost complement to a traditional enterprise data infrastructure. Organizations are embracing Hadoop for several notable merits:
• Hadoop is distributed. Bringing a high-tech twist to the adage, “Many hands make light work,” data is stored on local disks of a distributed cluster of servers.
• Hadoop runs on commodity hardware. Based on the average cost per terabyte of compute capacity of a prepackaged system, Hadoop is easily 10 times cheaper for comparable computing capacity compared to higher-cost specialized hardware.
• Hadoop is fault-tolerant. Hardware failure is expected and is mitigated by data replication and speculative processing. If capacity is available, Hadoop runs multiple copies of the same task, accepting the results from the task that finishes first.
• Hadoop does not require a predefined data schema. A key benefit of Hadoop is the ability to just upload any unstructured files without having to “schematize” them first. You can dump any type of data into Hadoop and allow the consuming programs to determine and apply structure when necessary.
• Hadoop scales to handle big data. Hadoop clusters can scale to between 6,000 and 10,000 nodes and handle more than 100,000 concurrent tasks and 10,000 concurrent jobs. Yahoo! runs thousands of clusters and more than 42,000 Hadoop nodes storing more than 200 petabytes of data.
• Hadoop is fast. In a performance test, a 1,400-node cluster sorted a terabyte of data in 62 seconds; a 3,400-node cluster sorted 100 terabytes in 173 minutes. To put it in context, one terabyte contains 2,000 hours of CD-quality music; 10 terabytes could store the entire US Library of Congress print collection.
You get the idea. Hadoop handles big data. It does it fast. It redefines the possible when it comes to analyzing large volumes of data, particularly semi-structured and unstructured data (text).
How YARN Opens Doors to Easier Programming Tools for Hadoop 2.0 Users The emergence of YARN for the Hadoop 2.0 platform has opened the door to new tools and applications that promise to allow more companies to reap the benefits of big data in ways never before possible with outcomes possibly never imagined. By separating the problem of cluster resource management from the data processing function, YARN offers a world beyond MapReduce: lessencumbered by complex programming protocols, faster, and at a lower cost…. Safe Reinforcement Learning can be de ned as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The rst is based on the modi cation of the optimality criterion, the classic discounted – nite/in nite horizon, with a safety factor. The second is based on the modi cation of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classi cation to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/sqrt(m). This improves on the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.
Hypervariate Data Visualization Both scientists and normal users face enormous amounts of data, which might be useless if no insight is gained from it. To achieve this, visualization techniques can be used. Many datasets have a dimensionality higher than three. Such data is called “hypervariate” and cannot be visualized directly in the three-dimensional space that we inhabit. Therefore, a wide variety of specialized techniques have been created for rendering hypervariate data. These techniques are based on very different principles and are designed for very different areas of application. This paper gives an overview of six representative techniques. For most techniques a rendering of a common dataset is provided to allow an easier comparison. Furthermore, an evaluation of the strengths and weaknesses of each technique is given. As an outlook, two papers dealing with quantitative analysis of visualization methods are presented.
Hypervariate Information Visualization In the last 20 years improvements in the computer sciences made it possible to store large data sets containing a plethora of different data attributes and data values, which could be applied in different application domains, for example, in the natural sciences, in law enforcements or in social studies. Due to this increasing data complexity in modern times, it is crucial to support the exploration of the hypervariate data with different visualization techniques. These facts are the fundament for this paper, which reveals how the information visualization can support the understanding of data with high dimensionality. Furthermore, it gives an overview and a comparison of the different categories of hypervariate information visualization, in order to analyse the advantages and the disadvantages of each category. We also addressed in the different interaction methods which help to create an understandable visualization and thus facilitate the user’s visual exploration. Interactive techniques are useful to create an understandable visualization of the relationships in a large data set. At the end, we also discussed the possibility of merging different interactions and visualization techniques.


ILNumerics: Numeric Computing for Industry Most enterprise software nowadays gets created by means of managed software frameworks. In the past they have often failed to deliver the speed required for professional data analysis and scientific computing. The ILNumerics Computing Engine offers a new approach for the integration of numerical algorithms into technical applications.
Implementation of a Practical Distributed Calculation System with Browsers Deep learning can achieve outstanding results in various fields. However, it requires so significant computational power that graphics processing units (GPUs) and/or numerous computers are often required for the practical application. We have developed a new distributed calculation framework called ”Sashimi” that allows any computer to be used as a distribution node only by accessing a website. We have also developed a new JavaScript neural network framework called ”Sukiyaki” that uses general purpose GPUs with web browsers. Sukiyaki performs 30 times faster than a conventional JavaScript library for deep convolutional neural networks (deep CNNs) learning. The combination of Sashimi and Sukiyaki, as well as new distribution algorithms, demonstrates the distributed deep learning of deep CNNs only with web browsers on various devices. The libraries that comprise the proposed methods are available under MIT license at http://…/.
IMSL C Math Library Version 8.5.0 The IMSL C Math Library, a component of the IMSL C Numerical Library, is a library of C functions useful in scientific programming. Each function is designed and documented for use in research activities as well as by technical specialists. A number of the example programs also show graphs of resulting output.
IMSL C Stat Library – Version 8.5.0 The IMSL C Stat Library, a component of the IMSL C Numerical Library, is a library of C functions useful in scientific programming. Each function is designed and documented to be used in research activities as well as by technical specialists. A number of the example programs also show graphs of resulting output.
Independent Component Analysis: Algorithms and Applications A fundamental problem in neural network research, as well as in many other disciplines, is finding a suitable representation of multivariate data, i.e. random vectors. For reasons of computational and conceptual simplicity, the representation is often sought as a linear transformation of the original data. In other words, each component of the representation is a linear combination of the original variables. Well-known linear transformation methods include principal component analysis, factor analysis, and projection pursuit. Independent component analysis (ICA) is a recently developed method in which the goal is to find a linear representation of nongaussian data so that the components are statistically independent, or as independent as possible. Such a representation seems to capture the essential structure of the data in many applications, including feature extraction and signal separation. In this paper, we present the basic theory and applications of ICA, and our recent work on the subject.
Inferential Methods to Assess the Difference The area under the curve (AUC) is the most common statistical approach to evaluate the discriminatory power of a set of factors in a binary regression model. A nested model framework is used to ascertain whether the AUC increases when new factors enter the model. Two statistical tests are proposed for the difference in the AUC parameters from these nested models. The asymptotic null distributions for the two test statistics are derived from the scenarios: (A) the difference in the AUC parameters is zero and the new factors are not associated with the binary outcome, (B) the difference in the AUC parameters is less than a strictly positive value. A confidence interval for the difference in AUC parameters is developed. Simulations are generated to determine the finite sample operating characteristics of the tests and a pancreatic cancer data example is used to illustrate this approach.
Information Limits of Aggregate Data This paper uses a small model in the Cowles Commission (CC) tradition to examine the limits of aggregate data. It argues that more can be learned about the macroeconomy following the CC approach than the reduced form and VAR approaches allow, but less than the DSGE approach tries to do.
Information Visualization with Self-Organizing Maps The Self-Organizing Map (SOM) is an unsupervised neural network algorithm that projects high- dimensional data onto a two-dimensional map. The projection preserves the topology of the data so that similar data items will be mapped to nearby locations on the map. Despite the popular use of the algorithm for clustering and information visualisation, a system has been lacking that combines the fast execution of the algorithm with powerful visualisation of the maps and effective tools for their interactive analysis. Powerful methods for interactive exploration and search from collections of free-form textual documents are needed to manage the ever-increasing flood of digital information. In this article we present a method, SOM, for automatic organization of full-text document collections using the self-organizing map (SOM) algorithm. The document collection is ordered onto a map in an unsupervised manner utilizing statistical information of short word contexts. The resulting ordered map where similar documents lie near each other thus presents a general view of the document space. With the aid of a suitable (SVG) interface, documents in interesting areas of the map can be browsed.
Infovis and Statistical Graphics: Different Goals, Different Looks The importance of graphical displays in statistical practice has been recognized sporadically in the statistical literature over the past century, with wider awareness following Tukey’s Exploratory Data Analysis (1977) and Tufte’s books in the succeeding decades. But statistical graphics still occupies an awkward in-between position: Within statistics, exploratory and graphical methods represent a minor subfield and are not wellintegrated with larger themes of modeling and inference. Outside of statistics, infographics (also called information visualization or Infovis) is huge, but their purveyors and enthusiasts appear largely to be uninterested in statistical principles. We present here a set of goals for graphical displays discussed primarily from the statistical point of view and discuss some inherent contradictions in these goals that may be impeding communication between the fields of statistics and Infovis. One of our constructive suggestions, to Infovis practitioners and statisticians alike, is to try not to cram into a single graph what can be better displayed in two or more. We recognize that we offer only one perspective and intend this article to be a starting point for a wide-ranging discussion among graphics designers, statisticians, and users of statistical methods. The purpose of this article is not to criticize but to explore the different goals that lead researchers in different fields to value different aspects of data visualization.
InsideBIGDATA Guide to In-Memory Computing In-memory computing (IMC) is an emerging field of importance in the big data industry. It is a quickly evolving technology, seen by many as an effective way to address the proverbial 3 V’s of big data – volume, velocity, and variety. Big data requires ever more powerful means to process and analyze growing stores of data, being collected at more rapid rates, and with increasing diversity in the types of data being sought – both structured and unstructured. In-memory computing’s rapid rise in the marketplace has the big data community on alert. In fact, Gartner picked in-memory computing as one of the Top Ten Strategic Initiatives.
InsideBIGDATA Guide to Predictive Analytics Predictive analytics, sometimes called advanced analytics, is a term used to describe a range of analytical and statistical techniques to predict future actions or behaviors. In business, predictive analytics are used to make proactive decisions and determine actions, by using statistical models to discover patterns in historical and transactional data to uncover likely risks and opportunities. Predictive analytics incorporates a range of activities which we will explore in this paper, including data access, exploratory data analysis and visualization, developing assumptions and data models, applying predictive models, then estimating and/or predicting future outcomes.
Installing R and Optional RStudio R is quickly becoming the statistical software of choice for researchers and analysts in a variety of disciplines. In recent years, it has surpassed many commonly used statistical programs in both number of users and availability of statistical methods. A fundamental difference between R and other statistical software packages is that R is open-source, meaning it is both free for download and the source code is available under the GNU General Project License. Anyone can contribute new techniques or analytical methods, which has been a primary factor enabling the growth of R. These contributions are called `packages’. Currently, almost 6000 packages are available for R.
Integrated Analytics – Platforms and Principles for Centralizing Your Data Companies are collecting more data than ever. But, given how difficult it is to unify the many internal and external data streams they’ve built, more data doesn’t necessarily translate into better analytics. The real challenge is to provide deep and broad access to “a single source of truth” in their data that the typically slow ETL process for data warehousing cannot achieve. More than just fast access, analysts need the ability to explore data at a granular level.
In this O’Reilly report, author Courtney Webster presents a roadmap to data centralization that will help your organization make data accessible, flexible, and actionable. Building a genuine data-driven culture depends on your company’s ability to quickly act upon new findings. This report explains how.
Intelligent Choice of the Number of Clusters in K -Means Clustering: An Experimental Study with Different Cluster Spreads The issue of determining ‘the right number of clusters’ in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the ‘intelligent’ K-Means method, ik-Means, that find the ‘right’ number of clusters by extracting ‘anomalous patterns’ from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.
Intelligent Data Analysis (Slide Deck)
Interactive Web Apps with shiny (Cheat Sheet)
Introducing Connection Analytics Connection Analytics provides a new way of looking at people, products, physical phenomena, or events. It provides insights by dissecting the types of relationships between entities to determine causation and can be used for generating predictive intelligence based on the patterns of interactions. Connection Analytics can address queries such as identifying influencers, the groups that they influence, and where promotions or other forms of marketing are best directed. It can be utilized for product affinity analysis by taking a bottom up look at how the decisions to buy different items are linked. Likewise, this approach can help analyze networks by patterns of activity, and fraud and money laundering through the actions (rather than identities) of involved actors. It can help segment customers based on behavior patterns like past purchase behavior or reviews vs. traditional segmentation techniques like income & demographics. Graph analytics is one of the most promising approaches to performing Connection Analytics. Teradata is the first analytics data platform provider to make graph computing accessible to the existing base of data scientists, database developers and business analysts by introducing a SQL-friendly approach. Underneath the hood, Teradata Aster is using a compute approach that allows the data to leverage the power and performance of massively parallel analytic processing engines and pre-built algorithms.
Introducing R R is a powerful environment for statistical computing which runs on several platforms. These notes are written especially for users running the Windows version, but most of the material applies to the Mac and Linux versions as well.
Introduction to Boosted Trees (Slide Deck)
Introduction to Machine Learning and Soft Computing • Introduction
• Single-layer Neural Networks
• Linear Classification
• Linear Regression
• Kernel
• Multi-layer Neural Networks
• Nonlinear Classification
• Nonlinear Regression
• Model Selection
• GA-based Frameworks
• PSO-based Frameworks
• Conclusion
• Epilogue
Introduction to Markov Random Fields This book sets out to demonstrate the power of the Markov random field (MRF) in vision. It treats the MRF both as a tool for modeling image data and, coupled with a set of recently developed algorithms, as a means of making inferences about images. The inferences concern underlying image and scene structure to solve problems such as image reconstruction, image segmentation, 3D vision, and object labeling. This chapter is designed to present some of the main concepts used in MRFs, both as a taster and as a gateway to the more detailed chapters that follow, as well as a stand-alone introduction to MRFs.
Introduction to Network Theory (& Graph Theory) (Slide Deck)
Introduction to Neural Networks In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese samples to a taste response.
Introduction to Probabilistic Topic Models Probabilistic topic models are a suite of algorithms whose aim is to discover the hidden thematic structure in large archives of documents. In this article, we review the main ideas of this eld, survey the current state-of-the-art, and describe some promising future directions. We rst describe latent Dirichlet allocation (LDA) [8], which is the simplest kind of topic model. We discuss its connections to probabilistic modeling, and describe two kinds of algorithms for topic discovery. We then survey the growing body of research that extends and applies topic models in interesting ways. These extensions have been developed by relaxing some of the statistical assumptions of LDA, incorporating meta-data into the analysis of the documents, and using similar kinds of models on a diversity of data types such as social networks, images and genetics. Finally, we give our thoughts as to some of the important unexplored directions for topic modeling. These include rigorous methods for checking models built for data exploration, new approaches to visualizing text and other high dimensional data, and moving beyond traditional information engineering applications towards using topic models for more scienti c ends.
Introduction to YARN Apache Hadoop 2.0 includes YARN, which separates the resource management and processing components. The YARN-based architecture is not constrained to MapReduce. This article describes YARN and its advantages over the previous distributed processing layer in Hadoop. Learn how to enhance your clusters with YARN’s scalability, efficiency, and flexibility.
Introduction: Credibility, Models, and Parameters The goal of this chapter is to introduce the conceptual framework of Bayesian data analysis. Bayesian data analysis has two foundational ideas. The first idea is that Bayesian inference is reallocation of credibility across possibilities. The second foundational idea is that the possibilities, over which we allocate credibility, are parameter values in meaningful mathematical models. These two fundamental ideas form the conceptual foundation for every analysis in this book. Simple examples of these ideas are presented in this chapter. The rest of the book merely fills in the mathematical and computational details for specific applications of these two ideas. This chapter also explains the basic procedural steps shared by every Bayesian analysis.


JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling JAGS is a program for Bayesian Graphical modelling which aims for compatibility with classic BUGS. The program could eventually be developed as an R package. This article explains the motivations for this program, briefly describes the architecture and then discusses some ideas for a vectorized form of the BUGS language.
Julia for R Programmers (Slide Deck)


Kernel Density Estimation with Ripley’s Circumferential Correction In this paper, we investigate (and extend) Ripley’s circumference method to correct bias of density estimation of edges (or frontiers) of regions. The idea of the method was theoretical and diffcult to implement. We provide a simple technique based of properties of Gaussian kernels to effciently compute weights to correct border bias on frontiers of the region of interest, with an automatic selection of an optimal radius for the method. We illustrate the use of that technique to visualize hot spots of car accidents and campsite locations, as well as location of bike thefts.


Large Linear Classification When Data Cannot Fit In Memory Recent advances in linear classification have shown that for applications such as document classification, the training can be extremely efficient. However, most of the existing training methods are designed by assuming that data can be stored in the computer memory. These methods cannot be easily applied to data larger than the memory capacity due to the random access to the disk. We propose and analyze a block minimization framework for data larger than the memory size. At each step a block of data is loaded from the disk and handled by certain learning methods. We investigate two implementations of the proposed framework for primal and dual SVMs, respectively. As data cannot fit in memory, many design considerations are very different from those for traditional algorithms. Experiments using data sets 20 times larger than the memory demonstrate the effectiveness of the proposed method.
Large-Scale Graph Visualization and Analytics Novel approaches to network visualization and analytics use sophisticated metrics that enable rich interactive network views and node grouping and filtering. A survey of graph layout and simplification methods reveals considerable progress in these new directions.
Latent Dirichlet Allocation We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Latent Semantic Analysis and Topic Modeling: Roads to Text Meaning (Slide Deck)
Latent Variable Mixture Modeling The aim of this study was to provide an overview of mixture modeling techniques, specifically as applied to nursing research, and to present examples from two studies to illustrate how these techniques may be used crosssectionally and longitudinally.
Latent Variable Models A powerful approach to probabilistic modelling involves sup- plementing a set of observed variables with additional latent, or hidden, variables. By de ning a joint distribution over visible and latent variables, the corresponding distribution of the observed variables is then obtained by marginalization. This allows relatively complex distributions to be ex- pressed in terms of more tractable joint distributions over the expanded variable space. One well-known example of a hidden variable model is the mixture distribution in which the hidden variable is the discrete component label. In the case of continuous latent variables we obtain models such as factor analysis. The structure of such probabilistic models can be made particularly transparent by giving them a graphical representation, usually in terms of a directed acyclic graph, or Bayesian network. In this chapter we provide an overview of latent variable models for representing continuous variables. We show how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well-known tech- nique of principal components analysis (PCA). By extending this technique to mixtures, and hierarchical mixtures, of probabilistic PCA models we are led to a powerful interactive algorithm for data visualization. We also show how the probabilistic PCA approach can be generalized to non-linear latent variable models leading to the Generative Topographic Mapping algorithm (GTM). Finally, we show how GTM can itself be extended to model tem- poral data.
LDAvis: A method for visualizing and interpreting topics We present LDAvis, a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation that is built using a combination of R and D3. Our visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic. First, we propose a novel method for choosing which terms to present to a user to aid in the task of topic interpretation, in which we define the relevance of a term to a topic. Second, we present results from a user study that suggest that ranking terms purely by their probability under a topic is suboptimal for topic interpretation. Last, we describe LDAvis, our visualization system that allows users to flexibly explore topic-term relationships using relevance to better understand a fitted LDA model.
Leakage in Data Mining: Formulation, Detection, and Avoidance Deemed “one of the top ten data mining mistakes”, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competi-tions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical i.i.d. assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by expli-citly defining modeling goals and analyzing the broader frame-work of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected.
Learn to use R – Your Hands-on Guide R is hot. Whether measured by more than 6,100 add-on packages, the 41,000+ members of LinkedIn’s R group or the 170+ R Meetup groups currently in existence, there can be little doubt that interest in the R statistics language, especially for data analysis, is soaring. Why R? It’s free, open source, powerful and highly extensible. “You have a lot of prepackaged stuff that’s already available, so you’re standing on the shoulders of giants,” Google’s chief economist told The New York Times back in 2009. Because it’s a programmable environment that uses command-line scripting, you can store a series of complex data-analysis steps in R. That lets you re-use your analysis work on similar data more easily than if you were using a point-and-click interface, notes Hadley Wickham, author of several popular R packages and chief scientist with RStudio. That also makes it easier for others to validate research results and check your work for errors — an issue that cropped up in the news recently after an Excel coding error was among several flaws found in an influential economics analysis report known as Reinhart/Rogoff. The error itself wasn’t a surprise, blogs Christopher Gandrud, who earned a doctorate in quantitative research methodology from the London School of Economics. “Despite our best efforts we always will” make errors, he notes. “The problem is that we often use tools and practices that make it difficult to find and correct our mistakes.” Sure, you can easily examine complex formulas on a spreadsheet. But it’s not nearly as easy to run multiple data sets through spreadsheet formulas to check results as it is to put several data sets through a script, he explains. Indeed, the mantra of “Make sure your work is reproducible!” is a common theme among R enthusiasts.
Learning Deep Architectures for AI Theoretical results strongly suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one needs deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult optimization task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas. This paper discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.
Learning from Dyadic Data Dyadic data refers to a domain with two nite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This type of data arises naturally in many application ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domain-independent framework of learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures. We propose an annealed version of the standard EM algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains
Learning the k in k-means When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level . We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity.
Learning the parts of objects by non-negative matrix factorization Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.
Learning the parts of objects by non-negative matrix factorization Is perception of the whole based on perception of its parts? There is psychological1 and physiological2,3 evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations4,5. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations.Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.
Learning to Extract International Relations from Political Context We describe a new probabilistic model for extracting events between major political actors from news corpora. Our unsupervised model brings together familiar components in natural language processing (like parsers and topic models) with contextual political information— temporal and dyad dependence—to infer latent event classes. We quantitatively evaluate the model’s performance on political science benchmarks: recovering expert-assigned event class valences, and detecting real-world conflict. We also conduct a small case study based on our model’s inferences.
Learning to Hash for Indexing Big Data – A Survey The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.
Learning Word Representation Considering Proximity and Ambiguity Distributed representations of words (aka word embedding) have proven helpful in solving natural language processing (NLP) tasks. Training distributed representations of words with neural networks has lately been a major focus of researchers in the field. Recent work on word embedding, the Continuous Bag-of-Words (CBOW) model and the Continuous Skip-gram (Skip-gram) model, have produced particularly impressive results, significantly speeding up the training process to enable word representation learning from largescale data. However, both CBOW and Skip-gram do not pay enough attention to word proximity in terms of model or word ambiguity in terms of linguistics. In this paper, we propose Proximity-Ambiguity Sensitive (PAS) models (i.e. PAS CBOW and PAS Skip-gram) to produce high quality distributed representations of words considering both word proximity and ambiguity. From the model perspective, we introduce proximity weights as parameters to be learned in PAS CBOWand used in PAS Skip-gram. By better modeling word proximity, we reveal the strength of pooling-structured neural networks in word representation learning. The proximity sensitive pooling layer can also be applied to other neural network applications that employ pooling layers. From the linguistics perspective, we train multiple representation vectors per word. Each representation vector corresponds to a particular group of POS tags of the word. By using PAS models, we achieved a 16.9% increase in accuracy over state-of-theart models.
Least Square Projection: a fast high precision multidimensional projection technique and its application to document mapping The problem of projecting multidimensional data into lower dimensions has been pursued by many researchers due to its potential application to data analysis of various kinds. This paper presents a novel multidimensional projection technique based on least square approximations. The approximations compute the coordinates of a set of projected points based on the coordinates of a reduced number of control points with defined geometry. We name the technique Least Square Projections (LSP). From an initial projection of the control points, LSP defines the positioning of their neighboring points through a numerical solution that aims at preserving a similarity relationship between the points given by a metric in mD. In order to perform the projection, a small number of distance calculations is necessary and no repositioning of the points is required to obtain a final solution with satisfactory precision. The results show the capability of the technique to form groups of points by degree of similarity in 2D. We illustrate that capability through its application to mapping collections of textual documents from varied sources, a strategic yet difficult application. LSP is faster and more accurate than other existing high quality methods, particularly where it was mostly tested, that is, for mapping text sets.
Let’s Debunk the Myths about Data Mining Data mining is about knowledge and information, but only occasionally about predicting the future. For as long as the field has existed, data miners have worked to explain the difference between data mining and other forms of data analysis. The terms ‘predictive analysis’ and ‘predictive modelling’ have been adopted widely to distinguish data mining and its modelling from other kinds. Unfortunately, this has led to the erroneous belief among non-practitioners that data mining is all about prediction, which it is not. Rather, data mining is about information and knowledge. Take a look at the diagram: On the left, we have the myth which has grown up around data mining: the idea that starting from data we create models which make predictions to guide action. This places a false emphasis on models; a more accurate picture of what really happens is shown on the right. Knowledge is applied to data, producing new knowledge which can again be applied to the data: an iterative process. At any point in this cycle, knowledge and data can be used together to produce new information. This creation of new information is sometimes called ‘prediction’, but it is often not information about the future. It may have some implications for the future, as many pieces of information do, but it is not a prediction in the usual sense of the word. In summary, the left hand diagram is erroneous because it leaves out knowledge, which is both an essential prerequisite and a product of data mining, and is used at every step. Data mining often produces models but these are only one kind of knowledge that it can produce, the other being human knowledge (knowledge in the head).
Leveraging Flexible Data Management with Graph Databases Integrating up-to-date information into databases from different heterogeneous data sources is still a time-consuming and mostly manual job that can only be accomplished by skilled experts. For this reason, enterprises often lack information regarding the current market situation, preventing a holistic view that is needed to conduct sound data analysis and market predictions. Ironically, the Web consists of a huge and growing number of valuable information from diverse organizations and data providers, such as the Linked Open Data cloud, common knowledge sources like Freebase, and social networks. One desirable usage scenario for this kind of data is its integration into a single database in order to apply data analytics. However, in today’s business intelligence tools there is an evident lack of support for so-called situational or ad-hoc data integration. What we need is a system which 1) provides a exible storage of heterogeneous information of di erent degrees of structure in an ad-hoc manner, and 2) supports mass data operations suited for data analytics. In this paper, we will provide our vision of such a system and describe an extension of the well-studied property graph model that allows to \integrate and analyze as you go” external data exposed in the RDF format in a seamless manner. The proposed integration approach extends the internal graph model with external data from the Linked Open Data cloud, which stores over 31 billion RDF triples (September 2011) from a variety of domains.
lfe: Linear Group Fixed Effects Linear models with fixed effects and many dummy variables are common in some fields. Such models are straightforward to estimate unless the factors have too many levels. The R package lfe solves this problem by implementing a generalization of the within transformation to multiple factors, tailored for large problems.
Linear Dimensionality Reduction (Slide Deck)
Linear models and linear mixed effects models in R with linguistic applications Part 1: Linear modeling
Part 2: A very basic tutorial for performing linear mixed effects analyses
Linked Open Data: The Essentials A Quick Start Guide for Decision Makers
Listen First! Turning Social Media Conversations into Business Advantage Nearly anywhere you turn online, people are talking about your products and categories, what they like and dislike, what they want, what pleases them or ticks them off, and what they would like you to do, or stop doing. Twitter, Facebook, YouTube, millions of blogs, forums, Web sites, and review sites make most of these conversations public, accessible, and researchable to every company. You also hear individuals talk about the richness and texture of their lives and your role in them. You learn about their aspirations, families, relationships, and homes; music and movies; vacations, hobbies, and sports; finances, jobs, and careers; education and technology; what they had for lunch, what they crave; and much more. By listening in on those conversations, you position yourself to develop powerful insights into people that, coupled with strategy, drive your business forward and create an enduring advantage. Listen First! Turning Social Media Conversations into Business Advantage will show you how.


Machine Learning This book is based partly on content from the 2013 session of the on-line Machine Learning course run by Andrew Ng (Stanford University). The on-line course is provided for free via the Coursera platform ( The author is no way affiliated with Coursera, Stanford University or Andrew Ng.
Machine Learning – The Complete Guide This is a Wikipedia book, a collection of Wikipedia articles that can be easily saved, rendered electronically, and ordered as a printed book.
Machine Learning and Cloud Computing: Survey of Distributed and SaaS Solutions Applying popular machine learning algorithms to large amounts of data raised new challenges for the ML practitioners. Traditional ML libraries does not support well processing of huge datasets, so that new approaches were needed. Parallelization using modern parallel computing frameworks, such as MapReduce, CUDA, or Dryad gained in popularity and acceptance, resulting in new ML libraries developed on top of these frameworks. We will briefly introduce the most prominent industrial and academic outcomes, such as Apache Mahout, GraphLab or Jubatus. We will investigate how cloud computing paradigm impacted the field of ML. First direction is of popular statistics tools and libraries (R system, Python) deployed in the cloud. A second line of products is augmenting existing tools with plugins that allow users to create a Hadoop cluster in the cloud and run jobs on it. Next on the list are libraries of distributed implementations for ML algorithms, and on-premise deployments of complex systems for data analytics and data mining. Last approach on the radar of this survey is ML as Software-as-a-Service, several BigData start-ups (and large companies as well) already opening their solutions to the market.
Machine Learning for E-mail Spam Filtering: Review,Techniques and Trends We present a comprehensive review of the most effective content-based e-mail spam filtering techniques. We focus primarily on Machine Learning-based spam filters and their variants, and report on a broad review ranging from surveying the relevant ideas, efforts, effectiveness, and the current progress. The initial exposition of the background examines the basics of e-mail spam filtering, the evolving nature of spam, spammers playing cat-and-mouse with e-mail service providers (ESPs), and the Machine Learning front in fighting spam. We conclude by measuring the impact of Machine Learning-based filters and explore the promising offshoots of latest developments.
Machine Learning Methods for Computer Security The study of learning in adversarial environments is an emerging discipline at the juncture between machine learning and computer security that raises new questions within both fields. The interest in learning-based methods for security and system design applications comes from the high degree of complexity of phenomena underlying the security and reliability of computer systems. As it becomes increasingly difficult to reach the desired properties by design alone, learning methods are being used to obtain a better understanding of various data collected from these complex systems. However, learning approaches can be co-opted or evaded by adversaries, who change to counter them. To-date, there has been limited research into learning techniques that are resilient to attacks with provable robustness guarantees making the task of designing secure learning-based systems a lucrative open research area with many challenges. The Perspectives Workshop, “Machine Learning Methods for Computer Security” was convened to bring together interested researchers from both the computer security and machine learning communities to discuss techniques, challenges, and future research directions for secure learning and learning-based security applications. This workshop featured twenty-two invited talks from leading researchers within the secure learning community covering topics in adversarial learning, game-theoretic learning, collective classification, privacy-preserving learning, security evaluation metrics, digital forensics, authorship identification, adversarial advertisement detection, learning for offensive security, and data sanitization. The workshop also featured workgroup sessions organized into three topic: machine learning for computer security, secure learning, and future applications of secure learning.
Machine Learning: A Probabilistic Perspective This books adopts the view that the best way to solve such problems is to use the tools of probability theory. Probability theory can be applied to any problem involving uncertainty. In machine learning, uncertainty comes in many forms: what is the best prediction about the future given some past data? what is the best model to explain some data? what measurement should I perform next? etc. The probabilistic approach to machine learning is closely related to the field of statistics, but differs slightly in terms of its emphasis and terminology. We will describe a wide variety of probabilistic models, suitable for a wide variety of data and tasks. We will also describe a wide variety of algorithms for learning and using such models. The goal is not to develop a cook book of ad hoc techiques, but instead to present a unified view of the field through the lens of probabilistic modeling and inference. Although we will pay attention to computational effciency, details on how to scale these methods to truly massive datasets are better described in other books, such as (Rajaraman and Ullman 2011; Bekkerman et al. 2011).
Machine Learning: The High-Interest Credit Card of Technical Debt Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.
Making Predictive Analytics More Practical With Alteryx Businesses today are in a conundrum. While the current economic climate has made organizations hesitant to take a risk on a strategic investment that could be a ‘bad bet’, the pace of business and the competitive marketplace dictate that organizations move quickly to take advantage of opportunities that could create a huge revenue windfall—and create a significant competitive advantage. Unfortunately, although the vast majority of organizations have a deep visibility into the past, thanks to traditional Business Intelligence (BI) tools that analyze the historical performance of the business, many still depend on intuition or simple optimism to better understand the present and future, forcing them to ‘fly blind’ when mapping their company’s future strategy. Why? There are two primary reasons: First, traditional BI tools and platforms do not provide forward-looking insight, rendering them useless in anticipating future performance. While they are able to deliver a wide range of detailed reports, sophisticated dashboards, and complex visualizations, all are based on historical information, leaving organizations stuck using guesswork, ‘gut feel’, or simple spreadsheets to anticipate the future, thereby ignoring the tremendous potential of their information assets that could give them predictive insight for competitive advantage. Second, most of the predictive analytical tools on the market today are complex, time-consuming, and expensive to use. Requiring multiple different technology systems and highly trained, specialized personnel to get from business question to predictive answer, companies that use the predictive analytical tools available today often find that the business opportunity is already in their rear-view mirror—by several weeks or months—before they have an answer. Clearly, this is unacceptable in today’s competitive reality. What today’s cutting-edge businesses need is a powerful yet easy to use predictive analytics platform that enables them to gain significant business value from predicting future business performance—quickly and inexpensively—using forward-looking insight rather than historical data. This paper will examine today’s business reality of information overload, the hurdles organizations must overcome using today’s complex, expensive, and time-consuming predictive analytical tools, and the new approach to agile predictive analytics offered by Alteryx.
Making the business case for text analytics This report hopes to establish some of the key barriers that prevent successful commercial deployments while providing real-world assistance so obstacles can be overcome. It will focus on the di erent needs of an initial text analytics adoption, including what our contributors all cited as the top company need: strong high-level executive support to help ensure necessary long-term funding. Text analytics can be applied in almost every business case and multiple units within the same organization can benefit from a centralized analytics division. The market’s future is still a concern because of the shortage in text analytics professionals, and this reality is a guiding force for today’s successful pilots and programs.
Malicious URL Detection using Machine Learning: A Survey Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of malicious URL detectors, machine learning techniques have been explored with increasing attention in recent years. This article aims to provide a comprehensive survey and a structural understanding of Malicious URL Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as a machine learning task, and categorize and review the contributions of literature studies that addresses different dimensions of this problem (feature representation, algorithm design, etc.). Further, this article provides a timely and comprehensive survey for a range of different audiences, not only for machine learning researchers and engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them understand the state of the art and facilitate their own research and practical applications. We also discuss practical issues in system design, open research challenges, and point out some important directions for future research.
Managing Big Data: A TDWI Best Practices Report The emerging phenomenon called big data is forcing numerous changes in businesses and other organizations. Many struggle just to manage the massive data sets and non-traditional data structures that are typical of big data. Others are managing big data by extending their data management skills and their portfolios of data management software. This empowers them to automate more business processes, operate closer to real time, and through analytics, learn valuable new facts about business operations, customers, partners, and so on. The result is big data management (BDM), an amalgam of old and new best practices, skills, teams, data types, and home-grown or vendor-built functionality. All of these are expanding and realigning so that businesses can fully leverage big data, not merely manage it. At the same time, big data must eventually find a permanent place in enterprise data management. BDM is well worth doing because managing big data leads to a number of benefits. According to this report’s survey, the business and technology tasks that improve most are analytic insights, the completeness of analytic data sets, business value drawn from big data, and all sales and marketing activities. BDM also has challenges, and common barriers include low organizational maturity relative to big data, weak business support, and the need to learn new technology approaches. Despite the newness of big data, half of organizations surveyed are actively managing big data today. For a quarter of organizations, big data mostly takes the form of the relational and structured data that comes from traditional applications, whereas another quarter manages traditional data along with big data from new sources such as Web servers, machines, sensors, customer interactions, and social media. A quarter of surveyed organizations have managed to scale up preexisting applications and databases to handle burgeoning volumes of relational big data. Another quarter has gone out on the leading edge by acquiring new data management platforms that are purpose-built for managing and analyzing multi-structured big data. Many more are evaluating such big data platforms now, creating a brisk market of vendor products and services for managing big data. According to the survey, the Hadoop Distributed File System (HDFS), MapReduce, and various Hadoop tools will be the software products most aggressively adopted for BDM in the next three years. Others include complex event processing (for streaming big data), NoSQL databases (for schema-free big data), in-memory databases (for real-time analytic processing of big data), private clouds, in-database analytics, and grid computing. Organizations are adjusting their technical best practices to accommodate BDM. Most are schooled in extract, transform, and load (ETL) in support of data warehousing (DW) and reporting. Preparing big data for analytics is similar, but different. Organizations are retraining existing personnel, augmenting their teams with consultants, and hiring new personnel. The focus is on data analysts, data scientists, and data architects who can develop the applications for data exploration and discovery analytics that organizations need for getting value from big data. This report accelerates users’ understanding of the many options that are available for big data management (BDM), including old, new, and upcoming options. The report brings readers up to date so they can make intelligent decisions about which tools, techniques, and team structures to apply to their next-generation solutions for BDM.
Map-Reduce for Machine Learning on Multicore We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model can be written in a certain ‘summation form,’ which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.
MapReduce: Simplified Data Processing on Large Clusters MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers nd the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.
Markov Chains Most of our study of probability has dealt with independent trials processes. These processes are the basis of classical probability theory and much of statistics. We have discussed two of the principal theorems for these processes: the Law of Large Numbers and the Central Limit Theorem. We have seen that when a sequence of chance experiments forms an independent trials process, the possible outcomes for each experiment are the same and occur with the same probability. Further, knowledge of the outcomes of the previous experiments does not influence our predictions for the outcomes of the next experiment. The distribution for the outcomes of a single experiment is sufficient to construct a tree and a tree measure for a sequence of n experiments, and we can answer any probability question about these experiments by using this tree measure. Modern probability theory studies chance processes for which the knowledge of previous outcomes influences predictions for future experiments. In principle, when we observe a sequence of chance experiments, all of the past outcomes could influence our predictions for the next experiment. For example, this should be the case in predicting a student’s grades on a sequence of exams in a course. But to allow this much generality would make it very difficult to prove general results. In 1907, A. A. Markov began the study of an important new type of chance process. In this process, the outcome of a given experiment can affect the outcome of the next experiment. This type of process is called a Markov chain.
Matchbox: Large Scale Online Bayesian Recommendations We present a probabilistic model for generating personalised recommendations of items to users of a web service. The Matchbox system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional ‘trait space’ in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don’t like) and observation of a set of ordinal ratings on a userspecific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an item’s popularity, a user’s taste or a user’s personal rating scale to drift over time. By using Assumed-Density Filtering (ADF) for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data.
Math for Machine Learning The goal of this document is to provide a \refresher’ on continuous mathematics for computer science students. It is by no means a rigorous course on these topics. The presentation, motivation, etc., are all from a machine learning perspective. The hope, however, is that it’s useful in other contexts. The two major topics covered are linear algebra and calculus (probability is currently left o )).
Matrix decompositions for regression analysis
Matrix Differentiation Throughout this presentation I have chosen to use a symbolic matrix notation. This choice was not made lightly. I am a strong advocate of index notation, when appropriate. For example, index notation greatly simplifies the presentation and manipulation of differential geometry. As a rule-of-thumb, if your work is going to primarily involve differentiation with respect to the spatial coordinates, then index notation is almost surely the appropriate choice. In the present case, however, I will be manipulating large systems of equations in which the matrix calculus is relatively simply while the matrix algebra and matrix arithmetic is messy and more involved. Thus, I have chosen to use symbolic notation.
Matrix Factorization Techniques for Recommender Systems Modern consumers are inundated with choices. Electronic retailers and content providers offer a huge selection of products, with unprecedented opportunities to meet a variety of special needs and tastes. Matching consumers with the most appropriate products is key to enhancing user satisfaction and loyalty. Therefore, more retailers have become interested in recommender systems, which analyze patterns of user interest in products to provide personalized recommendations that suit a user’s taste. Because good personalized recommendations can add another dimension to the user experience, e-commerce leaders like and Netflix have made recommender systems a salient part of their websites. Such systems are particularly useful for entertainment products such as movies, music, and TV shows. Many customers will view the same movie, and each customer is likely to view numerous different movies. Customers have proven willing to indicate their level of satisfaction with particular movies, so a huge volume of data is available about which movies appeal to which customers. Companies can analyze this data to recommend movies to particular customers.
Maximize the Effectiveness of your Text Analytics Initiatives (Slide Deck)
Maximizing the value provided by a Big Data Platform
Max-Margin Markov Networks In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches.
Measurement Drives Behavior Measurement impacts our personal lives every single day. If we want to lose some weight, we start by standing on the scale. Based on the outcome, we decide how much weight we need to lose, and every other day we check our progress. If there is enough progress, we become encouraged to lose more, and if we are disappointed, we’re driven to add even more effort in order to achieve our goal. In short, measurement drives our behavior. It can be witnessed in countless ways in our private lives. In fact, it is an important principle in the social sciences, often called the Hawthorne effect. In the business world this is no different; measurement also drives our professional behavior. Once your business starts measuring the results of a certain process, your employees will start focusing on it. There are numerous examples: If the CFO starts tracking the days-sales-outstanding (DSO—i.e., the average number of days it takes customers to pay their bills) on a daily basis, instead of assuming that customers will pay within 14 days or so, the people in the accounts receivable departments are more likely to pay attention and exert greater effort to make collections. If hotel managers and their front desk staff are held accountable for the percentage of guests that fill out the customer satisfaction survey, they will be more likely to remind guests of the survey. Measurement helps us not only to focus on our goals and objectives, but also to balance our actions. If you measure production speed alone in a manufacturing process, it is likely that quality issues will arise. For balance, you also need to measure how many produced units need rework. If a procurement department is only measured on how much additional discount it can squeeze out of contract manufacturers, it becomes hard to avoid unethical practices, such as the use of child labor in low-wage countries and the use of cheaper and environmentally unfriendly materials and production processes. Procurement departments need to identify a balanced set of metrics that includes ethical issues as well as price. In each of the functional disciplines within an organization—finance, sales, marketing, logistics, manufacturing, procurement, human resources (HR) or information technology (IT)—measurement is a key element of management, and ultimately of bottom-line performance.
Measures of dissimilarity Patterns or objects analysed using the techniques described in this book are usually represented by a vector of measurements. Many of the techniques require some measure of dissimilarity or distance between two pattern vectors, although sometimes data can arise directly in the form of a dissimilarity matrix….
Measuring Distances Applied multivariate statistics
Measuring Predictability: Theory and Macroeconomic Applications We propose a measure of predictability based on the ratio of the expected loss of a short-run forecast to the expected loss of a long-run forecast. This predictability measure can be tailored to the forecast horizons of interest, and it allows for general loss functions, univariate or multivariate information sets, and covariance stationary or difference stationary processes. We propose a simple estimator, and we suggest resampling methods for inference. We then provide several macroeconomic applications. First, we illustrate the implementation of predictability measures based on fitted parametric models for several U.S. macroeconomic time series. Second, we analyze the internal propagation mechanism of a standard dynamic macroeconomic model by comparing the predictability of model inputs and model outputs. Third, we use predictability as a metric for assessing the similarity of data simulated from the model and actual data. Finally, we outline several nonparametric extensions of our approach.
mediation: R Package for Causal Mediation Analysis In this paper, we describe the R package mediation for conducting causal mediation analysis in applied empirical research. In many scientific disciplines, the goal of researchers is not only estimating causal effects of a treatment but also understanding the process in which the treatment causally affects the outcome. Causal mediation analysis is frequently used to assess potential causal mechanisms. The mediation package implements a comprehensive suite of statistical tools for conducting such an analysis. The package is organized into two distinct approaches. Using the model-based approach, researchers can estimate causal mediation effects and conduct sensitivity analysis under the standard research design. Furthermore, the design-based approach provides several analysis tools that are applicable under different experimental designs. This approach requires weaker assumptions than the model-based approach. We also implement a statistical method for dealing with multiple (causally dependent) mediators, which are often encountered in practice. Finally, the package also offers a methodology for assessing causal mediation in the presence of treatment noncompliance, a common problem in randomized trials.
Metalearning for Feature Selection A general formulation of optimization problems in which various candidate solutions may use different feature-sets is presented, encompassing supervised classification, automated program learning and other cases. A novel characterization of the concept of a ‘good quality feature’ for such an optimization problem is provided; and a proposal regarding the integration of quality based feature selection into metalearning is suggested, wherein the quality of a feature for a problem is estimated using knowledge about related features in the context of related problems. Results are presented regarding extensive testing of this ‘feature metalearning’ approach on supervised text classification problems; it is demonstrated that, in this context, feature metalearning can provide significant and sometimes dramatic speedup over standard feature selection heuristics.
Mining Approximate Top K Subspace Anomalies in MultiDimensional TimeSeries Data Market analysis is a representative data analysis process with many applications. In such an analysis, critical numerical measures, such as profit and sales, fluctuate over time and form timeseries data. Moreover, the time series data correspond to market segments, which are described by a set of attributes, such as age, gender, education, income level, and product-category, that form a multi-dimensional structure. To better understand market dynamics and predict future trends, it is crucial to study the dynamics of time-series in multi-dimensional market segments. This is a topic that has been largely ignored in time series and data cube research. In this study, we examine the issues of anomaly detection in multi-dimensional time-series data. We propose timeseries data cube to capture the multidimensional space formed by the attribute structure. This facilitates the detection of anomalies based on expected values derived from higher level, \more general” time-series. Anomaly detection in a time-series data cube poses computational challenges, especially for high-dimensional, large data sets. To this end, we also propose an efficient search algorithm to iteratively select subspaces in the original high-dimensional space and detect anomalies within each one. Our experiments with both synthetic and real-world data demonstrate the effectiveness and efficiency of the proposed solution.
Mining Software Quality from Software Reviews: Research Trends and Open Issues Software review text fragments have considerably valuable information about users experience. It includes a huge set of properties including the software quality. Opinion mining or sentiment analysis is concerned with analyzing textual user judgments. The application of sentiment analysis on software reviews can find a quantitative value that represents software quality. Although many software quality methods are proposed they are considered difficult to customize and many of them are limited. This article investigates the application of opinion mining as an approach to extract software quality properties. We found that the major issues of software reviews mining using sentiment analysis are due to software lifecycle and the diverse users and teams.
Model-based Machine Learning Several decades of research in the field of machine learning have resulted in a multitude of different algorithms for solving a broad range of problems. To tackle a new application, a researcher typically tries to map their problem onto one of these existing methods, often influenced by their familiarity with specific algorithms and by the availability of corresponding software implementations. In this study, we describe an alternative methodology for applying machine learning, in which a bespoke solution is formulated for each new application. The solution is expressed through a compact modelling language, and the corresponding custom machine learning code is then generated automatically. This model-based approach offers several major advantages, including the opportunity to create highly tailored models for specific scenarios, as well as rapid prototyping and comparison of a range of alternative models. Furthermore, newcomers to the field of machine learning do not have to learn about the huge range of traditional methods, but instead can focus their attention on understanding a single modelling environment. In this study, we show how probabilistic graphical models, coupled with efficient inference algorithms, provide a very flexible foundation formodel-based machine learning, and we outline a large-scale commercial application of this framework involving tens of millions of users. We also describe the concept of probabilistic programming as a powerful software environment for modelbased machine learning, and we discuss a specific probabilistic programming language called Infer.NET, which has been widely used in practical applications.
Modeling and Optimization for Big Data Analytics With pervasive sensors continuously collecting and storing massive amounts of information, there is no doubt this is an era of data deluge. Learning from these large volumes of data is expected to bring significant science and engineering advances along with improvements in quality of life. However, with such a big blessing come big challenges. Running analytics on voluminous data sets by central processors and storage units seems infeasible, and with the advent of streaming data sources, learning must often be performed in real time, typically without a chance to revisit past entries. “Workhorse” signal processing (SP) and statistical learning tools have to be re-examined in today’s high-dimensional data regimes. This article contributes to the ongoing cross-disciplinary efforts in data science by putting forth encompassing models capturing a wide range of SP-relevant data analytic tasks, such as principal component analysis (PCA), dictionary learning (DL), compressive sampling (CS), and subspace clustering. It offers scalable architectures and optimization algorithms for decentralized and online learning problems, while revealing fundamental insights into the various analytic and implementation tradeoffs involved. Extensions of the encompassing models to timely data-sketching, tensor- and kernel-based learning tasks are also provided. Finally, the close connections of the presented framework with several big data tasks, such as network visualization, decentralized and dynamic estimation, prediction, and imputation of network link load traffic, as well as imputation in tensor-based medical imaging are highlighted.
mtk: A General-Purpose and Extensible R Environment for Uncertainty and Sensitivity Analyses of Numerical Experiments Along with increased complexity of the models used for scientific activities and engineering, come diverse and greater uncertainties. Today, effectively quantifying the uncertainties contained in a model appears to be more important than ever. Scientific fellows know how serious it is to calibrate their model in a robust way, and decision-makers describe how critical it is to keep the best effort to reduce the uncertainties about the model. Effectively accessing the uncertainties about the model requires mastering all the tasks involved in the numerical experiments, from optimizing the experimental design to managing the very time consuming aspect of model simulation and choosing the adequate indicators and analysis methods. In this paper, we present an open framework for organizing the complexity associated with numerical model simulation and analyses. Named mtk (Mexico Toolkit), the developed system aims at providing practitioners from different disciplines with a systematic and easy way to compare and to find the best method to effectively uncover and quantify the uncertainties contained in the model and further to evaluate their impact on the performance of the model. Such requirements imply that the system must be generic, universal, homogeneous, and extensible. This paper discusses such an implementation using the R scientific computing platform and demonstrates its functionalities with examples from agricultural modeling. The package mtk is of general purpose and easy to extend. Numerous methods are already available in the actual release version, including Fast, Sobol, Morris, Basic Monte-Carlo, Regression, LHS (Latin Hypercube Sampling), PLMM (Polynomial Linear metamodel). Most of them are compiled from available R packages with extension tools delivered by package mtk.
Multidimensional Scaling by Majorization: A Review A major breakthrough in the visualization of dissimilarities between pairs of objects was the formulation of the least-squares multidimensional scaling (MDS) model as defined by the Stress function. This function is quite flexible in that it allows possibly nonlinear transformations of the dissimilarities to be represented by distances between points in a low dimensional space. To obtain the visualization, the Stress function should be minimized over the coordinates of the points and the over the transformation. In a series of papers, Jan de Leeuw has made a significant contribution to majorization methods for the minimization of Stress in least-squares MDS. In this paper, we present a review of the majorization algorithm for MDS as implemented in the smacof package and related approaches. We present several illustrative examples and special cases.
Multi-Label Learning with Millions of Labels: Recommending Advertiser Bid Phrases for Web Pages Recommending phrases from web pages for advertisers to bid on against search engine queries is an important research problem with direct commercial impact. Most approaches have found it infeasible to determine the relevance of all possible queries to a given ad landing page and have focussed on making recommendations from a small set of phrases extracted (and expanded) from the page using NLP and ranking based techniques. In this paper, we eschew this paradigm, and demonstrate that it is possible to efficiently predict the relevant subset of queries from a large set of monetizable ones by posing the problem as a multi-label learning task with each query being represented by a separate label. We develop Multi-label Random Forests to tackle problems with millions of labels. Our proposed classifier has prediction costs that are logarithmic in the number of labels and can make predictions in a few milliseconds using 10 Gb of RAM. We demonstrate that it is possible to generate training data for our classifier automatically from click logs without any human annotation or intervention. We train our classifier on tens of millions of labels, features and training points in less than two days on a thousand node cluster. We develop a sparse semi-supervised multi-label learning formulation to deal with training set biases and noisy labels harvested automatically from the click logs. This formulation is used to infer a belief in the state of each label for each training ad and the random forest classifier is extended to train on these beliefs rather than the given labels. Experiments reveal significant gains over ranking and NLP based techniques on a large test set of 5 million ads using multiple metrics.
Multiple Change-point Detection: a Selective Overview Very long and noisy sequence data arise from biological sciences to social science including high throughput data in genomics and stock prices in econometrics. Often such data are collected in order to identify and understand shifts in trend, e.g., from a bull market to a bear market in finance or from a normal number of chromosome copies to an excessive number of chromosome copies in genetics. Thus, identifying multiple change points in a long, possibly very long, sequence is an important problem. In this article, we review both classical and new multiple change-point detection strategies. Considering the long history and the extensive literature on the change-point detection, we provide an in-depth discussion on a normal mean change-point model from aspects of regression analysis, hypothesis testing, consistency and inference. In particular, we present a strategy to gather and aggregate local information for change-point detection that has become the cornerstone of several emerging methods because of its attractiveness in both computational and theoretical properties.
Multiple Factor Analysis Multiple factor analysis (MFA, see Escofier and Pagès, 1990, 1994) analyzes observations described by several “blocks” or sets of variables. MFA seeks the common structures present in all or some of these sets. MFA is performed in two steps. First a principal component analysis (PCA) is performed on each data set which is then “normalized” by dividing all its elements by the square root of the first eigenvalue obtained from of its PCA. Second, the normalized data sets are merged to form a unique matrix and a global PCA is performed on this matrix. The individual data sets are then projected onto the global analysis to analyze communalities and discrepancies. MFA is used in very different domains such as sensory evaluation, economy, ecology, and chemistry.
Multiple Instance Learning: A Survey of Problem Characteristics and Applications Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research.
Multisensor data fusion: A review of the state-of-the-art There has been an ever-increasing interest in multi-disciplinary research on multisensor data fusion technology, driven by its versatility and diverse areas of application. Therefore, there seems to be a real need for an analytical review of recent developments in the data fusion domain. This paper proposes a comprehensive review of the data fusion state of the art, exploring its conceptualizations, benefits, and challenging aspects, as well as existing methodologies. In addition, several future directions of research in the data fusion community are highlighted and described.
Multivariate Archimax Copulas A multivariate extension of the bivariate class of Archimax copulas was recently proposed by Mesiar and J agr (2013), who asked under which conditions it holds. This paper answers their question and provides a stochastic representation of multivariate Archimax copulas. A few basic properties of these copulas are explored, including their minimum and maximum domains of attraction. Several non-trivial examples of multivariate Archimax copulas are also provided.
Multivariate Linear Models in R The multivariate linear model is Y (n m) = X (n k+1) B (k+1 m) + E (n m) where Y is a matrix of n observations on m response variables; X is a model matrix with columns for k + 1 regressors, typically including an initial column of 1s for the regression constant; B is a matrix of regression coe cients, one column for each response variable; and E is a matrix of errors. This model can be t with the lm function in R, where the left-hand side of the model comprises a matrix of response variables, and the right-hand side is speci ed exactly as for a univariate linear model (i.e., with a single response variable). This appendix to Fox and Weisberg (2011) explains how to use the Anova and linearHypothesis functions in the car package to test hypotheses for parameters in multivariate linear models, including models for repeated-measures data.
Multivariate Pricing Price strategy is the key marketing tool for companies to increase their competitive edge but too often, prices are based on costs, not on customers’ perceptions of value. Value-based pricing is a business strategy which sets selling prices based on the perceived value to the customer, rather than the actual cost of the product, the market price, competitors’ prices, or the historical price. Practically speaking, the goal is to align the money spent with the value perceived. For example, the number of users, lifetime spending, number of transactions, value of transaction, return-on-investment, cost saving, revenue; the list can continue. The most common techniques employ straightforward methods such as: ‘Would you pay for this item at this price?’. While the van Westendorf method and conjoint analysis are useful, this article focuses on multivariate pricing techniques that allow flexibility and agility into the pricing models, and therefore can be more widely employed by clients and product managers.


Narrative Science Systems: A Review Automatic narration of events and entities is the need of the hour, especially when live reporting is critical and volume of information to be narrated is huge. This paper discusses the challenges in this context, along with the algorithms used to build such systems. From a systematic study, we can infer that most of the work done in this area is related to statistical data. It was also found that subjective evaluation or contribution of experts is also limited for narration context.
Neural Graph Machines: Learning Neural Networks Using Graphs Label propagation is a powerful and flexible semi-supervised learning technique on graphs. Neural networks, on the other hand, have proven track records in many supervised learning tasks. In this work, we propose a training framework with a graph-regularised objective, namely ‘Neural Graph Machines’, that can combine the power of neural networks and label propagation. This work generalises previous literature on graph-augmented training of neural networks, enabling it to be applied to multiple neural architectures (Feed-forward NNs, CNNs and LSTM RNNs) and a wide range of graphs. The new objective allows the neural networks to harness both labeled and unlabeled data by: (a) allowing the network to train using labeled data as in the supervised setting, (b) biasing the network to learn similar hidden representations for neighboring nodes on a graph, in the same vein as label propagation. Such architectures with the proposed objective can be trained efficiently using stochastic gradient descent and scaled to large graphs, with a runtime that is linear in the number of edges. The proposed joint training approach convincingly outperforms many existing methods on a wide range of tasks (multi-label classification on social graphs, news categorization, document classification and semantic intent classification), with multiple forms of graph inputs (including graphs with and without node-level features) and using different types of neural networks.
Neural Machine Translation and Sequence-to-sequence Models: A Tutorial This tutorial introduces a new and powerful set of techniques variously called ‘neural machine translation’ or ‘neural sequence-to-sequence models’. These techniques have been used in a number of tasks regarding the handling of human language, and can be a powerful tool in the toolbox of anyone who wants to model sequential data of some sort. The tutorial assumes that the reader knows the basics of math and programming, but does not assume any particular experience with neural networks or natural language processing. It attempts to explain the intuition behind the various methods covered, then delves into them with enough mathematical detail to understand them concretely, and culiminates with a suggestion for an implementation exercise, where readers can test that they understood the content in practice.
Nonlinear functional regression: a functional RKHS approach This paper deals with functional regression, in which the input attributes as well as the response are functions. To deal with this problem, we develop a functional reproducing kernel Hilbert space approach; here, a kernel is an operator acting on a function and yielding a function. We demonstrate basic properties of these functional RKHS, as well as a representer theorem for this setting; we investigate the construction of kernels; we provide some experimental insight.
Novelty Detection in Learning Systems Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen. It is a useful technique in cases where an important class of data is under-represented in the training set. This means that the performance of the network will be poor for those classes. In some circumstances, such as medical data and fault detection, it is often precisely the class that is under-represented in the data, the disease or potential fault, that the network should detect. In novelty detection systems the network is trained only on the negative examples where that class is not present, and then detects inputs that do not fits into the model that it has acquired, that it, members of the novel class. This paper reviews the literature on novelty detection in neural networks and other machine learning techniques, as well as providing brief overviews of the related topics of statistical outlier detection and novelty detection in biological organisms.


Object Oriented Analysis using Natural Language Processing concepts: A Review The Software Development Life Cycle (SDLC) starts with eliciting requirements of the customers in the form of Software Requirement Specification (SRS). SRS document needed for software development is mostly written in Natural Language(NL) convenient for the client. From the SRS document only, the class name, its attributes and the functions incorporated in the body of the class are traced based on pre-knowledge of analyst. The paper intends to present a review on Object Oriented (OO) analysis using Natural Language Processing (NLP) techniques. This analysis can be manual where domain expert helps to generate the required diagram or automated system, where the system generates the required diagram, from the input in the form of SRS.
On Being a Data Skeptic I’d like to set something straight right out of the gate. I’m not a data cynic, nor am I urging other people to be. Data is here, it’s growing, and it’s powerful. I’m not hiding behind the word “skeptic” the way climate change “skeptics” do, when they should call themselves deniers. Instead, I urge the reader to cultivate their inner skeptic, which I define by the following characteristic behavior. A skeptic is someone who maintains a consistently inquisitive attitude toward facts, opinions, or (especially) beliefs stated as facts. A skeptic asks questions when confronted with a claim that has been taken for granted. That’s not to say a skeptic brow-beats someone for their beliefs, but rather that they set up reasonable experiments to test those beliefs. A really excellent skeptic puts the “science” into the term “data science.” In this paper, I’ll make the case that the community of data practitioners needs more skepticism, or at least would benefit greatly from it, for the following reason: there’s a two-fold problem in this community. On the one hand, many of the people in it are overly enamored with data or data science tools. On the other hand, other people are overly pessimistic about those same tools. I’m charging myself with making a case for data practitioners to engage in active, intelligent, and strategic data skepticism. I’m proposing a middle-of-the-road approach: don’t be blindly optimistic, don’t be blindly pessimistic. Most of all, don’t be awed. Realize there are nuanced considerations and plenty of context and that you don’t necessarily have to be a mathematician to understand the issues. …
On Clustering Validation Techniques Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. This paper introduces the fundamental concepts of clustering while it surveys the widely known clustering algorithms in a comparative way. Moreover, it addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature is presented. Furthermore, the paper illustrates the issues that are under-addressed by the recent algorithms and gives the trends in clustering process.
On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widelyheld belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better. This stems from the observation – which is borne out in repeated experiments – that while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster.
On k-Anonymity and the Curse of Dimensionality In recent years, the wide availability of personal data has made the problem of privacy preserving data mining an important one. A number of methods have recently been proposed for privacy preserving data mining of multidimensional data records. One of the methods for privacy preserving data mining is that of anonymization, in which a record is released only if it is indistinguishable from k other entities in the data. We note that methods such as k-anonymity are highly dependent upon spatial locality in order to effectively implement the technique in a statistically robust way. In high dimensional space the data be- comes sparse, and the concept of spatial locality is no longer easy to define from an application point of view. In this paper, we view the k-anonymization problem from the perspec- tive of inference attacks over all possible combinations of attributes. We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. This is because an exponential number of combinations of dimensions can be used to make precise inference attacks, even when individual attributes are partially specified within a range. We provide an analysis of the effect of dimensionality on k-anonymity methods. We conclude that when a data set contains a large number of attributes which are open to inference attacks, we are faced with a choice of either completely suppressing most of the data or losing the desired level of anonymity. Thus, this paper shows that the curse of high dimensionality also applies to the problem of privacy preserving data mining.
Online Algorithms This book chapter reviews fundamental concepts and results in the area of online algorithms. We first address classical online problems and then study various applications of current interest. Online algorithms represent a theoretical framework for studying problems in interactive computing. They model, in particular, that the input in an interactive system does not arrive as a batch but as a sequence of input portions and that the system must react in response to each incoming portion. Moreover, they take into account that at any point in time future input is unknown. As the name suggests, online algorithms consider the algorithmic aspects of interactive systems: We wish to design strategies that always compute good output and keep a given system in good state. No assumptions are made about the input stream. The input can even be generated by an adversary that creates new input portions based on the system’s reactions to previous ones. We seek algorithms that have a provably good performance.
Online Learning and Online Convex Optimization Online learning is a well established learning paradigm which has both theoretical and practical appeals. The goal of online learning is to make a sequence of accurate predictions given knowledge of the correct answer to previous prediction tasks and possibly additional available information. Online learning has been studied in several research fields including game theory, information theory, and machine learning. It also became of great interest to practitioners due the recent emergence of large scale applications such as online advertisement placement and online web ranking. In this survey we provide a modern overview of online learning. Our goal is to give the reader a sense of some of the interesting ideas and in particular to underscore the centrality of convexity in deriving efficient online learning algorithms. We do not mean to be comprehensive but rather to give a high-level, rigorous yet easy to follow, survey.
Online Portfolio Selection: A Survey Online portfolio selection is a fundamental problem in computational finance, which has been extensively studied across several research communities, including finance, statistics, artificial intelligence, machine learning, and data mining. This article aims to provide a comprehensive survey and a structural understanding of online portfolio selection techniques published in the literature. From an online machine learning perspective, we first formulate online portfolio selection as a sequential decision problem, and then we survey a variety of state-of-the-art approaches, which are grouped into several major categories, including benchmarks, Follow-the-Winner approaches, Follow-the-Loser approaches, Pattern-Matching–based approaches, and Meta-Learning Algorithms. In addition to the problem formulation and related algorithms, we also discuss the relationship of these algorithms with the capital growth theory so as to better understand the similarities and differences of their underlying trading ideas. This article aims to provide a timely and comprehensive survey for both machine learning and data mining researchers in academia and quantitative portfolio managers in the financial industry to help them understand the state of the art and facilitate their research and practical applications. We also discuss some open issues and evaluate some emerging new trends for future research.
Online Principal Component Analysis Principal Component Analysis (PCA) is one of the most well known and widely used procedures in scienti c computing. It is used for dimension reduction, signal denoising, regression, correlation analysis, visualization etc. It can be described in many ways but one is particularly appealing in the context of online algorithms. In the online setting, the algorithm receives the input vectors xt one ofter the other and must always output yt before receiving xt+1.
Ontology Learning from Text: A Survey of Methods After the vision of the Semantic Web was broadcasted at the turn of the millennium, ontology became a synonym for the solution to many problems concerning the fact that computers do not understand human language: if there were an ontology and every document were marked up with it and we had agents that would understand the markup, then computers would finally be able to process our queries in a really sophisticated way. Some years later, the success of Google shows us that the vision has not come true, being hampered by the incredible amount of extra work required for the intellectual encoding of semantic mark-up – as compared to simply uploading an HTML page. To alleviate this acquisition bottleneck, the field of ontology learning has since emerged as an important sub-field of ontology engineering. …
Operational Analytics from A to Z
Optimization theory in Statistics This paper addresses the issues of optimization theory and related numerical issues within the context of Statistics. Focusing on the problem of concave regression, several estimation techniques for nonparametric shape-constrained regression are classified, analyzed and compared qualitatively and quantitatively through numerical simulations. In particular, their main features, strengths and limitations for solving large instances of the problem are examined through this paper. Several improvements to enhance numerical stability and bound the computational cost are proposed. For each analyzed algorithm, the pseudo-code and its corresponding code in Scilab are provided. The results from this study demonstrate that the choice of the optimization approach strongly impact algorithmic performances. Interestingly, it is also shown that, currently, are not available methods able to solve efficiently large instance of the concave regression problems (more than many thousands of points). We suggest that further research to fill this gap in the literature should focus on finding a way to exploit and adapt classical multi-scale strategy to compute an approximate solution.
Overcoming the Barriers to Production-Ready Machine Learning Workflows (Slide Deck)


Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations Latent Dirichlet allocation (LDA) and other related topic models are increasingly popular tools for summarization and manifold discovery in discrete data. However, LDA does not capture correlations between topics. In this paper, we introduce the pachinko allocation model (PAM), which captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (DAG). The leaves of the DAG represent individual words in the vocabulary, while each interior node represents a correlation among its children, which may be words or other interior nodes (topics). PAM provides a flexible alternative to recent work by Blei and Lafferty (2006), which captures correlations only between pairs of topics. Using text data from newsgroups, historic NIPS proceedings and other research paper corpora, we show improved performance of PAM in document classification, likelihood of held-out data, the ability to support finer-grained topics, and topical keyword coherence.
Parameter estimation for text analysis Presents parameter estimation methods common with discrete probability distributions, which is of particular interest in text modeling. Starting with maximum likelihood, a posteriori and Bayesian estimation, central concepts like conjugate distributions and Bayesian networks are reviewed. As an application, the model of latent Dirichlet allocation (LDA) is explained in detail with a full derivation of an approximate inference algorithm based on Gibbs sampling, including a discussion of Dirichlet hyperparameter estimation.
PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA Graph analytics is a crucial element in extracting insights from Big Data because it helps discover hidden relationships by connecting the dots. A graph, meaning the network of nodes and relationships, treats the linkage between objects as equally important as the objects themselves. Social networks or supply chains are obvious examples, but graphs include any network of objects such as customers, products, purchase orders, customer support calls, product inventory, etc. HiperGraph, PARC’s breakthrough Big Data technology, is a high-performance graph analytics engine. Through a four-month research project with SAP, we added HiperGraph’s analytics to SAP HANA to demonstrate a live, real-time marketing insights use case. Graph reasoning technologies provide the ability to contextualize relational data with the tapestry of information and can go beyond simplistic reporting and dashboards. This creates opportunities to rapidly experiment, gain new insights, and identify root causes. The demonstrated technology match between HANA and HiperGraph has great disruptive potential, especially in the identification of key patterns within datasets (e.g., via clustering). With HANA and HiperGraph we can finally deliver on the promise of a closed feedback loop in the enterprise where transactions are analyzed and reacted to in real-time. The intelligence that is implicit in large volumes of structured and unstructured data from varieties of sources from inside or outside of the enterprise can be delivered to the users in the form of smart business applications. We concluded that the existing commercial or open source algorithms either did not provide the real-time response or were unable to scale to the large volumes of data. The requirements from our customer (an online retailer) required real-time response from their Big Data system. PARC’s graph reasoning, versatile goal-directed clustering, egocentric recommendations, and real-time recommendation algorithms combined with the power of HANA in-memory technologies far exceeded the expectations. Brand managers can use this solution to automatically find clusters of customers with similar purchases, clusters of products that are frequently bought together, clusters of products that tend to be purchased on sale vs. those that are purchased at full price, and so on, and act on these insights during the customer’s shopping experience. There is a great opportunity for businesses to gain value by combining the HANA in-memory technology with HiperGraph reasoning, recommendation, matrix factorization, egocentric collaborative filtering, and versatile goal-directed clustering. With SAP and PARC co-innovation in Big Data analytics we can now reduce and/or eliminate the need for complex extract, transform, and load (ETL) processes; increase speed in clustering; and introduce new accessibility for business users to directly explore data clusters. We are democratizing data science for all business users in the enterprise.
Patent Retrieval: A Literature Review With the ever increasing number of filed patent applications every year, the need for effective and efficient systems for managing such tremendous amounts of data becomes inevitably important. Patent Retrieval (PR) is considered is the pillar of almost all patent analysis tasks. PR is a subfield of Information Retrieval (IR) which is concerned with developing techniques and methods that effectively and efficiently retrieve relevant patent documents in response to a given search request. In this paper we present a comprehensive review on PR methods and approaches. It is clear that, recent successes and maturity in IR applications such as Web search can not be transferred directly to PR without deliberate domain adaptation and customization. Furthermore, state-of-the-art performance in automatic PR is still around average. These observations motivates the need for interactive search tools which provide cognitive assistance to patent professionals with minimal effort. These tools must also be developed in hand with patent professionals considering their practices and expectations. We additionally touch on related tasks to PR such as patent valuation, litigation, licensing, and highlight potential opportunities and open directions for computational scientists in these domains.
Perspectives of Predictive Modeling (Slide Deck)
Practical Approaches to Principal Component Analysis in the Presence of Missing Values Principal component analysis (PCA) is a classical data analysis technique that finds linear transformations of data that retain the maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overfitting and bad locally optimal solutions. A probabilistic formulation of PCA provides a good foundation for handling missing values, and we provide formulas for doing that. In case of high dimensional and very sparse data, overfitting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to variational Bayesian learning. Different versions of PCA are compared in artificial experiments, demonstrating the effects of regularization and modeling of posterior variance. The scalability of the proposed algorithm is demonstrated by applying it to the Netflix problem.
Practical Bayesian Optimization of Machine Learning Algorithms Machine learning algorithms frequently require careful tuning of model hyperparameters, regularization terms, and optimization parameters. Unfortunately, this tuning is often a ‘black art’ that requires expert experience, unwritten rules of thumb, or sometimes brute-force search. Much more appealing is the idea of developing automatic approaches which can optimize the performance of a given learning algorithm to the task at hand. In this work, we consider the automatic tuning problem within the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). The tractable posterior distribution induced by the GP leads to efficient use of the information gathered by previous experiments, enabling optimal choices about what parameters to try next. Here we show how the effects of the Gaussian process prior and the associated inference procedure can have a large impact on the success or failure of Bayesian optimization. We show that thoughtful choices can lead to results that exceed expert-level performance in tuning machine learning algorithms. We also describe new algorithms that take into account the variable cost (duration) of learning experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization on a diverse set of contemporary algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks.
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.
Practical Machine Learning: A New Look at Anomaly Detection Everyone loves a mystery, and at the heart of it, that’s what anomaly detection is—spotting the unusual, catching the fraud, discovering the strange activity. Anomaly detection has a wide range of useful applications, from banking security to natural sciences to medicine to marketing. Anomaly detection carried out by a machine-learning program is actually a form of artificial intelligence. With the ever-increasing volume of data and the new types of data, such as sensor data from an increasingly large variety of objects that needs to be considered, it’s no surprise that there also is a growing interest in being able to handle more decisions automatically via machine-learning applications. But in the case of anomaly detection, at least some of the appeal is the excitement of the chase itself. …
Practical Machine Learning: Innovations in Recommendation A key to one of most sophisticated and effective approaches in machine learning and recommendation is contained in the observation: “I want a pony.” As it turns out, building a simple but powerful recommender is much easier than most people think, and wanting a pony is part of the key. Machine learning, especially at the scale of huge datasets, can be a daunting task. There is a dizzying array of algorithms from which to choose, and just making the choice between them presupposes that you have sufficiently advanced mathematical background to understand the alternatives and make a rational choice. The options are also changing, evolving constantly as a result of the work of some very bright, very dedicated researchers who are continually refining existing algorithms and coming up with new ones.
Predicting Good Probabilities With Supervised Learning We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Models such as Naive Bayes, which make unrealistic independence assumptions, push probabilities toward 0 and 1. Other models such as neural nets and bagged trees do not have these biases and predict well calibrated probabilities. We experiment with two ways of correcting the biased probabilities predicted by some learning methods: Platt Scaling and Isotonic Regression. We qualitatively examine what kinds of distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. The empirical results show that after calibration boosted trees, random forests, and SVMs predict the best probabilities.
Predicting the future of predictive analytics The proliferation of data and the increasing awareness of the potential to gain valuable insight and a competitive advantage from that information are driving organizations to place data at the heart of their corporate strategy. Consumers regularly benefit from predictive analytics, in the form of anything from weather forecasts to insurance premiums. Organizations are now exploring the possibilities of using historical data to exploit growth opportunities and minimize business risks, a field known as predictive analytics. SAP commissioned Loudhouse to conduct primary research among business decision-makers in UK and US organizations to understand their attitudes to and experiences of predictive analytics, as well as a future view of usage, value and investment. The research reveals that businesses are struggling to take full advantage of the burgeoning and already overwhelming amount of data being collected. Challenges abound as firms seek to make effective use of data. While many businesses are investing in predictive analytics and already seeing benefits in a number of areas, even more see this as a future investment priority for their business. The research points to a data-driven future where advanced predictive analytics sits at the core of the business function rather than being siloed, is embraced by a greater proportion of the workforce and is used to drive decision-making across the whole business. To achieve this future vision, however, it is clear that businesses need to up-skill their workforce and invest in more intuitive technology. While firms in the UK and US recognize the potential of predictive analytics and the need for investment in skills, the US is further along the adoption curve than the UK. US organizations show greater promise for future investment in – and roll-out of – predictive analytics software across the workforce. Furthermore, US organizations perceive fewer challenges in using data to inform corporate strategy, and sense a greater need for training to embed the benefits of the technology into day-to-day business.
Predictive Analytics – The rise and value of predictive analytics in enterprise decision making In the past few years, predictive analytics has gone from an exotic technique practiced in just a few niches, to a competitive weapon with a rapidly expanding range of uses. The increasing adoption of predictive analytics is fueled by converging trends: the Big Data phenomenon, ever-improving tools for data analysis, and a steady stream of demonstrated successes in new applications. The modern analyst would say, “Give me enough data, and I can predict anything.”
Predictive Analytics enters the Mainstream
Predictive Analytics for Business Advantage To compete effectively in an era in which advantages are ephemeral, companies need to move beyond historical, rear-view understandings of business performance and customer behavior and become more proactive. Organizations today want to be predictive; they want to gain information and insight from data that enables them to detect patterns and trends, anticipate events, spot anomalies, forecast using what-if simulations, and learn of changes in customer behavior so that staff can take actions that lead to desired business outcomes. Success in being predictive and proactive can be a game changer for many business functions and operations, including marketing and sales, operations management, finance, and risk management. Although it has been around for decades, predictive analytics is a technology whose time has finally come. A variety of market forces have joined to make this possible, including an increase in computing power, a better understanding of the value of the technology, the rise of certain economic forces, and the advent of big data. Companies are looking to use the technology to predict trends and understand behavior for better business performance. Forward-looking companies are using predictive analytics across a range of disparate data types to achieve greater value. Companies are looking to also deploy predictive analytics against their big data. Predictive analytics is also being operationalized more frequently as part of a business process. Predictive analytics complements business intelligence and data discovery, and can enable organizations to go beyond the analytic complexity limits of many online analytical processing (OLAP) implementations. It is evolving from a specialized activity once utilized only among elite firms and users to one that could become mainstream across industries and market sectors. This TDWI Best Practices Report focuses on how organizations can and are using predictive analytics to derive business value. It provides in-depth survey analysis of current strategies and future trends for predictive analytics across both organizational and technical dimensions including organizational culture, infrastructure, data, and processes. It looks at the features and functionalities companies are using for predictive analytics and the infrastructure trends in this space. The report offers recommendations and best practices for successfully implementing predictive analytics in the organization. TDWI Research finds a shift occurring in the predictive analytics user base. No longer is predictive analytics the realm of statisticians and mathematicians. There is a definite trend toward business analysts and other business users making use of this technology. Marketing and sales are big current users of predictive analytics and market analysts are making use of the technology. Therefore, the report also looks at the skills necessary to perform predictive analytics and how the technology can be utilized and operationalized across the organization. It explores cultural and business issues involved with making predictive analytics possible. A unique feature of this report is its examination of the characteristics of companies that have actually measured either top-line or bottom-line impact with predictive analytics. In other words, it explores how those companies compare against those that haven’t measured value.
Predictive Analytics in Cloud CRM Cloud CRM solutions have long since become mainstream and expanded beyond their initial foothold in small and mid-sized enterprises. Today B2B and B2C companies in many industries are eyeing cloud CRM solutions for their call center, their sales force and more. These CRM solutions offer the classic benefits of a cloud offering—multi-tenancy, usage pricing, location transparency, network access and high availability. What these solutions often do not offer, however, is advanced analytics. Typically limited to reporting and dashboards, many cloud CRM solutions do not allow companies to maximize the value of their data. The analytics that are available in a typical cloud CRM solution assume that users have the necessary decision-making expertise as well as the time required to make these decisions. In a typical high-volume call center environment, neither of these assumptions is reasonable. What companies using these CRM solutions need is predictive analytics, specifically predictive analytic solutions designed to drive better decisions in real-time. Delivering predictive analytic solutions in a cloud CRM environment, however, has its own challenges. Those adopting cloud CRM solutions don’t want (nor have the budget) to hire analytics teams to build predictive analytic models using traditional techniques or have to move their cloud CRM data to an on-premise analytic environment. They also don’t want predictive analytic models “in the lab,” they want business-friendly decision-making solutions powered by sophisticated predictive analytics. To be successful with cloud CRM, these companies need predictive applications for the cloud, in the cloud.
Predictive Analytics in the Cloud Predictive analytics and cloud are hot topics in business today. Predictive analytics are increasingly the focus of many companies’ efforts to improve business performance with analytics while cloud is fast becoming the default option for purchasing and deploying software. Public, private and hybrid clouds are all evolving rapidly and are here to stay. But what’s happening at the intersection of these two technologies? How can predictive analytics in the cloud add value and what are the critical risks and issues involved? This paper explores the five key opportunities for organizations to use predictive analytics in the cloud:
• Using the cloud to deliver predictive analytics-enabled “Decisions as a Service” solutions
• Embedding predictive analytics in Software as a Service (SaaS) and other cloud-deployed applications
• Using the cloud to deliver predictive analytics to non-cloud applications across the extended enterprise
• Building predictive analytics against data in the cloud
• Using cloud computing to deliver elastic compute power for building predictive analytic models
Before discussing the various options for predictive analytics in the cloud it is worth clarifying exactly what we mean by the various terms.
Predictive Analytics Whitepaper
Principal Components: Mathematics, Example, Interpretation This paper will explain Principal Components Analysis, where “respecting structure” means “preserving variance”. Explain how to do PCA, show an example, and describe some of the issues that come up in interpreting the results. PCA has been rediscovered many times in many fields, so it is also known as the Karhunen-Loeve transformation, the Hotelling transformation, the method of empirical orthogonal functions, and singular value decomposition. We will call it PCA.
Probabilistic Forecasting A probabilistic forecast takes the form of a predictive probability distribution over future quantities or events of interest. Probabilistic forecasting aims to maximize the sharpness of the predictive distributions, subject to calibration, on the basis of the available information set. We formalize and study notions of calibration in a prediction space setting. In practice, probabilistic calibration can be checked by examining probability integral transform (PIT) histograms. Proper scoring rules such as the logarithmic score and the continuous ranked probability score serve to assess calibration and sharpness simultaneously. As a special case, consistent scoring functions provide decision-theoretically coherent tools for evaluating point forecasts.We emphasizemethodological links to parametric and nonparametric distributional regression techniques, which attempt to model and to estimate conditional distribution functions; we use the context of statistically postprocessed ensemble forecasts in numerical weather prediction as an example. Throughout, we illustrate concepts and methodologies in data examples.
Probabilistic Programming Probabilistic programs are usual functional or imperative programs with two added constructs: (1) the ability to draw values at random from distributions, and (2) the ability to condition values of variables in a program via observations. Models from diverse application areas such as computer vision, coding theory, cryptographic protocols, biology and reliability analysis can be written as probabilistic programs. Probabilistic inference is the problem of computing an explicit representation of the probability distribution implicitly specified by a probabilistic program. Depending on the application, the desired output from inference may vary – we may want to estimate the expected value of some function f with respect to the distribution, or the mode of the distribution, or simply a set of samples drawn from the distribution. In this paper, we describe connections this research area called “Probabilistic Programming” has with programming languages and software engineering, and this includes language design, and the static and dynamic analysis of programs. We survey current state of the art and speculate on promising directions for future research.
Probabilistic Syntax 1. The Tradition of Categoricity and Prospects for Stochasticity
2. The joys and perils of corpus linguistics
3. Probabilistic syntactic models
4. Continuous categories
5. Explaining more: probabilistic models of syntactic usage
6. Conclusion:
There are many phenomena in syntax that cry out for non-categorical and probabilistic modeling and explanation. The opportunity to leave behind ill-fitting categorical assumptions, and to better model probabilities of use in syntax is exciting. The existence of ‘soft’ constraints within the variable output of an individual speaker, of exactly the same kind as the typological syntactic constraints found across languages, makes exploration of probabilistic grammar models compelling. We saw that one is not limited to simple surface representations: I have tried to outline how probabilistic models can be applied on top of one’s favorite sophisticated linguistic representations. The frequency evidence needed for parameter estimation in probabilistic models requires a lot more data collection, and a lot more careful evaluation and model building than traditional syntax, where one example can be the basis of a new theory, but the results can enrich linguistic theory by revealing the soft constraints at work in language use. This is an area ripe for exploration by the next generation of syntacticians.
Probabilistic Topic Models As our collective knowledge continues to be digitized and stored – in the form of news, blogs, Web pages, scientific articles, books, images, sound, video, and social networks – it becomes more difficult to find and discover what we are looking for. We need new computational tools to help organize, search, and understand these vast amounts of information. Right now, we work with online information using two main tools – search and links. We type keywords into a search engine and find a set of documents related to them. We look at the documents in that set, possibly navigating to other linked documents. This is a powerful way of interacting with our online archive, but something is missing. Imagine searching and exploring documents based on the themes that run through them. We might “zoom in” and “zoom out” to find specific or broader themes; we might look at how those themes changed through time or how they are connected to each other. Rather than finding documents through keyword search alone, we might first find the theme that we are interested in, and then examine the documents related to that theme.
Probabilistic Topic Models Many chapters in this book illustrate that applying a statistical method such as Latent Semantic Analysis (LSA; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998) to large databases can yield insight into human cognition. The LSA approach makes three claims: that semantic information can be derived from a word-document co-occurrence matrix; that dimensionality reduction is an essential part of this derivation; and that words and documents can be represented as points in Euclidean space. In this chapter, we pursue an approach that is consistent with the first two of these claims, but differs in the third, describing a class of statistical models in which the semantic properties of words and documents are expressed in terms of probabilistic topics.
Probability Cheatsheet (Cheat Sheet)
Process Mining – seeing the real process (Poster)
ProM 6: The Process Mining Toolkit Process mining has been around for a decade, and it has proven to be a very fertile and successful researchfield. Part of this success can be contributed to the ProM tool, which combines most of the existing process mining techniques as plug-ins in a single tool. ProM 6 removes many limitations that existed in the previous versions, in par- ticular with respect to the tight integration between the tool and the GUI. ProM 6 has been developed from scratch and uses a completely redesigned architecture. The changes were driven by many real-life ap- plications and new insights into the design of process analysis software. Furthermore, the introduction of XESame in this toolkit allows for the conversion of logs to the ProM native format without programming.
Putting Hadoop To Work The Right Way Big data has rapidly progressed from an ambitious vision realized by a handful of innovators to a competitive advantage for businesses across dozens of industries. More data is available now – about customers, employees, competitors – than ever before. That data is intelligence that can have an impact on daily business decisions. Industry leaders rely on big data as a foundation to beat their rivals. This big data revolution is also behind the massive adoption of Hadoop. Hadoop has become the platform of choice for companies looking to harness big data’s power. Simply put, most traditional enterprise systems are too limited to keep up with the influx of big data; they are not designed to ingest large quantities of data first and analyze it later. The need to store and analyze big data cost-effectively is the main reason why Hadoop usage has grown exponentially in the last five years.
Putting Predictive Analytics to Work in Operations In a recent study, companies that tightly integrate predictive analytics into operational systems are more than twice as likely to report a transformative impact from predictive analytics as any others. Leaders are creating advantage across multiple core functions such as marketing, customer management, collections, customer service, distribution and more by applying predictive analytics to operational decisions. Small operational decisions, especially those about customers, are made over and over. The value of these decisions rapidly adds up. Making these decisions well is critical to business performance. Traditional approaches to analytics are hard to scale and hard to use in the real-time environment common for operational decisions. Predictive analytics work better, using data to make these decisions more precise, targeting and personalizing them to maximize customer value. Because these decisions about customers are made at the front line of an organization they must be made quickly and embedded in operational systems. This creates a challenge – the insight to action gap – that prevents many companies from taking advantage of predictive analytics in these decisions. To close the insight to action gap and put predictive analytics to work in operations, companies need to adopt Decision Management, a proven approach that leverages predictive analytics to make operational systems analytical.


R Essentials R is a highly extensible, open-source programming language used mainly for statistical analysis and graphics. It is a GNU project very similar to the S language. R’s strengths include its varying data structures, which can be more intuitive than data storage in other languages; its built-in statistical and graphical functions; and its large collection of useful plugins that can enhance the language’s abilities in many different ways. R can be run either as a series of console commands, or as full scripts, depending on the use case. It is heavily object-oriented, and allows you to create your own functions. It also has a common API for interacting with most file structures to access data stored outside of R.
R for Machine Learning It is common for today’s scientific and business industries to collect large amounts of data, and the ability to analyze the data and learn from it is critical to making informed decisions. Familiarity with software such as R allows users to visualize data, run statistical tests, and apply machine learning algorithms. Even if you already know other software, there are still good reasons to learn R:
1. R is free. If your future employer does not already have R installed, you can always download it for free, unlike other proprietary software packages that require expensive licenses. No matter where you travel, you can have access to R on your computer.
2. R gives you access to cutting-edge technology. Top researchers develop statistical learning methods in R, and new algorithms are constantly added to the list of packages you can download.
3. R is a useful skill. Employers that value analytics recognize R as useful and important. If for no other reason, learning R is worthwhile to help boost your resume.
R Is Still Hot–and Getting Hotter For the white paper titled “R Is Hot” about four years ago, the goal was to introduce the R programming language to a larger audience of statistical analysts and data scientists. As it turned out, the timing couldn’t have been better: R has now blossomed into the language of choice for data scientists worldwide. Today, R is widely used by scientists, researchers, and statisticians for modeling data and solving problems quickly and effectively. When people ask me which factors are driving the broader adoption of R among data analysts, I usually offer two key points:
1. R was designed specifically for statistical analysis, which means that analytics written in R typically require fewer lines of code (and hence less work) than analytics written in Java, Python, or C++.
2. R is an open source project, which means it is continually improved, upgraded, enhanced, and expanded by a global community of incredibly passionate developers and users.
R Markdown Cheat Sheet (Cheat Sheet)
R Markdown Cheat Sheet (Cheat Sheet)
R Quo Vadis? (Slide Deck)
Random Forests Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Freund and Schapire[1996]), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Rankcluster: An R Package for Clustering Multivariate Partial Rankings The Rankcluster package is the first R package proposing both modeling and clustering tools for ranking data, potentially multivariate and partial. Ranking data are modeled by the Insertion Sorting Rank (ISR) model, which is a meaningful model parametrized by a central ranking and a dispersion parameter. A conditional independence assumption allows multivariate rankings to be taken into account, and clustering is performed by means of mixtures of multivariate ISR models. The parameters of the cluster (central rankings and dispersion parameters) help the practitioners to interpret the clustering. Moreover, the Rankcluster package provides an estimate of the missing ranking positions when rankings are partial. After an overview of the mixture of multivariate ISR models, the Rankcluster package is described and its use is illustrated through the analysis of two real datasets.
Rationality, Optimism and Guarantees in General Reinforcement Learning In this article,1 we present a top-down theoretical study of general reinforcement learning agents. We begin with rational agents with unlimited resources and then move to a setting where an agent can only maintain a limited number of hypotheses and optimizes plans over a horizon much shorter than what the agent designer actually wants. We axiomatize what is rational in such a setting in a manner that enables optimism, which is important to achieve systematic explorative behavior. Then, within the class of agents deemed rational, we achieve convergence and nite-error bounds. Such results are desirable since they imply that the agent learns well from its experiences, but the bounds do not directly guarantee good performance and can be achieved by agents doing things one should obviously not. Good performance cannot in fact be guaranteed for any agent in fully general settings. Our approach is to design agents that learn well from experience and act rationally. We introduce a framework for general reinforcement learning agents based on rationality axioms for a decision function and an hypothesis-generating function designed so as to achieve guarantees on the number errors. We will consistently use an optimistic decision function but the hypothesis-generating function needs to change depending on what is known/assumed. We investigate a number of natural situations having either a frequentist or Bayesian avor, deterministic or stochastic environments and either nite or countable hypothesis class. Further, to achieve su ciently good bounds as to hold promise for practical success we introduce a notion of a class of environments being generated by a set of laws. None of the above has previously been done for fully general reinforcement learning environments.
Realization of Ontology Web Search Engine This paper describes the realization of the Ontology Web Search Engine. The Ontology Web Search Engine is realizable as independent project and as a part of other projects. The main purpose of this paper is to present the Ontology Web Search Engine realization details as the part of the Semantic Web Expert System and to present the results of the Ontology Web Search Engine functioning. It is expected that the Semantic Web Expert System will be able to process ontologies from the Web, generate rules from these ontologies and develop its knowledge base.
Real-Time Big Data Analytics: Emerging Architecture Imagine that it’s 2007. You’re a top executive at major search engine company, and Steve Jobs has just unveiled the iPhone. You immediately ask yourself, “Should we shift resources away from some of our current projects so we can create an experience expressly for iPhone users?” Then you begin wondering, “What if it’s all hype? Steve is a great showman … how can we predict if the iPhone is a fad or the next big thing?” The good news is that you’ve got plenty of data at your disposal. The bad news is that you have no way of querying that data and discovering the answer to a critical question: How many people are accessing my sites from their iPhones? Back in 2007, you couldn’t even ask the question without upgrading the schema in your data warehouse, an expensive process that might have taken two months. Your only choice was to wait and hope that a competitor didn’t eat your lunch in the meantime. Justin Erickson, a senior product manager at Cloudera, told me a version of that story and I wanted to share it with you because it neatly illustrates the difference between traditional analytics and real-time big data analytics. Back then, you had to know the kinds of questions you planned to ask before you stored your data. “Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it,” says Erickson. “Technologies such as MapReduce, Hive and Impala enable you to run queries without changing the data structures underneath.” Today, you are much less likely to face a scenario in which you cannot query data and get a response back in a brief period of time. Analytical processes that used to require month, days, or hours have been reduced to minutes, seconds, and fractions of seconds. But shorter processing times have led to higher expectations. Two years ago, many data analysts thought that generating a result from a query in less than 40 minutes was nothing short of miraculous. Today, they expect to see results in under a minute. That’s practically the speed of thought – you think of a query, you get a result, and you begin your experiment. “It’s about moving with greater speed toward previously unknown questions, defining new insights, and reducing the time between when an event happens somewhere in the world and someone responds or reacts to that event,” says Erickson. A rapidly emerging universe of newer technologies has dramatically reduced data processing cycle time, making it possible to explore and experiment with data in ways that would not have been practical or even possible a few years ago. Despite the availability of new tools and systems for handling massive amounts of data at incredible speeds, however, the real promise of advanced data analytics lies beyond the realm of pure technology. “Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse,” says Michael Minelli, co-author of Big Data, Big Analytics. “It’s about the ability to make better decisions and take meaningful actions at the right time. It’s about detecting fraud while someone is swiping a credit card, or triggering an offer while a shopper is standing on a checkout line, or placing an ad on a website while someone is reading a specific article. It’s about combining and analyzing data so you can take the right action, at the right time, and at the right place.” For some, real-time big data analytics (RTBDA) is a ticket to improved sales, higher profits and lower marketing costs. To others, it signals the dawn of a new era in which machines begin to think and respond more like humans.
Real-Time Enterprise Stories More than 20 detailed case studies from Bloomberg Businessweek Research Services and Forbes Insights featuring leading-edge enterprises across industries to explore the real value of the in-memory platform: SAP HANA
Reducing the Dimensionality of Data with Neural Networks High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such ‘‘autoencoder’’ networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.
Reducing the Sampling Complexity of Topic Models Inference in topic models typically involves a sampling step to associate latent variables with observations. Unfortunately the generative model loses sparsity as the amount of data increases, requiring O(k) operations per word for k topics. In this paper we propose an algorithm which scales linearly with the number of actually instantiated topics kd in the document. For large document collections and in structured hierarchical models kd k. This yields an order of magnitude speedup. Our method applies to a wide variety of statistical models such as PDP and HDP. At its core is the idea that dense, slowly changing distributions can be approximated e ciently by the combination of a Metropolis-Hastings step, use of sparsity, and amortized constant time sampling via Walker’s alias method.
Regularized Discriminant Analysis Linear and quadratic discriminant analysis are considered in the small sample high-dimensional setting. Alternatives to the usual maximum likelihood (plug-in) estimates for the covariance matrices are proposed. These alternatives are characterized by two parameters, the values of which are customized to individual situations by jointly minimizing a sample based estimate of future misclassification risk. Computationally fast implementations are presented, and the efficacy of the approach is examined through simulation studies and application to data. These studies indicate that in many circumstances dramatic gains in classification accuracy can be achieved.
Reinforcement Learning: A Tutorial The purpose of this tutorial is to provide an introduction to reinforcement learning (RL) at a level easily understood by students and researchers in a wide range of disciplines. The intent is not to present a rigorous mathematical discussion that requires a great deal of effort on the part of the reader, but rather to present a conceptual framework that might serve as an introduction to a more rigorous study of RL. The fundamental principles and techniques used to solve RL problems are presented. The most popular RL algorithms are presented. Section 1 presents an overview of RL and provides a simple example to develop intuition of the underlying dynamic programming mechanism. In Section 2 the parts of a reinforcement learning problem are discussed. These include the environment, reinforcement function, and value function. Section 3 gives a description of the most widely used reinforcement learning algorithms. These include TD(l) and both the residual and direct forms of value iteration, Q-learning, and advantage learning. In Section 4 some of the ancillary issues in RL are briefly discussed, such as choosing an exploration strategy and an appropriate discount factor. The conclusion is given in Section 5. Finally, Section 6 is a glossary of commonly used terms followed by references in Section 7 and a bibliography of RL applications in Section 8. The tutorial structure is such that each section builds on the information provided in previous sections. It is assumed that the reader has some knowledge of learning algorithms that rely on gradient descent (such as the backpropagation of errors algorithm).
Representation Learning: A Review and New Perspectives The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning. Index Terms – Deep learning, representation learning, feature learning, unsupervised learning, Boltzmann Machine, autoencoder, neural nets
Robust Principal Component Analysis? This paper is about a curious phenomenon. Suppose we have a data matrix, which is the superposition of a low-rank component and a sparse component. Can we recover each component individually? We prove that under some suitable assumptions, it is possible to recover both the low-rank and the sparse components exactly by solving a very convenient convex program called Principal Component Pursuit; among all feasible decompositions, simply minimize a weighted combination of the nuclear norm and of the l1 norm. This suggests the possibility of a principled approach to robust principal component analysis since our methodology and results assert that one can recover the principal components of a data matrix even though a positive fraction of its entries are arbitrarily corrupted. This extends to the situation where a fraction of the entries are missing as well. We discuss an algorithm for solving this optimization problem, and present applications in the area of video surveillance, where our methodology allows for the detection of objects in a cluttered background, and in the area of face recognition, where it offers a principled way of removing shadows and specularities in images of faces.
ROC Curve, Lift Chart and Calibration Plot This paper presents ROC curve, lift chart and calibration plot, three well known graphical techniques that are useful for evaluating the quality of classification models used in data mining and machine learning. Each technique, normally used and studied separately, defines its own measure of classification quality and its visualization. Here, we give a brief survey of the methods and establish a common mathematical framework which adds some new aspects, explanations and interrelations between these techniques. We conclude with an empirical evaluation and a few examples on how to use the presented techniques to boost classification accuracy
RStorm: Developing and Testing Streaming Algorithms in R Streaming data, consisting of indefinitely evolving sequences, are becoming ubiquitous in many branches of science and in various applications. Computer Scientists have developed streaming applications such as Storm and the S4 distributed stream computing platform1 to deal with data streams. However, in current production packages testing and evaluating streaming algorithms is cumbersome. This paper presents RStorm for the development and evaluation of streaming algorithms analogous to these production packages, but implemented fully in R. RStorm allows developers of streaming algorithms to quickly test, iterate, and evaluate various implementations of streaming algorithms. The paper provides both a canonical computer science example, the streaming word count, and examples of several statistical applications of RStorm.
Rules of Machine Learning: Best Practices for ML Engineering This document is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google. It presents a style for machine learning, similar to the Google C++ Style Guide and other popular guides to practical programming. If you have taken a class in machine learning, or built or worked on a machinelearned model, then you have the necessary background to read this document.


SAP HANA for Next-Generation Business: Applications and Real-Time Analytics Explore and Analyze Vast Quantities of Data from Virtually Any Source at the Speed of Thought SAP has introduced a new class of solutions that powers the next generation of business applications. The SAP HANA database is an in-memory database that combines transactional data processing, analytical data processing, and application logic processing functionality in memory. SAP HANA removes the limits of traditional database architecture that have severely constrained how business applications can be developed to support real-time business.
SAP Predictive Analysis – Real Life Use Case Predicting Who Will BuyAdditional Insurance This paper provides a step-by-step description, including screenshots, to evaluate how SAP Predictive Analysis and SAP InfiniteInsight, which as of this writing are now sold as a bundle, can be used to predict the potential customers who will buy additional products based on their behavior of interest. Using the combined strength of both SAP InfiniteInsight and SAP Predictive Analysis this article will also demonstrate how these two products can fulfill the challenge and also supplement each other to provide even better prediction models. This article is for educational purposes only and uses actual data from an insurance company. The data comes from the CoIL (Computational Intelligence and Learning) challenge from year 2000, which had the following goal: “Can you predict who would be interested in buying a caravan insurance policy and give an explanation why?” After reading this article you will be able to understand the differences between classification algorithms. You will learn how to simplify a dataset by determining which variables are important and vice versa, score a model and also how to use SAP Predictive Analysis and SAP InfiniteInsight to build models on existing data and run the custom models on new data. The authors of this article have closely observed the online Predictive Analysis community where members are constantly looking for two important things; which statistical algorithm to choose for which business case and real life business cases with actual data. The latter is certainly difficult to find but immensely crucial in understanding of this subject. Of course data is vital to each company and in today’s competitive market, gives a competitive advantage. Finding a real life case for educational purposes with actual data is a big challenge.
Scalable Strategies for Computing with Massive Data This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to de ne parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-speci c code. Second, the bigmemory package implements memory- and le-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support e cient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have e ectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.
Score Aggregation Techniques in Retrieval Experimentation Comparative evaluations of information retrieval systems are based on a number of key premises, including that representative topic sets can be created, that suitable relevance judgements can be generated, and that systems can be sensibly compared based on their aggregate performance over the selected topic set. This paper considers the role of the third of these assumptions – that the performance of a system on a set of topics can be represented by a single overall performance score such as the average, or some other central statistic. In particular, we experiment with score aggregation techniques including the arithmetic mean, the geometric mean, the harmonic mean, and the median. Using past TREC runs we show that an adjusted geometricmean providesmore consistent system rankings than the arithmetic mean when a significant fraction of the individual topic scores are close to zero, and that score standardization (Webber et al., SIGIR 2008) achieves the same outcome in a more consistent manner.
Security and Privacy Aspects in MapReduce on Clouds: A Survey MapReduce is a programming system for distributed processing large-scale data in an efficient and fault tolerant manner on a private, public, or hybrid cloud. MapReduce is extensively used daily around the world as an efficient distributed computation tool for a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and analysis of social networks. Security and privacy of data and MapReduce computations are essential concerns when a MapReduce computation is executed in public or hybrid clouds. In order to execute a MapReduce job in public and hybrid clouds, authentication of mappers-reducers, confidentiality of data-computations, integrity of data-computations, and correctness-freshness of the outputs are required. Satisfying these requirements shield the operation from several types of attacks on data and MapReduce computations. In this paper, we investigate and discuss security and privacy challenges and requirements, considering a variety of adversarial capabilities, and characteristics in the scope of MapReduce. We also provide a review of existing security and privacy protocols for MapReduce and discuss their overhead issues.
Security and Privacy of Sensitive Data in Cloud Computing: A Survey of Recent Developments Cloud computing is revolutionizing many ecosystems by providing organizations with computing resources featuring easy deployment, connectivity, configuration, automation and scalability. This paradigm shift raises a broad range of security and privacy issues that must be taken into consideration. Multi-tenancy, loss of control, and trust are key challenges in cloud computing environments. This paper reviews the existing technologies and a wide array of both earlier and state-of-the-art projects on cloud security and privacy. We categorize the existing research according to the cloud reference architecture orchestration, resource control, physical resource, and cloud service management layers, in addition to reviewing the existing developments in privacy-preserving sensitive data approaches in cloud computing such as privacy threat modeling and privacy enhancing protocols and solutions.
Security-related Research in Ubiquitous Computing — Results of a Systematic Literature Review In an endeavor to reach the vision of ubiquitous computing where users are able to use pervasive services without spatial and temporal constraints, we are witnessing a fast growing number of mobile and sensor-enhanced devices becoming available. However, in order to take full advantage of the numerous benefits offered by novel mobile devices and services, we must address the related security issues. In this paper, we present results of a systematic literature review (SLR) on security-related topics in ubiquitous computing environments. In our study, we found 5165 scientific contributions published between 2003 and 2015. We applied a systematic procedure to identify the threats, vulnerabilities, attacks, as well as corresponding defense mechanisms that are discussed in those publications. While this paper mainly discusses the results of our study, the corresponding SLR protocol which provides all details of the SLR is also publicly available for download.
Selecting a Visual Analytics Application Not surprisingly, everywhere you look, software companies are adopting the terms “visual analytics” and “interactive data visualization.” Tools that do little more than produce charts and dashboards are now laying claim to the label. How can you tell the cleverly named from the genuine? What should you look for? It’s important to know the defining characteristics of visual analytics before you shop. This paper introduces you to the seven essential elements of true visual analytics applications.
Sentiment Analysis of Twitter Data :A Survey of Techniques With the advancement of web technology and its growth, there is a huge volume of data present in the web for internet users and a lot of data is generated too. Internet has become a platform for online learning, exchanging ideas and sharing opinions. Social networking sites like Twitter, Facebook, Google+ are rapidly gaining popularity as they allow people to share and express their views about topics,have discussion with different communities, or post messages across the world. There has been lot of work in the field of sentiment analysis of twitter data. This survey focuses mainly on sentiment analysis of twitter data which is helpful to analyze the information in the tweets where opinions are highly unstructured, heterogeneous and are either positive or negative, or neutral in some cases. In this paper, we provide a survey and a comparative analyses of existing techniques for opinion mining like machine learning and lexicon-based approaches, together with evaluation metrics. Using various machine learning algorithms like Naive Bayes, Max Entropy, and Support Vector Machine, we provide a research on twitter data streams. We have also discussed general challenges and applications of Sentiment Analysis on Twitter
Sequential Combining of Expert Information Using Mathematica In every real-world domain where reasoning under uncertainty is required, combining information from different sources (‘experts’) can be really a powerful tool to enhance accuracy and precision of the ‘final’ estimate of the unknown quantity. Bayesian paradigm offers a coherent perspective which can be used to address the problem, but an issue strictly related to information combining is how to perform an efficient process of sequential consulting: at each stage, the investigator can select the ‘best’ expert to be consulted and choose whether to stop or continue the consulting. The aim of this paper is to rephrase the Bayesian combining algorithm in a sequential context and use Mathematica to implement suitable selecting and stopping rules.
Sequential Pattern Mining – Approaches and Algorithms Sequences of events, items or tokens occurring in an ordered metric space appear often in data and the requirement to detect and analyse frequent subsequences is a common problem. Sequential Pattern Mining arose as a sub-field of data mining to focus on this field. This paper surveys the approaches and algorithms proposed to date.
Set optimization – a rather short introduction Recent developments in set optimization are surveyed and extended including various set relations as well as fundamental constructions of a convex analysis for set- and vector-valued functions, and duality for set optimization problems. Extensive sections with bibliographical comments summarize the state of the art. Applica- tions to vector optimization and financial risk measures are discussed along with algorithmic approaches to set optimization problems.
Shallow and Deep Networks Intrusion Detection System: A Taxonomy and Survey Intrusion detection has attracted a considerable interest from researchers and industries. The community, after many years of research, still faces the problem of building reliable and efficient IDS that are capable of handling large quantities of data, with changing patterns in real time situations. The work presented in this manuscript classifies intrusion detection systems (IDS). Moreover, a taxonomy and survey of shallow and deep networks intrusion detection systems is presented based on previous and current works. This taxonomy and survey reviews machine learning techniques and their performance in detecting anomalies. Feature selection which influences the effectiveness of machine learning (ML) IDS is discussed to explain the role of feature selection in the classification and training phase of ML IDS. Finally, a discussion of the false and true positive alarm rates is presented to help researchers model reliable and efficient machine learning based intrusion detection systems.
Shiny Cheat Sheet (Cheat Sheet)
Simple Ain’t Easy: Real-World Problems with Basic Summary Statistics In applied statistical work, the use of even the most basic summary statistics, like means, medians and modes, can be seriously problematic. When forced to choose a single summary statistic, many considerations come into practice. This repo attempts to describe some of the non-obvious properties possessed by standard statistical methods so that users can make informed choices about methods.
Smart Data – Innovationen aus Daten Das Bundesministerium für Wirtschaft und Technologie wird mit dem Technologiewettbewerb „Smart Data – Innovationen aus Daten“ Forschungs- und Entwicklungsaktivitäten (FuE-Aktivitäten) fördern, die den zukünftigen Markt um Big Data für die Wirtschaft am Standort Deutschland nachhaltig erschließen. Studien prognostizieren einen rasanten Anstieg des weltweiten Umsatzvolumens mit Big Data auf über 15 Mrd. € im Jahr 2016 (Deutschland: 1,6 Mrd. €). Deutschland hat gute Chancen, im Bereich der skalierbaren Datenmanagement und -analysesysteme international eine führende Rolle einzunehmen. Sowohl etablierte Unternehmen der deutschen IT-Wirtschaft, zahlreiche Forschungseinrichtungen wie auch diverse Start-Ups sind im Umfeld von Big Data bereits aktiv. Mit „Smart Data“ soll ein Schwerpunkt auf die Entwicklung von innovativen Diensten und Dienstleistungen gelegt werden, um eine frühzeitige breitenwirksame Nutzung voranzutreiben. Die Verwertung der Big Data-Technologien steht noch weitgehend am Beginn und konzentriert sich dabei auf einige spezifische Bereiche wie Online-Werbung und E-Commerce in größeren Unternehmen und Organisationen. Von den entstehenden Lösungen wird erwartet, dass sie aufgrund ihrer Handhabbarkeit vor allem in Bezug auf Datensicherheit und Datenqualität in der Wirtschaft leicht Anklang finden. Insbesondere sollen durch die FuE-Aktivitäten innovative Systemlösungen für kleine und mittelständische Unternehmen (KMU) entstehen. „Smart Data“ steht für eine über die Technologieentwicklung hinaus gehende anwendungsnahe Perspektive, die auch KMU eine attraktive und rechtssichere Nutzung und Verwertung von Massendaten ermöglich. Dazu gehört auch, die grundlegenden Rahmenbedingungen, z.B. den Rechtsrahmen für die Nutzung von Big Data, zu adressieren. Gesucht werden Projekte mit Leuchtturmcharakter zur Beseitigung technischer, struktureller, organisatorischer und rechtlicher Hemmnisse für den Einsatz von Big Data-Technologien. Die Projekte sollen in den Anwendungs-bereichen Industrie, Mobilität, Energie und Gesundheit angesiedelt sein. Das Technologieprogramm „Smart Data“ folgt den Zielstellungen der IKT-Strategie „Deutschland Digital 2015“ der Bundesregierung sowie denen des Zukunftsprojekts „Internetbasierte Dienste für die Wirtschaft“ im Rahmen der Hightech-Strategie 2020 und liegt somit im erheblichen Bundesinteresse. Das Programm knüpft an wichtige Basistechnologien und Standards als Grundlage von Big Data an, die zum Beispiel in anderen BMWi-Technologieprogrammen wie THESEUS, Trusted Cloud, Autonomik für Industrie 4.0, Elektromobilität und E-Energy entwickelt wurden oder noch werden. Synergieeffekte mit dem vom Bundesministerium für Bildung und Forschung (BMBF) geförderten Programm „Management und Analyse großer Datenmengen (Big Data)“ oder mit korrespondierenden Programmen der Europäischen Kommission sind erwünscht.
Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones The growth in the use of computationally intensive statistical procedures, especially with big data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPUs, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to “embarrassingly parallel” (EP) algorithms, especially on non-shared memory platforms. This paper develops a broadlyapplicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations.
Software Escalation Prediction with Data Mining One of the most severe manifestations of poor quality of software products occurs when a customer “escalates” a defect: an escalation is triggered when a defect significantly impacts a customer’s operations. Escalated defects are then quickly resolved, at a high cost, outside of the general product release engineering cycle. While the software vendor and its customers often detect and report defects before they are escalated it is not always possible to quickly and accurately prioritize reported defects for resolution. As a result, even previously known defects, in addition to newly discovered defects, are often escalated by customers. Labor cost of escalations from known defects to a software vendor can amount to millions of dollars per year. The total costs to the vendor are even greater, including loss of reputation, satisfaction, loyalty, and repeat revenue. The objective of Escalation Prediction (EP) is to avoid escalations from known product defects by predicting and proactively resolving those known defects that have the highest escalation risk. This short paper outlines the business case for EP, an analysis of the business problem, the solution architecture, and some preliminary validation results on the effectiveness of EP.
Solutions Big Data IBM (Slide Deck)
Solving Differential Equations in R Although R is still predominantly applied for statistical analysis and graphical representation, it is rapidly becoming more suitable for mathematical computing. One of the fields where considerable progress has been made recently is the solution of differential equations. Here we give a brief overview of differential equations that can now be solved by R.
Some Class-Participation Demonstations for Decision Theory and Bayesian Statistics
Some models and methods for the analysis of observational data This article provides a short, concise and essentially self-contained exposition of some of the most important models and methods for the analysis of observational data, and a substantial number of illustrations of their application. Although for the most part our presentation follows P. Rosenbaum’s book, “Observational Studies”, and naturally draws on related literature, it contains original elements and simplifies and generalizes some basic results. The illustrations, based on simulated data, show the methods at work in some detail, highlighting pitfalls and emphasizing certain subjective aspects of the statistical analyses.
Sparse Principal Component Analysis Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results.We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings.We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results.
Spatial interpolation: Techniques for spatial data analysis (Slide Deck)
Spatio-Temporal Clustering: A Survey Spatio-temporal clustering is a process of grouping objects based on their spatial and temporal similarity. It is relatively new subfield of data mining which gained high popularity especially in geographic information sciences due to the pervasiveness of all kinds of location-based or environmental devices that record position, time or/and environmental properties of an object or set of objects in realtime. As a consequence, different types and large amounts of spatio-temporal data became available that introduce new challenges to data analysis and require novel approaches to knowledge discovery. In this chapter we concentrate on the spatiotemporal clustering in geographic space. First, we provide a classification of different types of spatio-temporal data. Then, we focus on one type of spatio-temporal clustering – trajectory clustering, pprovide an overview of the state-of-the-art approaches and methods of spatio-temporal clustering and finally present several scenarios in different application domains such as movement, cellular networks and environmental studies.
SQL-on-Hadoop Engines Explained Big Data And Hadoop – Hadoop is being regarded as one of the best platforms for storing and managing big data. It owes its success to its high data storage and processing scalability, low price/performance ratio, high performance, high availability, high schema flexibility, and its capability to handle all types of data. Unfortunately, Hadoop APIs, such as HDFS, MapReduce, and HBase, are quite complex. They require expertise in Java programming (or similar languages) and require in‐depth knowledge of how to parallelize query processing efficiently. The downsides of these interfaces are a small target audience, low productivity, and limited tool support. The Need For SQL-on-Hadoop Engines – What is needed is a programming interface that retains HDFS’s performance and scalability, offers high productivity and maintainability, is known to non‐technical users, and can be used by many reporting and analytical tools. The obvious choice is evidently SQL. SQL is a highlevel, declarative, and standardized database language, it’s familiar to countless BI specialists, it’s supported by almost all reporting and analytical tools, and has proven its worth over and over again. To offer SQL on Hadoop, SQL query engines are needed that can query and manipulate data stored in HDFS or HBase. Such products are called SQL‐on‐Hadoop engines. Lately, the popularity of SQL‐on‐Hadoop engine is growing rapidly. Here are just a few of the many SQLon‐ Hadoop engines available: Apache Drill, Apache Hive, CitusDB, Cloudera Impala, Concurrent Lingual, Hadapt, HP Vertica, InfiniDB, JethroData, MemSQL, Pivotal HAWQ, Progress DataDirect, ScleraDB, Shark, and SpliceMachine. On the outside most of the SQL‐on‐Hadoop engines look alike. They all support some SQL‐dialect that can be invoked through ODBC or JDBC. Internally, they can be very different. The differences stem from the purpose for which they have been designed. Here are some potential use cases for which they may have been designed:
• batch‐oriented query environment (data mining)
• interactive query environment (OLAP, self‐service BI, data visualization)
• point‐queries (retrieving and manipulating individual objects)
• investigative analytics (data science)
• operational intelligence (real‐time analytics)
• transactional (production systems) Undesired Big Data Silos – Most Hadoop‐based systems have been designed and developed by organizations for one or two use cases. The workload characteristics of these use cases are usually massive data load and execution of non‐interactive, complex forms of analytics. However, Hadoop implementations can support other use cases, including interactive reporting, data stream processing, transactional processing, and text search. The growing availability of SQL‐on‐Hadoop engines has just widen the range of use cases of Hadoop even more. Unfortunately, when deployed for a different use case, a specific Hadoop implementation may be unsuitable with regard to functionality or performance. Development of another use case may force an organization to develop a second solution in which data is stored again. In the long run, this results in many data management platforms: each one designed and optimized to support a limited number of use cases. Finally, this leads to undesirable big data silos. The disadvantages of having big data silos are: high costs because of data duplication, high data latency, complex data replication solutions, and data quality problems. Silos may work well temporarily, but history has shown that eventually the users of these silos will want to combine data from multiple data sources. When this happens, each application is extended to access multiple data sources. This leads to a dedicated integration solution for each one of them. The result is another undesired solution: an integration labyrinth. For an organization it’s almost impossible to guarantee that all these integration solutions are correct, efficient, and lead to consistent results. The Need For One Data Management Platform – The ROI on all big data stored in Hadoop is increased when it’s made available for as wide a range of use cases as possible, including all the new use cases offered by the SQL‐on‐Hadoop engines. What is needed is one Hadoop data management platform that has been designed to support all the current and future use cases, so that the need for duplication of all that big data is minimized and that the development of big data silos and an integration labyrinth is avoided. The Whitepaper – This whitepaper explains what SQL‐on‐Hadoop engines are, what the technological challenges are, and what potential use cases of SQL‐on‐Hadoop are. Besides a high‐level comparison of several of these engines, it also contains a detailed description of Apache Drill that brings to light some of the pertinent issues in providing SQL capabilities on big data. In addition, the MapR Technologies data management platform M7 is also described as an example of a big data platform that can support many different use cases.
SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA Today, not only Internet companies such as Google, Facebook or Twitter do have Big Data but also Enterprise Information Systems store an ever growing amount of data (called Big Enterprise Data in this paper). In a classical SAP system landscape a central data warehouse (SAP BW) is used to integrate and analyze all enterprise data. In SAP BW most of the business logic required for complex analytical tasks (e.g., a complex currency conversion) is implemented in the application layer on top of a standard relational database. While being independent from the underlying database when using such an architecture, this architecture has two major drawbacks when analyzing Big Enterprise Data: (1) algorithms in ABAP do not scale with the amount of data and (2) data shipping is required. To this end, we present a novel programming language called SQLScript to efficiently support complex and scalable analytical tasks inside SAP’s new main-memory database HANA. SQLScript provides two major extensions to the SQL dialect of SAP HANA: A functional and a procedural extension. While the functional extension allows the definition of scalable analytical tasks on Big Enterprise Data, the procedural extension provides imperative constructs to orchestrate the analytical tasks. The major contributions of this paper are two novel functional extensions: First, an extended version of the MapReduce programming model for supporting parallelizable user-defined functions (UDFs). Second, compared to recursion in the SQL standard, a generalized version of recursion to support graph analytics as well as machine learning tasks.
Stacked Graphs – Geometry & Aesthetics In February 2008, the New York Times published an unusual chart of box office revenues for 7500 movies over 21 years. The chart was based on a similar visualization, developed by the first author, that displayed trends in music listening. This paper describes the design decisions and algorithms behind these graphics, and discusses the reaction on the Web. We suggest that this type of complex layered graph is effective for displaying large data sets to a mass audience. We provide a mathematical analysis of how this layered graph relates to traditional stacked graphs and to techniques such as ThemeRiver, showing how each method is optimizing a different “energy function”. Finally, we discuss techniques for coloring and ordering the layers of such graphs. Throughout the paper, we emphasize the interplay between considerations of aesthetics and legibility.
Stan: A Probabilistic Programming Language Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively de nes a log probability function over parameters conditioned on speci ed data and constants. As of version 2.2.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectation propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can be called from the command line, through R using the RStan package, or through < Python using the PyStan package. All three interfaces support sampling or optimization-based inference and analysis, and RStan and PyStan also provide access to log probabilities, gradients, Hessians, and I/O transforms.
Standards in Predictive Analytics: The role of R, Hadoop and PMML in the mainstreaming of predictive analytics. Just a few years ago it was common to develop a predictive analytic model using a single proprietary tool against a sample of structured data. This would then be applied in batch, storing scores for future use in a database or data warehouse. Recently this model has been disrupted. There is a move to real-time scoring, calculating the value of predictive analytic models when they are needed rather than looking for them in a database. At the same time the variety of model execution platforms has expanded with in-database execution, columnar and inmemory databases as well as MapReduce-based execution becoming increasingly common. Modeling too has changed: the open source analytic modeling language R has become extremely popular, with up to 70% of analytic professionals using it at least occasionally. The range of data types being used in models has expanded along with the approaches used for storage. Modelers increasingly want to analyze all their data, not just a sample, to build a model. This increasingly complex and multi-vendor environment has increased the value of standards, both published standards and open source standards. In this paper we will explore the growing role of standards for predictive analytics in expanding the analytic ecosystem, handling Big Data and supporting the move to real-time scoring.
Statistical Inference: The Big Picture Statistics has moved beyond the frequentist-Bayesian controversies of the past. Where does this leave our ability to interpret results? I suggest that a philosophy compatible with statistical practice, labeled here statistical pragmatism, serves as a foundation for inference. Statistical pragmatism is inclusive and emphasizes the assumptions that connect statistical models with observed data. I argue that introductory courses often mischaracterize the process of statistical inference and I propose an alternative “big picture” depiction.
Statistical Learning and Kernel Methods (Slide Deck)
Statistical Learning Theory: Models, Concepts, and Results Statistical learning theory provides the theoretical basis for many of today’s machine learning algorithms and is arguably one of the most beautifully developed branches of arti cial intelligence in general. It originated in Russia in the 1960s and gained wide popularity in the 1990s following the development of the so-called Support Vector Machine (SVM), which has become a standard tool for pattern recognition in a variety of domains ranging from computer vision to computational biology. Providing the basis of new learning algorithms, however, was not the only motivation for developing statistical learning theory. It was just as much a philosophical one, attempting to answer the question of what it is that allows us to draw valid conclusions from empirical data. In this article we attempt to give a gentle, non-technical overview over the key ideas and insights of statistical learning theory. We do not assume that the reader has a deep background in mathematics, statistics, or computer science. Given the nature of the subject matter, however, some familiarity with mathematical concepts and notations and some intuitive understanding of basic probability is required. There exist many excellent references to more technical surveys of the mathematics of statistical learning theory: the monographs by one of the founders of statistical learning theory (Vapnik, 1995, Vapnik, 1998), a brief overview over statistical learning theory in Section 5 of Scholkopf and Smola (2002), more technical overview papers such as Bousquet et al. (2003), Mendelson (2003), Boucheron et al. (2005), Herbrich and Williamson (2002), and the monograph Devroye et al. (1996).
Statistical Model Selection with ‘Big Data’ Big Data offer potential benefits for statistical modelling, but confront problems like an excess of false positives, mistaking correlations for causes, ignoring sampling biases, and selecting by inappropriate methods. We consider the many important requirements when searching for a data-based relationship using Big Data, and the possible role of Autometrics in that context. Paramount considerations include embedding relationships in general initial models, possibly restricting the number of variables to be selected over by non-statistical criteria (the formulation problem), using good quality data on all variables, analyzed with tight significance levels by a powerful selection procedure, retaining available theory insights (the selection problem) while testing for relationships being well specified and invariant to shifts in explanatory variables (the evaluation problem), using a viable approach that resolves the computational problem of immense numbers of possible models.
Statistical Modeling: The Two Cultures There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
Statistical Software (R, SAS, SPSS, and Minitab) for Blind Students and Practitioners Access to information is crucial for the blind person’s success in education, but transferring knowledge about the existence of techniques into actually being able to complete those tasks is what will ultimately improve the blind person’s employment prospects. This paper is based on the experiences of the two authors; as blind academics in statistics, we are dependent on the usefulness of statistical software for blind users more than most blind people. The use of the \we” throughout this article is intentionally meant to be personal in terms of our own experiences but more importantly, also re ects the needs of the blind community as a whole. Blind students often bene t from one-to-one teaching resources which can aid in their uptake of statistical thinking and practice, but this additional service is only a temporary solution. Once the student has completed their rst course in statistics, they may embark on research at a university, or head out into industry to apply their knowledge. Irrespective of the direction they choose, they will need certainty in being able to independently create graphs for the sighted readers of their work. At the 2009 Workshop on E-Inclusion in Mathematics and Sciences, the rst author was able to meet other researchers who are concerned about the low rate of blind people entering the sciences in a broad sense and the mathematical sciences in particular. Godfrey (2009) presents what we believe is the rst formalized presentation (written by a blind person) of the current state of a airs for blind people taking statistics courses. Much of the material covered in that work still holds true today, although there have been some technological changes that have altered the landscape a little. The four main considerations of Godfrey (2009) were graphics, software, statistical tables, and mathematical formulae. Although software was just one element discussed, graphics and mathematical formulae are playing an increasing role in the usefulness of statistical software, especially with respect to the accessibility of support documentation. We have reviewed four statistical software packages that blind people might want to use in their university education. Our review is restricted to the Windows operating system because this is the predominant environment in which blind people are working. Before we review R, SAS, SPSS, and Minitab, we outline our expectations of statistical software, describe a simple task used to evaluate some practical experiences, and describe some issues with certain le formats and graphics. Following the software-speci c sections there is a general discussion of pertinent issues for software developers, including the relevant details of the legislative environment in the United States of America. The article closes with a simpli ed set of criteria and our overall assessment of the current state of the usefulness of statistical software for blind users.
Statistical Thinking: An Approach to Management (Slide Deck)
Stein´s Paradox in Statistics
STL: A seasonal Trend decomposition Procedure based on LOESS
Stochastic Gradient Descent Tricks Chapter 1 strongly advocates the stochastic back-propagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD). This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.
Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use A common problem in regression analysis is that of variable selection. Often, you have a large number of potential independent variables, and wish to select among them, perhaps to create a ‘best’ model. One common method of dealing with this problem is some form of automated procedure, such as forward, backward, or stepwise selection. We show that these methods are not to be recommended, and present better alternatives using PROC GLMSELECT and other methods.
Strengths and Weaknesses of Weak-Strong Cluster Problems: A Detailed Overview of State-of-the-art Classical Heuristics vs Quantum Approaches To date, a conclusive detection of quantum speedup remains elusive. Recently, a team by Google Inc. [arXiv:1512.02206] proposed a weak-strong cluster model tailored to have tall and narrow energy barriers separating local minima, with the aim to highlight the value of finite-range tunneling. More precisely, results from quantum Monte Carlo simulations, as well as the D-Wave 2X quantum annealer scale considerably better than state-of-the-art simulated annealing simulations. Moreover, the D-Wave 2X quantum annealer is $\sim 10^8$ times faster than simulated annealing on conventional computer hardware for problems with approximately $10^3$ variables. Here, an overview of different sequential, nontailored, as well as specialized tailored algorithms on the Google instances is given. We show that the quantum speedup is limited to sequential approaches and study the typical complexity of the benchmark problems using insights from the study of spin glasses.
Structural Equation Models Structural equation models (SEMs), also called simultaneous equation models, are multivariate (i.e., multiequation) regression models. Unlike the more traditional multivariate linear model, however, the response variable in one regression equation in an SEM may appear as a predictor in another equation; indeed, variables in an SEM may influence one-another reciprocally, either directly or through other variables as intermediaries. These structural equations are meant to represent causal relationships among the variables in the model. …
Structural Intervention Distance (SID) for Evaluating Causal Graphs Causal inference relies on the structure of a graph, often a directed acyclic graph (DAG). Di erent graphs may result in di erent causal inference statements and di erent intervention distributions. To quantify such differences, we propose a (pre-) distance between DAGs, the structural intervention distance (SID). The SID is based on a graphical criterion only and quanti es the closeness between two DAGs in terms of their corresponding causal inference statements. It is therefore well-suited for evaluating graphs that are used for computing interventions. Instead of DAGs it is also possible to compare CPDAGs, completed partially directed acyclic graphs that represent Markov equivalence classes. Since it di ers significantly from the popular Structural Hamming Distance (SHD), the SID constitutes a valuable additional measure.
Structure Learning of Probabilistic Graphical Models: A Comprehensive Survey Probabilistic graphical models combine the graph theory and probability theory to give a multivariate statistical modeling. They provide a unified description of uncertainty using probability and complexity using the graphical model. Especially, graphical models provide the following several useful properties:
• Graphical models provide a simple and intuitive interpretation of the structures of probabilistic models. On the other hand, they can be used to design and motivate new models.
• Graphical models provide additional insights into the properties of the model, including the conditional independence properties.
• Complex computations which are required to perform inference and learning in sophisticated models can be expressed in terms of graphical manipulations, in which the underlying mathematical expressions are carried along implicitly.
The graphical models have been applied to a large number of fields, including bioinformatics, social science, control theory, image processing, marketing analysis, among others. However, structure learning for graphical models remains an open challenge, since one must cope with a combinatorial search over the space of all possible structures. In this paper, we present a comprehensive survey of the existing structure learning algorithms.
Stupid Data Miner Tricks: Overfitting the S and P 500 It wasn’t too long ago that calling someone a data miner was a very bad thing. You could start a fistfight at a convention of statisticians with this kind of talk. It meant that you were finding the analytical equivalent of the bunnies in the clouds, poring over data until you found something. Everyone knew that if you did enough poring, you were bound to find that bunny sooner or later, but it was no more real than the one that blows over the horizon. Now, data mining is a small industry, with entire companies devoted to it. There are academic conferences devoted solely to data mining. The phrase no longer elicits as many invitations to step into the parking lot as it used to. What’s going on? These new data mining people are not fools. Sometimes data mining makes sense, and sometimes it doesn’t. …
Summarizing large text collection using topic modeling and clustering based on MapReduce framework. Document summarization provides an instrument for faster understanding the collection of text documents and has a number of real life applications. Semantic similarity and clustering can be utilized efficiently for generating effective summary of large text collections. Summarizing large volume of text is a challenging and time consuming problem particularly while considering the semantic similarity computation in summarization process. Summarization of text collection involves intensive text processing and computations to generate the summary. MapReduce is proven state of art technology for handling Big Data. In this paper, a novel framework based on MapReduce technology is proposed for summarizing large text collection. The proposed technique is designed using semantic similarity based clustering and topic modeling using Latent Dirichlet Allocation (LDA) for summarizing the large text collection over MapReduce framework. The summarization task is performed in four stages and provides a modular implementation of multiple documents summarization. The presented technique is evaluated in terms of scalability and various text summarization parameters namely, compression ratio, retention ratio, ROUGE and Pyramid score are also measured. The advantages of MapReduce framework are clearly visible from the experiments and it is also demonstrated that MapReduce provides a faster implementation of summarizing large text collections and is a powerful tool in Big Text Data analysis.
Support vs Confidence in Association Rule Algorithms The discovery of interesting association relationships among large amounts of business transactions is currently vital for making appropriate business decisions. There are currently a variety of algorithms to discover association rules. Some of these algorithms depend on the use of minimum support to weed out the uninteresting rules. Other algorithms look for highly correlated items, that is, rules with high confidence. In this paper we present a description of these types of association rule algorithms and a comparison of two algorithms representative of these approaches, with the aim of understanding the pros and cons of the support- and confidence-based approaches.
Survey of Clustering Data Mining Techniques Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. They are subject of the survey.
Survey of Distributed Decision We survey the recent distributed computing literature on checking whether a given distributed system configuration satisfies a given boolean predicate, i.e., whether the configuration is legal or illegal w.r.t. that predicate. We consider classical distributed computing environments, including mostly synchronous fault-free network computing (LOCAL and CONGEST models), but also asynchronous crash-prone shared-memory computing (WAIT-FREE model), and mobile computing (FSYNC model).
Survey of Expressivity in Deep Neural Networks We survey results on neural network expressivity described in ‘On the Expressive Power of Deep Neural Networks’. The paper motivates and develops three natural measures of expressiveness, which all display an exponential dependence on the depth of the network. In fact, all of these measures are related to a fourth quantity, trajectory length. This quantity grows exponentially in the depth of the network, and is responsible for the depth sensitivity observed. These results translate to consequences for networks during and after training. They suggest that parameters earlier in a network have greater influence on its expressive power — in particular, given a layer, its influence on expressivity is determined by the remaining depth of the network after that layer. This is verified with experiments on MNIST and CIFAR-10. We also explore the effect of training on the input-output map, and find that it trades off between the stability and expressivity.
Survey of Keyword Extraction Techniques Keywords are commonly used for search engines and document databases to locate information and determine if two pieces of test are related to each other. Reading and summarizing the contents of large entries of text into a small set of topics is difficult and time consuming for a human, so much so that it becomes nearly impossible to accomplish with limited manpower as the size of the information grows. As a result, automated systems are being more commonly used to do this task. This problem is challenging due to the intricate complexities of natural lan- guage, as well as the inherent difficulty in determining if a word or set of words accurately represent topics present within the text. With the advent of the internet, there is now both a massive amount of information available, as well as a demand to be able to search through all of this information. Keyword extraction from text data is a common tool used by search engines and indexes alike to quickly categorize and locate specific data based on explicitly or implicitly supplied keywords.
Survey on Feature Selection Feature selection plays an important role in the data mining process. It is needed to deal with the excessive number of features, which can become a computational burden on the learning algorithms. It is also necessary, even when computational resources are not scarce, since it improves the accuracy of the machine learning tasks, as we will see in the upcoming sections. In this review, we discuss the different feature selection approaches, and the relation between them and the various machine learning algorithms.
Survival Analysis in R This document is intended to assist an individual who has familiarity with R and who is taking a survival analysis course. Specifically, this was constructed for a biostatistics course at UCLA. Many theoretical details have been intentionally omitted for brevity; it is assumed the reader is familiar with the theory of the topics presented. Likewise, it is assumed the reader has basic understanding of R including working with data frames, vectors, matrices, plotting, and linear model fitting and interpretation. Functions that are introduced will only have the key arguments mentioned and discussed. Most functions have several other (optional) arguments, however, many of these will not be useful for an introductory course. The functions, with the exception of those I wrote, have well-written descriptions that specify each of the potential arguments and their use. The functions I have written include documentation on the following web site: <” target=”top”>http://…/> Ideally, this survival analysis document would be printed front-to-back and bound like a book. No topics run over two pages and those that are two pages would then be on opposing pages, making a topic’s introduction available without the need to flip back and forth between pages (unless the reader forgets a previous topic, then some flipping may be necessary). A more thorough look at Cox PH models beyond what is discussed here is available in a guide constructed by John Fox, which is listed in the References.
Sustainability in the Age of Big Data Big data and climate change share one important characteristic: Both are changing the course of history. Carbon dioxide levels have not been this high in 800,000 years, and the amount of data being generated today is unprecedented. The question at the recent Wharton conference on “Sustainability in the Age of Big Data” was how rapidly advancing information technologies can be brought together to forestall the worst ravages of global climate change. As Gary Survis, CMO of Big Data company Syncsort, IGEL senior fellow and conference moderator, noted, “It is rare that there is a confluence of two seismic events as transformative as climate change and big data. It presents amazing opportunities, as well as responsibilities.” Coming to terms with the scope of big data is a challenge, but the promise is enormous. Big data has the potential to revolutionize the two industries that generate the most carbon dioxide – energy and agriculture. Machine-to-machine communication can help reduce energy demands and increase the viability of renewable power sources. On farms, data from the molecular level may help give rise to a new green revolution, and sensors in satellites, farmland, trucks and grocery stores promise to reduce waste industry-wide. Important questions remain. Can big data be used to influence people’s behavior without manipulating them? Can private enterprise capitalize on big data’s possibilities without riding roughshod over the rights of those who generate the data? And can the high-tech innovations already underway in the developed world help solve the problems of those most in need? How well we answer these questions will determine whether we can realize the historic potential of “Sustainability in the Age of Big Data.”
Symbolic Calculus in Mathematical Statistics: A Review In the last ten years, the employment of symbolic methods has substantially extended both the theory and the applications of statistics and probability. This survey reviews the development of a symbolic technique arising from classical umbral calculus, as introduced by Rota and Taylor in $1994.$ The usefulness of this symbolic technique is twofold. The first is to show how new algebraic identities drive in discovering insights among topics apparently very far from each other and related to probability and statistics. One of the main tools is a formal generalization of the convolution of identical probability distributions, which allows us to employ compound Poisson random variables in various topics that are only somewhat interrelated. Having got a different and deeper viewpoint, the second goal is to show how to set up algorithmic processes performing efficiently algebraic calculations. In particular, the challenge of finding these symbolic procedures should lead to a new method, and it poses new problems involving both computational and conceptual issues. Evidence of efficiency in applying this symbolic method will be shown within statistical inference, parameter estimation, L\’evy processes, and, more generally, problems involving multivariate functions. The symbolic representation of Sheffer polynomial sequences allows us to carry out a unifying theory of classical, Boolean and free cumulants. Recent connections within random matrices have extended the applications of the symbolic method.
Symbolic Data Analysis: Definitions and Examples With the advent of computers, large, very large datasets have become routine. What is not so routine is how to analyse these data and/or how to glean useful information from within their massive confines. One approach is to summarize large data sets in such a way that the resulting summary dataset is of a manageable size. One consequence of this is that the data may no longer be formatted as single values such as is the case for classical data, but may be represented by lists, intervals, distributions and the like. These summarized data are examples of symbolic data. This paper looks at the concept of symbolic data in general, and then attempts to review the methods currently available to analyse such data. It quickly becomes clear that the range of methodologies available draws analogies with developments prior to 1900 which formed a foundation for the inferential statistics of the 1900’s, methods that are largely limited to small (by comparison) data sets and limited to classical data formats. The scarcity of available methodologies for symbolic data also becomes clear and so draws attention to an enormous need for the development of a vast catalogue (so to speak) of new symbolic methodologies along with rigorous mathematical foundational work for these methods.
Symbolic, Distributed and Distributional Representations for Natural Language Processing in the Era of Deep Learning: a Survey Natural language and symbols are intimately correlated. Recent advances in machine learning (ML) and in natural language processing (NLP) seem to contradict the above intuition: symbols are fading away, erased by vectors or tensors called distributed and distributional representations. However, there is a strict link between distributed/distributional representations and symbols, being the first an approximation of the second. A clearer understanding of the strict link between distributed/distributional representations and symbols will certainly lead to radically new deep learning networks. In this paper we make a survey that aims to draw the link between symbolic representations and distributed/distributional representations. This is the right time to revitalize the area of interpreting how symbols are represented inside neural networks.
SynopSys: Large Graph Analytics in the SAP HANA Database Through Summarization Graph-structured data is ubiquitous and with the advent of social networking platforms has recently seen a significant increase in popularity amongst researchers. However, also many business applications deal with this kind of data and can therefore benefit greatly from graph processing functionality offered directly by the underlying database. This paper summarizes the current state of graph data processing capabilities in the SAP HANA database and describes our efforts to enable large graph analytics in the context of our research project SynopSys. With powerful graph pattern matching support at the core, we envision OLAP-like evaluation functionality exposed to the user in the form of easy-to-apply graph summarization templates. By combining them, the user is able to produce concise summaries of large graph-structured datasets. We also point out open questions and challenges that we plan to tackle in the future developments on our way towards large graph analytics.


Teaching Machines to Read and Comprehend Teaching machines to read natural language documents remains an elusive challenge. Machine reading systems can be tested on their ability to answer questions posed on the contents of documents that they have seen, but until now large scale training and test datasets have been missing for this type of evaluation. In this work we define a new methodology that resolves this bottleneck and provides large scale supervised reading comprehension data. This allows us to develop a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure.
Temporal Data Mining: An Overview To classify data mining problems and algorithms we used two dimensions: data type and type of mining operations. One of the main issue that arise during the data mining process is treating data that contains temporal information. The area of temporal data mining has very much attention in the last decade because from the time related feature of the data, one can extract much significant information which can not be extracted by the general methods of data mining. Many interesting techniques of temporal data mining were proposed and shown to be useful in many applications. Since temporal data mining brings together techniques from different fields such as databases, statistics and machine learning the literature is scattered among many different sources. In this paper, we present a survey on techniques of temporal data mining.
Ten Simple Rules for Reproducible Computational Research Replication is the cornerstone of a cumulative science. However, new tools and technologies, massive amounts of data, interdisciplinary approaches, and the complexity of the questions being asked are complicating replication efforts, as are increased pressures on scientists to advance their research. As full replication of studies on independently collected data is often not feasible, there has recently been a call for reproducible research as an attainable minimum standard for assessing the value of scientific claims. This requires that papers in experimental science describe the results and provide a sufficiently clear protocol to allow successful repetition and extension of analyses based on original data. The importance of replication and reproducibility has recently been exemplified through studies showing that scientific papers commonly leave out experimental details essential for reproduction, studies showing difficulties with replicating published experimental results, an increase in retracted papers, and through a high number of failing clinical trials. This has led to discussions on how individual researchers, institutions, funding bodies, and journals can establish routines that increase transparency and reproducibility. In order to foster such aspects, it has been suggested that the scientific community needs to develop a ‘‘culture of reproducibility’’ for computational science, and to require it for published claims. We want to emphasize that reproducibility is not only a moral responsibility with respect to the scientific field, but that a lack of reproducibility can also be a burden for you as an individual researcher. As an example, a good practice of reproducibility is necessary in order to allow previously developed methodology to be effectively applied on new data, or to allow reuse of code and results for new projects. In other words, good habits of reproducibility may actually turn out to be a time-saver in the longer run. We further note that reproducibility is just as much about the habits that ensure reproducible research as the technologies that can make these processes efficient and realistic. Each of the following ten rules captures a specific aspect of reproducibility, and discusses what is needed in terms of information handling and tracking of procedures. If you are taking a bare-bones approach to bioinformatics analysis, i.e., running various custom scripts from the command line, you will probably need to handle each rule explicitly. If you are instead performing your analyses through an integrated framework (such as Gene- Pattern, Galaxy, LONI pipeline, or Taverna), the system may already provide full or partial support for most of the rules. What is needed on your part is then merely the knowledge of how to exploit these existing possibilities. In a pragmatic setting, with publication pressure and deadlines, one may face the need to make a trade-off between the ideals of reproducibility and the need to get the research out while it is still relevant. This trade-off becomes more important when considering that a large part of the analyses being tried out never end up yielding any results. However, frequently one will, with the wisdom of hindsight, contemplate the missed opportunity to ensure reproducibility, as it may already be too late to take the necessary notes from memory (or at least much more difficult than to do it while underway). We believe that the rewards of reproducibility will compensate for the risk of having spent valuable time developing an annotated catalog of analyses that turned out as blind alleys. As a minimal requirement, you should at least be able to reproduce the results yourself. This would satisfy the most basic requirements of sound research, allowing any substantial future questioning of the research to be met with a precise explanation. Although it may sound like a very weak requirement, even this level of reproducibility will often require a certain level of care in order to be met. There will for a given analysis be an exponential number of possible combinations of software versions, parameter values, preprocessing steps, and so on, meaning that a failure to take notes may make exact reproduction essentially impossible. With this basic level of reproducibility in place, there is much more that can be wished for. An obvious extension is to go from a level where you can reproduce results in case of a critical situation to a level where you can practically and routinely reuse your previous work and increase your productivity. A second extension is to ensure that peers have a practical possibility of reproducing your results, which can lead to increased trust in, interest for, and citations of your work. We here present ten simple rules for reproducibility of computational research. These rules can be at your disposal for whenever you want to make your research more accessible – be it for peers or for your future self.
texreg: Conversion of Statistical Model Output in R to LATEX and HTML Tables A recurrent task in applied statistics is the (mostly manual) preparation of model output for inclusion in LATEX, Microsoft Word, or HTML documents – usually with more than one model presented in a single table along with several goodness-of- t statistics. However, statistical models in R have diverse object structures and summary methods, which makes this process cumbersome. This article first develops a set of guidelines for converting statistical model output to LATEX and HTML tables, then assesses to what extent existing packages meet these requirements, and finally presents the texreg package as a solution that meets all of the criteria set out in the beginning. After providing various usage examples, a blueprint for writing custom model extensions is proposed.
Text Understanding from Scratch This article demontrates that we can apply deep learning to text understanding from character level inputs all the way up to abstract text concepts, using temporal convolutional networks (LeCun et al., 1998) (ConvNets). We apply ConvNets to various large-scale datasets, including ontology classification, sentiment analysis, and text categorization. We show that temporal ConvNets can achieve astonishing performance without the knowledge of words, phrases, sentences and any other syntactic or semantic structures with regards to a human language. Evidence shows that our models can work for both English and Chinese.
The Analytics Big Bang (infographic)
The Anatomy of Big Data Computing Advances in information technology and its widespread growth in several areas of business, engineering, medical and scientific studies are resulting in information/data explosion. Knowledge discovery and decision making from such rapidly growing voluminous data is a challenging task in terms of data organization and processing, which is an emerging trend known as Big Data Computing; a new paradigm which combines large scale compute, new data intensive techniques and mathematical models to build data analytics. Big Data computing demands a huge storage and computing for data curation and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of Big Data computing, differences between traditional data warehousing and Big Data, taxonomy of Big Data computing and underpinning technologies, integrated platform of Big Data and Clouds known as Big Data Clouds, layered architecture and components of Big Data Cloud and finally discusses open technical challenges and future directions.
The Art of Data Augmentation The term data augmentation refers to methods for constructing iterative optimization or sampling algorithms via the introduction of unobserved data or latent variables. For deterministic algorithms, the method was popularized in the general statistical community by the seminal article by Dempster, Laird, and Rubin on the EM algorithm for maximizing a likelihood function or, more generally, a posterior density. For stochastic algorithms, the method was popularized in the statistical literature by Tanner and Wong’s Data Augmentation algorithm for posterior sampling and in the physics literature by Swendsen and Wang’s algorithm for sampling from the Ising and Potts models and their generalizations; in the physics literature, the method of data augmentation is referred to as the method of auxiliary variables. Data augmentation schemes were used by Tanner and Wong to make simulation feasible and simple, while auxiliary variables were adopted by Swendsen and Wang to improve the speed of iterative simulation. In general, however, constructing data augmentation schemes that result in both simple and fast algorithms is a matter of art in that successful strategies vary greatly with the (observed-data) models being considered. After an overview of data augmentation/auxiliary variables and some recent developments in methods for constructing such efficient data augmentation schemes, we introduce an effective search strategy that combines the ideas of marginal augmentation and conditional augmentation, together with a deterministic approximation method for selecting good augmentation schemes. We then apply this strategy to three common classes of models (specifically, multivariate t, probit regression, and mixed-effects models) to obtain efŽficient Markov chain Monte Carlo algorithms for posterior sampling. We provide theoretical and empirical evidence that the resulting algorithms, while requiring similar programming effort, can show dramatic improvement over the Gibbs samplers commonly used for these models in practice. A key feature of all these new algorithms is that they are positive recurrent subchains of nonpositive recurrent Markov chains constructed in larger spaces.
The Art of Turning Data Into Product Having worked in academia, government and industry, I’ve had a unique opportunity to build products in each sector. Much of this product development has been around building data products. Just as methods for general product development have steadily improved, so have the ideas for developing data products. Thanks to large investments in the general area of data science, many major innovations (e.g., Hadoop, Voldemort, Cassandra, HBase, Pig, Hive, etc.) have made data products easier to build. Nonetheless, data products are unique in that they are often extremely difficult, and seemingly intractable for small teams with limited funds. Yet, they get solved every day. How? Are the people who solve them superhuman data scientists who can come up with better ideas in five minutes than most people can in a lifetime? Are they magicians of applied math who can cobble together millions of lines of code for high-performance machine learning in a few hours? No. Many of them are incredibly smart, but meeting big problems head-on usually isn’t the winning approach. There’s a method to solving data problems that avoids the big, heavyweight solution, and instead, concentrates building something quickly and iterating. Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small. We call this Data Jujitsu: the art of using multiple data elements in clever ways to solve iterative problems that, when combined, solve a data problem that might otherwise be intractable. It’s related to Wikipedia’s definition of the ancient martial art of jujitsu: “the art or technique of manipulating the opponent’s force against himself rather than confronting it with one’s own force.”
The Basic AI Drives One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted. We start by showing that goal-seeking systems will have drives to model their own operation and to improve themselves. We then show that self-improving systems will be driven to clarify their goals and represent them as economic utility functions. They will also strive for their actions to approximate rational economic behavior. This will lead almost all systems to protect their utility functions from modification and their utility measurement systems from corruption. We also discuss some exceptional systems which will want to modify their utility functions. We next discuss the drive toward self-protection which causes systems try to prevent themselves from being harmed. Finally we examine drives toward the acquisition of resources and toward their efficient utilization. We end with a discussion of how to incorporate these insights in designing intelligent technology which will lead to a positive future for humanity.
The Bayesian New Statistics: Two historical trends converge There have been two historical shifts in the practice of data analysis. One shift is from hypothesis testing to estimation with uncertainty and meta-analysis, which among frequentists in psychology has recently been dubbed ‘the New Statistics’ (Cumming, 2014). A second shift is from frequentist methods to Bayesian methods. We explain and applaud both of these shifts. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The two historical trends converge in Bayesian methods for estimation with uncertainty and meta-analysis.
The beginner’s guide to app analytics The web analytics industry has grown in size and sophistication over the past decade, enabling marketers to create targeted and revenue-driven online presences. This represents a huge shift in how brands operate and communicate with their consumers. And it’s awesome. But the web isn’t where the majority of your audience is going to be anymore, and it isn’t where the learning curve is. Today, there are over 1.2 billion mobile web users worldwide, and mobile traffic is slated to reach over 25% of total Internet traffic by the year-end. And now, US daily smartphone screen time has even exceeded television time. For brands designing, launching, promoting and investing in mobile apps, tracking the right engagement metrics is critical to long-term success in terms of ROI and growth. Using realtime app insights allows you to get to know and adapt to your users right from the start, so that they keep coming back, even if you’re just getting started. In this eBook, we outline how to define your brand’s mobile goals, switch from a web-only mindset, and get started identifying, measuring, and learning from your key app analytics.
The Big Data Economy: Why and how our future with data is cleaner, leaner, and smarter (Slide Deck)
The Big Potential of Big Data Big data works. Adopters have reaped benefits in ROI, customer interactions and insights into customer behavior. Of the organizations that used big data at least 50% of the time, three in five (60%) said that they had exceeded their goals. At the same time, of the companies that used big data less than 50% of the time, just 33% said that they had exceeded their goals. The more frequently that companies felt that they were making sufficient use of data, the more likely they exceeded their goals. More than nine in 10 companies (92%) who had always or frequently made sufficient use of data said that they had met or exceeded their goals, while just 5% who said that they were making sufficient use of data said that they were falling short of their goals. At the same time, marketers seem to be suffering from a personality split. The overwhelming majority of executives say they are satisfied with their marketing. When pressed for more detail, however, the participants’ rosy view contradicts other, more detailed findings. Executives believe that they are using big data enough when they aren’t. A majority of agencies and non-agencies said that they were frequently or always making sufficient use of data in marketing decisions. However, only about one in 10 non-agencies managed more than half their advertising/marketing with big data, and a third of agencies used big data in more than half their initiatives. Many executives may be struggling to define big data and its potential benefits. Just over half of senior executives (both at agencies and other companies) said that they agreed or strongly agreed that they had a good understanding of big data and its benefits. Systems that generate data quickly and can account for changing consumer behavior – those that utilize machine learning – will be increasingly important. Roughly a quarter of respondents called them critical to the success of their marketing, while another 43% of agency executives and 44% of senior executives at non-agency organizations said they would be increasingly important for most initiatives.
The Dimensionality of Customer Satisfaction Survey Responses and Implications for Driver Analysis The canonical design of customer satisfaction surveys asks for global satisfaction with a product or service and for evaluations of its distinct attributes. Users of these surveys are often interested in the relationship between global satisfaction and the attributes, with regression analysis used to measure the conditional associations. Regression analysis is only appropriate when the global satisfaction measure results from the attribute evaluations, and is not appropriate when the covariance of the items lie in a low dimensional subspace, such as in a factor model. Potential reasons for low dimensional responses are responses that are haloed from overall satisfaction and an unintended lack of specificity of items. In this paper we develop a Bayesian mixture model that facilitates the empirical distinction between regression models and relatively much lower dimensional factor models. The model uses the dimensionality of the covariance among items in a survey as the primary classification criterion while accounting for heterogeneous usage of rating scales. We apply the model to four different customer satisfaction surveys evaluating hospitals, an academic program, smart-phones, and theme parks respectively. We show that correctly assessing the heterogeneous dimensionality of responses is critical for meaningful inferences by comparing our results to those from regression models.
The Enron Corpus: A New Dataset for Email classification Research Automated classification of email messages into user-speci c folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a stateof- the-art classi er (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classi er, and using all the sections in combination with regression weights.
The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R This paper describes an R package named flare, which implements a family of new high dimensional regression methods (LAD Lasso, SQRT Lasso, lq Lasso, and Dantzig selector) and their extensions to sparse precision matrix estimation (TIGER and CLIME). These methods exploit different nonsmooth loss functions to gain modeling exibility, estimation robustness, and tuning insensitiveness. The developed solver is based on the alternating direction method of multipliers (ADMM). The package flare is coded in double precision C, and called from R by a user-friendly interface. The memory usage is optimized by using the sparse matrix output. The experiments show that flare is efficient and can scale up to large problems.
The Forrester Wave: Big Data Streaming Analytics Platforms, Q3 2014 Streaming analytics is anything but a sleepy, rearview mirror analysis of data. No, it is about knowing and acting on what’s happening in your business at this very moment – now. Forrester calls these perishable insights because they occur at a moment’s notice and you must act on them fast within a narrow window of opportunity before they quickly lose their value. The high velocity, white-water flow of data from innumerable real-time data sources such as market data, Internet of Things, mobile, sensors, clickstream, and even transactions remain largely unnavigated by most firms. The opportunity to leverage streaming analytics has never been greater. In Forrester’s 50-criteria evaluation of big data streaming analytics platforms, we evaluated seven platforms from IBM, Informatica, SAP, Software AG, SQLstream, Tibco Software, and Vitria.
The Forward Search and Data Visualisation The forward search is a powerful robust statistical method for exploring the relationship between data and fitted models, which produces an appreciable number of graphs that illuminate the structure of the data. Atkinson and Riani (2000) describe its use in linear and nonlinear regression, response transformation and in generalized linear models, where the emphasis is on the detection of unidentified subsets of the data and of multiple masked outliers and of their effect on inferences. In this talk we extend the method to the analysis of multivariate data, where the emphasis is rather more on the data and less on the multivariate normal model. The forward search orders the observations by closeness to the assumed model, starting from a small subset of the data and increasing the number of observations m used for fitting the model. Outliers and small unidentified subsets of observations enter at the end of the search. Even if there are a number of groups, as in cluster analysis, we start by fitting one multivariate normal distribution to the data. An important graphical tool is a variety of plots of the Mahalanobis distances of the individual observations during the search. Each unit, originally a point in v-dimensional space, is then represented by a curve in two dimensions connecting the almost n values of the distance for each unit calculated during the search. Our task is now to classify these curves. Forward plots of Mahalanobis distances give a good initial indication of clusters, if any, which can be refined by, for example, plots of distances for unclassified units compared to those for each established group. We can also start the forward search at a different points, for example inside each cluster in turn, when we obtain very different curves for each unit. If our aim is cluster analysis, we finish by fitting as many multivariate normal distributions as there are clusters, visually monitoring the behaviour of our forward clustering algorithm. Because we use Mahalanobis distances, it is important that the data are approximately normal. We therefore combine cluster analysis with a multivariate form of the Box-Cox family of transformations. We again use graphical methods, particularly a series of “fan plots”, to establish appropriate transformations.
The forward search: Theory and data analysis The Forward Search is a powerful general method, incorporating flexible data-driven trimming, for the detection of outliers and unsuspected structure in data and so for building robust models. Starting from small subsets of data, observations that are close to the fitted model are added to the observations used in parameter estimation. As this subset grows we monitor parameter estimates, test statistics and measures of fit such as residuals. The paper surveys theoretical development in work on the Forward Search over the last decade. The main illustration is a regression example with 330 observations and 9 potential explanatory variables. Mention is also made of procedures for multivariate data, including clustering, time series analysis and fraud detection.
The Future of Data Analysis
The Future of Retail Analytics Retail has always been a data-intensive industry. As the tools available to store, manage and analyze this data evolved, so did the role the analysis of data played in retail decision-making. From visibility and control, to transparency, to efficiency, to customer engagement. It is cheaper, faster and easier today to store and process more data than ever before. Retailers have gotten better at data management. Question is, how well are they able to leverage insights from this analysis to drive strategic decisions? EKN conducted an industry survey to benchmark the state of the retail industry in terms of analytics maturity. Findings from the primary research covering 65+ respondents, interview based qualitative inputs from retail executives, and EKN’s secondary research from public and proprietary sources are presented in this report. In a retail environment where consumer spending is stunted and competition from newer, digital channels is eroding store sales, the route for brick and mortar retailers to earn a larger share of wallet of the customer is through deeper, Omni-channel customer engagement. Customer engagement is only as effective as how well you know the customer and how well you are equipped to act on that insight across your channels. Perhaps this is why customer insight emerges as retailers’ highest-priority goal from analytics initiatives in 2013. In addition, findings from EKN’s survey include:
• Retailers’ analytics maturity is low: 2 in 5 retailers state they lag behind their competitors in terms of their analytics maturity and a further 2 in 5 suggest they are at par. The “analytical retailer” is thus the exception rather than the rule.
• Data management and integration will be a key area of investment in an effort to increase analytical maturity. Retailers are looking to integrate a variety of data sources over the next 2 years, however public and open data remains a relatively under-explored opportunity.
• Retailers find their current analytics organizational setup sub-optimal. Only 18% currently have a shared services model for analytics in place whereas approximately 60% would like to move towards such a model.
• Retailers will invest in contextual, visual and mobile-friendly delivery of insights to combat the biggest challenge that prevents them from leveraging analytics strategically – delivery of insights to the right resource at the right time.
• Retailers’ eCommerce or Omni-channel function emerges as the business function with the highest potential opportunity for analytics impact, the highest rate of data growth and the highest planned technology investment. However, it is also currently the function with the lowest analytics maturity.
• Usability is the most important feature retailers will look for when choosing analytics solutions in 2013. Even with the delivery of insights being their biggest challenge, mobile or tablet access ranks relatively low.
The traditional view of data management and analysis in retail has been tool-driven – be it relational databases of decades past or Business Intelligence tools more recently. In EKN’s view, “business analytics” is a concept that focuses on decisions and outcomes, and is a far better indicator of the future of retail analytics.
The Global Impact of Open Data Open data has spurred economic innovation, social transformation, and fresh forms of political and government accountability in recent years, but few people understand how open data works. This comprehensive report, developed with support from Omidyar Network, presents detailed case studies of open data projects throughout the world, along with in-depth analysis of what works and what doesn’t. Authors Andrew Young and Stefaan Verhulst, both with The GovLab at New York University, explain how these projects have made governments more accountable and efficient, helped policymakers find solutions to previously intractable public problems, created new economic opportunities, and empowered citizens through new forms of social mobilization.
The Google File System We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
The Graph Story of the SAP HANA Database Many traditional and new business applications work with inherently graphstructured data and therefore benefit from graph abstractions and operations provided in the data management layer. The property graph data model not only offers schema flexibility but also permits managing and processing data and metadata jointly. By having typical graph operations implemented directly in the database engine and exposing them both in the form of an intuitive programming interface and a declarative language, complex business application logic can be expressed more easily and executed very efficiently. In this paper we describe our ongoing work to extend the SAP HANA database with built-in graph data support. We see this as a next step on the way to provide an efficient and intuitive data management platform for modern business applications with SAP HANA.
The hidden costs of open source Clusters based on open-source software and the Linux operating system have come to dominate high performance computing (HPC). This is due in part to their superior performance, cost-effectiveness and flexibility. The same factors that make open-source software the choice of HPC professionals have also made it less accessible to smaller centers. The complexity and associated cost of deploying and managing open-source clusters threatens to erode the very cost benefits that have made them compelling in the first place. As customers choose between open-source and commercial alternatives, there are many different costs related to administration and productivity that should be considered. These are explored in this paper in order to give a true cost perspective. We also examine how a commercial management product, such as IBM Platform HPC, enables HPC customers to side-step many overhead cost and support issues that often plague open-source environments and enable them to deploy powerful, easy to use clusters.
The impact of social segregation on human mobility in developing and industrialized regions This study leverages mobile phone data to analyze human mobility patterns in a developing nation, especially in comparison to those of a more industrialized nation. Developing regions, such as the Ivory Coast, are marked by a number of factors that may influence mobility, such as less infrastructural coverage and maturity, less economic resources and stability, and in some cases, more cultural and language-based diversity. By comparing mobile phone data collected from the Ivory Coast to similar data collected in Portugal, we are able to highlight both qualitative and quantitative differences in mobility patterns – such as differences in likelihood to travel, as well as in the time required to travel – that are relevant to consideration on policy, infrastructure, and economic development. Our study illustrates how cultural and linguistic diversity in developing regions (such as Ivory Coast) can present challenges to mobility models that perform well and were conceptualized in less culturally diverse regions. Finally, we address these challenges by proposing novel techniques to assess the strength of borders in a regional partitioning scheme and to quantify the impact of border strength on mobility model accuracy.
The Internet of Things: Making sense of the next mega-trend The third wave of the Internet may be the biggest one yet
The Matrix Cookbook These pages are a collection of facts (identities, approximations, inequalities, relations, …) about matrices and matters relating to them. It is collected in this form for the convenience of anyone who wants a quick desktop reference .
The Matrix Cookbook These pages are a collection of facts (identities, approximations, inequalities, relations, …) about matrices and matters relating to them. It is collected in this form for the convenience of anyone who wants a quick desktop reference .
The Metropolis-Hastings Algorithm This article is a self-contained introduction to the Metropolis- Hastings algorithm, this ubiquitous tool for producing dependent simula- tions from an arbitrary distribution. The document illustrates the principles of the methodology on simple examples with R codes and provides entries to the recent extensions of the method.
The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox To enable complex data-intensive applications such as personalized recommendations, targeted advertising, and intelligent services, the data management community has focused heavily on the design of systems to train complex models on large datasets. Unfortunately, the design of these systems largely ignores a critical component of the overall analytics process: the serving and management of models at scale. In this work, we present Velox, a new component of the Berkeley Data Analytics Stack. Velox is a data management system for facilitating the next steps in real-world, large-scale analytics pipelines: online model management, maintenance, and serving. Velox provides end-user applications and services with a low-latency, intuitive interface to models, transforming the raw statistical models currently trained using existing offline large-scale compute frameworks into full-blown, end-to-end data products capable of targeting advertisements, recommending products, and personalizing web content. To provide up-to-date results for these complex models, Velox also facilitates lightweight online model maintenance and selection (i.e., dynamic weighting). In this paper, we describe the challenges and architectural considerations required to achieve this functionality, including the abilities to span online and offline systems, to adaptively adjust model materialization strategies, and to exploit inherent statistical properties such as model error tolerance, all while operating at “Big Data” scale.
The ProM framework: A new era in process mining tool support Under the umbrella of buzzwords such as “Business Activity Monitoring” (BAM) and “Business Process Intelligence” (BPI) both academic (e.g., EMiT, Little Thumb, InWoLvE, Process Miner, and MinSoN) and commercial tools (e.g., ARIS PPM, HP BPI, and ILOG JViews) have been developed. The goal of these tools is to extract knowledge from event logs (e.g., transaction logs in an ERP system or audit trails in a WFM system), i.e., to do process mining. Unfortunately, tools use different formats for reading/storing log files and present their results in different ways. This makes it difficult to use different tools on the same data sets and to compare the mining results. Furthermore, some of these tools implement concepts that can be very useful in the other tools but it is often difficult to combine tools. As a result, researchers working on new process mining techniques are forced to build a mining infrastructure from scratch or test their techniques in an isolated way, disconnected from any practical applications. To overcome these kind of problems, we have developed the ProM framework, i.e., an “plugable” environment for process mining. The framework is flexible with respect to the input and output format, and is also open enough to allow for the easy reuse of code during the implementation of new process mining ideas. This paper introduces the ProM framework and gives an overview of the plug-ins that have been developed.
The ProM Framework: A New Era in Process Mining Tool Support Under the umbrella of buzzwords such as “Business Activity Monitoring” (BAM) and “Business Process Intelligence” (BPI) both academic (e.g., EMiT, Little Thumb, InWoLvE, Process Miner, and MinSoN) and commercial tools (e.g., ARIS PPM, HP BPI, and ILOG JViews) have been developed. The goal of these tools is to extract knowledge from event logs (e.g., transaction logs in an ERP system or audit trails in a WFM system), i.e., to do process mining. Unfortunately, tools use different formats for reading/storing log files and present their results in different ways. This makes it difficult to use different tools on the same data set and to compare the mining results. Furthermore, some of these tools implement concepts that can be very useful in the other tools but it is often difficult to combine tools. As a result, researchers working on new process mining techniques are forced to build a mining infrastructure from scratch or test their techniques in an isolated way, disconnected from any practical applications. To overcome these kind of problems, we have developed the ProM framework, i.e., an “pluggable” environment for process mining. The framework is flexible with respect to the input and output format, and is also open enough to allow for the easy reuse of code during the implementation of new process mining ideas. This paper introduces the ProM framework and gives an overview of the plug-ins that have been developed.
The Promise and Peril of Big Data
The retail market as a complex system Aim of this paper is to introduce the complex system perspective into retail market analysis. Currently, to understand the retail market means to search for local patterns at the micro level, involving the segmentation, separation and profiling of diverse groups of consumers. In other contexts, however, markets are modelled as complex systems. Such strategy is able to uncover emerging regularities and patterns that make markets more predictable, e.g. enabling to predict how much a country’s GDP will grow. Rather than isolate actors in homogeneous groups, this strategy requires to consider the system as a whole, as the emerging pattern can be detected only as a result of the interaction between its self-organizing parts. This assumption holds also in the retail market: each customer can be seen as an independent unit maximizing its own utility function. As a consequence, the global behaviour of the retail market naturally emerges, enabling a novel description of its properties, complementary to the local pattern approach. Such task demands for a data-driven empirical framework. In this paper, we analyse a unique transaction database, recording the micro-purchases of a million customers observed for several years in the stores of a national supermarket chain. We show the emergence of the fundamental pattern of this complex system, connecting the products’ volumes of sales with the customers’ volumes of purchases. This pattern has a number of applications. We provide three of them. By enabling us to evaluate the sophistication of needs that a customer has and a product satisfies, this pattern has been applied to the task of uncovering the hierarchy of needs of the customers, providing a hint about what is the next product a customer could be interested in buying and predicting in which shop she is likely to go to buy it.
The Rise of Big Data Analytics for Marketing (Slide Deck)
The SAP HANA Database – An Architecture Overview Requirements of enterprise applications have become much more demanding because they execute complex reports on transactional data while thousands of users may read or update records of the same data. The goal of the SAP HANA database is the integration of transactional and analytical workloads within the same database management system. To achieve this, a columnar engine exploits modern hardware (multiple CPU cores, large main memory, and caches), compression of database content, maximum parallelization in the database kernel, and database extensions required by enterprise applications, e.g., specialized data structures for hierarchies or support for domain specific languages. In this paper we highlight the architectural concepts employed in the SAP HANA database. We also report on insights gathered with the SAP HANA database in real-world enterprise application scenarios.
The Science of Data Science Although there is considerable talk of the need for data scientists in the United States and world economies, and although a number of universities throughout the United States are now offering degrees in the data science area, there is surprisingly little consensus as to what comprises the key scientific and engineering challenges of data science, which has recently been raised as a matter of national concern. Many current programs emphasize statistical sampling techniques, approaches to visualization, and the programming of analysis packages. However, we contend that these are not the only areas that are required for the big data needs of emergent ‘fourth paradigm’ data-driven science, where the scientific method is enhanced by the integration of significant data sources into the practice of scientific research.
The Singular Value Decomposition, Applications and Beyond The singular value decomposition (SVD) is not only a classical theory in matrix computation and analysis, but also is a powerful tool in machine learning and modern data analysis. In this tutorial we first study the basic notion of SVD and then show the central role of SVD in matrices. Using majorization theory, we consider variational principles of singular values and eigenvalues. Built on SVD and a theory of symmetric gauge functions, we discuss unitarily invariant norms, which are then used to formulate general results for matrix low rank approximation. We study the subdifferentials of unitarily invariant norms. These results would be potentially useful in many machine learning problems such as matrix completion and matrix data classification. Finally, we discuss matrix low rank approximation and its recent developments such as randomized SVD, approximate matrix multiplication, CUR decomposition, and Nyström approximation. Randomized algorithms are important approaches to large scale SVD as well as fast matrix computations.
The Split-Apply-Combine Strategy for Data Analysis Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece inde- pendently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored. The paper includes two case studies showing how these insights make it easier to work with batting records for veteran baseball players and a large 3d array of spatio-temporal ozone measurements.
The stringdist Package for Approximate String Matching Comparing text strings in terms of distance functions is a common and fundamental task in many statistical text-processing applications. Thus far, string distance functionality has been somewhat scattered around R and its extension packages, leaving users with inconistent interfaces and encoding handling. The stringdist package was designed to offer a low-level interface to several popular string distance algorithms which have been re-implemented in C for this purpose. The package offers distances based on counting q-grams, edit-based distances, and some lesser known heuristic distance functions. Based on this functionality, the package also offers inexact matching equivalents of R’s native exact matching functions match and %in%.
The Top Challenges in Big Data and Analytics Over the past few years, there has been a tremendous amount of hype around Big Data – data that doesn’t work well in traditional BI systems and warehouses because of its volume, its variety, and the velocity at which it is acquired and changed. Is the hype justified? Lavastorm Analytics believes it is. Often Big Data has been talked about as a “problem” because it couldn’t be easily processed with traditional systems based on relational databases, but it really is a tremendous opportunity to enhance and even transform how you run your business. The value of Big Data can be significant. It can lead to innovations, such as new pricing models, new ways to engage with your customers and partners, new product ideas, or new market opportunities. For example, at a recent conference, one large financial institution estimated that Big Data could help them reduce by 30- 65% the time to market and cost of their strategic innovation projects. …
The top five ways to get started with big data Remember what life was like before big data? The term has become so prevalent in the business lexicon that sometimes it’s hard to remember that big data is a relatively recent phenomenon. Some may have viewed it as a fad, but data generated by people, processes and machines is only continuing to grow. Big data is here to stay. Make no mistake, data is an asset—but not when you’re drowning in it. In the information age, one of your greatest resources can also be your biggest downfall if your organization doesn’t know how to leverage it properly. So what can you do with your data?
The Truth about Principal Components and Factor Analysis Principal components tries to re-express the data as a sum of uncorrelated components. There are lots of other techniques which try to do similar things, like Fourier analysis, or wavelet decomposition. Things like Fourier analysis decompose the data into a sum of a fixed set of basis functions or basis vectors. This has the advantage of making results comparable across data sets, and of making the meaning of the components clear. So why ever do PCA rather than a Fourier transform? First, in some situations the idea of doing a Fourier transform is just embarrassingly weird. For the states or cars data ets, we could number the features and take cosines of the feature numbers, etc., but it just seems crazy. No such embarrassment attends PCA. Second, when using a fixed set of components, there is no guarantee that a small number of components will give a good reconstruction of the original data. PCA guarantees that the first q components will do a better (mean-square) job of reconstructing the original data than any other linear method using only q components. Third, it is good at preserving distances between the points – the component scores give the optimal linear multidimensional scaling. PCA gives us uncorrelated components, which are generally not independent components; for that you need independent component analysis. PCA looks for linear combinations of the original features; one could well do better by finding nonlinear combinations. Rather than directions in feature space, these would be curves or surfaces. PCA is purely a descriptive technique; in itself it makes no prediction about what future data will look like.
The Unreasonable Effectiveness of Data Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc^2. Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages. Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data. One of us, as an undergraduate at Brown University, remembers the excitement of having access to the Brown Corpus, containing one million English words. Since then, our fi eld has seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to fi ve words long. In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfi ltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions – captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks – if only we knew how to extract the model from the data.
The Value of Big Data: Using big data to examine and discover the value in data for accurate analytics Data warehousing is a success, judging by its 25 year history of use across all industries. Business intelligence met the needs it was designed for: to give non-technical people within the organization access to important, shared data. The resulting improvements in all aspects of business operations are hard to dispute when compared to the prior era of static batch reporting. During the same period that data warehousing and BI matured, the automation and instrumenting of almost all processes and activities changed the data landscape in most companies. Where there were only a few applications and minimal monitoring 25 years ago, there is ubiquitous computing and data available about every activity today. Data warehouses have not been able to keep up with business demands for new sources of information, new types of data, more complex analysis and greater speed. Companies can put this data to use in countless ways, but for most it remains uncollected or unused, locked away in silos within IT. There has been a gradual maturing of data use in organizations. In the early days of BI it was enough to provide access to core financial and customer transactions. Better access enabled process changes, and these led to the need for more data and more varied uses of information. These changes put increasing strain on information processing and delivery capabilities that were designed under assumptions of stability and common use. Most companies now have a backlog of new data and analysis requests that BI groups are struggling to meet. Enter big data. Big data is not simply about growing data volumes – it’s also about the fact that the data being collected today is different in ways that make it unwieldy for conventional databases and BI tools. Big data is also about new technologies that were developed to support the storage, retrieval and processing of this new data. The technologies originated in the world of web applications and internet-based companies, but they are now spreading into enterprise applications of all sorts. New technology coupled with new data enables new practices like real-time monitoring of operations across retail channels, supply chain practices at finer grain and faster speed, and analysis of customers at the level of individual activities and behaviors. Until recently, large scale data collection and analysis capabilities like these would have required a Wal-Mart sized investment, limiting them to large organizations. These capabilities are now available to all, regardless of company size or budget. This is creating a rush to adopt big data technologies. As the use of big data grows, the need for data management will grow. Many organizations already struggle to manage existing data. Big data adds complexity, which will only increase the challenge. The combination of new data and new technology requires new data management capabilities and processes to capture the promised long-term value.
The Value of Signal (and the Cost of Noise): The New Economics of Meaning-Making Nearly every aspect of our daily lives generates a digital footprint. From mobile phones and social media to inventory look-ups and online purchases, we collect more data about processes, people and things than ever before. Winning companies are able to create business value by building a richer understanding of customers, products, employees and partners – extracting business meaning from this torrent of data. The business stakes of “meaning-making” simply could not be higher. To level-set this concept, we conducted primary research, including direct interviews with executives, to understand how business analytics is being applied to uncover new revenue sources and to reduce costs. The 300 companies we surveyed told us they achieved a total economic benefit of roughly $766 billion over the past year based on their use of business analytics. Among those that participated in our research, investment in business analytics yielded an average 8.4% increase in revenues and an average 8.1% improvement in cost reduction in the previous fiscal year. Companies that generated the most value from business analytics expect to grow revenue faster and reduce costs more aggressively. Leading companies also clearly recognize that success means winning the battle for talent with business analytics skills. Many companies – perhaps most – are missing the opportunity for significant economic benefit. If the companies we surveyed were to begin deploying best practices in analytics, we estimate they could create $853 billion of value within the next 12 months. It’s a new era in business, one in which growth will be driven as much by insight and foresight as by physical products and assets. Importantly, as our research demonstrates, a roadmap for success is beginning to emerge.
The Visual Display of Quantitative Information (Slide Deck)
Theano and LSTM for Sentiment Analysis (Slide Deck)
Time Series Databases Time series databases enable a fundamental step in the central storage and analysis of many types of machine data. As such, they lie at the heart of the Internet of Things (IoT). There’s a revolution in sensor– to–insight data flow that is rapidly changing the way we perceive and understand the world around us. Much of the data generated by sensors, as well as a variety of other sources, benefits from being collected as time series. Although the idea of collecting and analyzing time series data is not new, the astounding scale of modern datasets, the velocity of data accumulation in many cases, and the variety of new data sources together contribute to making the current task of building scalable time series databases a huge challenge. A new world of time series data calls for new approaches and new tools.
Time Series Prediction with the Self-Organizing Map: A Review We provide a comprehensive and updated survey on applications of Kohonen’s self-organizing map (SOM) to time series prediction (TSP). The main goal of the paper is to show that, despite being originally designed as an unsupervised learning algorithm, the SOM is flexible enough to give rise to a number of efficient supervised neural architectures devoted to TSP tasks. For each SOM-based architecture to be presented, we report its algorithm implementation in detail. Similarities and differences of such SOM-based TSP models with respect to standard linear and nonlinear TSP techniques are also highlighted. We conclude the paper with indications of possible directions for further research on this field.
Times Series: Cointegration An overview of results for the cointegrated VAR model for nonstationary I(1) variables is given. The emphasis is on the analysis of the model and the tools for asymptotic inference. These include: formulation of criteria on the parameters, for the process to be nonstationary and I(1), formulation of hypotheses of interest on the rank, the cointegrating relations and the adjustment coefficients. A discussion of the asymptotic distribution results that are used for inference. The results are illustrated by a few examples. A number of extensions of the theory are pointed out.
To Explain or To Predict? Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this paper is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process.
Too Big to Fail: Large Samples and the p-Value Problem The Internet has provided IS researchers with the opportunity to conduct studies with extremely large samples, frequently well over 10,000 observations. There are many advantages to large samples, but researchers using statistical inference must be aware of the p-value problem associated with them. In very large samples, p-values go quickly to zero, and solely relying on p-values can lead the researcher to claim support for results of no practical significance. In a survey of large sample IS research, we found that a significant number of papers rely on a low p-value and the sign of a regression coefficient alone to support their hypotheses. This research commentary recommends a series of actions the researcher can take to mitigate the p-value problem in large samples and illustrates them with an example of over 300,000 camera sales on eBay. We believe that addressing the p-value problem will increase the credibility of large sample IS research as well as provide more insights for readers.
Top 10 Algorithms in Data Mining This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community.With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and reviewcurrent and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development.
Top 10 Trends in Business Intelligence for 2014 (Slide Deck)
Top 7 Trends in Big Data for 2015 • Big Data gets cloudy.
• ETL gets personal.
• SQL or NoSQL, that is the question.
• Hadoop: Part of the new normal in data storage.
• You will start trying to fish in the data lake.
• The big data ecosystem will start to change form.
• IOT (Internet of Things) will continue to grow, driving new data solutions.
Top Five High-Impact Use Cases for Big Data Analytics Today’s data-driven companies have a competitive edge over their peers. How? They are generating breakthrough insights by bringing together all of their structured and unstructured data and analyzing it together – all at once.
Toward Scalable Systems for Big Data Analytics: A Technology Tutorial Recent technological advancements have led to a deluge of data from distinctive domains (e.g., health care and scienti c sensors, user-generated data, Internet and nancial companies, and supply chain systems) over the past two decades. The term big data was coined to capture the meaning of this emerging trend. In addition to its sheer volume, big data also exhibits other unique characteristics as compared with traditional data. For instance, big data is commonly unstructured and require more real-time analysis. This development calls for new system architectures for data acquisition, transmission, storage, and large-scale data processing mechanisms. In this paper, we present a literature survey and system tutorial for big data analytics platforms, aiming to provide an overall picture for nonexpert readers and instill a do-it-yourself spirit for advanced audiences to customize their own big-data solutions. First, we present the de nition of big data and discuss big data challenges. Next, we present a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics. These four modules form a big data value chain. Following that, we present a detailed survey of numerous approaches and mechanisms from research and industry communities. In addition, we present the prevalent Hadoop framework for addressing big data challenges. Finally, we outline several evaluation benchmarks and potential research directions for big data systems.
Towards Better Analysis of Machine Learning Models: A Visual Analytics Perspective Interactive model analysis, the process of understanding, diagnosing, and refining a machine learning model with the help of interactive visualization, is very important for users to efficiently solve real-world artificial intelligence and data mining problems. Dramatic advances in big data analytics has led to a wide variety of interactive model analysis tasks. In this paper, we present a comprehensive analysis and interpretation of this rapidly developing area. Specifically, we classify the relevant work into three categories: understanding, diagnosis, and refinement. Each category is exemplified by recent influential work. Possible future research opportunities are also explored and discussed.
Training Very Deep Networks Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.
Transform Your Organization With Strong Data Management – Executive Overview: The Data Management Playbook Business stakeholders are putting more pressure on IT to keep up with and meet market demands. Yet existing data management platforms and strategies were not designed for an agile business or the digital age. Data management is the backbone needed to define and orchestrate the digital experience and to ensure confidence in the data used in your key processes. Forrester’s data management playbook shows enterprise architecture (EA) professionals how to build a more elastic and flexible data management practice to meet the new data demands brought on by digital disruption.
TSclust: An R Package for Time Series Clustering Time series clustering is an active research area with applications in a wide range of elds. One key component in cluster analysis is determining a proper dissimilarity measure between two data objects, and many criteria have been proposed in the literature to assess dissimilarity between two time series. The R package TSclust is aimed to implement a large set of well-established peer-reviewed time series dissimilarity measures, including measures based on raw data, extracted features, underlying parametric models, complexity levels, and forecast behaviors. Computation of these measures allows the user to perform clustering by using conventional clustering algorithms. TSclust also includes a clustering procedure based on p values from checking the equality of generating models, and some utilities to evaluate cluster solutions. The implemented dissimilarity functions are accessible individually for an easier extension and possible use out of the clustering context. The main features of TSclust are described and examples of its use are presented.
Turning Big Data Into Useful Information Many Fortune 500 companies are recognizing enterprise data as a strategic business asset. Leading companies are using troves of operational data to optimize their processes, create intelligent products and delight their customers. In addition, increased demands for regulatory transparency are forcing companies to capture and maintain an audit trail of the information they use in their business decisions. Despite this, large companies struggle to access, manage and leverage the information that they create in their day-to-day processes. The rapid growth in the number of IT systems has resulted in a complex and fragmented landscape, where potentially valuable data lays trapped in fragmented inconsistent silos of applications, databases and organizations.
Twitter client for R Twitter is a popular service that allows users to broadcast short messages (‘tweets’) for others to read. Over the years this has become a valuable tool not just for standard social media purposes but also for data mining experiments such as sentiment analysis. The twitteR package is intended to provide access to the Twitter API within R, allowing users to grab interesting subsets of Twitter data for their analyses. This document is not intended to be exhaustive nor comprehensive but rather a brief introduction to some of the more common bits of functionality and some basic examples of how they can be used. In the last section I’ve included a variety of links to people using twitteR to solve real world problems.


Uncovering Social Network Structures through Penetration Data We propose a method for uncovering the structure of the adopters’ network underlying the diffusion process, based on penetration data alone. By uncovering the traces that this network leaves on the dissemination process, the degree distribution of the network can be estimated. We show that the network’s degree distribution has a significant effect on the contagion properties. Ignoring the network structure introduces significant errors to estimated diffusion parameters and may lead to flawed assessments of the magnitude of the contagion process. In three studies we validate the proposed method using data for known mapped networks and the adoption process propagating on them.
Understanding Big Data Quality for Maximum Information Usability The barriers to entry for high-performance scalable data management and computing continue to fall, and “big data” is rapidly moving into the mainstream. So it’s easy to become so focused on the anticipated business benefits of large-scale data analytics that we lose sight of the intricacy associated with data acquisition, preparation and quality assurance. In some ways, the clamoring demand for large-scale analysis only heightens the need for data governance and data quality assurance. And while there are some emerging challenges associated with managing big data quality, reviewing good data management practices will help to maximize data usability. In this paper we examine some of the challenges presented by managing the quality and governance of big data, and how those can be balanced with the need to deliver usable analytical results. We explore the dimensions of data quality for big data, and examine the reasons for practical approaches to proactive monitoring, managing reference data and metadata, and sharing knowledge about interpreting and using data sets. By examining some examples, we can identify ways to balance governance with usability and come up with a strategic plan for data quality, including tactical steps for taking advantage of the power of the cluster to drive more meaning and value out of the data. Finally, we consider a checklist of characteristics to look for when evaluating information management tools for big data.
Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data
Understanding predictive information criteria for Bayesian models We review the Akaike, deviance, and Watanabe-Akaike information criteria from a Bayesian perspective, where the goal is to estimate expected out-of-sample-prediction error using a biascorrected adjustment of within-sample error. We focus on the choices involved in setting up these measures, and we compare them in three simple examples, one theoretical and two applied. The contribution of this review is to put all these information criteria into a Bayesian predictive context and to better understand, through small examples, how these methods can apply in practice.
Understanding Probabilistic Classifiers Probabilistic classifiers are developed by assuming generative models which are product distributions over the original attribute space (as in naive Bayes) or more involved spaces (as in general Bayesian networks). While this paradigm has been shown experimentally successful on real world applications, despite vastly simplified probabilistic assumptions, the question of why these approaches work is still open. This paper resolves this question.We show that almost all joint distributions with a given set of marginals (i.e., all distributions that could have given rise to the classifier learned) or, equivalently, almost all data sets that yield this set of marginals, are very close (in terms of distributional distance) to the product distribution on the marginals; the number of these distributions goes down exponentially with their distance from the product distribution. Consequently, as we show, for almost all joint distributions with this set of marginals, the penalty incurred in using the marginal distribution rather than the true one is small. In addition to resolving the puzzle surrounding the success of probabilistic classifiers our results contribute to understanding the tradeoffs in developing probabilistic classifiers and will help in developing better classifiers.
Understanding Random Forests. From Theory to Practice. Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results. Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyze and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances as computed from non-totally randomized trees (e.g., standard Random Forest) suffer from a combination of defects, due to masking effects, misestimations of node impurity or due to the binary structure of decision trees. Finally, the last part of this dissertation addresses limitations of random forests in the context of large datasets. Through extensive experiments, we show that subsampling both samples and features simultaneously provides on par performance while lowering at the same time the memory requirements. Overall this paradigm highlights an intriguing practical fact: there is often no need to build single models over immensely large datasets. Good performance can often be achieved by building models on (very) small random parts of the data and then combining them all in an ensemble, thereby avoiding all practical burdens of making large data fit into memory.
Understanding the Basis of the Kalman Filter via a Simple and Intuitive Derivation This article provides a simple and intuitive derivation of the Kalman filter, with the aim of teaching this useful tool to students from disciplines that do not require a strong mathematical background. The most complicated level of mathematics required to understand this derivation is the ability to multiply two Gaussian functions together and reduce the result to a compact form. The Kalman filter is over 50 years old but is still one of the most important and common data fusion algorithms in use today. Named after Rudolf E. Kálmán, the great success of the Kalman filter is due to its small computational requirement, elegant recursive properties, and its status as the optimal estimator for one-dimensional linear systems with Gaussian error statistics [1] . Typical uses of the Kalman filter include smoothing noisy data and providing estimates of parameters of interest. Applications include global positioning system receivers, phaselocked loops in radio equipment, smoothing the output from laptop trackpads, and many more. From a theoretical standpoint, the Kalman filter is an algorithm permitting exact inference in a linear dynamical system, which is a Bayesian model similar to a hidden Markov model but where the state space of the latent variables is continuous and where all latent and observed variables have a Gaussian distribution (often a multivariate Gaussian distribution). The aim of this lecture note is to permit people who find this description confusing or terrifying to understand the basis of the Kalman filter via a simple and intuitive derivation.
Upping Your Analytic IQ – Driving Real Time Insights through Increased Employee Engagement (Slide Deck)
Using DeployR to Solve the R Integration Problem Organizations use analytics to empower decision making, often in real time. That means the ability to easily embed and share the results of R analytics within existing web, desktop, and mobile applications, plus backend systems, is vital. As important as real-time delivery of analytics is to this process, users also need to consider security in terms of identity and data integrity, and scale in terms of workload and throughput. DeployR from Revolution Analytics delivers the power and flexibility of R securely and at scale to all of your decision-making systems. DeployR is an integration technology for deploying R analytics inside web, desktop, mobile, and dashboard applications, as well as backend systems. DeployR turns your R scripts into analytics web services, so R code can be easily executed by applications running on a secure server. Using analytics web services, DeployR also solves key integration problems faced by those adopting R-based analytics alongside existing IT infrastructure. These services make it easy for application developers to collaborate with R programmers to integrate R analytics into their applications without any R programming knowledge. DeployR is available in two editions: DeployR Open and DeployR Enterprise. DeployR Open is a free, open source solution that is ideal for prototyping, building, and deploying non-critical business applications. DeployR Enterprise scales for business-critical applications and offers support for production-grade workloads, as well as seamless integration with popular enterprise security solutions such as single sign-on (SSO), Lightweight Directory Access Protocol (LDAP), Active Directory, or Pluggable Authentication Modules (PAM).
Using R and Tableau R functions and models can now be used in Tableau by creating new calculated fields that dynamically invoke the R engine and pass values to R. The results are then returned back to Tableau for use by the Tableau visualization engine.
Using Storm for Real-Time First Story Detection Twitter has been an excellent tool for extracting information in real-time, such as news and events. This dissertation uses Twitter to address the problem of real-time new events detection on the Storm distributed platform such that the system benefits from the scalability, efficiency and robustness this framework can offer. Towards this direction, three different implementations have been deployed, each of which having a different configuration. The first and simplest distributed implementation was the baseline approach. Two implementations followed, in an attempt to achieve faster data processing without loss in accuracy. The rest two implementations demonstrated significant improvements in both performance and scalability. Specifically, they achieved a 1357.16 % and 1213.15 % speed-up over the single-threaded baseline version, correspondingly. Moreover, the accuracy and robustness of the scalable approaches comparing to the baseline version were retained.
Using visualization to understand big data Studies have shown that the human short-term memory is capable of holding 3 – 7 items in place simultaneously, which means that people can only juggle a few items in their heads before they start to lose track of them. Visualization creates encodings of data into visual channels that people can view and understand. This process externalizes the data and enables people to think about and manipulate the data at a higher level. This externalization enables humans to think more complex thoughts about larger amounts of information than would otherwise be possible.1 Visualization exploits the human visual system to provide an intuitive, immediate and language-independent way to view and show your data. It is an essential tool for understanding information. The human visual system is by far the richest, most immediate, highest bandwidth pipeline into the human mind. The amount of brain capacity that is devoted to processing visual input far exceeds that of the other human senses. Some scientific estimates suggest that the human visual system is capable of processing about 9 megabits of information per second, which corresponds to close to 1 million letters of text per second. Visualization research over the past decades has discovered a wide range of effective visualization techniques that go far beyond the basic pie, bar and line charts used so pervasively in spreadsheets and dashboards. These techniques are especially useful now that most organizations are being confronted with big data. The majority of organizations are struggling to make sense of output from data sources that include RFID communications, social media text, customer surveys, streaming video and more, along with data captured over very long periods of time. For the IBM Institute for Business Value report on big data, IBM surveyed more than 1100 business and IT professionals and found that less than 26 percent of respondents who had active big data efforts could analyze extremely unstructured data such as voice and video and just 35 percent could analyze streaming data.2 Visualization plays a key role in enabling the understanding of these complex data analytics, and it can convey the key analytical nuggets of information to other people in the organization who have less expertise in analytics. When companies can analyze big data, they benefit. In that same IBM survey, 63 percent of respondents reported that they believe that understanding and exploiting big data effectively can create a competitive advantage for their organizations.3 Big data analysis can help them improve decision making, create a 360-degree view of their customers, improve security and surveillance, analyze operations and augment data warehousing. Visualization can play a vital role in using big data to get a complete view of your customer. This paper covers how.


Variable Selection and Estimation in High-Dimensional Models Models with high-dimensional covariates arise frequently in economics and other fields. Often, only a few covariates have important effects on the dependent variable. When this happens, the model is said to be sparse. In applications, however, it is not known which covariates are important and which are not. This paper reviews methods for discriminating between important and unimportant covariates with particular attention given to methods that discriminate correctly with probability approaching 1 as the sample size increases. Methods are available for a wide variety of linear, nonlinear, semiparametric, and nonparametric models. The performance of some of these methods in finite samples is illustrated through Monte Carlo simulations and an empirical example.
Variational Inference: A Review for Statisticians One of the core problems of modern statistics is to approximate difficult-to-compute probability distributions. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation about the posterior. In this paper, we review variational inference (VI), a method from machine learning that approximates probability distributions through optimization. VI has been used in myriad applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of distributions and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this widely-used class of algorithms.
Vector Autoregressive Models
vim Graphical Cheat Sheet (Cheat Sheet)
Visual Analysis Best Practices: Simple Techniques for Making Every Data Visualization Useful and Beautiful You made a visualization! Congratulations: you are part of a small but growing group that’s taking advantage of the power of visualization. However, bringing your visualizations from “good” to “great” takes time, patience and attention to detail. Luckily, we have compiled a short but important list of techniques to get you started. Happy visualizing!
Visual Analysis of Large Graphs: State-of-the-Art and Future Research Challenges The analysis of large graphs plays a prominent role in various fields of research and is relevant in many important application areas. Effective visual analysis of graphs requires appropriate visual presentations in combination with respective user interaction facilities and algorithmic graph analysis methods. How to design appropriate graph analysis systems depends on many factors, including the type of graph describing the data, the analytical task at hand, and the applicability of graph analysis methods. The most recent surveys of graph visualization and navigation techniques cover techniques that had been introduced until 2000 or concentrate only on graph layouts published until 2002. Recently, new techniques have been developed covering a broader range of graph types, such as time-varying graphs. Also, in accordance with ever growing amounts of graph-structured data becoming available, the inclusion of algorithmic graph analysis and interaction techniques becomes increasingly important. In this State-of-the-Art Report, we survey available techniques for the visual analysis of large graphs. Our review firstly considers graph visualization techniques according to the type of graphs supported. The visualization techniques form the basis for the presentation of interaction approaches suitable for visual graph exploration. As an important component of visual graph analysis, we discuss various graph algorithmic aspects useful for the different stages of the visual graph analysis process. We also present main open research challenges in this field.
Visual Data Mining Techniques Never before in history has data been generated at such high volumes as it is today. Exploring and analyzing the vast volumes of data has become increasingly difficult. Information visualization and visual data mining can help to deal with the flood of information. The advantage of visual data exploration is that the user is directly involved in the data mining process. There are a large number of information visualization techniques that have been developed over the last two decades to support the exploration of large data sets. In this paper, we propose a classification of information visualization and visual data mining techniques based on the data type to be visualized, the visualization technique, and the interaction technique. We illustrate the classification using a few examples, and indicate some directions for future work.
Visualize This In TechTarget’s 2013 Analytics & Data Warehousing Reader Survey, 36% of 664 respondents said their organizations were using data visualization and discovery tools, while another 41% said deployments were planned in the next 12 months. In response to another question, 44% said they expected their organizations to increase spending on data visualization initiatives over the next 12 months. That was the third-highest percentage among 10 technology categories, narrowly topped only by the figures for data warehousing and predictive analytics. It isn’t surprising that data visualization would be gaining in popularity. Business intelligence and analytics are becoming more central to business strategies – and business success. As a result, many organizations are looking to broaden the use of BI data in decision making. Visualizing that data can make it easier for users to grasp. But it’s easy to go awry on data visualization. In an August 2013 blog post, Forrester Research analyst Ryan Morrill cited “a cascade of bad examples” of infographics and other types of visualizations with data errors and overly complicated designs. He recommended focusing on two things in creating visualizations: engaging graphics, yes, but also a “data-driven design” in which the visual elements help to accurately depict the information being presented. This special edition of Business Information examines the dos and don’ts of managing successful data visualization processes. First we offer advice from experienced users on finding and deploying the right BI and data visualization tools. Next we provide tips on building effective visualizations for use in BI dashboards. We finish with a look at the use of geographic information systems to help improve the quality of health care.
Visualizing and Understanding Convolutional Networks Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark (Krizhevsky et al., 2012). However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classi er. Used in a diagnostic role, these visualizations allow us to nd model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from di erent model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classi er is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Visualizing Data The HBR Insight Center highlights emerging thinking around today’s most important business ideas. In this Insight Center, we’ll explore the power of using data visualization to drive business strategy. We’ll talk about when (and when not) to use visualization, how to get started, how to know if you’re getting a good return on your data visualization investment, and more.


Ward’s Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm TheWard error sum of squares hierarchical clustering method has been very widely used since its rst description by Ward in a 1963 publication. It has also been generalized in various ways. However there are di erent interpretations in the literature and there are di erent implementations of the Ward agglomerative algorithm in commonly used software systems, including di ering expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward’s hierarchical clustering method.
What does “Big Data” mean for official statistics? In our modern world more and more data are generated on the web and produced by sensors in the ever growing number of electronic devices surrounding us. The amount of data and the frequency at which they are produced have led to the concept of ‘Big data’. Big data is characterized as data sets of increasing volume, velocity and variety; the 3 V’s. Big data is often largely unstructured, meaning that it has no pre-defined data model and/or does not fit well into conventional relational databases. Apart from generating new commercial opportunities in the private sector, big data is also potentially very interesting as an input for official statistics; either for use on its own, or in combination with more traditional data sources such as sample surveys and administrative registers. However, harvesting the information from big data and incorporating it into a statistical production process is not easy. As such, this paper will seek to address two fundamental questions, i.e. the What and the How.
What Is Data Science? We’ve all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O’Reilly said that “data is the next Intel Inside.” But what does that statement mean? Why do we suddenly care about statistics and about data? In this post, I examine the many sides of data science — the technologies, the companies and the unique skill sets.
What is Decision Support? This paper attempts to describe and clarify the meaning of the term Decision Support (DS). Based on a survey of DS related WWW documents, and taking a broad view of DS, a classification of DS and related disciplines is presented. DS is put in the context of Decision Making, and some most important disciplines of DS are overviewed: Operations Research, Decision Analysis, Decision Support Systems, Data Warehousing and OLAP, and Group Decision Support.
What Is Statistics? One might think that there is a simple answer to the question posed in the title of the form “Statistics is. . . . ” Sadly, there is not, although many contemporary statistical authors have attempted to answer the question.This article captures the essence of some of these efforts, setting them in their historical contexts. In the process, we focus on the cross-disciplinary nature of much modern statistical research. This discussion serves as a backdrop to the the aims of the Annual Review of Statistics and its Application, which begins publication with the present volume.
What is the expectation maximization algorithm? Probabilistic models, such as hidden Markov models or Bayesian networks, are commonly used to model biological data. Much of their popularity can be attributed to the existence of efficient and robust procedures for learning parameters from observations. Often, however, the only data available for training a probabilistic model are incomplete. Missing values can occur, for example, in medical diagnosis, where patient histories generally include results from a limited battery of tests. Alternatively, in gene expression clustering, incomplete data arise from the intentional omission of gene-to-cluster assignments in the probabilistic model. The expectation maximization algorithm enables parameter estimation in probabilistic models with incomplete data.
What Makes a Visualization Memorable? An ongoing debate in the Visualization community concerns the role that visualization types play in data understanding. In human cognition, understanding and memorability are intertwined. As a first step towards being able to ask questions about impact and effectiveness, here we ask: “What makes a visualization memorable?” We ran the largest scale visualization study to date using 2,070 single-panel visualizations, categorized with visualization type (e.g., bar chart, line graph, etc.), collected from news media sites, government reports, scientific journals, and infographic sources. Each visualization was annotated with additional attributes, including ratings for data-ink ratios and visual densities. Using Amazon’s Mechanical Turk, we collected memorability scores for hundreds of these visualizations, and discovered that observers are consistent in which visualizations they find memorable and forgettable. We find intuitive results (e.g., attributes like color and the inclusion of a human recognizable object enhance memorability) and less intuitive results (e.g., common graphs are less memorable than unique visualization types). Altogether our findings suggest that quantifying memorability is a general metric of the utility of information, an essential step towards determining how to design effective visualizations.
What’s Wrong With Deep Learning (Slide Deck)
Which chart or graph is right for you? You’ve got data and you’ve got questions. Creating a chart or graph links the two, but sometimes you’re not sure which type of chart will get the answer you seek. This paper answers questions about how to select the best charts for the type of data you’re analyzing and the questions you want to answer. But it won’t stop there. Stranding your data in isolated, static graphs limits the number of questions you can answer. Let your data become the centerpiece of decision making by using it to tell a story. Combine related charts. Add a map. Provide filters to dig deeper. The impact? Business insight and answers to questions at the speed of thought.
Which chart or graph is right for you? You’ve got data and you’ve got questions. Creating a chart or graph links the two, but sometimes you’re not sure which types of charts and graphs will help you find the answers you need. This paper answers questions about how to select the best charts for the type of data you’re analyzing and the questions you want to answer. But it won’t stop there. Stranding your data in isolated, static graphs limits the number of questions you can answer. Let your data become the centerpiece of decision making by using it to tell a story. Combine related charts. Add a map. Provide filters to dig deeper. The impact? Business insight and answers to questions at the speed of thought.
Why Big Data Analytics? The primary focus right now is on using big data analytics to gain insights across all areas of operations, customers, and product innovation. By applying the right analytical solutions to your data, you can:
• Better understand customer behavior.
• Gain better insights into operations.
• Identify risks such as fraud.
• Comply with government regulations.
• Answer complex questions.
• See important patterns in data that you couldn’t see before.
Will Big Data Make IT Infrastructure Sexy Again? Love it or hate it, big data seems to be driving a renaissance in IT infrastructure spending. IDC, for example, estimates that worldwide spending for infrastructure hardware alone (servers, storage, PCs, tablets, and peripherals) will rise from $461 billion in 2013 to $468 billion in 2014. Gartner predicts that total IT spending will grow 3.1% in 2014, reaching $3.8 trillion, and forecasts “consistent four to five percent annual growth through 2017.” For a lot of people, the mere thought of all that additional cash makes IT infrastructure seem sexy again. Big data impacts IT spending directly and indirectly. The direct effects are less dramatic, largely because adding terabytes to a Hadoop cluster is much less costly than adding terabytes to an enterprise data warehouse (EDW). That said, IDG Enterprise’s 2014 Big Data survey indicates that more than half of the IT leaders polled believe “they will have to re-architect the data center network to some extent” to accommodate big data services. The indirect effects are more dramatic, thanks in part to Rubin’s Law (derived by Dr. Howard Rubin, the unofficial dean of IT economics), which holds that demand for technology rises as the cost of technology drops (which it invariably will, according to Moore’s Law). Since big data essentially “liberates” data that had been “trapped” in mainframes and EDWs, the demand for big data services will increase as organizations perceive the untapped value at their fingertips. In other words, as utilization of big data goes up, spending on big data services and related infrastructure will also rise. “Big data has the same sort of disruptive potential as the client-server revolution of 30 years ago, which changed the whole way that IT infrastructure evolved,” says Marshall Presser, field chief technology officer at Pivotal, a provider of application and data infrastructure software. “For some people, the disruption will be exciting and for others, it will be threatening.” For IT vendors offering products and services related to big data, the future looks particularly rosy. IDC predicts stagnant (0.7%) growth of legacy IT products and high-volume (15%) growth of cloud, mobile, social, and big data products—the so-called “3rd Platform” of IT. According to IDC, “3rd Platform technologies and solutions will drive 29 percent of 2014 IT spending and 89 percent of all IT spending growth.” Much of that growth will come from the “cannibalization” of traditional IT markets. Viewed from that harrowing perspective, maybe “scary” is a better word than “sexy” to describe the looming transformation of IT infrastructure.
word2vec Parameter Learning Explained The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector representations of words learned by word2vec models have been proven to be able to carry semantic meanings and are useful in various NLP tasks. As an increasing number of researchers would like to experiment with word2vec, I notice that there lacks a material that comprehensively explains the parameter learning process of word2vec in details, thus preventing many people with less neural network experience from understanding how exactly word2vec works. This note provides detailed derivations and explanations of the parameter update equations for the word2vec models, including the original continuous bag-of-word (CBOW) and skip-gram models, as well as advanced tricks, hierarchical soft-max and negative sampling. In the appendix a review is given on the basics of neuron network models and backpropagation.


Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup This paper presents Yinyang K-means, a new algorithm for K-means clustering. By clustering the centers in the initial stage, and leveraging efficiently maintained lower and upper bounds between a point and centers, it more effectively avoids unnecessary distance calculations than prior algorithms. It significantly outperforms prior K-means algorithms consistently across all experimented data sets, cluster numbers, and machine configurations. The consistent, superior performance—plus its simplicity, user-control of overheads, and guarantee in producing the same clustering results as the standard K-means—makes Yinyang K-means a dropin replacement of the classic K-means with an order of magnitude higher performance.


ZAP BI Chart Type Cheat Sheet (Cheat Sheet)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s