Whats new on arXiv

Impacts of Dirty Data: and Experimental Evaluation

Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification, clustering, and regression algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.

Applying the Delta method in metric analytics: A practical guide with novel ideas

During the last decade, the information technology industry has adopted a data-driven culture, relying on online metrics to measure and monitor business performance. Under the setting of big data, the majority of such metrics approximately follow normal distributions, opening up potential opportunities to model them directly and solve big data problems using distributed algorithms. However, certain attributes of the metrics, such as their corresponding data generating processes and aggregation levels, pose numerous challenges for constructing trustworthy estimation and inference procedures. Motivated by four real-life examples in metric development and analytics for large-scale A/B testing, we provide a practical guide to applying the Delta method, one of the most important tools from the classic statistics literature, to address the aforementioned challenges. We emphasize the central role of the Delta method in metric analytics, by highlighting both its classic and novel applications.

Vulnerability of Deep Learning

The Renormalisation Group (RG) provides a framework in which it is possible to assess whether a deep-learning network is sensitive to small changes in the input data and hence prone to error, or susceptible to adversarial attack. Distinct classification outputs are associated with different RG fixed points and sensitivity to small changes in the input data is due to the presence of relevant operators at a fixed point. A numerical scheme, based on Monte Carlo RG ideas, is proposed for identifying the existence of relevant operators and the corresponding directions of greatest sensitivity in the input data. Thus, a trained deep-learning network may be tested for its robustness and, if it is vulnerable to attack, dangerous perturbations of the input data identified.

Some HCI Priorities for GDPR-Compliant Machine Learning

In this short paper, we consider the roles of HCI in enabling the better governance of consequential machine learning systems using the rights and obligations laid out in the recent 2016 EU General Data Protection Regulation (GDPR)—a law which involves heavy interaction with people and systems. Focussing on those areas that relate to algorithmic systems in society, we propose roles for HCI in legal contexts in relation to fairness, bias and discrimination; data protection by design; data protection impact assessments; transparency and explanations; the mitigation and understanding of automation bias; and the communication of envisaged consequences of processing.

Big Data and Reliability Applications: The Complexity Dimension

Big data features not only large volumes of data but also data with complicated structures. Complexity imposes unique challenges in big data analytics. Meeker and Hong (2014, Quality Engineering, pp. 102-116) provided an extensive discussion of the opportunities and challenges in big data and reliability, and described engineering systems that can generate big data that can be used in reliability analysis. Meeker and Hong (2014) focused on large scale system operating and environment data (i.e., high-frequency multivariate time series data), and provided examples on how to link such data as covariates to traditional reliability responses such as time to failure, time to recurrence of events, and degradation measurements. This paper intends to extend that discussion by focusing on how to use data with complicated structures to do reliability analysis. Such data types include high-dimensional sensor data, functional curve data, and image streams. We first provide a review of recent development in those directions, and then we provide a discussion on how analytical methods can be developed to tackle the challenging aspects that arise from the complexity feature of big data in reliability applications. The use of modern statistical methods such as variable selection, functional data analysis, scalar-on-image regression, spatio-temporal data models, and machine learning techniques will also be discussed.

ORGaNICs: A Theory of Working Memory in Brains and Machines

Working memory is a cognitive process that is responsible for temporarily holding and manipulating information. Most of the empirical neuroscience research on working memory has focused on measuring sustained activity in prefrontal cortex (PFC) and/or parietal cortex during simple delayed-response tasks, and most of the models of working memory have been based on neural integrators. But working memory means much more than just holding a piece of information online. We describe a new theory of working memory, based on a recurrent neural circuit that we call ORGaNICs (Oscillatory Recurrent GAted Neural Integrator Circuits). ORGaNICs are a variety of Long Short Term Memory units (LSTMs), imported from machine learning and artificial intelligence. ORGaNICs can be used to explain the complex dynamics of delay-period activity in prefrontal cortex (PFC) during a working memory task. The theory is analytically tractable so that we can characterize the dynamics, and the theory provides a means for reading out information from the dynamically varying responses at any point in time, in spite of the complex dynamics. ORGaNICs can be implemented with a biophysical (electrical circuit) model of pyramidal cells, combined with shunting inhibition via a thalamocortical loop. Although introduced as a computational theory of working memory, ORGaNICs are also applicable to models of sensory processing, motor preparation and motor control. ORGaNICs offer computational advantages compared to other varieties of LSTMs that are commonly used in AI applications. Consequently, ORGaNICs are a framework for canonical computation in brains and machines.

Assessment meets Learning: On the relation between Item Response Theory and Bayesian Knowledge Tracing

Few models have been more ubiquitous in their respective fields than Bayesian knowledge tracing and item response theory. Both of these models were developed to analyze data on learners. However, the study designs that these models are designed for differ; Bayesian knowledge tracing is designed to analyze longitudinal data while item response theory is built for cross-sectional data. This paper illustrates a fundamental connection between these two models. Specifically, the stationary distribution of the latent variable and the observed response variable in Bayesian knowledge Tracing are related to an item response theory model. Furthermore, recent advances in network psychometrics demonstrate how this relationship can be exploited and generalized to a network model.

Studying Invariances of Trained Convolutional Neural Networks

Convolutional Neural Networks (CNNs) define an exceptionally powerful class of models for image classification, but the theoretical background and the understanding of how invariances to certain transformations are learned is limited. In a large scale screening with images modified by different affine and nonaffine transformations of varying magnitude, we analyzed the behavior of the CNN architectures AlexNet and ResNet. If the magnitude of different transformations does not exceed a class- and transformation dependent threshold, both architectures show invariant behavior. In this work we furthermore introduce a new learnable module, the Invariant Transformer Net, which enables us to learn differentiable parameters for a set of affine transformations. This allows us to extract the space of transformations to which the CNN is invariant and its class prediction robust.

Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts

Current Zero-Shot Learning (ZSL) approaches are restricted to recognition of a single dominant unseen object category in a test image. We hypothesize that this setting is ill-suited for real-world applications where unseen objects appear only as a part of a complex scene, warranting both the `recognition’ and `localization’ of an unseen category. To address this limitation, we introduce a new \emph{`Zero-Shot Detection’} (ZSD) problem setting, which aims at simultaneously recognizing and locating object instances belonging to novel categories without any training examples. We also propose a new experimental protocol for ZSD based on the highly challenging ILSVRC dataset, adhering to practical issues, e.g., the rarity of unseen objects. To the best of our knowledge, this is the first end-to-end deep network for ZSD that jointly models the interplay between visual and semantic domain information. To overcome the noise in the automatically derived semantic descriptions, we utilize the concept of meta-classes to design an original loss function that achieves synergy between max-margin class separation and semantic space clustering. Furthermore, we present a baseline approach extended from recognition to detection setting. Our extensive experiments show significant performance boost over the baseline on the imperative yet difficult ZSD problem.

Constant-Time Predictive Distributions for Gaussian Processes

One of the most compelling features of Gaussian process (GP) regression is its ability to provide well calibrated posterior distributions. Recent advances in inducing point methods have drastically sped up marginal likelihood and posterior mean computations, leaving posterior covariance estimation and sampling as the remaining computational bottlenecks. In this paper we address this shortcoming by using the Lanczos decomposition algorithm to rapidly approximate the predictive covariance matrix. Our approach, which we refer to as LOVE (LanczOs Variance Estimates), substantially reduces the time and space complexity over any previous method. In practice, it can compute predictive covariances up to 2,000 times faster and draw samples 18,000 time faster than existing methods, all without sacrificing accuracy.

A Kernel Theory of Modern Data Augmentation

Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding modern data augmentation techniques. We start by showing that for kernel classifiers, data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. We connect this general approximation framework to prior work in invariant kernels, tangent propagation, and robust optimization. Next, we explicitly tackle the compositional aspect of modern data augmentation techniques, proposing a novel model of data augmentation as a Markov process. Under this model, we show that performing k-nearest neighbors with data augmentation is asymptotically equivalent to a kernel classifier. Finally, we illustrate ways in which our theoretical framework can be leveraged to accelerate machine learning workflows in practice, including reducing the amount of computation needed to train on augmented data, and predicting the utility of a transformation prior to training.

How robust are Structural Equation Models to model miss-specification? A simulation study

Structural Equation Models (SEMs) are routinely used in the analysis of empirical data by researchers spanning different scientific fields such as psychologists or econometricians. In some fields, such as in ecology, SEMs have only started recently to attract attention and thanks to dedicated software packages the use of SEMs have steadily increased. Yet, common analysis practices in such fields that might be transposed from other statistical techniques such as model acceptance or rejection based on p value screening might be poorly fitted for SEMs. In this simulation study, SEMs were fitted via two commonly used R packages: lavaan and piecewiseSEM. Datasets were simulated under different modelling scenarios to test the impact of sample size and model complexity on various global and local model fitness indices. The results showed that not one single model indices should be used to decide on model fitness but rather a combination of different model fitness indices is needed. The global chi square test for lavaan or the Fisher C statistic are, in isolation, poor indicators of model fitness. Combining the different metrics explored here provided little safeguards against model overfitting, this emphasizes the need to cautiously interpret the inferred (causal) relations from fitted SEMs. Researchers in scientific fields with little experience in SEMs, such as in ecology, should consider and accept these limitations.

Gradients on Sets

For a locally Lipschitz continuous function f:X\to\mathbb{R} the generalized gradient \partial f(x) of Clarke is used to develop some (set-valued) gradient on a set A\subset X. Existence, uniqueness and some approximation are considered for optimal descent directions on set A. The results serve as basis for nonsmooth numerical descent algorithms that can be found in subsequent papers.

Graph Partition Neural Networks for Semi-Supervised Classification

We present graph partition neural networks (GPNN), an extension of graph neural networks (GNNs) able to handle extremely large graphs. GPNNs alternate between locally propagating information between nodes in small subgraphs and globally propagating information between the subgraphs. To efficiently partition graphs, we experiment with several partitioning algorithms and also propose a novel variant for fast processing of large scale graphs. We extensively test our model on a variety of semi-supervised node classification tasks. Experimental results indicate that GPNNs are either superior or comparable to state-of-the-art methods on a wide variety of datasets for graph-based semi-supervised classification. We also show that GPNNs can achieve similar performance as standard GNNs with fewer propagation steps.

High-dimensional Stochastic Inversion via Adjoint Models and Machine Learning

Performing stochastic inversion on a computationally expensive forward simulation model with a high-dimensional uncertain parameter space (e.g. a spatial random field) is computationally prohibitive even with gradient information provided. Moreover, the `nonlinear’ mapping from parameters to observables generally gives rise to non-Gaussian posteriors even with Gaussian priors, thus hampering the use of efficient inversion algorithms designed for models with Gaussian assumptions. In this paper, we propose a novel Bayesian stochastic inversion methodology, characterized by a tight coupling between a gradient-based Langevin Markov Chain Monte Carlo (LMCMC) method and a kernel principal component analysis (KPCA). This approach addresses the `curse-of-dimensionality’ via KPCA to identify a low-dimensional feature space within the high-dimensional and nonlinearly correlated spatial random field. Moreover, non-Gaussian full posterior probability distribution functions are estimated via an efficient LMCMC method on both the projected low-dimensional feature space and the recovered high-dimensional parameter space. We demonstrate this computational framework by integrating and adapting recent developments such as data-driven statistics-on-manifolds constructions and reduction-through-projection techniques to solve inverse problems in linear elasticity.

Nesting Probabilistic Programs

We formalize the notion of nesting probabilistic programming queries and investigate the resulting statistical implications. We demonstrate that query nesting allows the definition of models which could not otherwise be expressed, such as those involving agents reasoning about other agents, but that existing systems take approaches that lead to inconsistent estimates. We show how to correct this by delineating possible ways one might want to nest queries and asserting the respective conditions required for convergence. We further introduce, and prove the correctness of, a new online nested Monte Carlo estimation method that makes it substantially easier to ensure these conditions are met, thereby providing a simple framework for designing statistically correct inference engines.

Snap Machine Learning

We describe an efficient, scalable machine learning library that enables very fast training of generalized linear models. We demonstrate that our library can remove the training time as a bottleneck for machine learning workloads, opening the door to a range of new applications. For instance, it allows more agile development, faster and more fine-grained exploration of the hyper-parameter space, enables scaling to massive datasets and makes frequent re-training of models possible in order to adapt to events as they occur. Our library, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern distributed systems. This allows us to effectively leverage available network, memory and heterogeneous compute resources. On a terabyte-scale publicly available dataset for click-through-rate prediction in computational advertising, we demonstrate the training of a logistic regression classifier in 1.53 minutes, a 46x improvement over the fastest reported performance.

Capturing near-equilibrium solutions: a comparison between high-order Discontinuous Galerkin methods and well-balanced schemes
RankME: Reliable Human Ratings for Natural Language Generation
Multilevel Monte Carlo Method for Ergodic SDEs without Contractivity
Some Closure Results for Polynomial Factorization and Applications
CIM/E Oriented Graph Database Model Architecture and Parallel Network Topology Processing
Noise-induced rectification in out-of-equilibrium structures
A picture is worth a thousand words but how to organize thousands of pictures?
Analog simulator of integro-differential equations with classical memristors
Pivot Sampling in QuickXSort: Precise Analysis of QuickMergesort and QuickHeapsort
Generalized Stirling Numbers I
Interplay of Probabilistic Shaping and the Blind Phase Search Algorithm
Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera
Deep Choice Model Using Pointer Networks for Airline Itinerary Prediction
Contraction and Robustness of Continuous Time Primal-Dual Dynamics
Real-time Deep Registration With Geodesic Loss
Unraveling Go gaming nature by Ising Hamiltonian and common fate graphs: tactics and statistics
Deep Co-Training for Semi-Supervised Image Recognition
EEG machine learning with Higuchi fractal dimension and Sample Entropy as features for successful detection of depression
Scalable analysis of linear networked systems via chordal decomposition
Escaping Saddles with Stochastic Gradients
Symplectic frieze patterns
Efficient Hardware Realization of Convolutional Neural Networks using Intra-Kernel Regular Pruning
Covert Communication over a K-User Multiple Access Channel
Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling
A Unified Theory of Regression Adjustment for Design-based Inference
Database Perspectives on Blockchains
Deep Learning Reconstruction of Ultra-Short Pulses
Crackling to periodic dynamics in sheared granular media
Estimation of lactate threshold with machine learning techniques in recreational runners
Optimal Bipartite Network Clustering
$W^{1,p}$ regularity of solutions to Kolmogorov equation and associated Feller semigroup
Multistage stochastic programs with a random number of stages: dynamic programming equations, solution methods, and application to portfolio selection
Optimality of multi-refraction dividend strategies in the dual model
False discovery rate control for multiple testing based on p-values with càdlàg distribution functions
Matroids and Codes with the Rank Metric
Optimal Boundary Kernels and Weightings for Local Polynomial Regression
Identifying and Estimating Principal Causal Effects in Multi-site Trials
Sufficient Conditions for a Linear Estimator to be a Local Polynomial Regression
Deep Multiple Instance Learning for Zero-shot Image Tagging
The world of research has gone berserk: modeling the consequences of requiring ‘greater statistical stringency’ for scientific publication
Heuristics for vehicle routing problems: Sequence or set optimization?
A Meaning-based Statistical English Math Word Problem Solver
Dynamic-structured Semantic Propagation Network
Modelling sparsity, heterogeneity, reciprocity and community structure in temporal interaction data
Lyapunov Functions for First-Order Methods: Tight Automated Convergence Guarantees
Reconfiguring spanning and induced subgraphs
Real-time Detection, Tracking, and Classification of Moving and Stationary Objects using Multiple Fisheye Images
Load Balancing for 5G Ultra-Dense Networks using Device-to-Device Communications
A Globally Asymptotically Stable Polynomial Vector Field with Rational Coefficients and no Local Polynomial Lyapunov Function
Distributed Caching for Complex Querying of Raw Arrays
Salient Objects in Clutter: Bringing Salient Object Detection to the Foreground
A dataset and architecture for visual reasoning with a working memory
Expected Time to Extinction of SIS Epidemic Model Using Quasy Stationary Distribution
Parameterized Low-Rank Binary Matrix Approximation
Efficient Decoding Schemes for Noisy Non-Adaptive Group Testing when Noise Depends on Number of Items in Test
Varying k-Lipschitz Constraint for Generative Adversarial Networks
Modeling the effects of telephone nursing on healthcare utilization
A constant-ratio approximation algorithm for a class of hub-and-spoke network design problems and metric labeling problems: star metric case
Runlength-Limited Sequences and Shift-Correcting Codes
Gaussian Processes indexed on the symmetric group: prediction and learning
Surjections and double posets
Learning Sparse Deep Feedforward Networks via Tree Skeleton Expansion
On Combination Networks with Cache-aided Relays and Users
Towards Advanced Phenotypic Mutations in Cartesian Genetic Programming
Analysis of an asymptotic preserving scheme for stochastic linear kinetic equations in the diffusion limit
Towards Image Understanding from Deep Compression without Decoding
Signless Laplacian determinations of some graphs with independent edges
Patchwise object tracking via structural local sparse appearance model
Q-process and asymptotic properties of Markov processes conditioned not to hit moving boundaries
Local weak convergence for PageRank
Object Captioning and Retrieval with Natural Language
Semantic Segmentation of Pathological Lung Tissue with Dilated Fully Convolutional Networks
Existence and smoothness of the density for the stochastic continuity equation
Downlink coverage probability in cellular networks with Poisson-Poisson cluster deployed base stations
Fair non-monetary scheduling in federated clouds
The ApolloScape Dataset for Autonomous Driving
Triplet-Center Loss for Multi-View 3D Object Retrieval
Monocular Fisheye Camera Depth Estimation Using Semi-supervised Sparse Velodyne Data
Complex-YOLO: Real-time 3D Object Detection on Point Clouds
Quantile correlation coefficient: a new tail dependence measure
Shellability of face posets of electrical networks and the CW poset property
Further Consequences of the Colorful Helly Hypothesis
Chemi-net: a graph convolutional network for accurate drug property prediction
Trianguloids and Triangulations of Root Polytopes
Induced Saturation of Graphs
Coordination via predictive assistants from a game-theoretic view
Link prediction for interdisciplinary collaboration via co-authorship network
Tropical integrable systems and Young tableaux: Shape equivalence and Littlewood-Richardson correspondence
Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-end Model
Land cover mapping at very high resolution with rotation equivariant CNNs: towards small yet accurate models
Online Controlled Experiments for Personalised e-Commerce Strategies: Design, Challenges, and Pitfalls
Heterogeneous Doppler Spread-based CSI Estimation Planning for TDD Massive MIMO
Consistent sets of lines with no colorful incidence
On the existence of a scalar pressure field in the Bredinger problem
Improved Part Segmentation Performance by Optimising Realism of Synthetic Images using Cycle Generative Adversarial Networks
$EVA^2$ : Exploiting Temporal Redundancy in Live Computer Vision
Activity Detection with Latent Sub-event Hierarchy Learning
A local characterization of crystals for the quantum queer superalgebra
Synchronisation of Partial Multi-Matchings via Non-negative Factorisations
A particle-based variational approach to Bayesian Non-negative Matrix Factorization
Fast approximation and exact computation of negative curvature parameters of graphs
Learning deep structured active contours end-to-end
Faces as Lighting Probes via Unsupervised Deep Highlight Extraction
Distributed Transactions: Dissecting the Nightmare


Document worth reading: “Run Time Prediction for Big Data Iterative ML Algorithms: a KMeans case study”

Data science and machine learning algorithms running on big data infrastructure are increasingly important in activities ranging from business intelligence and analytics to cybersecurity, smart city management, and many fields of science and engineering. As these algorithms are further integrated into daily operations, understanding how long they take to run on a big data infrastructure is paramount to controlling costs and delivery times. In this paper we discuss the issues involved in understanding the run time of iterative machine learning algorithms and provide a case study of such an algorithm – including a statistical characterization and model of the run time of an implementation of K-Means for the Spark big data engine using the Edward probabilistic programming language. Run Time Prediction for Big Data Iterative ML Algorithms: a KMeans case study

Magister Dixit

“Managing big data for analytics is not the same as managing DW data for reporting. In fact, the two are almost opposites … . For example, reporting is about seeing the latest values of the numbers that you track over time via a report. Obviously, you know the report, the business entities it represents, and the data warehouse that feeds the report. An analysis is more about discovering variables you don’t know, based on data that you probably don’t know very well. Also, a report requires a solid audit trail, so its data must be managed with welldocumented metadata and possibly master data, too. Since most analyses have no expectation of an audit trail, there’s no need to manage one. That’s just a sampling of the differences. The point is to embrace Big Data Management for analytics as a unique practice that doesn’t follow all the strict rules we’re taught for reporting and data warehousing.” Philip Russom ( 2013 )

Book Memo: “Introduction to Deep Learning”

From Logical Calculus to Artificial Intelligence
This textbook presents a concise, accessible and engaging first introduction to deep learning, offering a wide range of connectionist models which represent the current state-of-the-art. The text explores the most popular algorithms and architectures in a simple and intuitive style, explaining the mathematical derivations in a step-by-step manner. The content coverage includes convolutional networks, LSTMs, Word2vec, RBMs, DBNs, neural Turing machines, memory networks and autoencoders. Numerous examples in working Python code are provided throughout the book, and the code is also supplied separately at an accompanying website. Topics and features: introduces the fundamentals of machine learning, and the mathematical and computational prerequisites for deep learning; discusses feed-forward neural networks, and explores the modifications to these which can be applied to any neural network; examines convolutional neural networks, and the recurrent connections to a feed-forward neural network; describes the notion of distributed representations, the concept of the autoencoder, and the ideas behind language processing with deep learning; presents a brief history of artificial intelligence and neural networks, and reviews interesting open research problems in deep learning and connectionism. This clearly written and lively primer on deep learning is essential reading for graduate and advanced undergraduate students of computer science, cognitive science and mathematics, as well as fields such as linguistics, logic, philosophy, and psychology.

Book Memo: “Machine Learning and Cognition in Enterprises”

Business Intelligence Transformed
Learn about the emergence and evolution of IT in the enterprise, see how machine learning is transforming business intelligence, and discover various cognitive artificial intelligence solutions that complement and extend machine learning. In this book, author Rohit Kumar explores the challenges when these concepts intersect in IT systems by presenting detailed descriptions and business scenarios. He starts with the basics of how artificial intelligence started and how cognitive computing developed out of it. He’ll explain every aspect of machine learning in detail, the reasons for changing business models to adopt it, and why your business needs it. Along the way you’ll become comfortable with the intricacies of natural language processing, predictive analytics, and cognitive computing. Each technique is covered in detail so you can confidently integrate it into your enterprise as it is needed. This practical guide gives you a roadmap for transformin g your business with cognitive computing, giving you the ability to work confidently in an ever-changing enterprise environment.

Distilled News

A Collection of Definitions of Intelligence

This paper is a survey of a large number of informal definitions of “intelligence” that the authors have collected over the years. Naturally, compiling a complete list would be impossible as many definitions of intelligence are buried deep inside articles and books. Nevertheless, the 70-odd definitions presented here are, to the authors’ knowledge, the largest and most well referenced collection there is.

Multi-Class Text Classification with PySpark

Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for us to be able to stream and analyze it in real time. In addition, Apache Spark is fast enough to perform exploratory queries without sampling. Many industry experts have provided all the reasons why you should use Spark for Machine Learning?

Twitter Analysis with Python

Twitter is a good ressource to collect data. We can find a few libraries (R or Python) which allow you to build your own dataset with the data generated by Twitter. This tutorial is focus on the preparation of the data and no on the collect. Throughout this analysis we are going to see how to work with the twitter’s data. If You want to play with the same data you can download it here.

Introduction to Data Analysis in Python with IPL Dataset

Data Science / Analytics is all about finding valuable insights from the given dataset. In short, Finding answers that could help business. In this tutorial, We will see how to get started with Data Analysis in Python. The Python packages that we use in this notebook are: numpy, pandas, matplotlib, and seaborn

Using Deep Learning to Facilitate Scientific Image Analysis

Many scientific imaging applications, especially microscopy, can produce terabytes of data per day. These applications can benefit from recent advances in computer vision and deep learning. In our work with biologists on robotic microscopy applications (e.g., to distinguish cellular phenotypes) we’ve learned that assembling high quality image datasets that separate signal from noise is a difficult but important task. We’ve also learned that there are many scientists who may not write code, but who are still excited to utilize deep learning in their image analysis work. A particular challenge we can help address involves dealing with out-of-focus images. Even with the autofocus systems on state-of-the-art microscopes, poor configuration or hardware incompatibility may result in image quality issues. Having an automated way to rate focus quality can enable the detection, troubleshooting and removal of such images.

Winding Paths to Data Science: Jesse Mostipak

This infographic series features the speakers from Kaggle’s CareerCon 2018 session, ‘Real Stories from a Panel of Successful Career Switchers’.

Multiscale Methods and Machine Learning

We highlight recent developments in machine learning and Deep Learning related to multiscale methods, which analyze data at a variety of scales to capture a wider range of relevant features. We give a general overview of multiscale methods, examine recent successes, and compare with similar approaches.

Speeding up Metropolis-Hastings with Rcpp

In the most recent post, I profiled a Metropolis-in-Gibbs sampler for estimating the parameters of a Bayesian logistic regression model. The conclusion was that evaluation of the log-posterior was a significant run time bottleneck. In each iteration, the log-posterior is evaluated twice: once at the current draw, and another at the proposed draw. This post hones in on this issue to show how Rcpp can help get past this bottleneck. For this particular post, my code and results are in this sub-repo. If you’re short on time, TLDR: just by coding the log-posterior in C++ instead of a vectorized R function, we can significantly reduce run time. The R implementation runs about 4-7 times slower. If you’re coding your own samplers, profiling your code and re-writing bottlenecks in Rcpp can be hugely beneficial.

Basics of Bayesian Decision Theory

The use of formal statistical methods to analyse quantitative data in data science has increased considerably over the last few years. One such approach, Bayesian Decision Theory (BDT), also known as Bayesian Hypothesis Testing and Bayesian inference, is a fundamental statistical approach that quantifies the tradeoffs between various decisions using distributions and costs that accompany such decisions. In pattern recognition it is used for designing classifiers making the assumption that the problem is posed in probabilistic terms, and that all of the relevant probability values are known. Generally, we don’t have such perfect information but it is a good place to start when studying machine learning, statistical inference, and detection theory in signal processing. BDT also has many applications in science, engineering, and medicine.

Probabilistic Forecasting: Learning Uncertainty

The majority of industry and academic numeric predictive projects deal with deterministic or point forecasts of expected values of a random variable given some conditional information. In some cases, these predictions are enough for decision making. However, these predictions don’t say much about the uncertainty of your underlying stochastic process. A common desire of all data scientists is to make predictions for an uncertain future. Clearly then, forecasts should be probabilistic, i.e., they should take the form of probability distributions over future quantities or events. This form of prediction is known as probabilistic forecasting and in the last decade has seen a surge in popularity. Recent evidence of this are the 2014 and 2017 Global Energy Forecasting Competitions (GEFCom). GEFCom2014 focused on producing multiple quantile forecasts for wind, solar, load, and electricity prices, and GEFCom2017 focused on hierarchical rolling probabilistic forecasts of load. More recently the M4 Competition aims to produce point forecasts of 100,000-time series but has also optionally for the first time opened to submitting prediction interval forecasts too.

Swarm Optimization: Goodbye Gradients

Fish schools, bird flocks, and bee swarms. These combinations of real-time biological systems can blend knowledge, exploration, and exploitation to unify intelligence and solve problems more efficiently. There’s no centralized control. These simple agents interact locally, within their environment, and new behaviors emerge from the group as a whole. In the world of evolutionary alogirthms one such inspired method is particle swarm optimization (PSO). It is a swarm intelligence based computational technique that can be used to find an approximate solution to a problem by iteratively trying to search candidate solutions (called particles) with regard to a given measure of quality around a global optimum. The movements of the particles are guided by their own best known position in the search-space as well as the entire swarm’s best known position. PSO makes few or no assumptions about the problem being optimized and can search very large spaces of candidate solutions. As a global optimization method PSO does not use the gradient of the problem being optimized, which means PSO does not require that the optimization problem be differentiable as is required by classic optimization methods such as gradient descent. This makes it a widely popular optimizer for many nonconvex or nondifferentiable problems.

R Packages worth a look

Computing F-Statistics from Pool-Seq Data (poolfstat)
Functions for the computation of F-statistics from Pool-Seq data in population genomics studies. The package also includes several utilities to manipulate Pool-Seq data stored in standard format (‘vcf’ and ‘rsync’ files as obtained from the popular software ‘VarScan’ and ‘PoPoolation’ respectively) and perform conversion to alternative format (as used in the ‘BayPass’ and ‘SelEstim’ software).

Voting Systems, Instant-Runoff Voting, Borda Method, Various Condorcet Methods (votesys)
Various methods to count ballots in voting systems are provided: Instant-runoff voting described in Reynolds, Reilly and Ellis (2005, ISBN:9789185391189), Borda method in Emerson (2013) <doi:10.1007/s00355-011-0603-9>, original Condorcet method in Stahl and Johnson (2017, ISBN:9780486807386), Dodgson method in McCabe-Dansted and Slinko (2008) <doi:10.1007/s00355-007-0282-8>, Simpson-Kramer method in Levin and Nalebuff (1995) <doi:10.1257/jep.9.1.3>, Schulze method in Schulze (2011) <doi:10.1007/s00355-010-0475-4>, Ranked pairs method in Tideman (1987) <doi:10.1007/BF00433944>. Functions to check validity of ballots are also provided to ensure flexibility.

A Graph Based Particle Simulator Based on D3-Force (particles)
Simulating particle movement in 2D space has many application. The ‘particles’ package implements a particle simulator based on the ideas behind the ‘d3-force’ ‘JavaScript’ library. ‘particles’ implements all forces defined in ‘d3-force’ as well as others such as vector fields, traps, and attractors.

3D Forest Simulation Visualization Tool (DGVM3D)
This is a visualization tool for vegetation structure/succession in space and/or time mainly for forest gap models. However, it could also be used to visualize observed forest stands. If used for models, they should contain either individual trees or cohorts (e.g. LPJ-GUESS by Smith et al. (2014) <doi:10.5194/bg-11-2027-2014>). For a list of required and additional data fields see the vignette.

Dependence Measures via Energy Statistics (EDMeasure)
Implementations of (1) mutual dependence measures and mutual independence tests in Jin, Z., and Matteson, D. S. (2017) <arXiv:1709.0253>; (2) independent component analysis methods based on mutual dependence measures in Jin, Z., and Matteson, D. S. (2017) <arXiv:1709.0253> and Pfister, N., et al. (2018) <doi:10.1111/rssb.12235>; (3) conditional mean dependence measures and conditional mean independence tests in Shao, X., and Zhang, J. (2014) <doi:10.1080/01621459.2014.887012> and Park, T., et al. (2015) <doi:10.1214/15-EJS1047>.

If you did not already know

Approximate Bayesian Computation (ABC) google
This Chapter, ‘ABC Samplers’, is to appear in the forthcoming Handbook of Approximate Bayesian Computation (2018). It details the main ideas and algorithms used to sample from the ABC approximation to the posterior distribution, including methods based on rejection/importance sampling, MCMC and sequential Monte Carlo. …

PredictionIO google
PredictionIO is an open source machine learning server for software developers to create predictive features, such as personalization, recommendation and content discovery. …

Data Mining (DM) google
Data mining (the analysis step of the “Knowledge Discovery in Databases” process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. …

Distilled News

Introducing conu – Scripting Containers Made Easier

There has been a need for a simple, easy-to-use handler for writing tests and other code around containers that would implement helpful methods and utilities. For this we introduce conu, a low-level Python library. This project has been driven from the start by the requirements of container maintainers and testers. In addition to basic image and container management methods, it provides other often used functions, such as container mount, shortcut methods for getting an IP address, exposed ports, logs, name, image extending using source-to-image, and many others. conu aims for stable engine-agnostic APIs that would be implemented by several container runtime back-ends. Switching between two different container engines should require only minimum effort. When used for testing, one set of tests could be executed for multiple back-ends.

Understanding Experimentation Platforms

Thanks to approaches such as continuous integration and continuous delivery, companies that once introduced new products every six months are now shipping software several times daily. Reaching the market quickly is vital today, but rapid updates are impractical unless they provide genuine customer value. With this eBook, you’ll learn how online controlled experiments can help you gain customer feedback quickly so you can maintain a speedy release cycle. Using examples from Google, LinkedIn, and other organizations, Adil Aijaz, Trevor Stuart, and Henry Jewkes from Split Software explain basic concepts and show you how to build a scalable experimentation platform for conducting full-stack, comprehensive, and continuous tests. Along the way, you’ll learn practical tips on best practices and common pitfalls you’re likely to face along the way. This eBook is ideal for engineers, data scientists, and product managers.

Document worth reading: “Rule-Mining based classification: a benchmark study”

This study proposed an exhaustive stable/reproducible rule-mining algorithm combined to a classifier to generate both accurate and interpretable models. Our method first extracts rules (i.e., a conjunction of conditions about the values of a small number of input features) with our exhaustive rule-mining algorithm, then constructs a new feature space based on the most relevant rules called ‘local features’ and finally, builds a local predictive model by training a standard classifier on the new local feature space. This local feature space is easy interpretable by providing a human-understandable explanation under the explicit form of rules. Furthermore, our local predictive approach is as powerful as global classical ones like logistic regression (LR), support vector machine (SVM) and rules based methods like random forest (RF) and gradient boosted tree (GBT). Rule-Mining based classification: a benchmark study