Advertisements

R Packages worth a look

Fast Algorithms for Best Subset Selection (L0Learn)
Highly optimized toolkit for (approximately) solving L0-regularized learning problems. The algorithms are based on coordinate descent and local combina …

Simulation of Graphically Constrained Matrices (gmat)
Implementation of the simulation method for Gaussian graphical models described in Córdoba et al. (2018) <arXiv:1807.03090>. The package also pro …

Graph-Based Landscape De-Fragmentation (gDefrag)
Provides a set of tools to help the de-fragmentation process. It works by prioritizing the different sections of linear infrastructures (e.g. roads, po …

Advertisements

Book Memo: “Mathematics of Big Data”

Spreadsheets, Databases, Matrices, and Graphs
The first book to present the common mathematical foundations of big data analysis across a range of applications and technologies.Today, the volume, velocity, and variety of data are increasing rapidly across a range of fields, including Internet search, healthcare, finance, social media, wireless devices, and cybersecurity. Indeed, these data are growing at a rate beyond our capacity to analyze them. The tools-including spreadsheets, databases, matrices, and graphs-developed to address this challenge all reflect the need to store and operate on data as whole sets rather than as individual elements. This book presents the common mathematical foundations of these data sets that apply across many applications and technologies. Associative arrays unify and simplify data, allowing readers to look past the differences among the various tools and leverage their mathematical similarities in order to solve the hardest big data challenges.The book first introduces the concept of the associative array in practical terms, presents the associative array manipulation system D4M (Dynamic Distributed Dimensional Data Model), and describes the application of associative arrays to graph analysis and machine learning. It provides a mathematically rigorous definition of associative arrays and describes the properties of associative arrays that arise from this definition. Finally, the book shows how concepts of linearity can be extended to encompass associative arrays. Mathematics of Big Data can be used as a textbook or reference by engineers, scientists, mathematicians, computer scientists, and software engineers who analyze big data.

R Packages worth a look

Adaptive Sparsity Models (AdaptiveSparsity)
Implements Figueiredo EM algorithm for adaptive sparsity (Jeffreys prior) (see Figueiredo, M.A.T.; , ‘Adaptive sparseness for supervised learning,’ Pat …

Greedy Experimental Design Construction (GreedyExperimentalDesign)
Computes experimental designs for a two-arm experiment with covariates by greedily optimizing a balance objective function. This optimization provides …

FIS ‘MarketMap C-Toolkit’ (rhli)
Complete access from ‘R’ to the FIS ‘MarketMap C-Toolkit’ (‘FAME C-HLI’). ‘FAME’ is a fully integrated software and database management system from FIS …

Whats new on arXiv

Latent Dirichlet Allocation (LDA) for Topic Modeling of the CFPB Consumer Complaints

A text mining approach is proposed based on latent Dirichlet allocation (LDA) to analyze the Consumer Financial Protection Bureau (CFPB) consumer complaints. The proposed approach aims to extract latent topics in the CFPB complaint narratives, and explores their associated trends over time. The time trends will then be used to evaluate the effectiveness of the CFPB regulations and expectations on financial institutions in creating a consumer oriented culture that treats consumers fairly and prioritizes consumer protection in their decision making processes. The proposed approach can be easily operationalized as a decision support system to automate detection of emerging topics in consumer complaints. Hence, the technology-human partnership between the proposed approach and the CFPB team could certainly improve consumer protections from unfair, deceptive or abusive practices in the financial markets by providing more efficient and effective investigations of consumer complaint narratives.


Anomaly Detection for Water Treatment System based on Neural Network with Automatic Architecture Optimization

We continue to develop our neural network (NN) based forecasting approach to anomaly detection (AD) using the Secure Water Treatment (SWaT) industrial control system (ICS) testbed dataset. We propose genetic algorithms (GA) to find the best NN architecture for a given dataset, using the NAB metric to assess the quality of different architectures. The drawbacks of the F1-metric are analyzed. Several techniques are proposed to improve the quality of AD: exponentially weighted smoothing, mean p-powered error measure, individual error weight for each variable, disjoint prediction windows. Based on the techniques used, an approach to anomaly interpretation is introduced.


Preventing Poisoning Attacks on AI based Threat Intelligence Systems

As AI systems become more ubiquitous, securing them becomes an emerging challenge. Over the years, with the surge in online social media use and the data available for analysis, AI systems have been built to extract, represent and use this information. The credibility of this information extracted from open sources, however, can often be questionable. Malicious or incorrect information can cause a loss of money, reputation, and resources; and in certain situations, pose a threat to human life. In this paper, we use an ensembled semi-supervised approach to determine the credibility of Reddit posts by estimating their reputation score to ensure the validity of information ingested by AI systems. We demonstrate our approach in the cybersecurity domain, where security analysts utilize these systems to determine possible threats by analyzing the data scattered on social media websites, forums, blogs, etc.


Statistical Model Compression for Small-Footprint Natural Language Understanding

In this paper we investigate statistical model compression applied to natural language understanding (NLU) models. Small-footprint NLU models are important for enabling offline systems on hardware restricted devices, and for decreasing on-demand model loading latency in cloud-based systems. To compress NLU models, we present two main techniques, parameter quantization and perfect feature hashing. These techniques are complementary to existing model pruning strategies such as L1 regularization. We performed experiments on a large scale NLU system. The results show that our approach achieves 14-fold reduction in memory usage compared to the original models with minimal predictive performance impact.


Analyzing Hypersensitive AI: Instability in Corporate-Scale Machine Learning

Predictive geometric models deliver excellent results for many Machine Learning use cases. Despite their undoubted performance, neural predictive algorithms can show unexpected degrees of instability and variance, particularly when applied to large datasets. We present an approach to measure changes in geometric models with respect to both output consistency and topological stability. Considering the example of a recommender system using word2vec, we analyze the influence of single data points, approximation methods and parameter settings. Our findings can help to stabilize models where needed and to detect differences in informational value of data points on a large scale.


Semantic Parsing: Syntactic assurance to target sentence using LSTM Encoder CFG-Decoder

Semantic parsing can be defined as the process of mapping natural language sentences into a machine interpretable, formal representation of its meaning. Semantic parsing using LSTM encoder-decoder neural networks have become promising approach. However, human automated translation of natural language does not provide grammaticality guarantees for the sentences generate such a guarantee is particularly important for practical cases where a data base query can cause critical errors if the sentence is ungrammatical. In this work, we propose an neural architecture called Encoder CFG-Decoder, whose output conforms to a given context-free grammar. Results are show for any implementation of such architecture display its correctness and providing benchmark accuracy levels better than the literature.


Linear Programming Approximations for Index Coding

Index coding, a source coding problem over broadcast channels, has been a subject of both theoretical and practical interest since its introduction (by Birk and Kol, 1998). In short, the problem can be defined as follows: there is an input \textbf{x} \triangleq (\textbf{x}_1, \dots, \textbf{x}_n), a set of n clients who each desire a single symbol \textbf{x}_i of the input, and a broadcaster whose goal is to send as few messages as possible to all clients so that each one can recover its desired symbol. Additionally, each client has some predetermined ‘side information,’ corresponding to certain symbols of the input \textbf{x}, which we represent as the ‘side information graph’ \mathcal{G}. The graph \mathcal{G} has a vertex v_i for each client and a directed edge (v_i, v_j) indicating that client i knows the jth symbol of the input. Given a fixed side information graph \mathcal{G}, we are interested in determining or approximating the ‘broadcast rate’ of index coding on the graph, i.e. the fewest number of messages the broadcaster can transmit so that every client gets their desired information. Using index coding schemes based on linear programs (LPs), we take a two-pronged approach to approximating the broadcast rate. First, extending earlier work on planar graphs, we focus on approximating the broadcast rate for special graph families such as graphs with small chromatic number and disk graphs. In certain cases, we are able to show that simple LP-based schemes give constant-factor approximations of the broadcast rate, which seem extremely difficult to obtain in the general case. Second, we provide several LP-based schemes for the general case which are not constant-factor approximations, but which strictly improve on the prior best-known schemes.


A Projection Pursuit Forest Algorithm for Supervised Classification

This paper presents a new ensemble learning method for classification problems called projection pursuit random forest (PPF). PPF uses the PPtree algorithm introduced in Lee et al. (2013). In PPF, trees are constructed by splitting on linear combinations of randomly chosen variables. Projection pursuit is used to choose a projection of the variables that best separates the classes. Utilizing linear combinations of variables to separate classes takes the correlation between variables into account which allows PPF to outperform a traditional random forest when separations between groups occurs in combinations of variables. The method presented here can be used in multi-class problems and is implemented into an R (R Core Team, 2018) package, PPforest, which is available on CRAN, with development versions at https://…/PPforest.


Imparting Interpretability to Word Embeddings

As an ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representation. They capture semantic and syntactic relations among words but the vector corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute, interpretable meaning. We introduce an additive modification to the objective function of the embedding learning algorithm that encourages the embedding vectors of words that are semantically related a predefined concept to take larger values along a specified dimension, while leaving the original semantic learning mechanism mostly unaffected. In other words, we align words that are already determined to be related, along predefined concepts. Therefore, we impart interpretability to the word embedding by assigning meaning to its vector dimensions. The predefined concepts are derived from an external lexical resource, which in this paper is chosen as Roget’s Thesaurus. We observe that alignment along the chosen concepts is not limited to words in the Thesaurus and extends to other related words as well. We quantify the extent of interpretability and assignment of meaning from our experimental results. We also demonstrate the preservation of semantic coherence of the resulting vector space by using word-analogy and word-similarity tests. These tests show that the interpretability-imparted word embeddings that are obtained by the proposed framework do not sacrifice performances in common benchmark tests.


ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

In this work, we propose an alternative solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a novel regularized KL divergence between their highly-peaked output distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we propose the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet (Ping et al., 2017). We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model.


Bounded Information Rate Variational Autoencoders

This paper introduces a new member of the family of Variational Autoencoders (VAE) that constrains the rate of information transferred by the latent layer. The latent layer is interpreted as a communication channel, the information rate of which is bound by imposing a pre-set signal-to-noise ratio. The new constraint subsumes the mutual information between the input and latent variables, combining naturally with the likelihood objective of the observed data as used in a conventional VAE. The resulting Bounded-Information-Rate Variational Autoencoder (BIR-VAE) provides a meaningful latent representation with an information resolution that can be specified directly in bits by the system designer. The rate constraint can be used to prevent overtraining, and the method naturally facilitates quantisation of the latent variables at the set rate. Our experiments confirm that the BIR-VAE has a meaningful latent representation and that its performance is at least as good as state-of-the-art competing algorithms, but with lower computational complexity.


Attend and Rectify: a Gated Attention Mechanism for Fine-Grained Recovery

We propose a novel attention mechanism to enhance Convolutional Neural Networks for fine-grained recognition. It learns to attend to lower-level feature activations without requiring part annotations and uses these activations to update and rectify the output likelihood distribution. In contrast to other approaches, the proposed mechanism is modular, architecture-independent and efficient both in terms of parameters and computation required. Experiments show that networks augmented with our approach systematically improve their classification accuracy and become more robust to clutter. As a result, Wide Residual Networks augmented with our proposal surpasses the state of the art classification accuracies in CIFAR-10, the Adience gender recognition task, Stanford dogs, and UEC Food-100.


FuzzerGym: A Competitive Framework for Fuzzing and Learning

Fuzzing is a commonly used technique designed to test software by automatically crafting program inputs. Currently, the most successful fuzzing algorithms emphasize simple, low-overhead strategies with the ability to efficiently monitor program state during execution. Through compile-time instrumentation, these approaches have access to numerous aspects of program state including coverage, data flow, and heterogeneous fault detection and classification. However, existing approaches utilize blind random mutation strategies when generating test inputs. We present a different approach that uses this state information to optimize mutation operators using reinforcement learning (RL). By integrating OpenAI Gym with libFuzzer we are able to simultaneously leverage advancements in reinforcement learning as well as fuzzing to achieve deeper coverage across several varied benchmarks. Our technique connects the rich, efficient program monitors provided by LLVM Santizers with a deep neural net to learn mutation selection strategies directly from the input data. The cross-language, asynchronous architecture we developed enables us to apply any OpenAI Gym compatible deep reinforcement learning algorithm to any fuzzing problem with minimal slowdown.


Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer

Autoencoders provide a powerful framework for learning compressed representations by encoding all of the information needed to reconstruct a data point in a latent code. In some cases, autoencoders can ‘interpolate’: By decoding the convex combination of the latent codes for two datapoints, the autoencoder can produce an output which semantically mixes characteristics from the datapoints. In this paper, we propose a regularization procedure which encourages interpolated outputs to appear more realistic by fooling a critic network which has been trained to recover the mixing coefficient from interpolated data. We then develop a simple benchmark task where we can quantitatively measure the extent to which various autoencoders can interpolate and show that our regularizer dramatically improves interpolation in this setting. We also demonstrate empirically that our regularizer produces latent codes which are more effective on downstream tasks, suggesting a possible link between interpolation abilities and learning useful representations.


Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks

Systematic compositionality is the ability to recombine meaningful units with regular and predictable outcomes, and it’s seen as key to humans’ capacity for generalization in language. Recent work has studied systematic compositionality in modern seq2seq models using generalization to novel navigation instructions in a grounded environment as a probing tool, requiring models to quickly bootstrap the meaning of new words. We extend this framework here to settings where the model needs only to recombine well-trained functional words (such as ‘around’ and ‘right’) in novel contexts. Our findings confirm and strengthen the earlier ones: seq2seq models can be impressively good at generalizing to novel combinations of previously-seen input, but only when they receive extensive training on the specific pattern to be generalized (e.g., generalizing from many examples of ‘X around right’ to ‘jump around right’), while failing when generalization requires novel application of compositional rules (e.g., inferring the meaning of ‘around right’ from those of ‘right’ and ‘around’).


A Hand-Held Multimedia Translation and Interpretation System with Application to Diet Management
Minimizing convex quadratic with variable precision Krylov methods
Guess who Multilingual approach for the automated generation of author-stylized poetry
Clinical Text Classification with Rule-based Features and Knowledge-guided Convolutional Neural Networks
Signal Alignment for Humanoid Skeletons via the Globally Optimal Reparameterization Algorithm
Real-Time Stereo Vision for Road Surface 3-D Reconstruction
Eigenspace-Based Minimum Variance Combined with Delay Multiply and Sum Beamformer: Application to Linear-Array Photoacoustic Imaging
High-Mobility Wideband Massive MIMO Communications: Doppler Compensation, Analysis and Scaling Law
A Fixed-Parameter Linear-Time Algorithm to Compute Principal Typings of Planar Flow Networks
Entanglement Transitions from Holographic Random Tensor Networks
Universal Scaling Theory of the Boundary Geometric Tensor in Disordered Metals
Ricci curvature for parametric statistics via optimal transport
Comparative study of Discrete Wavelet Transforms and Wavelet Tensor Train decomposition to feature extraction of FTIR data of medicinal plants
Weakly Monotone Fock Space and Monotone Convolution of the Wigner Law
NIP omega-categorical structures: the rank 1 case
Hierarchical Multi Task Learning With CTC
Real-time digital signal recovery for a low-pass transfer function system with multiple complex poles
The trinomial transform triangle
The classification of homogeneous finite-dimensional permutation structures
Continuous approximation of $(M_t,M_t, 1)$ distributions with application to production
A Holistic Approach to Forecasting Wholesale Energy Market Prices
Reconstructing Latent Orderings by Spectral Clustering
Datamining a medieval medical text reveals patterns in ingredient choice that reflect biological activity against the causative agents of specified infections
Distributed Second-order Convex Optimization
A Scalable MCEM Estimator for Spatio-Temporal Autoregressive Models
Representational efficiency outweighs action efficiency in human program induction
Fast and Deterministic Approximations for $k$-Cut
CT Image Enhancement Using Stacked Generative Adversarial Networks and Transfer Learning for Lesion Segmentation Improvement
Minimum distance computation of linear codes via genetic algorithms with permutation encoding
Take a Look Around: Using Street View and Satellite Images to Estimate House Prices
Approximation Schemes for Low-Rank Binary Matrix Approximation Problems
What kind of content are you prone to tweet Multi-topic Preference Model for Tweeters
Once reinforced random walk on $\mathbb{Z}\times Γ$
How Consumer Empathy Assist Power Grid in Demand Response
Automatic Identification of Ineffective Online Student Questions in Computing Education
A $φ$-Competitive Algorithm for Scheduling Packets with Deadlines
Efficient Power Flow Management and Peak Shaving in a Microgrid-PV System
A Novel Scheme for Support Identification and Iterative Sampling of Bandlimited Graph Signals
Tomlinson-Harashima Precoded Rate-Splitting for Multiuser MIMO Systems
Evaluating Word Embeddings in Multi-label Classification Using Fine-grained Name Typing
Efficient Training on Very Large Corpora via Gramian Estimation
A Tale of Santa Claus, Hypergraphs and Matroids
Tracking Sparse mmWave Channel: Performance Analysis under Intra-Cluster Angular Spread
Is the SIC Outcome There When Nobody Looks
Achievable Rate maximization by Passive Intelligent Mirrors
Asymptotically Optimal Estimation Algorithm for the Sparse Signal with Arbitrary Distributions
Performance, Power, and Area Design Trade-offs in Millimeter-Wave Transmitter Beamforming Architectures
Few-Shot Adaptation for Multimedia Semantic Indexing
Negative Imaginary State Feedback Control with a Prescribed Degree of Stability
Coexistence of scale invariant and rhythmic behavior in self-organized criticality
A Machine Learning Approach for Detecting Students at Risk of Low Academic Achievement
Isolating effects of age with fair representation learning when assessing dementia
Disorder-robust entanglement transport
Efficient Sampling of Bandlimited Graph Signals
Exponential Stabilization for Ito Stochastic Systems with Multiple Input Delays
Monocular Object Orientation Estimation using Riemannian Regression and Classification Networks
Convex Relaxations in Power System Optimization: A Brief Introduction
Stability of generalized Petersen graphs
UAV-Based in-band Integrated Access and Backhaul for 5G Communications
Cooperative Adaptive Cruise Control for Connected Autonomous Vehicles by Factoring Communication-Related Constraints
Optimal estimation of Gaussian mixtures via denoised method of moments
Limiting spectral distribution of the product of truncated Haar unitary matrices
ArticulatedFusion: Real-time Reconstruction of Motion, Geometry and Segmentation Using a Single Depth Camera
Chest X-rays Classification: A Multi-Label and Fine-Grained Problem
Normalization of ternary generalized pseudostandard words
Ricci-flat graphs with girth four
Towards Explainable and Controllable Open Domain Dialogue Generation with Dialogue Acts
Visual Domain Adaptation with Manifold Embedded Distribution Alignment
Machine Learning Based Featureless Signalling
A hybrid algorithm for the two-trust-region subproblem
On the modular Erdös-Burgess constant
Simple robust genomic prediction and outlier detection for a multi-environmental field trial
Searching for network modules
In pixels we trust: From Pixel Labeling to Object Localization and Scene Categorization
Label Aggregation via Finding Consensus Between Models
Deep Sequential Multi-camera Feature Fusion for Person Re-identification
Mr. DLib’s Living Lab for Scholarly Recommendations
SPDEs with Space-Mean Dynamics
Deep Adaptive Proposal Network for Object Detection in Optical Remote Sensing Images
Quantifying Volatility Reduction in German Day-ahead Spot Market in the Period 2006 through 2016
Sequence to Logic with Copy and Cache
On the Phase Tracking Reference Signal (PT-RS) Design for 5G New Radio (NR)
QoS and Coverage Aware Dynamic High Density Vehicle Platooning (HDVP)
Birkhoff-von Neumann Graphs that are PM-compact
Automated Phenotyping of Epicuticular Waxes of Grapevine Berries Using Light Separation and Convolutional Neural Networks
Indexing Execution Patterns in Workflow Provenance Graphs through Generalized Trie Structures
Generative Adversarial Networks for MR-CT Deformable Image Registration
Can We Assess Mental Health through Social Media and Smart Devices Addressing Bias in Methodology and Evaluation
MITK-ModelFit: generic open-source framework for model fits and their exploration in medical imaging – design, implementation and application on the example of DCE-MRI
Test-time augmentation with uncertainty estimation for deep learning-based medical image segmentation
Stochastic Quantization for the Edwards Measure of Fractional Brownian Motion with $Hd=1$
Green function of a random walk in a cone
Speeding up the Hyperparameter Optimization of Deep Convolutional Neural Networks
Revisiting Cross Modal Retrieval
On some special classes of contact $B_0$-VPG graphs
An entropy generation formula on $RCD(K,\infty)$ spaces
Fuzzy quantification for linguistic data analysis and data mining
ISIC 2018-A Method for Lesion Segmentation
Localization of disordered harmonic chain with long-range correlation
Image Reconstruction via Variational Network for Real-Time Hand-Held Sound-Speed Imaging
Delay and Communication Tradeoffs for Blockchain Systems with Lightweight IoT Clients
Modeling Visual Context is Key to Augmenting Object Detection Datasets
Semi-Dense 3D Reconstruction with a Stereo Event Camera
Selective Zero-Shot Classification with Augmented Attributes
On the almost-principal minors of a symmetric matrix
Can Artificial Intelligence Reliably Report Chest X-Rays : Radiologist Validation of an Algorithm trained on 1.2 Million X-Rays
On the Sweep Map for Fuss Rational Dyck Paths
Two algorithms for a fully coupled and consistently macroscopic PDE-ODE system modeling a moving bottleneck on a road
Conditional Random Fields as Recurrent Neural Networks for 3D Medical Imaging Segmentation
Stochastic Model Predictive Control with Discounted Probabilistic Constraints
Guided Upsampling Network for Real-Time Semantic Segmentation
Three for one and one for three: Flow, Segmentation, and Surface Normals
Prophet Secretary Through Blind Strategies
Robust Oil-spill Forensics and Petroleum Source Differentiation using Quantized Peak Topography Maps
A Microservice-enabled Architecture for Smart Surveillance using Blockchain Technology
Edge colourings and topological graph polynomials
Improving Simple Models with Confidence Profiles
Finding Minimum Volume Circumscribing Ellipsoids Using Copositive Programming
A Strategy of MR Brain Tissue Images’ Suggestive Annotation Based on Modified U-Net
Harmonic functions on mated-CRT maps
Hybrid scene Compression for Visual Localization
An invariance principle for ergodic scale-free random environments
Exact Algorithms for Finding Well-Connected 2-Clubs in Real-World Graphs: Theory and Experiments
Using Deep Neural Networks to Translate Multi-lingual Threat Intelligence
Exact asymptotics for Duarte and supercritical rooted kinetically constrained models
Bio-Measurements Estimation and Support in Knee Recovery through Machine Learning
Emulating malware authors for proactive protection using GANs over a distributed image visualization of the dynamic file behavior
Optimal Las Vegas Approximate Near Neighbors in $\ell_p$
Self-Organizing Maps as a Storage and Transfer Mechanism in Reinforcement Learning
Limited Memory Kelley’s Method Converges for Composite Convex and Submodular Objectives
Attention-Guided Curriculum Learning for Weakly Supervised Classification and Localization of Thoracic Diseases on Chest Radiographs
Positional Value in Soccer: Expected League Points Added above Replacement
An expansion formula for type A and Kronecker quantum cluster algebras
A unified theory of adaptive stochastic gradient descent as Bayesian filtering
Partial recovery bounds for clustering with the relaxed $K$means
Realization Spaces of Uniform Phased Matroids
A geometric integration approach to nonsmooth, nonconvex optimisation
Transfer Learning for Action Unit Recognition
Capsule Networks against Medical Imaging Data Challenges
Compositional GAN: Learning Conditional Image Composition
Nested Covariance Determinants and Restricted Trek Separation in Gaussian Graphical Models
A linear-time algorithm for generalized trust region problems

Magister Dixit

“It’s not enough to tell someone, ‘This is done by boosted decision trees, and that’s the best classification algorithm, so just trust me, it works.’ As a builder of these applications, you need to understand what the algorithm is doing in order to make it better. As a user who ultimately consumes the results, it can be really frustrating to not understand how they were produced. When we worked with analysts in Windows or in Bing, we were analyzing computer system logs. That’s very difficult for a human being to understand. We definitely had to work with the experts who understood the semantics of the logs in order to make progress. They had to understand what the machine learning algorithms were doing in order to provide useful feedback. … It really comes back to this big divide, this bottleneck, between the domain expert and the machine learning expert. I saw that as the most challenging problem facing us when we try to really make machine learning widely applied in the world. I saw both machine learning experts and domain experts as being difficult to scale up. There’s only a few of each kind of expert produced every year. I thought, how can I scale up machine learning expertise? I thought the best thing that I could do is to build software that doesn’t take a machine learning expert to use, so that the domain experts can use them to build their own applications. That’s what prompted me to do research in automating machine learning while at MSR [Microsoft Research].” Alice Zheng ( 2015 )

If you did not already know

Model Features google
A key question in Reinforcement Learning is which representation an agent can learn to efficiently reuse knowledge between different tasks. Recently the Successor Representation was shown to have empirical benefits for transferring knowledge between tasks with shared transition dynamics. This paper presents Model Features: a feature representation that clusters behaviourally equivalent states and that is equivalent to a Model-Reduction. Further, we present a Successor Feature model which shows that learning Successor Features is equivalent to learning a Model-Reduction. A novel optimization objective is developed and we provide bounds showing that minimizing this objective results in an increasingly improved approximation of a Model-Reduction. Further, we provide transfer experiments on randomly generated MDPs which vary in their transition and reward functions but approximately preserve behavioural equivalence between states. These results demonstrate that Model Features are suitable for transfer between tasks with varying transition and reward functions. …

Penalized Splines of Propensity Prediction (PSPP) google
Little and An (2004, Statistica Sinica 14, 949-968) proposed a penalized spline of propensity prediction (PSPP) method of imputation of missing values that yields robust model-based inference under the missing at random assumption. The propensity score for a missing variable is estimated and a regression model is fitted that includes the spline of the estimated logit propensity score as a covariate. The predicted unconditional mean of the missing variable has a double robustness (DR) property under misspecification of the imputation model. We show that a simplified version of PSPP, which does not center other regressors prior to including them in the prediction model, also has the DR property. We also propose two extensions of PSPP, namely, stratified PSPP and bivariate PSPP, that extend the DR property to inferences about conditional means. These extended PSPP methods are compared with the PSPP method and simple alternatives in a simulation study and applied to an online weight loss study conducted by Kaiser Permanente..
‘Robust-squared’ Imputation Models Using BART


RuleMatrix google
With the growing adoption of machine learning techniques, there is a surge of research interest towards making machine learning systems more transparent and interpretable. Various visualizations have been developed to help model developers understand, diagnose, and refine machine learning models. However, a large number of potential but neglected users are the domain experts with little knowledge of machine learning but are expected to work with machine learning systems. In this paper, we present an interactive visualization technique to help users with little expertise in machine learning to understand, explore and validate predictive models. By viewing the model as a black box, we extract a standardized rule-based knowledge representation from its input-output behavior. We design RuleMatrix, a matrix-based visualization of rules to help users navigate and verify the rules and the black-box model. We evaluate the effectiveness of RuleMatrix via two use cases and a usability study. …

Distilled News

Echoes of the Future

This report discusses ways to combine graphics output from the ‘graphics’ package and the ‘grid’ package in R and introduces a new function echoGrob in the ‘gridGraphics’ package.


6 data analytics trends that will dominate 2018

As businesses transform into data-driven enterprises, data technologies and strategies need to start delivering value. Here are four data analytics trends to watch in the months ahead.
• Data lakes will need to demonstrate business value or die
• The CDO will come of age
• Rise of the data curator
• Data governance strategies will be key themes for all C-level executives
• The proliferation of metadata management continues
• Predictive analytics helps improve data quality


The end of errors in ANOVA reporting

Psychology is still (unfortunately) massively using analysis of variance (ANOVA). Despite its relative simplicity, I am very often confronted to errors in its reporting, for instance in student´s theses or manuscripts. Beyond the incomplete, uncomprehensible or just wrong reporting, one can find a tremendous amount of genuine errors (that could influence the results and their intepretation), even in published papers! (See the excellent statcheck to quickly check the stats of a paper). This error proneness can be at least partially explained by the fact that copy/pasting the (appropriate) values of any statistical software and formatting them textually is a very annoying process. How to end it


Top 8 Sites To Make Money By Uploading Files

pay to upload sites have been really in now days and is indeed an easy way to make few bugs online. Its one of those ways to make money online that doesn´t need any effort on your part. Each time you upload your files to their servers and someone downloads them, you get paid a certain amount.


Amazon Alexa and Accented English

Earlier this spring, one of my data science friends here in SLC got in contact with me about some fun analysis. My friend Dylan Zwick is a founder at Pulse Labs, a voice-testing startup, and they were chatting with the Washington Post about a piece on how devices like Amazon Alexa deal with accented English. The piece is published today in the Washington Post and turned out really interesting! Let´s walk through the analysis I did for Dylan and Pulse Labs.


Explaining Black-Box Machine Learning Models – Code Part 1: tabular data + caret + iml

This is code that will accompany an article that will appear in a special edition of a German IT magazine. The article is about explaining black-box machine learning models. In that article I´m showcasing three practical examples:
1.Explaining supervised classification models built on tabular data using caret and the iml package
2.Explaining image classification models with keras and lime
3.Explaining text classification models with lime


Benchmarking Feature Selection Algorithms with Xy()

Feature Selection is one of the most interesting fields in machine learning in my opinion. It is a boundary point of two different perspectives on machine learning – performance and inference. From a performance point of view, feature selection is typically used to increase the model performance or to reduce the complexity of the problem in order to optimize computational efficiency. From an inference perspective, it is important to extract variable importance to identify key drivers of a problem. Many people argue that in the era of deep learning feature selection is not important anymore. As a method of representation learning, deep learning models can find important features of the input data on their own. Those features are basically nonlinear transformations of the input data space. However, not every problem is suited to be approached with neural nets (actually, many problems). In many practical ML applications feature selection plays a key role on the road to success.


Causation in a Nutshell

Knowing the who, what, when, where, etc., is vital in marketing. Predictive analytics can also be useful for many organizations. However, also knowing the why helps us better understand the who, what, when, where, and so on, and the ways they are tied together. It also helps us predict them more accurately. Knowing the why increases their value to marketers and increases the value of marketing. Analysis of causation can be challenging, though, and there are differences of opinion among authorities. The statistical orthodoxy is that randomized experiments are the best approach. Experiments in many cases are infeasible or unethical, however. They also can be botched or be so artificial that they do not generalize to real world conditions. They may also fail to replicate. They are not magic.


Autoencoder as a Classifier using Fashion-MNIST Dataset

In this tutorial, you will learn & understand how to use autoencoder as a classifier in Python with Keras. You’ll be using Fashion-MNIST dataset as an example.


Receiver Operating Characteristic Curves Demystified (in Python)

In Data Science, evaluating model performance is very important and the most commonly used performance metric is the classification score. However, when dealing with fraud datasets with heavy class imbalance, a classification score does not make much sense. Instead, Receiver Operating Characteristic or ROC curves offer a better alternative. ROC is a plot of signal (True Positive Rate) against noise (False Positive Rate). The model performance is determined by looking at the area under the ROC curve (or AUC). The best possible AUC is 1 while the worst is 0.5 (the 45 degrees random line). Any value less than 0.5 means we can simply do the exact opposite of what the model recommends to get the value back above 0.5. While ROC curves are common, there aren´t that many pedagogical resources out there explaining how it is calculated or derived. In this blog, I will reveal, step by step, how to plot an ROC curve using Python. After that, I will explain the characteristics of a basic ROC curve.


Opportunity: data lakes offer a ‘360-degree view’ to an organisation

Data lakes provide a solution for businesses looking to harness the power of data. Stuart Wells, executive vice president, chief product and technology officer at FICO, discusses with Information Age how approaching data in this way can lead to better business decisions.


Using the AWS Glue Data Catalog as the Metastore for Hive

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. AWS Glue crawlers can automatically infer schema from source data in Amazon S3 and store the associated metadata in the Data Catalog. For more information about the Data Catalog, see Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.

Document worth reading: “Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning”

The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning. It is written with an INFORMS audience in mind, specifically those readers who are familiar with the basics of optimization algorithms, but less familiar with machine learning. We begin by deriving a formulation of a supervised learning problem and show how it leads to various optimization problems, depending on the context and underlying assumptions. We then discuss some of the distinctive features of these optimization problems, focusing on the examples of logistic regression and the training of deep neural networks. The latter half of the tutorial focuses on optimization algorithms, first for convex logistic regression, for which we discuss the use of first-order methods, the stochastic gradient method, variance reducing stochastic methods, and second-order methods. Finally, we discuss how these approaches can be employed to the training of deep neural networks, emphasizing the difficulties that arise from the complex, nonconvex structure of these models. Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning

R Packages worth a look

Taxicab Correspondence Analysis (TaxicabCA)
Computation and visualization of Taxicab Correspondence Analysis, Choulakian (2006) <doi:10.1007/s11336-004-1231-4>. Classical correspondence ana …

Two-Group Ta-Test (tatest)
The ta-test is a modified two-sample or two-group t-test of Gosset (1908). In small samples with less than 15 replicates,the ta-test significantly redu …

Unified Interface to Distance, Dissimilarity, Similarity Matrices (disto)
Provides a high level API to interface over sources storing distance, dissimilarity, similarity matrices with matrix style extraction, replacement and …

Cooperative Aspects of Linear Production Programming Problems (coopProductGame)
Computes cooperative game and allocation rules associated with linear production programming problems.

GreedyExperimentalDesign JARs (GreedyExperimentalDesignJARs)
These are GreedyExperimentalDesign Java dependency libraries. Note: this package has no functionality of its own and should not be installed as a stand …

Document worth reading: “Does modelling need a Reformation Ideas for a new grammar of modelling”

The quality of mathematical modelling is looked at from the perspective of science’s own quality control arrangement and recent crises. It is argued that the crisis in the quality of modelling is at least as serious as that which has come to light in fields such as medicine, economics, psychology, and nutrition. In the context of the nascent sociology of quantification, the linkages between big data, algorithms, mathematical and statistical modelling (use and misuse of p-values) are evident. Looking at existing proposals for best practices the suggestion is put forward that the field needs a thorough Reformation, leading to a new grammar for modelling. Quantitative methodologies such as uncertainty and sensitivity analysis can form the bedrock on which the new grammar is built, while incorporating important normative and ethical elements. To this effect we introduce sensitivity auditing, quantitative storytelling, and ethics of quantification. Does modelling need a Reformation Ideas for a new grammar of modelling