R Packages worth a look

Access Domains and Search Popular Websites (websearchr)
Functions that allow for accessing domains and a number of search engines.

Hyphenation and Syllable Counting for Text Analysis (sylly)
Provides the hyphenation algorithm used for ‘TeX’/’LaTeX’ and similar software, as proposed by Liang (1983, <https://…/> ). Mainly contains the function hyphen() to be used for hyphenation/syllable counting of text objects. It was originally developed for and part of the ‘koRpus’ package, but later released as a separate package so it’s lighter to have this particular functionality available for other packages. Support for various languages needs be added on-the-fly or by plugin packages; this package does not include any language specific data. Due to some restrictions on CRAN, the full package sources are only available from the project homepage. To ask for help, report bugs, request features, or discuss the development of the package, please subscribe to the koRpus-dev mailing list (<> ).

Optimal Design and Statistical Power of Cost-Efficient Multilevel Randomized Trials (odr)
Calculate the optimal sample allocation that minimizes variance of treatment effect in a multilevel randomized trial under fixed budget and cost structure, perform power analyses with and without accommodating costs and budget. The reference for proposed methods is: Shen, Z., & Kelcey, B. (under review). Optimal design of cluster randomized trials under condition- and unit-specific cost structures. 2018 American Educational Research Association (AERA) annual conference.

Fit and Predict a Gaussian Process Model with (Time-Series) Binary Response (binaryGP)
Allows the estimation and prediction for binary Gaussian process model. The mean function can be assumed to have time-series structure. The estimation methods for the unknown parameters are based on penalized quasi-likelihood/penalized quasi-partial likelihood and restricted maximum likelihood. The predicted probability and its confidence interval are computed by Metropolis-Hastings algorithm. More details can be seen in Sung et al (2017) <arXiv:1705.02511>.

Distributions and Gradients (dng)
Provides density, distribution function, quantile function and random generation for the split-t distribution, and computes the mean, variance, skewness and kurtosis for the split-t distribution (Li, F, Villani, M. and Kohn, R. (2010) <doi:10.1016/j.jspi.2010.04.031>).

Simulation Based Inference of Lasso Estimator (EAinference)
Estimator augmentation methods for statistical inference on high-dimensional data, as described in Zhou, Q. (2014) <doi:10.1080/01621459.2014.946035> and Zhou, Q. and Min, S. (2017) <doi:10.1214/17-EJS1309>. It provides several simulation-based inference methods: (a) Gaussian and wild multiplier bootstrap for lasso, group lasso, scaled lasso, scaled group lasso and their de-biased estimators, (b) importance sampler for approximating p-values in these methods, (c) Markov chain Monte Carlo lasso sampler with applications in post-selection inference.


Document worth reading: “Resource Elasticity for Distributed Data Stream Processing: A Survey and Future Directions”

Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructures, and Internet of Things, continuous data streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for processing unbounded data streams in a scalable and efficient manner. This paper surveys state of the art on stream processing engines and mechanisms for exploiting resource elasticity features of cloud computing in stream processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, stream processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. This work examines some of these challenges and discusses solutions proposed in the literature to address them. Resource Elasticity for Distributed Data Stream Processing: A Survey and Future Directions

If you did not already know

Kanri Distance (KDC) google
Kanri’s proprietary combination of patented statistical and process methods provides a uniquely powerful and insightful ability to evaluate large data sets with multiple variables. While many tools evaluate patterns and dynamics for large data, only the Kanri Distance Calculator allows users to understand where they stand with respect to a desired target state and the specific contribution of each variable toward the overall distance from the target state. The Kanri model not only calculates the relationship of variables within the overall data set, but more importantly mathematically teases out the interaction between each of them. This combination of relational insights fuels Kanri’s breakthrough distance calculator. It answers the question “In a world of exponentially expanding data how do I find the variables that will solve my problem and it helps quickly to reach that conclusion.” But the Kanri model does not stop there. Kanri tells you exactly, formulaically how much each variable contributes. The Kanri Distance Calculator opens a new world of solution development possibilities that can apply the power of massive data sets to an individual…or to an individualized objective. …

Probably Approximately Correct Learning (PAC Learning,WARL) google
In computational learning theory, probably approximately correct learning (PAC learning) is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the “probably” part), the selected function will have low generalization error (the “approximately correct” part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples. The model was later extended to treat noise (misclassified samples). An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning. In particular, the learner is expected to find efficient functions (time and space requirements bounded to a polynomial of the example size), and the learner itself must implement an efficient procedure (requiring an example count bounded to a polynomial of the concept size, modified by the approximation and likelihood bounds). …

Local Projections google
In this paper, we propose a novel approach for outlier detection, called local projections, which is based on concepts of Local Outlier Factor (LOF) (Breunig et al., 2000) and RobPCA (Hubert et al., 2005). By using aspects of both methods, our algorithm is robust towards noise variables and is capable of performing outlier detection in multi-group situations. We are further not reliant on a specific underlying data distribution. For each observation of a dataset, we identify a local group of dense nearby observations, which we call a core, based on a modification of the k-nearest neighbours algorithm. By projecting the dataset onto the space spanned by those observations, two aspects are revealed. First, we can analyze the distance from an observation to the center of the core within the projection space in order to provide a measure of quality of description of the observation by the projection. Second, we consider the distance of the observation to the projection space in order to assess the suitability of the core for describing the outlyingness of the observation. These novel interpretations lead to a univariate measure of outlyingness based on aggregations over all local projections, which outperforms LOF and RobPCA as well as other popular methods like PCOut (Filzmoser et al., 2008) and subspace-based outlier detection (Kriegel et al., 2009) in our simulation setups. Experiments in the context of real-word applications employing datasets of various dimensionality demonstrate the advantages of local projections. …

Book Memo: “Beautiful Data”

The Stories Behind Elegant Data Solutions
In this insightful book, you’ll learn from the best data practitioners in the field just how wide-ranging – and beautiful – working with data can be. Join 39 contributors as they explain how they developed simple and elegant solutions on projects ranging from the Mars lander to a Radiohead video. With Beautiful Data, you will:
• Explore the opportunities and challenges involved in working with the vast number of datasets made available by the Web
• Learn how to visualize trends in urban crime, using maps and data mashups
• Discover the challenges of designing a data processing system that works within the constraints of space travel
• Learn how crowdsourcing and transparency have combined to advance the state of drug research
• Understand how new data can automatically trigger alerts when it matches or overlaps pre-existing data
• Learn about the massive infrastructure required to create, capture, and process DNA data
That’s only small sample of what you’ll find in Beautiful Data. For anyone who handles data, this is a truly fascinating book.

Whats new on arXiv

Practical Machine Learning for Cloud Intrusion Detection: Challenges and the Way Forward

Operationalizing machine learning based security detections is extremely challenging, especially in a continuously evolving cloud environment. Conventional anomaly detection does not produce satisfactory results for analysts that are investigating security incidents in the cloud. Model evaluation alone presents its own set of problems due to a lack of benchmark datasets. When deploying these detections, we must deal with model compliance, localization, and data silo issues, among many others. We pose the problem of ‘attack disruption’ as a way forward in the security data science space. In this paper, we describe the framework, challenges, and open questions surrounding the successful operationalization of machine learning based security detections in a cloud environment and provide some insights on how we have addressed them.

Deconvolutional Latent-Variable Model for Text Sequence Matching

A latent-variable model is introduced for text matching, inferring sentence representations by jointly optimizing generative and discriminative objectives. To alleviate typical optimization challenges in latent-variable models for text, we employ deconvolutional networks as the sequence decoder (generator), providing learned latent codes with more semantic information and better generalization. Our model, trained in an unsupervised manner, yields stronger empirical predictive performance than a decoder based on Long Short-Term Memory (LSTM), with less parameters and considerably faster training. Further, we apply it to text sequence-matching problems. The proposed model significantly outperforms several strong sentence-encoding baselines, especially in the semi-supervised setting.

Feature Engineering for Predictive Modeling using Reinforcement Learning

Feature engineering is a crucial step in the process of predictive modeling. It involves the transformation of given feature space, typically using mathematical functions, with the objective of reducing the modeling error for a given target. However, there is no well-defined basis for performing effective feature engineering. It involves domain knowledge, intuition, and most of all, a lengthy process of trial and error. The human attention involved in overseeing this process significantly influences the cost of model generation. We present a new framework to automate feature engineering. It is based on performance driven exploration of a transformation graph, which systematically and compactly enumerates the space of given options. A highly efficient exploration strategy is derived through reinforcement learning on past examples.

Lazy stochastic principal component analysis

Stochastic principal component analysis (SPCA) has become a popular dimensionality reduction strategy for large, high-dimensional datasets. We derive a simplified algorithm, called Lazy SPCA, which has reduced computational complexity and is better suited for large-scale distributed computation. We prove that SPCA and Lazy SPCA find the same approximations to the principal subspace, and that the pairwise distances between samples in the lower-dimensional space is invariant to whether SPCA is executed lazily or not. Empirical studies find downstream predictive performance to be identical for both methods, and superior to random projections, across a range of predictive models (linear regression, logistic lasso, and random forests). In our largest experiment with 4.6 million samples, Lazy SPCA reduced 43.7 hours of computation to 9.9 hours. Overall, Lazy SPCA relies exclusively on matrix multiplications, besides an operation on a small square matrix whose size depends only on the target dimensionality.

Handling Factors in Variable Selection Problems

Factors are categorical variables, and the values which these variables assume are called levels. In this paper, we consider the variable selection problem where the set of potential predictors contains both factors and numerical variables. Formally, this problem is a particular case of the standard variable selection problem where factors are coded using dummy variables. As such, the Bayesian solution would be straightforward and, possibly because of this, the problem, despite its importance, has not received much attention in the literature. Nevertheless, we show that this perception is illusory and that in fact several inputs like the assignment of prior probabilities over the model space or the parameterization adopted for factors may have a large (and difficult to anticipate) impact on the results. We provide a solution to these issues that extends the proposals in the standard variable selection problem and does not depend on how the factors are coded using dummy variables. Our approach is illustrated with a real example concerning a childhood obesity study in Spain.

Class-Splitting Generative Adversarial Networks

Generative Adversarial Networks (GANs) produce systematically better quality samples when class label information is provided., i.e. in the conditional GAN setup. This is still observed for the recently proposed Wasserstein GAN formulation which stabilized adversarial training and allows considering high capacity network architectures such as ResNet. In this work we show how to boost conditional GAN by augmenting available class labels. The new classes come from clustering in the representation space learned by the same GAN model. The proposed strategy is also feasible when no class information is available, i.e. in the unsupervised setup. Our generated samples reach state-of-the-art Inception scores for CIFAR-10 and STL-10 datasets in both supervised and unsupervised setup.

Neural Optimizer Search with Reinforcement Learning

We present an approach to automate the process of discovering optimization methods, with a focus on deep learning architectures. We train a Recurrent Neural Network controller to generate a string in a domain specific language that describes a mathematical update equation based on a list of primitive functions, such as the gradient, running average of the gradient, etc. The controller is trained with Reinforcement Learning to maximize the performance of a model after a few epochs. On CIFAR-10, our method discovers several update rules that are better than many commonly used optimizers, such as Adam, RMSProp, or SGD with and without Momentum on a ConvNet model. We introduce two new optimizers, named PowerSign and AddSign, which we show transfer well and improve training on a variety of different tasks and architectures, including ImageNet classification and Google’s neural machine translation system.

Analyzing users’ sentiment towards popular consumer industries and brands on Twitter

Social media serves as a unified platform for users to express their thoughts on subjects ranging from their daily lives to their opinion on consumer brands and products. These users wield an enormous influence in shaping the opinions of other consumers and influence brand perception, brand loyalty and brand advocacy. In this paper, we analyze the opinion of 19M Twitter users towards 62 popular industries, encompassing 12,898 enterprise and consumer brands, as well as associated subject matter topics, via sentiment analysis of 330M tweets over a period spanning a month. We find that users tend to be most positive towards manufacturing and most negative towards service industries. In addition, they tend to be more positive or negative when interacting with brands than generally on Twitter. We also find that sentiment towards brands within an industry varies greatly and we demonstrate this using two industries as use cases. In addition, we discover that there is no strong correlation between topic sentiments of different industries, demonstrating that topic sentiments are highly dependent on the context of the industry that they are mentioned in. We demonstrate the value of such an analysis in order to assess the impact of brands on social media. We hope that this initial study will prove valuable for both researchers and companies in understanding users’ perception of industries, brands and associated topics and encourage more research in this field.

Uniquely labelled geodesics of Coxeter groups
Anisotropic Functional Fourier Deconvolution from indirect long-memory observations
Numerical reconstruction of the first band(s) in an inverse Hill’s problem
Extreme Value Estimation for Discretely Sampled Continuous Processes
Data-Driven Model Predictive Control of Autonomous Mobility-on-Demand Systems
Inter-Subject Analysis: Inferring Sparse Interactions with Dense Intra-Graphs
Minimum Covariance Determinant and Extensions
Multi-Resolution Functional ANOVA for Large-Scale, Many-Input Computer Experiments
Multi-camera Multi-Object Tracking
A Unified Approach to the Global Exactness of Penalty and Augmented Lagrangian Functions I: Parametric Exactness
Estimated Depth Map Helps Image Classification
A Deep-Reinforcement Learning Approach for Software-Defined Networking Routing Optimization
A Flocking-based Approach for Distributed Stochastic Optimization
On the Design of LQR Kernels for Efficient Controller Learning
On Compiling DNNFs without Determinism
Near Optimal Sketching of Low-Rank Tensor Regression
Covert Wireless Communication with Artificial Noise Generation
Persistence Flamelets: multiscale Persistent Homology for kernel density exploration
Talagrand Concentration Inequalities for Stochastic Partial Differential Equations
Supervised Learning with Indefinite Topological Kernels
On the Use of Machine Translation-Based Approaches for Vietnamese Diacritic Restoration
Statistical Methods for Ecological Breakpoints and Prediction Intervals
Cost Adaptation for Robust Decentralized Swarm Behaviour
Variational Memory Addressing in Generative Models
Irreversibility of mechanical and hydrodynamic instabilities
Discrete-Time Polar Opinion Dynamics with Susceptibility
Accelerating PageRank using Partition-Centric Processing
Deep Recurrent NMF for Speech Separation by Unfolding Iterative Thresholding
Hypergraph Theory: Applications in 5G Heterogeneous Ultra-Dense Networks
Maximal Moments and Uniform Modulus of Continuity for Stable Random Fields
The k-tacnode process
Fractional iterated Ornstein-Uhlenbeck Processes
Learning RBM with a DC programming Approach
Large Vocabulary Automatic Chord Estimation Using Deep Neural Nets: Design Framework, System Variations and Limitations
Local Private Hypothesis Testing: Chi-Square Tests
SceneCut: Joint Geometric and Object Segmentation for Indoor Scenes
Chromatic number, Clique number, and Lovász’s bound: In a comparison
Semi-Automated Nasal PAP Mask Sizing using Facial Photographs
SpectralFPL: Online Spectral Learning for Single Topic Models
Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization
Analysis of Wireless-Powered Device-to-Device Communications with Ambient Backscattering
Convergence characteristics of the generalized residual cutting method
Visual Question Generation as Dual Task of Visual Question Answering
Temporal Multimodal Fusion for Video Emotion Classification in the Wild
The size of $3$-uniform hypergraphs with given matching number and codegree
A First Derivative Potts Model for Segmentation and Denoising Using MILP
3D Deformable Object Manipulation using Fast Online Gaussian Process Regression
Human Pose Estimation using Global and Local Normalization
Self-Dual Codes better than the Gilbert–Varshamov bound
Convolutional neural networks that teach microscopes how to image
Learning Complex Swarm Behaviors by Exploiting Local Communication Protocols with Deep Reinforcement Learning
Bayesian nonparametric inference for the M/G/1 queueing systems based on the marked departure process
Neural network identification of people hidden from view with a single-pixel, single-photon detector
Sorting with Recurrent Comparison Errors
Real-time predictive maintenance for wind turbines using Big Data frameworks
Assumption-Based Approaches to Reasoning with Priorities
Hysteretic percolation from locally optimal decisions
A Communication-Efficient Distributed Data Structure for Top-k and k-Select Queries
The power of big data sparse signal detection tests on nonparametric detection boundaries
Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer’s Disease
On Composite Quantum Hypothesis Testing
A New Framework for $\mathcal{H}_2$-Optimal Model Reduction
Hybrid Beamforming Based on Implicit Channel State Information for Millimeter Wave Links
Speech Recognition Challenge in the Wild: Arabic MGB-3
Secure Energy Efficiency Optimization for MISO Cognitive Radio Network with Energy Harvesting
Blood-based metabolic signatures in Alzheimer’s disease
Alternating least squares as moving subspace correction
Spectral Asymptotics for Krein-Feller-Operators with respect to Random Recursive Cantor Measures
Connectedness of random set attractors
On the distribution of monochromatic complete subgraphs and arithmetic progressions
Influence of Clustering on Cascading Failures in Interdependent Systems
Down the Large Rabbit Hole
Playing for Benchmarks
AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection
Stochastic parameterization identification using ensemble Kalman filtering combined with expectation-maximization and Newton-Raphson maximum likelihood methods
Density of the set of probability measures with the martingale representation property
H-DenseUNet: Hybrid Densely Connected UNet for Liver and Liver Tumor Segmentation from CT Volumes
Symbolic Optimal Control
Efficient Column Generation for Cell Detection and Segmentation
Beyond the Sharp Null: Randomization Inference, Bounded Null Hypotheses, and Confidence Intervals for Maximum Effects
On the multi-dimensional elephant random walk
Extended-Alphabet Finite-Context Models
Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness
Non-Depth-First Search against Independent Distributions on an AND-OR Tree
Stable-like fluctuations of Biggins’ martingales
Multi-label Pixelwise Classification for Reconstruction of Large-scale Urban Areas
On the precise determination of the Tsallis parameters in proton – proton collisions at LHC energies
Geometric SMOTE: Effective oversampling for imbalanced learning through a geometric extension of SMOTE
Distributed Submodular Minimization And Motion Coordination Over Discrete State Space
Berezinskii-Kosteriltz-Thouless transition in disordered multi-channel Luttinger liquids
If and When a Driver or Passenger is Returning to Vehicle: Framework to Infer Intent and Arrival Time
Urban Land Cover Classification with Missing Data Using Deep Convolutional Neural Networks
On Andrews–Warnaar’s identities of partial theta functions
A new ‘3D Calorimetry’ of hot nuclei
Inducing Distant Supervision in Suggestion Mining through Part-of-Speech Embeddings
Quantum Autoencoders via Quantum Adders with Genetic Algorithms
Bidirected Graphs I: Signed General Kotzig-Lovász Decomposition
On the $l^p$-norm of the Discrete Hilbert transform
Learned Features are better for Ethnicity Classification
Dynamic Evaluation of Neural Sequence Models
Perturbative Black Box Variational Inference

R Packages worth a look

Gui for Simulating Time Series (tsgui)
This gui shows realisations of times series, currently ARMA and GARCH processes. It might be helpful for teaching and studying.

Sample Size Calculation for Mean and Proportion Comparisons in Phase 3 Clinical Trials (SampleSize4ClinicalTrials)
The design of phase 3 clinical trials can be classified into 4 types: (1) Testing for equality;(2) Superiority trial;(3) Non-inferiority trial; and (4) Equivalence trial according to the goals. Given that none of the available packages combines these designs in a single package, this package has made it possible for researchers to calculate sample size when comparing means or proportions in phase 3 clinical trials with different designs. The ssc function can calculate the sample size with pre-specified type 1 error rate,statistical power and effect size according to the hypothesis testing framework. Furthermore, effect size is comprised of true treatment difference and non-inferiority or equivalence margins which can be set in ssc function. (Reference: Yin, G. (2012). Clinical Trial Design: Bayesian and Frequentist Adaptive Methods. John Wiley & Sons.)

Rcmdr’ Plugin for Alpha-Sutte Indicator ‘sutteForecastR’ (RcmdrPlugin.sutteForecastR)
The ‘sutteForecastR’ is a package of Alpha-Sutte indicator. To make the ‘sutteForecastR’ user friendly, so we develop an ‘Rcmdr’ plug-in based on the Alpha-Sutte indicator function.

Simulation Studies with Stan (rstansim)
Provides a set of functions to facilitate and ease the running of simulation studies of Bayesian models using ‘stan’. Provides functionality to simulate data, fit models, and manage simulation results.

Credit Risk Scorecard (scorecard)
Makes the development of credit risk scorecard easily and efficiently by providing functions such as information value, variable filter, optimal woe binning, scorecard scaling and performance evaluation etc. The references including: 1. Refaat, M. (2011, ISBN: 9781447511199). Credit Risk Scorecard: Development and Implementation Using SAS. 2. Siddiqi, N. (2006, ISBN: 9780471754510). Credit risk scorecards. Developing and Implementing Intelligent Credit Scoring.

An Orthogonality Constrained Optimization Approach for Semi-Parametric Dimension Reduction Problems (orthoDr)
Utilize an orthogonality constrained optimization algorithm of Wen & Yin (2013) <DOI:10.1007/s10107-012-0584-1> to solve a variety of dimension reduction problems in the semiparametric framework, such as Ma & Zhu (2013) <DOI:10.1214/12-AOS1072>, and Sun, Zhu, Wang & Zeng (2017) <arXiv:1704.05046>. It also serves as a general purpose optimization solver for problems with orthogonality constraints.

Book Memo: “Probabilistic Metric Spaces”

This distinctly nonclassical treatment focuses on developing aspects that differ from the theory of ordinary metric spaces, working directly with probability distribution functions rather than random variables. The two-part treatment begins with an overview that discusses the theory’s historical evolution, followed by a development of related mathematical machinery. The presentation defines all needed concepts, states all necessary results, and provides relevant proofs. The second part opens with definitions of probabilistic metric spaces and proceeds to examinations of special classes of probabilistic metric spaces, topologies, and several related structures, such as probabilistic normed and inner-product spaces. Throughout, the authors focus on developing aspects that differ from the theory of ordinary metric spaces, rather than simply transferring known metric space results to a more general setting.

If you did not already know

Apache Lucy google
The Apache Lucy search engine library provides full-text search for dynamic programming languages. …

TrueSkill Ranking System google
TrueSkill is a Bayesian ranking algorithm developed by Microsoft Research and used in the Xbox matchmaking system built to address some perceived flaws in the Elo rating system. It is an extension of the Glicko rating system to multiplayer games. The purpose of a ranking system is to both identify and track the skills of gamers in a game (mode) in order to be able to match them into competitive matches. The TrueSkill ranking system only uses the final standings of all teams in a game in order to update the skill estimates (ranks) of all gamers playing in this game. Ranking systems have been proposed for many sports but possibly the most prominent ranking system in use today is ELO. …

Intrablocks Correspondence Analysis (IBCA) google
We propose a new method to describe contingency tables with double partition structures in columns and rows. Furthermore, we propose new superimposed representations, based on the introduction of variable dilations for the partial clouds associated with the partitions of the columns and the rows. …