Distilled News

RDBL – manipulate data in-database with R code only

In this post I introduce our own package RDBL, the R DataBase Layer. With this package you can manipulate data in-database without writing SQL code. The package interprets the R code and sends out the corresponding SQL statements to the database, fully transparently. To minimize overhead, the data is only fetched when absolutely necessary, allowing the user to create the relevant joins (merge), filters (logical indexing) and groupings (aggregation) in R code before the SQL is run on the database. The core idea behind RDBL is to let R users with little or no SQL knowledge to utilize the power of SQL database engines for data manipulation.

Layman’s Intro to #AI and Neural Networks

Simply put, any algorithm that has the ability to learn on its own, given a set of data, without having to program the rules of the domain explicitly, falls under the ambit of Machine Learning. This is different from Data Analytics or Expert systems where, rules, logic, propositions or activities has to be manually coded by an expert programmer.

Segmenting and refining images with SharpMask

Can a computer distinguish between the many objects in a photograph as effortlessly as the human eye? When humans look at an image, they can identify objects down to the last pixel. At Facebook AI Research (FAIR) we’re pushing machine vision to the next stage — our goal is to similarly understand images and objects at the pixel level.

Building a recommendation engine with AWS Data Pipeline, Elastic MapReduce and Spark

From Google’s advertisements to Amazon’s product suggestions, recommendation engines are everywhere. As users of smart internet services, we’ve become so accustomed to seeing things we like. This blog post is an overview of how we built a product recommendation engine for Hubba. I’ll start with an explanation of different types of recommenders and how we went about the selection process. Then I’ll cover our AWS solution before diving into some implementation details.

TensorLayer: Deep learning and Reinforcement learning library for Researchers and Engineers.

TensorLayer is designed to use by both Researchers and Engineers, it is a transparent library built on the top of Google TensorFlow. It is designed to provide a higher-level API to TensorFlow in order to speed-up experimentations and developments. TensorLayer is easy to be extended and modified. In addition, we provide many examples and tutorials to help you to go through deep learning and reinforcement learning. The documentation is not only for describing how to use TensorLayer API but also a tutorial to walk through different type of Neural Networks, Deep Reinforcement Learning and Natural Language Processing etc. In addition, TensorLayer’s tutorial also include all modularized implementation of Google TensorFlow Deep Learning tutorial, so you could read TensorFlow tutorial as the same time . However, different with other inflexible TensorFlow wrappers, TensorLayer assumes that you are somewhat familiar with Neural Networks and TensorFlow. A basic understanding of how TensorFlow works is required to be able to use TensorLayer skillfully.

Whats new on arXiv

Prediction and Optimal Scheduling of Advertisements in Linear Television

Advertising is a crucial component of marketing and an important way for companies to raise awareness of goods and services in the marketplace. Advertising campaigns are designed to convey a marketing image or message to an audience of potential consumers and television commercials can be an effective way of transmitting these messages to a large audience. In order to meet the requirements for a typical advertising order, television content providers must provide advertisers with a predetermined number of ‘impressions’ in the target demographic. However, because the number of impressions for a given program is not known a priori and because there are a limited number of time slots available for commercials, scheduling advertisements efficiently can be a challenging computational problem. In this case study, we compare a variety of methods for estimating future viewership patterns in a target demographic from past data. We also present a method for using those predictions to generate an optimal advertising schedule that satisfies campaign requirements while maximizing advertising revenue.

Collaborative Filtering with Recurrent Neural Networks

We show that collaborative filtering can be viewed as a sequence prediction problem, and that given this interpretation, recurrent neural networks offer very competitive approach. In particular we study how the long short-term memory (LSTM) can be applied to collaborative filtering, and how it compares to standard nearest neighbors and matrix factorization methods on movie recommendation. We show that the LSTM is competitive in all aspects, and largely outperforms other methods in terms of item coverage and short term predictions.

Fluctuations, large deviations and rigidity in hyperuniform systems: a brief survey

We present a brief survey of fluctuations and large deviations of particle systems with subextensive growth of the variance. These are called hyperuniform (or superhomogeneous) systems. We then discuss the relation between hyperuniformity and rigidity. In particular we give sufficient conditions for rigidity of such systems in d=1,2.

Many-body delocalization with random vector potentials

Characterization of intersecting families of maximum size in $PSL(2,q)$

Learning in concave games with imperfect information

Fundamental Limits of Budget-Fidelity Trade-off in Label Crowdsourcing

Bayesian Projection of Life Expectancy Accounting for the HIV/AIDS Epidemic

Playing Anonymous Games using Simple Strategies

Lower bounds for the smallest singular value of structured random matrices

Algorithms for Colourful Simplicial Depth and Medians in the Plane

Phase Transition in Conditional Curie-Weiss Model

Johnson-Schechtman and Khinchine inequalities in noncommutative probability theory

$Φ$-moment inequalities for independent and freely independent random variables

Applying Topological Persistence in Convolutional Neural Network for Music Audio Signals

Restricted completion of sparse partial Latin squares

Toll number of the Cartesian and the lexicographic product of graphs

Proceedings First Workshop on Causal Reasoning for Embedded and safety-critical Systems Technologies

$χ$-bounds, operations and chords

Skew-t Filter and Smoother with Improved Covariance Matrix Approximation

Activity Networks with Delays An application to toxicity analysis

Hard Negative Mining for Metric Learning Based Zero-Shot Classification

Using an epidemiological approach to maximize data survival in the internet of things

Test for Temporal Homogeneity of Means in High-dimensional Longitudinal Data

An invariant for minimum triangle-free graphs

Frobenius and Cartier algebras of Stanley-Reisner rings (II)

Estimating the Number of Clusters via Normalized Cluster Instability

Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events

A Note on the Practicality of Maximal Planar Subgraph Algorithms

Maximum Correntropy Unscented Filter

Well-Posedness and Stability for a Class of Stochastic Delay Differential Equations with Singular Drift

Leveraging over intact priors for boosting control and dexterity of prosthetic hands by amputees

Graphic TSP in 2-connected cubic graphs

Book Memo: “A Course in Mathematical Statistics and Large Sample Theory”

This graduate-level textbook is primarily aimed at graduate students of statistics, mathematics, science, and engineering who have had an undergraduate course in statistics, an upper division course in analysis, and some acquaintance with measure theoretic probability. It provides a rigorous presentation of the core of mathematical statistics.
Part I of this book constitutes a one-semester course on basic parametric mathematical statistics. Part II deals with the large sample theory of statistics – parametric and nonparametric, and its contents may be covered in one semester as well. Part III provides brief accounts of a number of topics of current interest for practitioners and other disciplines whose work involves statistical methods.

R Packages worth a look

Format Outputs of Statistical Tests According to APA Guidelines (apa)
Formatter functions in the ‘apa’ package take the return value of a statistical test function, e.g. a call to chisq.test() and return a string formatted according to the guidelines of the APA (American Psychological Association).

Substance Flow Computation (sfc)
Provides a function sfc() to compute the substance flow with the input files — ‘data’ and ‘model’. If sample.size is set more than 1, uncertainty analysis will be executed while the distributions and parameters are supplied in the file ‘data’.

Parallel Utilities for Lambda Selection along a Regularization Path (pulsar)
Model selection for penalized graphical models using the Stability Approach to Regularization Selection (‘StARS’), with options for speed-ups including Bounded StARS (B-StARS), batch computing, and other stability metrics (e.g., graphlet stability G-StARS).

Survival and Competing Risk Analyses with Time-to-Event Data as Covariates (time2event)
Cox proportional hazard and competing risk regression analyses can be performed with time-to-event data as covariates.

Markov-Switching GARCH Models (MSGARCH)
The MSGARCH package offers methods to fit (by Maximum Likelihood or Bayesian), simulate, and forecast various Markov-Switching GARCH processes.

Local Association Measures (zebu)
Implements the estimation of local (and global) association measures: Ducher’s Z, pointwise mutual information and normalized pointwise mutual information. The significance of local (and global) association is accessed using p-values estimated by permutations. Finally, using local association subgroup analysis, it identifies if the association between variables is dependent on the value of another variable.

Document worth reading: “How to Build Dashboards That Persuade, Inform and Engage”

Flow is powerful. Think about a great conversation you’ve had, with no awkwardness or selfconsciousness: just effortless communication. In data visualization, flow is crucial. Your audience should smoothly absorb and use the information in a dashboard without distractions or turbulence. Lack of flow means lack of communication, which means failure. Psychologist Mihaly Czikszentmihalyi has studied flow extensively. Czikszentmihalyi and other researchers have found that flow is correlated with happiness, creativity, and productivity. People experience flow when their skills are engaged and they’re being challenged just the right amount. The experience is not too challenging or too easy: flow is a just-right, Goldilocks state of being. So how do you create flow for an audience? By tailoring the presentation of data to that audience. If you focus on the skills, motivations, and needs of an audience, you’ll have a better chance of creating a positive experience of flow with your dashboards. And by creating that flow, you’ll be able to persuade, inform, and engage. How to Build Dashboards That Persuade, Inform and Engage

If you did not already know: “Jackknife Resampling”

In statistics, the jackknife is a resampling technique especially useful for variance and bias estimation. The jackknife predates other common resampling methods such as the bootstrap. The jackknife estimator of a parameter is found by systematically leaving out each observation from a dataset and calculating the estimate and then finding the average of these calculations. Given a sample of size N, the jackknife estimate is found by aggregating the estimates of each N – 1 estimate in the sample.
The jackknife technique was developed in Quenouille (1949, 1956). Tukey (1958) expanded on the technique and proposed the name “jackknife” since, like a Boy Scout’s jackknife, it is a “rough and ready” tool that can solve a variety of problems even though specific problems may be more efficiently solved with a purpose-designed tool.
The jackknife represents a linear approximation of the bootstrap.
Jackknife Resampling google

Magister Dixit

“Distinguishing between feature selection and dimensionality reduction might seem counter-intuitive at first, since feature selection will eventually lead (reduce dimensionality) to a smaller feature space. In practice, the key difference between the terms “feature selection” and “dimensionality reduction” is that in feature selection, we keep the “original feature axis”, whereas dimensionality reduction usually involves a transformation technique.” Sebastian Raschka ( August 24, 2014 )

Book Memo: “Decision Making and Modelling in Cognitive Science”

This book discusses the paradigm of quantum ontology as an appropriate model for measuring cognitive processes. It clearly shows the inadequacy of the application of classical probability theory in modelling the human cognitive domain. The chapters investigate the context dependence and neuronal basis of cognition in a coherent manner. According to this framework, epistemological issues related to decision making and state of mind are seen to be similar to issues related to equanimity and neutral mind, as discussed in Buddhist perspective. The author states that quantum ontology as a modelling tool will help scientists create new methodologies of modelling in other streams of science as well.

R Packages worth a look

Censored and Truncated Quantile Regression (ctqr)
Estimation of quantile regression models for survival data.

Multidimensional Gauss-Hermite Quadrature (MultiGHQuad)
Uses a transformed, rotated and optionally adapted n-dimensional grid of quadrature points to calculate the numerical integral of n multivariate normal distributed parameters.

Miscellaneous Utilities and Functions (JWileymisc)
A collection of miscellaneous tools and functions, such as tools to generate descriptive statistics tables, format output, visualize relations among variables or check distributions.

Evolutionary Computing in R (ecr)
Provides a powerful framework for evolutionary computing in R. The user can easily construct powerful evolutionary algorithms for tackling both single- and multi-objective problems by plugging in different predefined evolutionary building blocks, e. g., operators for mutation, recombination and selection with just a few lines of code. Your problem cannot be easily solved with a standard EA which works on real-valued vectors, permutations or binary strings? No problem, ‘ecr’ has been developed with that in mind. Extending the framework with own operators is also possible. Additionally there are various comfort functions, like monitoring, logging and more.

Phase II Clinical Trial Design for Multinomial Endpoints (ph2mult)
Provide multinomial design methods under intersection-union test (IUT) and union-intersection test (UIT) scheme for Phase II trial. The design types include : Minimax (minimize the maximum sample size), Optimal (minimize the expected sample size), Admissible (minimize the Bayesian risk) and Maxpower (maximize the exact power level).

Stack Data Type as an ‘R6’ Class (rstack)
An extremely simple stack data type, implemented with ‘R6’ classes. The size of the stack increases as needed, and the amortized time complexity is O(1). The stack may contain arbitrary objects.

If you did not already know: “Min-Wise Independent Permutations Locality Sensitive Hashing Scheme (MinHash)”

In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words. … Min-Wise Independent Permutations Locality Sensitive Hashing Scheme (MinHash) google


Get every new post delivered to your Inbox.

Join 119 other followers