Infinite Variational Autoencoder (VAE) google
This paper presents an infinite variational autoencoder (VAE) whose capacity adapts to suit the input data. This is achieved using a mixture model where the mixing coefficients are modeled by a Dirichlet process, allowing us to integrate over the coefficients when performing inference. Critically, this then allows us to automatically vary the number of autoencoders in the mixture based on the data. Experiments show the flexibility of our method, particularly for semi-supervised learning, where only a small number of training samples are available. …

Luigi google
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Luigi is an open source Python-based data framework for building data pipelines. Instead of using an XML/YAML configuration of some sort, all the jobs and their dependencies are written as Python programs. Because it’s Python, developers can backtrack to figure out exactly how data is processed.
The framework makes it easier to build large data pipelines, with built-in checkpointing, failure recovery, parallel execution, command line integration, etc. Since it’s a Python program, any Python library assets can be reused. The Luigi framework itself is a couple of thousand lines, so it’s also easy to understand the entire mechanism.
Facebook built a similar internal system called Dataswarm (Video), which allows developers to manage the entire data pipeline on Git + Python.
While Luigi was originally invented for Spotify’s internal needs, companies such as Foursquare, Stripe, and Asana are using it in production. …

Simple Probabilistic Inverse (SPI) google
Spectral topic modeling algorithms operate on matrices/tensors of word co-occurrence statistics to learn topic-specific word distributions. This approach removes the dependence on the original documents and produces substantial gains in efficiency and provable topic inference, but at a cost: the model can no longer provide information about the topic composition of individual documents. Recently Thresholded Linear Inverse (TLI) is proposed to map the observed words of each document back to its topic composition. However, its linear characteristics limit the inference quality without considering the important prior information over topics. In this paper, we evaluate Simple Probabilistic Inverse (SPI) method and novel Prior-aware Dual Decomposition (PADD) that is capable of learning document-specific topic compositions in parallel. Experiments show that PADD successfully leverages topic correlations as a prior, notably outperforming TLI and learning quality topic compositions comparable to Gibbs sampling on various data. …