Project Reports and Posters, Spring 2018

Final Project Prize Winners, Outstanding Posters and Submissions (CS230)

Generalized data structure synthesis

Many systems have a few key data structures at their heart. Finding correct and efficient implementations for these data structures is not always easy. Today´s paper introduces Cozy ( ), which can handle this task for you given a high-level specification of the state, queries, and update operations that need to be supported: ‘Cozy has three goals: to reduce programmer effort, to produce bug-free code, and to match the performance of handwritten code. We found that using Cozy requires an order of magnitude fewer lines of code than manual implementation, makes no mistakes even when human programmers do, and often matches the performance of handwritten code.’

Teaching AI to Know When it is Stumped

Researchers from Stanford University have released the second edition of the Stanford Question Answering Dataset (SQuAD) to help AI, such as Siri, learn when it lacks enough information to answer a question accurately. The dataset consists of over 50,000 unanswerable questions designed to appear answerable based off information in the accompanying paragraphs. Humans can easily identify when a question is unanswerable, but AI systems have a harder time, which limits the ability of automated assistants and other services to accurately respond to users´ queries. For example, when asked who the King of England is, a human would recognize that England does not have a king, while AI systems are more likely to say the deceased King George VI, who was the last King of the United Kingdom.

Creating Slopegraphs with R

Presenting data results in the most informative and compelling manner is part of the role of the data scientist. It’s all well and good to master the arcana of some algorithm, to manipulate and master the numbers and bend them to your will to produce a ‘solution’ that is both accurate and useful. But, those activities are typically in pursuit of informing some decision or at least providing information that serves a purpose. So taking those results and making them compelling and understandable by your audience is part of your job!

Decentralized code distribution for the future of open source

The code we write is one of society´s most valuable outputs. The software we develop is almost never a monolithic, independent structure that can be constructed, admired, and replaced. Instead (largely through open source) we have built an interwoven network of dependencies that links together first releases, new versions, forks, and both public and private libraries. Like the Internet itself, this network can at times, be fragile, and in its current form may be susceptible to censorship, manipulation, and destruction.

Methods for Efficient Resource Utilization in Statistical Machine Learning Algorithms

In recent years, statistical machine learning has emerged as a key technique for tackling problems that elude a classic algorithmic approach. One such problem, with a major impact on human life, is the analysis of complex biomedical data. Solving this problem in a fast and efficient manner is of major importance, as it enables, e.g., the prediction of the efficacy of different drugs for therapy selection. While achieving the highest possible prediction quality appears desirable, doing so is often simply infeasible due to resource constraints. Statistical learning algorithms for predicting the health status of a patient or for finding the best algorithm configuration for the prediction require an excessively high amount of resources. Furthermore, these algorithms are often implemented with no awareness of the underlying system architecture, which leads to sub-optimal resource utilization. This thesis presents methods for efficient resource utilization of statistical learning applications. The goal is to reduce the resource demands of these algorithms to meet a given time budget while simultaneously preserving the prediction quality. As a first step, the resource consumption characteristics of learning algorithms are analyzed, as well as their scheduling on underlying parallel architectures, in order to develop optimizations that enable these algorithms to scale to larger problem sizes. For this purpose, new profiling mechanisms are incorporated into a holistic profiling framework. The results show that one major contributor to the resource issues is memory consumption. To overcome this obstacle, a new optimization based on dynamic sharing of memory is developed that speeds up computation by several orders of magnitude in situations when available main memory is the bottleneck, leading to swapping out memory. One important application that can be applied for automated parameter tuning of learning algorithms is model-based optimization. Within a huge search space, algorithm configurations are evaluated to find the configuration with the best prediction quality. An important step towards better managing this search space is to parallelize the search process itself. However, a high runtime variance within the configuration space can cause inefficient resource utilization. For this purpose, new resource-aware scheduling strategies are developed that efficiently map evaluations of configurations to the parallel architecture, depending on their resource demands. In contrast to classical scheduling problems, the new scheduling interacts with the configuration proposal mechanism to select configurations with suitable resource demands. With these strategies, it becomes possible to make use of the full potential of parallel architectures. Compared to established parallel execution models, the results show that the new approach enables model-based optimization to converge faster to the optimum within a given time budget.

Parallelizing Linear Regression or Using Multiple Sources

My previous post was explaining how mathematically it was possible to parallelize computation to estimate the parameters of a linear regression. More speficially, we have a matrix X\mathbf{X}X which is n×kn\times kn×k matrix and y\mathbf{y}y a nnn-dimensional vector, and we want to compute ß^=[XTX]-1XTy\widehat{\mathbf{\beta}}=[\mathbf{X}^T\mathbf{X}]^{-1}\mathbf{X}^T\mathbf{y} ß =[X T X] -1 X T y by spliting the job. Instead of using the nnn observations, we´ve seen that it was to possible to compute ‘something’ using the first n1n_1n 1 rows, then the next n2n_2n 2 rows, etc. Then, finally, we ‘aggregate’ the mmm objects created to get our overall estimate.

7 Simple Data Visualizations You Should Know in R

1. Bar Chart
2. Histogram
3. Heat Map
4. Scatter Plot
5. Box Plot
6. Correlogram
7. Area Chart

Melt and cast the shape of your data.frame – Exercises

Datasets often arrive to us in a form that is different from what we need for our modelling or visualisations functions who in turn don´t necessary require the same format. Reshaping data.frames is a step that all analysts need but many struggle with. Practicing this meta-skill will in the long-run result in more time to focus on the actual analysis.

Idle thoughts lead to R internals: how to count function arguments

Some R functions have an awful lot of arguments’, you think to yourself. ‘I wonder which has the most’ It´s not an original thought: the same question as applied to the R base package is an exercise in the Functions chapter of the excellent Advanced R. Much of the information in this post came from there. There are lots of R packages. We´ll limit ourselves to those packages which ship with R, and which load on startup. Which ones are they

A Comparative Review of the BlueSky Statistics GUI for R

BlueSky Statistics´ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses. A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server. This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

Explaining Keras image classification models with lime

Last week I published a blog post about how easy it is to train image classification models with Keras. What I did not show in that post was how to use the model for making predictions. This, I will do here. But predictions alone are boring, so I´m adding explanations for the predictions using the lime package.

Awesome Twitter Word Clouds in R

For my goals, I decided to work through the book Tidy Text Mining with R by Julia Silge and David Robinson I chose to tap into Twitter data for my text analysis using the rtweets package. Inspired by some of the word clouds in the Tidy Text book, I decided to plot the data in fancy word clouds using wordcloud2.