Distilled News

An Introduction to Stock Market Data Analysis with Python (Part 1)

This post is the first in a two-part series on stock data analysis using Python, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems. This first post discusses topics up to introducing moving averages.

Interpreting and Visualizing Neural Networks for Text Processing

Neural networks have become the go-to approach for text processing tasks, but they are also notorious for their obscurity. Recently, we applied a neural network to the task of predicting numerical ratings for text-based consumer reviews, training the model to learn the ratings directly from the words in each review. It turns out that our model had decent prediction accuracy, but this wasn’t our objective, especially since we already knew the ratings for these reviews. Instead, we wanted to understand why those particular ratings were assigned. Interpreting what a neural network has learned from data is a separate endeavor from applying it to make predictions. In this post, we’ll explore some strategies for bringing the inside of a neural network to light, using our ratings prediction model to demonstrate.

LSTM for Human Activity Recognition

I will be using an LSTM on the data to learn (as a cellphone attached on the waist) to recognise the type of activity that the user is doing.

The Argument for Accelerated and Integrated Analytics

The rise of modern business intelligence (BI) has seen the emergence of a number of component parts designed to support the different analytical functions necessary to deliver what enterprises require. Perhaps the most fundamental component of the BI movement is the traditional frontend or visualization application. Companies like Tableau, Qlik, Birst, Domo and Periscope provide these. There are dozens more – all with essentially equivalent capabilities: the ability to make spreadsheets look beautiful. Some of these companies have been tremendously successful, primarily differentiating themselves on the axis of usability. Another, equally critical component of the BI equation is the database. Here, too, there are the usual suspects: Redshift, Impala, Vertica, Netezza and others. Some of these databases are fully featured, system-of-record worthy solutions, while others focus on a particular performance axis, streaming, for instance, and do it well. Finally, there is the emergence, across BI and database players, more advanced analytics tools, driven by the explosion of interest in and development of machine learning, deep learning and artificial intelligence. This market has its stars, starting with the big boys – Google, Facebook, Amazon, Microsoft, IBM, Baidu, Tesla – as well as a host of highly credentialed startups, like Sentient, Ayasdi, Bonsai and H2O.ai.

Whats new on arXiv

Input Convex Neural Networks

This paper presents the input convex neural network architecture. These are scalar-valued (potentially deep) neural networks with constraints on the network parameters such that the output of the network is a convex function of (some of) the inputs. The networks allow for efficient inference via optimization over some inputs to the network given others, and can be applied to settings including structured prediction, data imputation, reinforcement learning, and others. In this paper we lay the basic groundwork for these models, proposing methods for inference, optimization and learning, and analyze their representational power. We show that many existing neural network architectures can be made input-convex with only minor modification, and develop specialized optimization algorithms tailored to this setting. Finally, we highlight the performance of the methods on multi-label prediction, image completion, and reinforcement learning problems, where we show improvement over the existing state of the art in many cases.

Efficient Feature Selection With Large and High-dimensional Data

Driven by the advances in technology, large and high-dimensional data have become the rule rather than the exception. Approaches that allow for feature selection with such data are thus highly sought after, in particular, since standard methods, like cross-validated Lasso, can be computationally intractable and, in any case, lack theoretical guarantees. In this paper, we propose a novel approach to feature selection in regression. Consisting of simple optimization steps and tests, it is computationally more efficient than existing methods and, therefore, suited even for very large data sets. Moreover, in contrast to standard methods, it is equipped with sharp statistical guarantees. We thus expect that our algorithm can help to leverage the increasing volume of data in Biology, Public Health, Astronomy, Economics, and other fields.

Annotating Derivations: A New Evaluation Strategy and Dataset for Algebra Word Problems

We propose a new evaluation for automatic solvers for algebra word problems, which can identify reasoning mistakes that existing evaluations overlook. Our proposal is to use derivations for evaluations, which reflect the reasoning process of the solver by explaining how the equation system was constructed. We accomplish this by developing an algorithm for checking the equivalence between two derivations, and showing how derivation annotations can be semi-automatically added to existing datasets. To make our experiments more comprehensive, we also annotated DRAW-1K , a new dataset of 1000 general algebra word problems. In total, our experiments span over 2300 algebra word problems. We found that the annotated derivation enable a superior evaluation of automatic solvers than previously used metrics.

A Topological Algorithm for Determining How Road Networks Evolve Over Time

We provide an efficient algorithm for determining how a road network has evolved over time, given two snapshot instances from different dates. To allow for such determinations across different databases and even against hand drawn maps, we take a strictly topological approach in this paper, so that we compare road networks based strictly on graph-theoretic properties. Given two road networks of same region from two different dates, our approach allows one to match road network portions that remain intact and also point out added or removed portions. We analyze our algorithm both theoretically, showing that it runs in polynomial time for non-degenerate road networks even though a related problem is NP-complete, and experimentally, using dated road networks from the TIGER/Line archive of the U.S. Census Bureau.

Constraint-Based Clustering Selection

Semi-supervised clustering methods incorporate a limited amount of supervision into the clustering process. Typically, this supervision is provided by the user in the form of pairwise constraints. Existing methods use such constraints in one of the following ways: they adapt their clustering procedure, their similarity metric, or both. All of these approaches operate within the scope of individual clustering algorithms. In contrast, we propose to use constraints to choose between clusterings generated by very different unsupervised clustering algorithms, run with different parameter settings. We empirically show that this simple approach often outperforms existing semi-supervised clustering methods.

Language as a Latent Variable: Discrete Generative Models for Sentence Compression

In this work we explore deep generative models of text in which the latent representation of a document is itself drawn from a discrete language model distribution. We formulate a variational auto-encoder for inference in this model and apply it to the task of compressing sentences. In this application the generative model first draws a latent summary sentence from a background language model, and then subsequently draws the observed sentence conditioned on this latent summary. In our empirical evaluation we show that generative formulations of both abstractive and extractive compression yield state-of-the-art results when trained on a large amount of supervised data. Further, we explore semi-supervised compression scenarios where we show that it is possible to achieve performance competitive with previously proposed supervised models while training on a fraction of the supervised data.

Changepoint Detection in the Presence of Outliers

Many traditional methods for identifying changepoints can struggle in the presence of outliers, or when the noise is heavy-tailed. Often they will infer additional changepoints in order to fit the outliers. To overcome this problem, data often needs to be pre-processed to remove outliers, though this is not feasible in applications where the data needs to be analysed online. We present an approach to changepoint detection that is robust to the presence of outliers. The idea is to adapt existing penalised cost approaches for detecting changes so that they use cost functions that are less sensitive to outliers. We argue that cost functions that are bounded, such as the classical biweight cost, are particularly suitable — as we show that only bounded cost functions are robust to arbitrarily extreme outliers. We present a novel and efficient dynamic programming algorithm that can then find the optimal segmentation under our penalised cost criteria. Importantly, this algorithm can be used in settings where the data needs to be analysed online. We present theoretical bounds on the worst-case complexity of this algorithm, and show empirically that its average computational cost is linear in the amount of the data. We show the usefulness of our approach for applications such as analysing well-log data, detecting copy number variation, and detecting tampering of wireless devices.

GIRAF: A Fast Algorithm for Structured Low-Rank Matrix Recovery

Structured low-rank matrix priors are emerging as powerful alternatives to traditional image recovery methods such as total variation (TV) and wavelet regularization. The main challenge in applying these schemes to large-scale problems is the computational complexity and memory demand resulting from a lifting of the image to a large structured matrix. We introduce a fast and memory efficient algorithm that exploits the convolution structure of the structured matrix to work in the original non-lifted domain, thus considerably reducing the complexity. Our experiments on the recovery of images from undersampled Fourier measurements show that the resulting algorithm provides improved reconstructions over TV regularization with comparable computation time.

Deep Learning in Multi-Layer Architectures of Dense Nuclei

Robust Confidence Intervals in High-Dimensional Left-Censored Regression

Chains conditions in algebraic lattices

Computationally Efficient Markov Chain Monte Carlo Methods for Hierarchical Bayesian Inverse Problems

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers

Review of multi-fidelity models

Multilayer Spectral Graph Clustering via Convex Layer Aggregation

Neighborhood growth dynamics on the Hamming plane

Functional Laws for Trimmed Levy Processes

Matching preclusion for $n$-grid graphs

A Novel Progressive Multi-label Classifier for Classincremental Data

Statistical Modeling for Spatio-Temporal Degradation Data

Empty-car routing in ridesharing systems

Deep Multi-Task Learning with Shared Memory

Fully Bayesian Estimation and Variable Selection in Partially Linear Wavelet Models

Hereditarily Non Uniformly Perfect Sets

On the (im)possibility of fairness

Novel stochastic properties of the short-time spectrum for unvoiced pronunciation modeling and synthesis

Using Neural Network Formalism to Solve Multiple-Instance Problems

Tight fluctuations of weight-distances in random graphs with infinite-variance degrees

Percolation on networks with weak and heterogeneous dependency

Random Popular Matchings with Incomplete Preference Lists

A Computer Algebra Package for Polynomial Sequence Recognition

Transport-entropy inequalities on locally acting groups of permutations

Semiparametric clustered overdispersed multinomial goodness-of-fit of log-linear models

Estimating Probability Distributions using ‘Dirac’ Kernels (via Rademacher-Walsh Polynomial Basis Functions)

On rotated Schur-positive sets

Implicit renewal theory in the arithmetic case

Stabilizing on the distinguishing number of a graph

Some results on the Signature and Cubature of the Fractional Brownian motion for $H>\frac{1}{2}$

Renyi-Ulam Games and Forbidden Substrings

Block-proximal methods with spatially adapted acceleration

Multi-Output Artificial Neural Network for Storm Surge Prediction in North Carolina

Discovering Sound Concepts and Acoustic Relations In Text

A penalized likelihood method for classification with matrix-valued predictors

A Few Photons Among Many: Unmixing Signal and Noise for Photon-Efficient Active Imaging

One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities

On the Non-Existence of Unbiased Estimators in Constrained Estimation Problems

Watermark Options

Delocalising the parabolic Anderson model through partial duplication of the potential

An unbiased Monte Carlo estimator for derivatives. Application to CIR

Regulating Reward Training by Means of Certainty Prediction in a Neural Network-Implemented Pong Game

Gershgorin disks for multiple eigenvalues of non-negative matrices

MPI Parallelization of the Resistive Wall Code STARWALL: Report of the EUROfusion High Level Support Team Project JORSTAR

Quasi-Monte Carlo for an Integrand with a Singularity along a Diagonal in the Square

Finding long simple paths in a weighted digraph using pseudo-topological orderings

AMR-to-text generation as a Traveling Salesman Problem

A Wald-type test statistic for testing linear hypothesis in logistic regression models based on minimum density power divergence estimator

Predicting human-driving behavior to help driverless vehicles drive: random intercept Bayesian Additive Regression Trees

Discreet Coin Weighings and the Sorting Strategy

Conditionally Bi-Free Independence for Pairs of Algebras

Asymptotic tensor rank of graph tensors: beyond matrix multiplication

Screening Rules for Convex Problems

Incorporating Relation Paths in Neural Relation Extraction

Book Memo: “Encyclopedia of Algorithms”

This dynamic reference work provides solutions to vital algorithmic problems for scholars, researchers, practitioners, teachers and students in fields such as computer science, mathematics, statistics, biology, economics, financial software, and medical informatics.
This second edition is broadly expanded, building upon the success of its former edition with more than 450 new and updated entries. These entries are designed to ensure algorithms are presented from growing areas of research such as bioinformatics, combinatorial group testing, differential privacy, enumeration algorithms, game theory, massive data algorithms, modern learning theory, social networks, and VLSI CAD algorithms.
Over 630 entries are organized alphabetically by problem, with subentries allowing for distinct solutions. Each entry includes a description of the basic algorithmic problem; the input and output specifications; key results; examples of applications; citations to key literature, open problems, experimental results, links to data sets and downloadable code.
All entries are peer-reviewed, written by leading experts in the field―and each entry contains links to a summary of the author’s research work.
This defining reference is available in both print and online―a dynamic living work with hyperlinks to related entries, cross references citations, and a myriad other valuable URLs.
New and Updated entries include:
Algorithmic Aspects of Distributed Sensor Networks,
Algorithms for Modern Computers
Certified Reconstruction and Mesh Generation
Combinatorial Group Testing
Compression of Text and Data Structures
Computational Counting
Computational Economics
Computational Geometry
Differential Privacy
Enumeration Algorithms
Exact Exponential Algorithms
Game Theory
Graph Drawing
Group Testing
Internet Algorithms
Kernels and Compressions
Massive Data Algorithms
Mathematical Optimization
Modern Learning Theory
Social Networks
Stable Marriage Problems, k-SAT Algorithms
Sublinear Algorithms
Tile Self-Assembly
VLSI CAD Algorithms

Book Memo: “Introduction to Nonparametric Statistics for the Biological Sciences Using R”

Eight self-contained lessons instructing how to use R to distinguish between data that could be classified as nonparametric as opposed to data that could be classified as parametric, with both approaches to data classification covered extensively From data to final interpretation of outcomes – starts with a simple real-world data set from the biological sciences and outlines step-by-step guidance on how R can be used to address nonparametric data analysis and the generation of graphical images to promote effective communication of outcomes Focuses on data review and accompanying data quality review processes – so that outcomes can be trusted and hold up to peer review

R Packages worth a look

Simple Memory Profiling for R (profmem)
A simple and light-weight API for memory profiling of R expressions. The profiling is built on top of R’s built-in memory profiler (‘utils::Rprofmem()’), which records every memory allocation done by R (also native code).

Convert Nested Functions to Pipes (pipefittr)
To take nested function calls and convert them to a more readable form using pipes from package ‘magrittr’.

Multi-Objective Kriging Optimization (moko)
Multi-Objective optimization based on the Kriging metamodel. Important functions: mkm, VMPF, MEGO and HEGO.

Network Fingerprint Framework in R (NFP)
An implementation of the network fingerprint framework. This method worked by making systematic comparisons to a set of well-studied ‘basic networks’, measuring both the functional and topological similarity. A biological could be characterized as a spectrum-like vector consisting of similarities to basic networks. It shows great potential in biological network study.

Document worth reading: “Fast unfolding of communities in large networks”

We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection method in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2.6 million customers and by analyzing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad-hoc modular networks. Fast unfolding of communities in large networks

If you did not already know: “Feedforward Sequential Memory Networks (FSMN)”

We introduce a new structure for memory neural networks, called feedforward sequential memory networks (FSMN), which can learn long-term dependency without using recurrent feedback. The proposed FSMN is a standard feedforward neural networks equipped with learnable sequential memory blocks in the hidden layers. In this work, we have applied FSMN to several language modeling (LM) tasks. Experimental results have shown that the memory blocks in FSMN can learn effective representations of long history. Experiments have shown that FSMN based language models can significantly outperform not only feedforward neural network (FNN) based LMs but also the popular recurrent neural network (RNN) LMs. … Feedforward Sequential Memory Networks (FSMN) google

Book Memo: “An Introduction to Fuzzy Linear Programming Problems”

Theory, Methods and Applications
The book presents a snapshot of the state of the art in the field of fully fuzzy linear programming. The main focus is on showing current methods for finding the fuzzy optimal solution of fully fuzzy linear programming problems in which all the parameters and decision variables are represented by non-negative fuzzy numbers. It presents new methods developed by the authors, as well as existing methods developed by others, and their application to real-world problems, including fuzzy transportation problems. Moreover, it compares the outcomes of the different methods and discusses their advantages/disadvantages. As the first work to collect at one place the most important methods for solving fuzzy linear programming problems, the book represents a useful reference guide for students and researchers, providing them with the necessary theoretical and practical knowledge to deal with linear programming problems under uncertainty.

Book Memo: “Post, Mine, Repeat”

Social Media Data Mining Becomes Ordinary
In this book, Helen Kennedy argues that as social media data mining becomes more and more ordinary, as we post, mine and repeat, new data relations emerge. These new data relations are characterised by a widespread desire for numbers and the troubling consequences of this desire, and also by the possibility of doing good with data and resisting data power, by new and old concerns, and by instability and contradiction. Drawing on action research with public sector organisations, interviews with commercial social insights companies and their clients, focus groups with social media users and other research, Kennedy provides a fascinating and detailed account of living with social media data mining inside the organisations that make up the fabric of everyday life.