Distilled News

Free Book: Process Improvement Using Data

Rethinking statistical learning theory: learning using statistical invariants

This paper introduces a new learning paradigm, called Learning Using Statistical Invariants (LUSI), which is different from the classical one. In a classical paradigm, the learning machine constructs a classification rule that minimizes the probability of expected error; it is data-driven model of learning. In the LUSI paradigm, in order to construct the desired classification function, a learning machine computes statistical invariants that are specific for the problem, and then minimizes the expected error in a way that preserves these invariants; it is thus both data- and invariant-driven learning. From a mathematical point of view, methods of the classical paradigm employ mechanisms of strong convergence of approximations to the desired function, whereas methods of the new paradigm employ both strong and weak convergence mechanisms. This can significantly increase the rate of convergence.

Bad Bots Are Stealing Data AND Ruining Customer Experience

Every online customer touchpoint – including websites, mobile apps, and APIs – is being attacked by bots. What are these bad bots doing? Interrupting good customer traffic, committing fraud, and stealing information – just ad fraud alone is set to exceed $3.3 billion in 2018! If all that wasn´t bad enough, bots are also trying to skew the data your company uses to make decisions. Your marketing and customer experience colleagues track user behavior to improve customer journeys or buy advertising. Unless you´re actively defending against bad bots, this these decisions could be way off base, and extremely costly.

Union Multiple Data.Frames with Different Column Names

On Friday, while working on a project that I needed to union multiple data.frames with different column names, I realized that the base::rbind() function doesn´t take data.frames with different columns names and therefore just quickly drafted a rbind2() function on the fly to get the job done based on the idea of MapReduce that I discussed before (https://…/playing-map-and-reduce-in-r-subsetting ).

Manage your Data Science project structure in early stage.

Jupyter Notebook (or Colab, databrisk’s notebook etc) provide a very efficient way for building a project in a short time. We can create any python class and function in the notebook without re-launching kernel. Helping to shorten the waiting time for that. It is good for small scale project and experiment. However, it may not good for long term growth.

How to rapidly test dozens of deep learning models in Python

Optimizing machine learning (ML) models is not an exact science. The best model architecture, optimization algorithm and hyperparameter settings depend on the data you’re working with. Thus, being able to quickly test several model configurations is imperative in maximizing productivity & driving progress in your ML project. In this article, we’ll create an easy-to-use interface which allows you to do this. We’re essentially going to build an assembly line for ML models.

Evolution of Spark Analytics

Apache Spark is an open source, scalable, massively parallel, in-memory execution environment for running analytics applications. Data scientist is Primarily responsible for building predictive analytic models and building insights. He will analyze data that’s been cataloged and prepared by the data engineer using machine learning tools like Watson Machine Learning. He will build applications using Jupyter Notebooks, RStudio After the data scientist shares his Analytical outputs, Application developer can build APPs like a cognitive chatbot. As the chatbot engages with customers, it will continuously improve its knowledge and help uncover new insights.
Lets get into the shoes of a data scientist and see what are the things I want to do as a Data Scientist:
• I want to run my analytic jobs, I want to run social media analytics or text analytics
• I want to run queries on demand
• I want to run R, Python scripts on Spark
• I want to submit Spark jobs
• I want to view History Server Logs of my application so that i can compare my jobs performance and improve it further
• I want to see Daemon logs for my debugging
• I want to write Notebooks

Portfolio Optimization with Deep Reinforcement Learning

Portfolio Optimization or the process of giving optimal weights to assets in a financial portfolio is a fundamental problem in Financial Engineering. There are many approaches one can follow?-?for passive investments the most common is liquidity based weighting or market capitalization weighting. If one has no view on investment performance one follows equal weighting. Following the Capital Asset Pricing Model, the most elegant solution is the Markovitz Optimal portfolio?-?where risk-averse investors try to maximize return based on their level of risk .

R Packages worth a look

L1-Penalized Censored Gaussian Graphical Models (cglasso)
The l1-penalized censored Gaussian graphical model (cglasso) is an extension of the graphical lasso estimator developed to handle datasets with censore …

Targeted Stable Balancing Weights Using Optimization (optweight)
Use optimization to estimate weights that balance covariates for binary, multinomial, continuous, and longitudinal treatments in the spirit of Zubizarr …

Securely Wrangle Dataset According to Data Usage Agreement (duawranglr)
Create shareable data sets from raw data files that contain protected elements. Relying on master crosswalk files that list restricted variables, packa …

If you did not already know

Risk-Averse Imitation Learning (RAIL) google
Imitation learning algorithms learn viable policies by imitating an expert’s behavior when reward signals are not available. Generative Adversarial Imitation Learning (GAIL) is a state-of-the-art algorithm for learning policies when the expert’s behavior is available as a fixed set of trajectories. We evaluate in terms of the expert’s cost function and observe that the distribution of trajectory-costs is often more heavy-tailed for GAIL-agents than the expert at a number of benchmark continuous-control tasks. Thus, high-cost trajectories, corresponding to tail-end events of catastrophic failure, are more likely to be encountered by the GAIL-agents than the expert. This makes the reliability of GAIL-agents questionable when it comes to deployment in safety-critical applications like robotic surgery and autonomous driving. In this work, we aim to minimize the occurrence of tail-end events by minimizing tail-risk within the GAIL framework. We quantify tail-risk by the Conditional-Value-at-Risk (CVaR) of trajectories and develop the Risk-Averse Imitation Learning (RAIL) algorithm. We observe that the policies learned with RAIL show lower tail-end risk than those of vanilla GAIL. Thus the proposed RAIL algorithm appears as a potent alternative to GAIL for improved reliability in safety-critical applications. …

Pairwise Issue Expansion google
A public decision-making problem consists of a set of issues, each with multiple possible alternatives, and a set of competing agents, each with a preferred alternative for each issue. We study adaptations of market economies to this setting, focusing on binary issues. Issues have prices, and each agent is endowed with artificial currency that she can use to purchase probability for her preferred alternatives (we allow randomized outcomes). We first show that when each issue has a single price that is common to all agents, market equilibria can be arbitrarily bad. This negative result motivates a different approach. We present a novel technique called ‘pairwise issue expansion’, which transforms any public decision-making instance into an equivalent Fisher market, the simplest type of private goods market. This is done by expanding each issue into many goods: one for each pair of agents who disagree on that issue. We show that the equilibrium prices in the constructed Fisher market yield a ‘pairwise pricing equilibrium’ in the original public decision-making problem which maximizes Nash welfare. More broadly, pairwise issue expansion uncovers a powerful connection between the public decision-making and private goods settings; this immediately yields several interesting results about public decisions markets, and furthers the hope that we will be able to find a simple iterative voting protocol that leads to near-optimum decisions. …

PyData google
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. We aim to be an accessible, community-driven conference, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. …

Book Memo: “(Almost) Impossible Integrals, Sums, and Series”

This book contains a multitude of challenging problems and solutions that are not commonly found in classical textbooks. One goal of the book is to present these fascinating mathematical problems in a new and engaging way and illustrate the connections between integrals, sums, and series, many of which involve zeta functions, harmonic series, polylogarithms, and various other special functions and constants. Throughout the book, the reader will find both classical and new problems, with numerous original problems and solutions coming from the personal research of the author. Where classical problems are concerned, such as those given in Olympiads or proposed by famous mathematicians like Ramanujan, the author has come up with new, surprising or unconventional ways of obtaining the desired results. The book begins with a lively foreword by renowned author Paul Nahin and is accessible to those with a good knowledge of calculus from undergraduate students to researchers, and will appeal to all mathematical puzzlers who love a good integral or series.

Document worth reading: “Graph-based Ontology Summarization: A Survey”

Ontologies have been widely used in numerous and varied applications, e.g., to support data modeling, information integration, and knowledge management. With the increasing size of ontologies, ontology understanding, which is playing an important role in different tasks, is becoming more difficult. Consequently, ontology summarization, as a way to distill key information from an ontology and generate an abridged version to facilitate a better understanding, is getting growing attention. In this survey paper, we review existing ontology summarization techniques and focus mainly on graph-based methods, which represent an ontology as a graph and apply centrality-based and other measures to identify the most important elements of an ontology as its summary. After analyzing their strengths and weaknesses, we highlight a few potential directions for future research. Graph-based Ontology Summarization: A Survey

Distilled News

The 2018 State of Data Management

Profisee, a global leading modern data management technology company, released the results of its first annual data management report, ‘The 2018 State of Data Management.’ The survey, conducted between January and April of 2018, aims to provide insights into data management, strategy, challenges, trends, benchmarks and ‘how others are doing it’. It is Profisee´s hope that data management professionals across the global data management community can use this information to help drive new initiatives, improve existing efforts and leverage the data for comparison against how others across the globe are engaged with data management. The report found two interrelated themes that highlight the biggest opportunity to better implement common best practices in data management. These themes include a greater need for business engagement vs IT led initiatives, as well as analyzing the benefits of a data management program between cost savings and how a data management program can help drive revenue.

Reproducible development with Rmarkdown and Github

I´m pretty sure most readers of this blog are already familiar with Rmarkdown and Github. In this post I don´t pretend to invent the wheel but rather give a quick run-down of how I set-up and use these tools to produce high quality and scalable (in human time) reproducible data science development code.

Data Capture — the Deep Learning Way

Information extraction from text is one of the fairly popular machine learning research areas, often embodied in Named Entity Recognition, Knowledge Base Completion or similar tasks. But in business, many information extraction problems do not fit well into the academic taxonomy – take the problem of capturing data from business, layout-heavy documents like invoices. NER-like approaches are a poor fit because there isn´t rich text context and layout plays an important role in encoding the information. At the same time, the layouts can be so variable that simple template matching doesn´t cut it at all. However, capturing data is a huge business problem – just for invoices, one billion per day is exchanged worldwide. Yet only a tiny fraction of the problem is solved by traditional data capture software, leaving armies of people to do the mindless paper-pushing drudgery – meanwhile we are making strides in self-driving technology and steadily approaching human-level machine translation. It was genuinely surprising for us at Rossum when we realized this absurd gap. Perhaps our main ambition in this post is to inspire the reader in how it´s possible to make a difference in an entrenched field using deep learning. Everyone is solving (and applying) the standard academic tasks – the key for us was to ignore the shackles of their standard formulations and rephrase the problem from the first principles of deep learning.

Rank Collapse in Deep Learning

We can learn a lot about Why Deep Learning Works by studying the properties of the layer weight matrices of pre-trained neural networks. And, hopefully, by doing this, we can get some insight into what a well trained DNN looks like-even without peaking at the training data. One broad question we can ask is: How is information concentrated in Deep Neural Network (DNNs)? To get a handle on this, we can run ‘experiments’ on the pre-trained DNNs available in pyTorch.

Machine Learning: A Gentle Introduction.

In the past decade or so, Machine Learning and broadly Data Science has taken over the technological front by storm. Almost every tech-enthusiast has or wants to have a piece in it. In 2012 Harvard Business Review called the job of a Data Scientist as ‘The Sexiest Job of the 21st Century’, and six years hence, it still holds that tag tight and high. But what makes it so appealing? Let’s have a closer look.

How Do Artificial Neural Networks Learn?

This is the fifth post in the series that I am writing based on the book First contact with DEEP LEARNING, Practical introduction with Keras. In it I will present an intuitive vision of the main components of the learning process of a neural network and put into practice some of the concepts presented here with an interactive tool called TensorFlow Playground.

Ethics & Algorithms Toolkit

Government leaders and staff who leverage algorithms are facing increasing pressure from the public, the media, and academic institutions to be more transparent and accountable about their use. Every day, stories come out describing the unintended or undesirable consequences of algorithms. Governments have not had the tools they need to understand and manage this new class of risk. GovEx, the City and County of San Francisco, Harvard DataSmart, and Data Community DC have collaborated on a practical toolkit for cities to use to help them understand the implications of using an algorithm, clearly articulate the potential risks, and identify ways to mitigate them.

R Packages worth a look

Finding Patterns of Monotonicity and Convexity in Data (DIconvex)
Given an initial set of points, this package minimizes the number of elements to discard from this set such that there exists at least one monotonic an …

binb’ is not ‘Beamer’ (binb)
A collection of ‘LaTeX’ styles using ‘Beamer’ customization for pdf-based presentation slides in ‘RMarkdown’. At present it contains ‘RMarkdown’ adapta …

Estimating Speaker Style Distinctiveness (stylest)
Estimates distinctiveness in speakers’ (authors’) style. Fits models that can be used for predicting speakers of new texts. Methods developed in Spirli …

Document worth reading: “On the Learning Dynamics of Deep Neural Networks”

While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features’ frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize gradient starvation where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features. On the Learning Dynamics of Deep Neural Networks