The 2018 State of Data Management

Profisee, a global leading modern data management technology company, released the results of its first annual data management report, ‘The 2018 State of Data Management.’ The survey, conducted between January and April of 2018, aims to provide insights into data management, strategy, challenges, trends, benchmarks and ‘how others are doing it’. It is Profisee´s hope that data management professionals across the global data management community can use this information to help drive new initiatives, improve existing efforts and leverage the data for comparison against how others across the globe are engaged with data management. The report found two interrelated themes that highlight the biggest opportunity to better implement common best practices in data management. These themes include a greater need for business engagement vs IT led initiatives, as well as analyzing the benefits of a data management program between cost savings and how a data management program can help drive revenue.

Reproducible development with Rmarkdown and Github

I´m pretty sure most readers of this blog are already familiar with Rmarkdown and Github. In this post I don´t pretend to invent the wheel but rather give a quick run-down of how I set-up and use these tools to produce high quality and scalable (in human time) reproducible data science development code.

Data Capture — the Deep Learning Way

Information extraction from text is one of the fairly popular machine learning research areas, often embodied in Named Entity Recognition, Knowledge Base Completion or similar tasks. But in business, many information extraction problems do not fit well into the academic taxonomy – take the problem of capturing data from business, layout-heavy documents like invoices. NER-like approaches are a poor fit because there isn´t rich text context and layout plays an important role in encoding the information. At the same time, the layouts can be so variable that simple template matching doesn´t cut it at all. However, capturing data is a huge business problem – just for invoices, one billion per day is exchanged worldwide. Yet only a tiny fraction of the problem is solved by traditional data capture software, leaving armies of people to do the mindless paper-pushing drudgery – meanwhile we are making strides in self-driving technology and steadily approaching human-level machine translation. It was genuinely surprising for us at Rossum when we realized this absurd gap. Perhaps our main ambition in this post is to inspire the reader in how it´s possible to make a difference in an entrenched field using deep learning. Everyone is solving (and applying) the standard academic tasks – the key for us was to ignore the shackles of their standard formulations and rephrase the problem from the first principles of deep learning.

Rank Collapse in Deep Learning

We can learn a lot about Why Deep Learning Works by studying the properties of the layer weight matrices of pre-trained neural networks. And, hopefully, by doing this, we can get some insight into what a well trained DNN looks like-even without peaking at the training data. One broad question we can ask is: How is information concentrated in Deep Neural Network (DNNs)? To get a handle on this, we can run ‘experiments’ on the pre-trained DNNs available in pyTorch.

Machine Learning: A Gentle Introduction.

In the past decade or so, Machine Learning and broadly Data Science has taken over the technological front by storm. Almost every tech-enthusiast has or wants to have a piece in it. In 2012 Harvard Business Review called the job of a Data Scientist as ‘The Sexiest Job of the 21st Century’, and six years hence, it still holds that tag tight and high. But what makes it so appealing? Let’s have a closer look.

How Do Artificial Neural Networks Learn?

This is the fifth post in the series that I am writing based on the book First contact with DEEP LEARNING, Practical introduction with Keras. In it I will present an intuitive vision of the main components of the learning process of a neural network and put into practice some of the concepts presented here with an interactive tool called TensorFlow Playground.

Ethics & Algorithms Toolkit

Government leaders and staff who leverage algorithms are facing increasing pressure from the public, the media, and academic institutions to be more transparent and accountable about their use. Every day, stories come out describing the unintended or undesirable consequences of algorithms. Governments have not had the tools they need to understand and manage this new class of risk. GovEx, the City and County of San Francisco, Harvard DataSmart, and Data Community DC have collaborated on a practical toolkit for cities to use to help them understand the implications of using an algorithm, clearly articulate the potential risks, and identify ways to mitigate them.