Designing powerful outlier and anomaly detection algorithms requires using the right tools. Discover how robust statistical distances can help. Statistical distances are distances between distributions or samples, which are used in a variety of machine learning applications, such as anomaly and outlier detection, ordinal regression, and in generative adversarial networks (GANs). This post explores how to compare distributions using both visual tools and robust statistical distances. When comparing data sets, statistical tests can tell you whether or not two data samples are likely to be generated by the same process or if they are related in some way. However, there are cases where, rather than deciding whether to reject a statistical hypothesis, you want to measure how similar or far apart the data sets are without any assumptions. Statistical distances, as distances between samples, are an interesting answer to that problem.
In 2018 Fast Company declared the Data Scientist the best job for the third year in a row, which I wholeheartedly agree with (besides the Director of Fun at the York National Railway Museum), however the role of data scientist, as we know it, will soon have the same fate as the bowling pinsetters, chariot racers, and human alarm clocks. In 2000-2010 data science was dominated by masters of herculean subjects, with PhDs in linear algebra and statistics, combined with expertise in the uncelebrated (at the time) field of coding. Data science truly had an emphasis on the science of manipulating data, focusing on how to mathematically validate significance and trends. This was a great first step in helping society gain insights from the massive influx of big data, however it now has its drawbacks. Tipping the balance too far towards degrees of freedom and vectors is great in the ivory towers of academics, but when it comes to practical and timely results for businesses, is not ideal. I recently heard a story about a team of PhD data scientists at a Fortune 500 company having trouble improving their built from scratch multi-layered neural network model´s accuracy. They spent hours meticulously tuning cryptic hyper-parameters and adding layers to their model with no success. The data then ended up falling into the hands of an employee fresh out of his undergraduate degree. After quickly looking at the data, his first step was to create a simple regression model and remove all zero values, immediately skyrocketing accuracy, and creating a cluster of self-conscious PhDs. Despite his lack of experience with scalar multiplication or multi-threading programming, his domain and practical knowledge made all of the difference.
… The answer to these challenges is an ingenious approach to data analytics which transforms, pre-sorts, pre-codes and pre-fabricates raw data into signals. Signals turn massive data sets into manageable, relevant pieces that expedite analysis. Signals are useful information about events, customers, systems and interactions. They describe behaviors, events and attributes, and they can predict future outcomes. A typical business can be profiled with 3000 baseline signals, which are essentially the family jewels of a business. Signals are combinations of raw data. Often they mirror patterns and dynamics in a business or an industry. Once signals are created they can be reused or recombined like LEGO building blocks. Signals can be constructed, connected and reconfigured easily and quickly. They eliminate the need to go back to or process raw data every time a new use case comes up. …
Python Fire is a library for automatically generating command line interfaces (CLIs) from absolutely any Python object.
Developing and deploying software based on machine learning is a very different animal in terms of process and workflow. Machine learning (ML) is being touted as the solution to problems in every phase of the software development product lifecycle, from automating the cleansing of data as it is ingested to replacing textual user interfaces with chatbots. As software engineers gain more experience in developing and deploying production quality ML solutions, it is becoming clear that ML development is unique compared to that of other types of software.
Uber uses convolutional neural networks in many domains that could potentially involve coordinate transforms, from designing self-driving vehicles to automating street sign detection to build maps and maximizing the efficiency of spatial movements in the Uber Marketplace. In deep learning, few ideas have experienced as much impact as convolution. Almost all state-of-the-art results in machine vision make use of stacks of convolutional layers as basic building blocks. Since such architectures are widespread, we should expect that they excel at simple tasks like painting a single pixel in a tiny image, right Surprisingly, it turns out that convolution often has difficulty completing seemingly trivial tasks. In our paper, An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution, we expose and analyze a generic inability of convolutional neural networks (CNNs) to transform spatial representations between two different types: coordinates in (i, j) Cartesian space and coordinates in one-hot pixel space. It´s surprising because the task appears so simple, and it may be important because such coordinate transforms seem to be required to solve many common tasks, like detecting objects in images, training generative models of images, and training reinforcement learning (RL) agents from pixels. It turns out that these tasks may have subtly suffered from this failing of convolution all along, as suggested by performance improvements we demonstrate across several domains when using the solution we propose, a layer called CoordConv.