Auditing Black-box Models for Indirect Influence

Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior, and in particular how different features influence the model prediction. This is important when interpreting the behavior of complex models, or asserting that certain problematic attributes (like race or gender) are not unduly influencing decisions. In this paper, we present a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the dataset, without knowing how the models work. Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features. As a result, we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all. Our approach does not require the black-box model to be retrained. This is important if (for example) the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence like feature selection. We present experimental evidence for the effectiveness of our procedure using a variety of publicly available datasets and models. We also validate our procedure using techniques from interpretable learning and feature selection, as well as against other black-box auditing procedures.


Data Science Ethics Course

In this course, we will explore the moral, social, and ethical ramifications of the choices we make at the different stages of the data analysis pipeline, from data collection and storage to understand feedback loops in analysis. Through class discussions, case studies and exercises, students will learn the basics of ethical thinking in science, understand the history of ethical dilemmas in scientific work, and study the distinct challenges associated with ethics in modern data science.


Data analytics for internet of things: A review

The internet of things (IoT), which provides a way to connect every “thing” via the internet to further develop a convenient environment, has been around for more than a decade. The trend of the development of IoT nowadays is to focus not only on its devices and systems but also on data analysis. The main reason is that data from sensors or systems typically contain valuable information that is very useful for improving the system performance or providing a better service to the user if we come up with a good “data analysis” solution. This paper begins with a brief review of data mining technologies for IoT. Then, a reference data analytics architecture is given to show how data analysis technologies can be applied to an IoT system. Finally, applications, open issues, and possible research directions are addressed.


Saving, resuming, and restarting experiments with Polyaxon

In this post we will introduce a new feature on Polyaxon, checkpointing, resuming, and restarting experiments. Often data scientists can’t afford to let their models to run for days before making adjustment. Sometimes infrastructure crashes can interrupt the training and force them to run their models all over again. It’s crucial to be able to stop training at any point, for any reason, and resume it later on. It’s also crucial to be able to resume an experiment with different parameters multiple times without losing the original progress. The experiments should ideally be immutable and reproducible, and in order to add this structure to your experiments, Polyaxon creates and exposes a couple of paths for every experiment, these paths are created on the volumes (logs and ouputs) provided during the deployment. You don’t need to figure out these paths or hardcode them manually, Polyaxon will provide an environment variable for the outputs POLYAXON_OUTPUTS_PATH for example, that you can use to export your outputs, artifacts, and checkpoints. You can also use our helper to get the paths get_outputs_path. In this post, we will go over some strategies to save your and work and resume it or restart it.


Hadoop 3: Comparison with Hadoop 2 and Spark

The major difference between Hadoop 3 and 2 is that the new version provides better optimization and usability, as well as certain architectural improvements. Spark and Hadoop differ mainly in the level of abstraction. Hadoop was created as the engine for processing large amounts of existing data. It has a low level of abstraction that allows performing complex manipulations but can cause learning and managing difficulties. Spark is easier and faster, with a lot of convenient high-level tools and functions that can simplify your work. Spark operates on top of Hadoop and has many good libraries like Spark SQL or machine learning library MLlib. To summarize, if your work does not require special features, Spark can be the most reasonable choice.


How to do Semantic Segmentation using Deep learning

This article is a comprehensive overview including a step-by-step guide to implement a deep learning image segmentation model.


Announcing PyTorch 1.0 for both research and production

The path for taking AI development from research to production has historically involved multiple steps and tools, making it time-intensive and complicated to test new approaches, deploy them, and iterate to improve accuracy and performance. To help accelerate and optimize this process, we’re introducing PyTorch 1.0, the next version of our open source AI framework. PyTorch 1.0 takes the modular, production-oriented capabilities from Caffe2 and ONNX and combines them with PyTorch’s existing flexible, research-focused design to provide a fast, seamless path from research prototyping to production deployment for a broad range of AI projects. With PyTorch 1.0, AI developers can both experiment rapidly and optimize performance through a hybrid front end that seamlessly transitions between imperative and declarative execution modes. The technology in PyTorch 1.0 has already powered many Facebook products and services at scale, including performing 6 billion text translations per day.


Apache Spark: Python vs. Scala

When it comes to using the Apache Spark framework, the data science community is divided in two camps; one which prefers Scala whereas the other preferring Python. This article compares the two, listing their pros and cons.


Skewness vs Kurtosis – The Robust Duo

Kurtosis and Skewness are very close relatives of the “data normalized statistical moment” family – Kurtosis being the fourth and Skewness the third moment, and yet they are often used to detect very different phenomena in data. At the same time, it is typically recommendable to analyse the outputs of both together to gather more insight and understand the nature of the data better.


PrivacyGuide: towards an implementation of the EU GDPR on Internet privacy policy evaluation

Studies have shown that only 1% or less of total users click on privacy policies, and those that do rarely actually read them. The GDPR requires clear succinct explanations and explicit consent (i.e., no burying your secrets on page 37 of a 70 page document), but that’s not the situation on the ground right now, and it’s hard to see that changing overnight on May 25th.


MLE with General Optimization Functions in R

In my previous post (https://…/mle-in-r ), it is shown how to estimate the MLE based on the log likelihood function with the general-purpose optimization algorithm, e.g. optim(), and that the optimizer is more flexible and efficient than wrappers in statistical packages. A benchmark comparison are given below showing the use case of other general optimizers commonly used in R, including optim(), nlm(), nlminb(), and ucminf(). Since these optimizers are normally designed to minimize the objective function, we need to add a minus (-) sign to the log likelihood function that we want to maximize, as shown in the minLL() function below. In addition, in order to speed up the optimization process, we can suppress the hessian in the function call. If indeed the hessian is required to calculate standard errors of estimated parameters, it can be calculated by calling the hessian() function in the numDeriv package. As shown in the benchmark result, although the ucminf() is the most efficient optimization function, a hessian option can increase the computing time by 70%. In addition, in the second fastest nlminb() function, there is no built-in option to output the hessian. Therefore, sometimes it might be preferable to estimate model parameters first and then calculate the hessian afterwards for the analysis purpose, as demonstrated below.
Advertisements