Debugging Machine Learning - Reflections from DAWN Retreat

by Paroma Varma, Chris Ré, and other members of DAWN 27 Sep 2018

“What do you spend time on while debugging machine learning pipelines?”

Responses to this question at the Fall 2018 DAWN Retreat ranged from “finding the best way to use transfer learning” to “systematically sampling from raw data”. We identify three broad themes from our discussions and explore them in this post: (1) shaping training data, (2) exploiting log data, and (3) model introspection.

Check out our other blogs related to debugging machine learning: using provenance to debug training sets and a checklist of common errors (and solutions) in machine learning pipelines!

At the Fall 2018 DAWN Retreat, we wanted to learn about what people spend time on while debugging machine learning pipelines. We were fortunate to have several interesting conversations with our industry affiliates about specific problems they face while training models, and broad problems they think are holding machine learning and Software 2.0 from achieving its full potential.

We group issues that members discussed into three broad categories:

Shaping Training Data: The most popular responses to questions about debugging models were about shaping training data! Within this general area, we heard about the need for systematically converting raw data to a training set, preserving data structure and metadata during this process, and visualizing data distributions using a dashboard for machine learning data.
Exploiting Log Data: There was a lot of interest in learning from log data effectively, or exploiting it to aid classical learning algorithms. This touched on our previous work using developer exhaust to simplify complex problems like model search, but hinted at a broader need for the ability to learn from large amounts of structured, machine-generated data.
Model Introspection: Almost everyone mentioned the need to better understand the interaction between the model and the data. This included being able to determine whether the model was performing well on the datapoints that mattered, discovering which subsets of the data were hurting model performance, etc. in a systematic fashion.

Shaping Training Data

Labeling and preprocessing data to ensure it captures the distribution of real data is a tedious and time consuming process.

We heard about how shaping training data is one of the most time-consuming parts of deploying machine learning in the real world. Our discussions with industry researchers reflected the same, touching on the need for systematically converting raw data to a training set, preserving data structure and metadata during this process, and visualizing data distributions using a dashboard for machine learning (retreat was a few days after Google released the What-If Tool).

Converting Raw Data to an ML Dataset

We heard about how a key component of debugging machine learning models is managing the process of converting raw data to a dataset. Problems in this category include:

Combining and organizing multimodal data, and data from different sources (e.g., CSVs and image data)
Defining the problem schema and designing a set of target labels for a particular task. Deciding how to best split data categories into different hierarchies to take advantage of multi-task learning.
Determining the best embedding or feature space for a specific data and downstream model pair
Deciding how to sample raw data to create training data by selecting subsets of the data that matter, a problem especially relevant when there is high density of data available (e.g., streaming settings)

Preserving Data Structure

Related to the problem above, many data processing pipelines can indirectly remove information critical to the task at hand. For example, cropping and centering could remove location information in an image, metadata associated with the data source (user ID, location etc.) is usually removed to ensure privacy, etc. Related issues include:

Preserving data structure while preprocessing data by incorporating knowledge about data sources and other metadata into the end model to help improve performance
Utilizing “bad” datapoints or “outliers” that seem to hurt the model but still contain some information about the task at hand. This also relates to deciding how to shape and sample data in cases in which there is high density of data.

Visual Dashboard for ML Data

Several people mentioned the need for an interactive tool to explore and understand their data. This would help answer the question: “Is there signal in the data I’m training on for the task I’m interested in?” Often, people train large, complex models to improve accuracy without making sure the data can actually answer the question they’re interested in. Google’s What-If Tool is a great fit for this need to visualize the data distribution.

Exploiting Log Data

Machine-generated log data consists of rich, structured information that a model can learn from or exploit to help improve performance on related tasks.

The problems surrounding utilizing log data effectively touched on our ideas about developer exhaust, or byproducts of the data analytics pipeline to simplify otherwise complex problems. Issues mentioned included:

Learning about system behavior from log data by converting log data to a structured training set. This is a subset of shaping training data, but for a specific kind of raw data format. This problem becomes especially challenging when multiple forms and levels of log data are available.
Converting domain expertise to a data format and model objective function given log data and a question of interest. While domain experts can manually extract signal by skimming through various levels and kinds of log information, automating this process is not straightforward due to the diversity of log structures.

Model Introspection

Understanding the interaction between the data and model is key to systematically discovering errors in both the data and the model and guiding decisions about shaping training data.

There were several discussions around debugging the model or understanding its behavior in conjunction with the data its trained on. This included:

Quickly checking whether a model is learning given a set of training data and model hyperparameters by identifying overfitting, predicting accuracy, etc.
Taking advantage of transfer learning systematically to balance the trade-off between fine-tuning on the small, labeled dataset and keeping weights from the pre-trained network. This includes deciding how much (and which subset of) training data to use for fine-tuning
Adding “breakpoints” between layers of a model to study behavior through different layers of a model

Looking Ahead

We are grateful for the opportunity to learn about the problems users face while debugging machine learning models and find that shaping training data, exploiting log data, and model introspection to be some of the popular areas people mentioned. This is not a comprehensive list by any means and we would appreciate any feedback and additions!