Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference
Machine learning inference, which involves making predictions from machine learning models, is an increasingly important problem today, crucial for many systems like spam detection and content recommendation. Serving ML predictions is fundamentally different from serving other workloads, such as web pages or database queries, because, unlike those workloads, ML applications have unique statistical properties, like an amenability to approximation. However, existing systems for ML inference serving, like Clipper or AWS Sagemaker, neglect these statistical properties and approach ML inference workloads as extensions of conventional data serving workloads. In response, we developed Willump, a system for automatically performing statistical optimization of ML inference applications. Willump’s new optimizations improve the performance of real-world ML inference applications curated from major data science competitions by up to 10x. We recently published the Willump paper at MLSys 2020. Willump’s source code is available on GitHub.
Willump targets a common class of ML inference application: those whose performance bottleneck is feature computation. ML models usually cannot make predictions directly from raw data. Instead, ML inference applications use pipelines of transformations to compute numerical representations of data, called features, from which models can make predictions. For example, TF-IDF statistics are features commonly computed from string data. Often, especially when using less-expensive ML models such as linear classifiers and boosted trees, computing features is the costliest part of ML inference. For example, a recent study at Microsoft found feature computation accounted for over 99% of the runtime of some production ML inference applications. Willump’s optimizations each improve the performance of a different type of ML inference query which is commonly bottlenecked by feature computation.
Willump Optimization 1: Automatic End-to-end Cascades
Willump’s first optimization, automatic end-to-end cascades, improves the performance of classification queries, which classify data inputs into categories. Classification queries are amenable to approximation: ML models usually classify data inputs using many features, but can often accurately classify many of those inputs using only a few features. For example, in an application that detects toxic online comments, we can use the presence of curse words to quickly classify some comments as toxic, but we may need to compute more expensive TF-IDF and word embedding features to classify others.
Willump takes advantage of this property of ML inference applications by automatically constructing end-to-end cascades, which classify some data inputs using an approximate model dependent on only some of the original application’s features. When using cascades, Willump attempts to classify each data input with the approximate model. For each data input, Willump returns the approximate model’s prediction if the model’s confidence exceeds a threshold, but otherwise computes all remaining features and classifies the data input with the original model. We diagram this below:
Cascades are not a new idea, but unlike previous systems, Willump can construct cascades automatically for almost any ML inference application. Willump automatically constructs cascades using an application’s training data and an accuracy target. First, Willump analyzes the flow of data through the application to identify its features and the operators that compute them. Then, Willump selects several sets of inexpensive but predictively powerful features. Specifically, Willump chooses several potential maximum feature costs, then for each chosen cost selects the set of features with the largest sum of permutation importance scores where the cost of the features is less than the chosen cost. Next, for each selected set of features, Willump trains an approximate model, empirically chooses a confidence threshold based on the accuracy target, and uses these to estimate the cost of accurately making predictions using cascades constructed from those features. Finally, Willump constructs cascades from the selected set of features that minimizes this prediction cost.
Willump Optimization 2: Top-K Query Approximation
Willump’s second optimization, top-K query approximation, improves the performance of top-K queries. Top-K queries request a ranking of the K top-scoring items in an input dataset. For example, a music recommendation system might request the 10 songs, out of a large catalog, that a user would like the most. Top-K queries are fundamentally asymmetric: predictions for high-scoring data inputs (which are ranked and returned) must be much more precise than predictions for low-scoring data inputs (which are discarded).
Willump leverages this asymmetry by automatically approximating top-K queries. Willump filters out low-scoring data inputs with an approximate model, then ranks the remainder with the original model. Previous systems have optimized top-K queries in a similar way, but unlike prior work, Willump can construct an approximate model automatically for almost any ML inference application. Like in end-to-end cascades, Willump automatically constructs an approximate model that depends on a small set of inexpensive but predictively powerful features, with the goal of maximizing performance subject to an accuracy target.
We evaluated Willump on several real-world benchmarks curated from entries to major data science competitions, especially from Kaggle. In this blog post, we look at three of these benchmarks:
Toxic: Evaluates whether online comments are toxic. Computes string representations, then applies a linear model.
Music: Performs music recommendation. Queries pre-computed features from a remote data store, then applies a boosted trees model.
All benchmark source code is in the Willump repository.
We first evaluate the performance of Willump’s end-to-end cascades optimization on classification queries conducted on all three benchmarks. First, we measure the original benchmark’s performance. Then, we apply Willump’s end-to-end cascades optimization, verifying that cascades do not cause a statistically significant loss in accuracy. As we can see below, cascades improve performance by 1.6-4.7x.
We next evaluate the performance of Willump’s top-K approximation optimization on top-K queries conducted on all three benchmarks. As before, we first measure the original benchmark’s performance. We then apply our approximation optimization, requiring that the precision of the approximation always be above 0.95. As we can see below, top-K query approximation improves performance by 2.7-9.6x.
In this blog post, we introduced Willump, a statistically-aware end-to-end optimizer for ML inference. If you are interested in learning more about Willump, please see our paper, published at MLSys 2020, or look at Willump’s source code on GitHub.