DAWNBench v1 Deep Learning Benchmark Results
April 20th, 2018 marked the end of our first iteration of DAWNBench, the first deep learning benchmark and competition that measures end-to-end performance: the time/cost required to achieve a state-of-the-art accuracy level for common deep learning tasks, as well as the latency/cost of inference at this state-of-the-art accuracy level. Focusing on end-to-end performance provided an objective means of normalizing across differences in computation frameworks, hardware, optimization algorithms, hyperparameter settings, and other factors that affect real-world performance.
Thanks to innovative submissions and community involvement, we have seen more than an order of magnitude improvements in end-to-end training time and cost. Some of these improvements involve simple changes like version bumps in software frameworks (e.g., TensorFlow 1.7 to 1.8), while others make use of novel architectures and new data processing techniques. This post highlights the best results and some of the more interesting techniques, tools, and ideas employed.
Training: Faster, cheaper, better
Over the past several months since we introduced DAWNBench, steady progress has been made to improve training time and cost. Our original submissions for ImageNet used optimized implementations for multiple GPUs on a single-node from open source frameworks, TensorFlow and MXNet, but still took 10 days 10 hours and 42 minutes and $1112.64 for public cloud instances to train. Because of a range of hardware, framework, and algorithmic changes, ResNet50 can now be trained on ImageNet in as little as 30 minutes with checkpointing and 24 minutes without checkpointing using half of a Google TPUv2 Pod, representing a 477x speed-up! While previous work has shown to be slightly faster at 15 minutes or 20 minutes, this submission used a much smaller cluster of machines and maintained a higher accuracy of 76.01% top-1.
The cheapest submission for ResNet50 on ImageNet ran in 8 hours 53 minutes for a total of $58.53 on a Google TPUv2 machine using TensorFlow 1.8.0-rc1, which is a 19x cost improvement over our best seed entry that used 8 Nvidia K80 GPUs on AWS. With the same hardware and model architecture but a slightly older version of TensorFlow (1.7-rc1), Google trained ResNet50 in 12 hours and 26 minutes, meaning there was a 1.4x speed-up between versions and demonstrating the benefits of keeping up with frameworks as they make continuous improvements. Google was able to reduce the cost further from their ResNet submissions to $49.30 by using AmoebaNet-D N6F256, a new model found through architecture search, a 23x improvement over our seed entries and a 1.19x improvement over ResNet50.
Other hardware and cloud providers weren’t far behind! Using PyTorch with 8 Nivida V100 GPUs on AWS, fast.ai was able to train ResNet50 in 2 hours 58 minutes for a total of $72.50 with a progressive resizing technique from “Progressive Growing of GANs for Improved Quality, Stability, and Variation” and “Enhanced Deep Residual Networks for Single Image Super-Resolution” that increased the resolution of images over training to get higher throughput (images per second) at the beginning without loss in final accuracy. With only CPUs, Intel used 128 AWS instances with 36 cores each to train ImageNet in 3 hours and 26 minutes.
The smaller dataset of CIFAR10 also saw impressive speed-ups. Starting with ResNet164 from “Identity Mappings in Deep Residual Networks ” that trained in 2 hours and 31 minutes on a Nvidia P100, training time fell to 2 minutes and 54 seconds thanks to fast.ai and their student team. Using a Custom Wide ResNet architecture and 8 Nvidia V100s, they achieved a 52x speed-up.
The team from fast.ai also dropped training cost from $8.35 to $0.26. Going even further they showed you can train a model on CIFAR10 in a reasonable amount of time for free using Google Colaboratory.
Inference: delivered in 10 milliseconds or less!
For ImageNet inference, Intel submitted the best result in both cost and latency. Using an Intel optimized version of Caffe on high performance AWS instances, they reduced per image latency to 9.96 milliseconds and processed 10,000 images for $0.02.
Looking across the wide variety of submissions DAWNBench received, there are three high-level takeaways we want to emphasize:
There is a huge range in cost and performance of different deep learning systems and hardware. Over the course of the competition, training time and cost saw a 10-100x drop even though the initial seed entries had reasonably optimized implementations from popular open source frameworks and ran on state-of-the-art hardware. Moreover, the leading entries were split across TPUs, GPUs and CPUs: no single type of hardware dominated in all cases. Deep learning systems are both diverse and rapidly improving.
- Performance in deep learning systems depends heavily on every level of the stack: hardware, software framework, model architecture, and training procedure. Google’s ImageNet submissions showed that AmoebaNet-D N6F256, a learned architecture, was faster and as accurate as ResNet50 on a single TPUv2 machine, but couldn’t scale to half of a TPUv2 Pod like ResNet50. Fast.ai’s use of progressive resizing achieved higher performance on 8xV100s than advertised by Nvidia. To get the best performance from deep learning systems requires end-to-end research and development because each element represents a large and only partially explored search space.
- The community cares about openness and reproducibility. We were pleasantly surprised at how many of the entries were open source. Even though we did not force entrants to open source their work, we appreciate that so many different parties in the community want to improve reproducibility in this space.
We believe these takeaways above demonstrate the need for benchmarks and competitions like DAWNBench that focus on end-to-end performance in terms of wall-clock time and quality, encourage community involvement, and evolve to keep up with the pace of the field.
The first iteration of DAWNBench has been a success, demonstrating that open, end-to-end benchmarks for deep learning can lead to substantial progress. We are working on exciting next steps to benchmark more tasks and grow the community. Stay tuned to O’Reilly AI in NYC on May 2!
Disclosure: The Stanford DAWN research project is a five-year industrial affiliates program at Stanford University and is financially supported in part by founding members including Intel, Microsoft, NEC, Teradata, VMWare, and Google. For more information, including information regarding Stanford’s policies on openness in research and policies affecting industrial affiliates program membership, please see DAWN’s membership page.