A retrospective on NSDI 2017

by Deepak Narayanan, Shoumik Palkar, and James Thomas 28 Apr 2017

A group of us at DAWN went to NSDI last month. The program was quite diverse, spanning a wide variety of sub-areas in the networking and distributed systems space.

We were excited to see some trends in the research presented that meshed well with the DAWN vision.

Greater emphasis on systems for machine learning

The machine learning community has spent a lot of time optimizing different machine learning algorithms to achieve better accuracies in different settings. Despite these advances, deploying models in practice remains extremely hard. In particular, tasks like hyperparameter tuning, efficient model serving and updating of models in an online setting remain challenging, especially for users who are not machine learning experts.

Clipper is a work from UC Berkeley that tries to make serving machine learning predictions faster and easier. By batching multiple concurrent queries, Clipper can better utilize physical compute resources (for example, blocking is fairly ineffective for small compute problems, but becomes far more effective for medium to large-sized problems), thus improving the net throughput of the system. Picking an arbitrarily high batch size can lead to an unacceptably high latency however, particularly for real-time applications like Netflix and Amazon; Clipper thus accepts a latency SLO that it adheres to, while still trying to maximize throughput.

There were a couple of other papers in this space like Gaia, which shows that when training models, parameter updates can be sent over the network infrequently; and Tux2, which improves support for machine learning algorithms on graphs.

Video as a source of new challenges in data analytics

Recent advances in computer vision have opened up an exciting new domain for machine learning: video processing. With as many as 500 hours of YouTube content uploaded every minute, it is increasingly becoming necessary for machine learning algorithms to be able to process video content efficiently. Unfortunately, processing these video streams at scale in real time with state-of-the-art computer vision techniques is still too expensive.

VideoStorm is a system from MSR that tries to determine the best way to schedule video processing jobs on a cluster of machines. VideoStorm makes the observation that video analysis algorithms can be run with a number of parameters (image resolution, frame rate, window size) that affect accuracy and performance. It is able navigate the accuracy-performance tradeoff space to achieve high accuracy within a given computational budget.

ExCamera (which we describe in more detail later in this post) also looks at the general problem of video processing – in particular, it tries to make video encoding faster.

At DAWN, we strongly believe that effectively being able to run various analyses on video represents the next frontier of data analysis. In a recent arXiv submision, we describe NoScope, a system that attempts to train cheap filters that produce the same output as much more expensive convolutional neural network approaches on certain binary classification problems.

Use of accelerators & programmable hardware to ease performance bottlenecks

With 40 GbE being deployed and 100 GbE around the corner, packet processing on general-purpose CPUs is becoming increasingly difficult. Unforutnately, the trend toward faster networks and fewer cycles for packet processing runs counter to the rising complexity of packet processing algorithms due to the growth of the cloud, virtualization and increasing concerns about security. Specialized packet processing hardware seems to be the only solution.

The first paper we saw in this area was APUNet, from authors at KAIST, which proposes the use of a die-integrated GPU for packet processing. Packet processing with external GPUs is often limited by data transfer over PCIe, and die-integrated GPUs avoid this cost. The paper shows some good results, with speedups of up to 4x over optimized CPU implementations for functions like intrusion detection and checksumming. The evaluation of the integrated GPU’s cost-effectiveness presented in the paper seems a bit questionable, however: the authors use AMD’s list prices for components, which may be influenced by marketing and business factors rather than just manufacturing cost. A better evaluation would have compared performance of CPU and GPU-based systems in terms of throughput per unit die area and per watt.

The second paper related to this trend was VFP from Microsoft. As users of the Microsoft Azure cloud demand more functionality (such as ACLs and fine-grained billing) from the network, packet processing tasks are becoming more complex – as network link speeds continue to increase. Microsoft has observed that most of the network functionality they require can be implemented at the host level and does not require network-wide state, but an FPGA is still required per host to keep up with the traffic. Maintaining a large team of digital circuit engineers to develop an FPGA-based platform must be costly, and it remains to be seen whether commodity NICs will be able to catch up to Microsoft’s needs.

Advent of distributed computation frameworks with fine-grained parallelism

Despite much work in the distributed systems community, systems to efficiently execute applications with fine-grained tasks do not exist today.

ExCamera, a paper written by colleagues at Stanford, proposes a new video encoding algorithm that runs on top of Amazon Lambda. Conventionally, video codecs parallelize encoding by breaking up a large video into smaller chunks, and then encoding each of these chunks independently. This however forces a larger frame at the start of each chunk, and this adversely affects the size of the encoded video. Thus, the current naïve way of parallelizing an encoding task creates fairly large encoded videos.

ExCamera is able to do better by passing state from each chunk to the next to reduce the size of chunks’ initial frames; this process can be performed in parallel across all of the chunks. ExCamera is able to use 1000s of threads in the cloud, seeing speedups of 56x on a representative encoding task compared to the multi-threaded vpxenc implementation.

While it focused primarily on video encoding, the techniques presented in this paper can be generalized to other computation-intensive tasks such as compilation, interactive machine learning and visualization.

Desire for performance without compromising on programmability

Flexplane is a system that allows software emulation of network algorithms at near-line rate. As opposed to running these algorithms completely in simulation, Flexplane keeps the end hosts unchanged and only emulates the parts of the computation that run on switches. This ensures that the nuances of hardware stacks and NICs are captured, while also making integrations with real applications like Spark easy, since the interfaces to applications are unchanged. Flexplane places an emphasis on ease of programmability, allowing users to quickly iterate on their ideas and experiments, but without compromising on performance. Flexplane communicates with the emulator (which resides on a stand-alone multicore machine) through “abstract packets” (at a high level, the original packet without the payload).

mOS (recipient of this year’s Best Paper Award) is a reusable network stack for middleboxes which monitors flow state. Middleboxes are network appliances which perform complex tasks on traffic flowing through them; common middleboxes include intrusion detection systems and accounting devices. Unfortunately, programming these middleboxes is very difficult; there is no API for performing standard tasks like detecting packet retransmissions, reconstructing TCP streams, and inspecting HTTP headers for malicious payloads. mOS provides such an API, taking care to cleanly separate the API from the application-specific logic of the middlebox. Developers can track network state for the client and server by listening for particular events that occur in flows. To solve the scalability issue of maintaining a list of registered events for each connection, events are shared between sockets which look at similar classes of data.

Both Flexplane and mOS recognize the need for programmability to go hand-in-hand with performance – offering hard-to-use programming interfaces often makes good performance hard to achieve in practice. Weld, a system we’ve been working on at DAWN, tries to address similar challenges in the data analytics domain; even though this is a significantly different domain than packet processing, we believe the general idea of designing systems that offer performance without sacrificing on programmability is an important one to keep in mind.