Skip to main content Skip to secondary navigation
Main content start

A Five-Year Research Project to Democratize AI

Stanford DAWN (Data Analytics for What’s Next)  ·  2018–2022

Making a machine learning system is a complicated process — but with better tools, we believe any organization could do it.

Data is a barrier to ML

Making a machine learning system involves collecting, processing, distributing, and monitoring huge amounts of data. It takes work to build infrastructure that can handle so much data, and it takes work to execute the parts of the process that have to be done by hand. As a result, it has often taken hundreds or thousands of ML experts to build production-quality systems. Most organizations do not have the resources to afford this, and so the transformative potential of ML is going untapped in many fields.

Better tools are needed

With better data management tools, the process would become easier. The DAWN project set out to research and build these tools. Our vision is that anyone with expertise in their domain — such as a medical lab optimizing clinical procedures or a business group addressing its field-specific problems — can build their own production-quality data products without requiring a team of experts in machine learning.

Collaboration to empower everyone

To take steps toward this vision, a group of faculty, students, and industry partners got together to collaborate on a wide range of projects. These projects addressed every step in the ML production pipeline, across every layer of the hardware/software stack. These collaborations produced new tools and companies, and they prepared the next generation of researchers and innovators to continue searching for solutions throughout their careers.

“It’s hard in grad school to find a project that pulls together so many different collaborators. It was a really cool team, both from industry and grad students. It was really fun rather than the typical grad school solo-journey student experience. I feel grateful about that.”
— Firas Abuzaid

DAWN Leadership

People

DAWN addresses every step of the ML production process

Today it is easier than ever to choose, adjust, and train machine learning models — the core algorithms that learn from data to produce the desired results. But a model can only do its job if people have gathered a lot of good data for it to learn from, and it can only be useful if people make it widely available and monitor its output for errors. DAWN aimed to make all these steps easier, streamlining the process from beginning to end.

Collecting and preparing data

One of the greatest challenges is to acquire or produce enough data in the first place. Many ML models require huge amounts of training data, and the data often have to be cleaned of errors and labeled with additional information. These tasks often need to be done by hand.

Selecting and extracting features

For the model to be effective, the data need to be reduced to the most important features. While each data point likely includes many values, some of these values may be redundant, some may be irrelevant to the task at hand, and some may need to be combined to represent the subject in terms that are meaningful to a domain expert.

Training and running the model

Thanks to years of ML research, the models and algorithms themselves are often good enough out of the box. The main challenge here is one that affects every step in the process: running systems quickly-and cost-effectively, when many ML applications are constructed from disparate parts that weren’t designed to work together efficiently.

Productionizing the system

Once an ML product is built, it requires substantial effort to deploy, operate, and monitor at scale, especially if critical business processes rely on it. An organization using ML needs to check that the algorithm is working, debug issues that arise, and make the system robust to changes in data.

Flagship projects

Site Pages

  • MacroBase

    MacroBase is a new analytic monitoring engine designed to prioritize human attention in large-scale datasets and data streams.
  • Snorkel

    A system for rapidly creating, modeling, and managing training data, focused on accelerating the development of structured or “dark” data extraction applications for domains in which large labeled training sets are not available or easy to obtain.
  • Spatial

    A new Domain Specific Language for programming reconfigurable hardware from a parameterized, high level abstraction.
  • Weld

    A runtime for improving the performance of data-intensive applications. It optimizes across libraries and functions by expressing the core computations in libraries using a small common intermediate representation, similar to CUDA and OpenCL.
  • NoScope

    A system for querying videos at scale using neural networks and for accelerating neural network inference by over 1000× by exploiting model specialization and dynamic cascades.
  • DAWNBench

    A benchmark suite for end-to-end deep learning training and inference.
  • HyperMapper

    A multi-objective black-box optimization tool based on Bayesian Optimization.