DeepFreak: Learning Crystallography Diffraction Patterns with Automated Machine Learning

A detailed description can be found in this paper. Crystallography is the science that studies the properties of crystals. It has been a central tool in many disciplines, including chemistry, geology, biology, materials science, metallurgy, and physics, and has led to substantial advances in, for instance, drugs development for fighting diseases. In crystallography, a crystal is irradiated with an X-ray beam that strikes the crystal and produces an image with a diffraction pattern (Figure 1, see this video for more...

Massive Multi-Task Learning with Snorkel MeTaL: Bringing More Supervision to Bear

TL;DR: We use Snorkel MeTaL1 to construct a simple model (pretrained BERT + linear task heads) and incorporate a variety of supervision signals (traditional supervision, transfer learning, multi-task learning, weak supervision, and ensembling) in a Massive Multi-Task Learning (MMTL) setting, achieving a new state-of-the-art score on the GLUE Benchmark and four of its nine component tasks (CoLA, SST-2, MRPC, STS-B). Research is ongoing, with a code release of the MMTL package coming in Snorkel MeTaL v0.5 in April 2019. Designing...

Model Assertions as a Tool for Quality Assurance and Improving ML Models

Machine learning is increasingly being used in real-world domains, such as self driving cars or healthcare. However, ML models can fail in confusing or complicated ways. For example, autonomous vehicles have suffered multiple incidents where they accelerated into one type of highway lane divider. We believe it is critical to develop tools for ensure model quality and to improve models over time, especially as ML is deployed in mission-critical domains. Prior work on quality assurance for machine learning has focused...

DAWN PI Delivers NeurIPS Keynote

We are excited to have DAWN Principal Investigator, Kunle Olukotun, presenting some of our latest research advancements in his keynote at the NeurIPS conference tomorrow. To accompany his keynote we are providing a reading list for some of the topics that will be covered during his talk. HALP: [Updated Manuscript (12/5/18)][Manuscript (3/9/18)][Blog] Stay tuned (updated results and code coming soon)![Initial Definition and Study of Hardware Versus Statistical Effiency] Snorkel: [Paper][Website] Multi-Task Learning (Snorkel MeTaL): [Paper][Code] Software 2.0: [Paper][Blog] Spatial: [Paper][Code]...

Debugging Machine Learning - Reflections from DAWN Retreat

“What do you spend time on while debugging machine learning pipelines?” Responses to this question at the Fall 2018 DAWN Retreat ranged from “finding the best way to use transfer learning” to “systematically sampling from raw data”. We identify three broad themes from our discussions and explore them in this post: (1) shaping training data, (2) exploiting log data, and (3) model introspection. Check out our other blogs related to debugging machine learning: using provenance to debug training sets and...

Earthquake Hunting with Efficient Time Series Similarity Search

Worldwide, major earthquakes (magnitude 7+) occur approximately once a month, while magnitude 2 and smaller earthquakes can happen up to several thousand times a day. In fact, earthquake frequency is inversely proportional to magnitude, meaning most earthquakes are very small. An estimated 1% of these small-magnitude events are detected and recorded in public catalogs (Figure 1), yet these low magnitude earthquakes are used by scientists to uncover unknown seismic sources, understand earthquake mechanics and predict major seismic events. Figure 1:...

Sketching Classifiers with Limited Memory, or Better Feature Hashing with One Simple Trick

This post accompanies the paper “Sketching Linear Classifiers over Data Streams” by Kai Sheng Tai, Vatsal Sharan, Peter Bailis and Gregory Valiant, which was presented at SIGMOD 2018. Check out our code on GitHub. In online learning, we learn a predictor by continuously updating its weights according to a stream of labelled examples. For example, in spam classification, an online learning approach allows the spam classifier to dynamically adapt to newly-observed features, even those introduced by an adversary attempting to...

Debugging Training Data for Software 2.0

Training data is playing an increasingly important role in defining the performance of modern machine learning systems. The goal of this blog post is to maintain a “checklist” of the types of errors that can be introduced by unaccounted phenomena in the data and their labels, and simple ways to check for these errors. We would love to hear about other errors you have encountered and how you identify and correct them! Check out our previous blog on using the...

Moment-based quantile sketches for efficient aggregation

Quantiles or their equivalents (percentiles) are commonly used in data exploration workflows. However, they can be expensive to compute on increasingly high-volume, multi-dimensional datasets. In order to reduce query response times, data systems make use of sketch data structures to accelerate quantile computations and deliver approximate results. In this post, we show how a set of statistics including \(\sum x^2, \sum x^3,...\) can be used to define a compact and efficient sketch: the moments sketch. The key property of this...

Filter Before You Parse: Faster Analytics on Raw Data with Sparser

Many big data applications often run on raw, unstructured or semi-structured data formats, such as JSON. Querying these files is often very time-consuming, especially for exploratory applications, where data scientists run queries to explore and better understand their data. Surprisingly, 80-90% of the execution time in these applications is actually spent on parsing the data, not on evaluating the actual query itself. Parsing is, in fact, the bottleneck. In this post, we introduce Sparser (code here), a recent research project...