The last decade of database research and its blindingly bright future. or Database Research: A love song.

by Michael Cafarella and Chris Ré 11 Apr 2018

To go by Twitter and many hallway conversations, the database research community has been unsettled lately in a way that we have never seen before. Many people are unhappy with the review process, many types of useful work seem to be more difficult to pursue, and our relationship with adjacent fields such as machine learning is unclear. Turing award winner – and giant of the field – Mike Stonebraker made some (though not all) of these points in a recent talk that, like everything Mike says, are worth taking seriously.

All of these points of view have merit and deserve consideration. But we think it is worth reflecting on a different viewpoint.

Data management has had an impact that has surpassed our wildest dreams, and it is arguably the most exciting time for data management research. Ever.

The Hallowed Recent Past (The Enormous Burrito)

What has happened in data management in the last ten years? Try this on for size:

Structured data in billions of pockets. The iPhone came out in 2007. Every iPhone and Android device — billions of them — has an SQL engine in it.
Hadoop, Spark, and other open source triumphs. The first Hadoop Summit was in 2008. Now it powers Facebook, Twitter, NSA (we think!), and has 3B+ in market cap from Cloudera and Hortonworks. The terrific Spark and SparkSQL projects have had huge impact. Hadoop and Spark aren’t the end of it. According to http://projects.apache.org/statistics.html, 8 of the 10 busiest (by number of commits) Apache open source projects in the past year are data-oriented: Ambari, Ignite, Hadoop, Beam, HBase, Flink, Lucene-solr, and Spark. Spark and Flink even have founding members from the DB research community. Some might object that these projects are not from the SIGMOD community per se; we would answer that they embody many (not all!) of our community’s ideas. We should pitch a big tent around as much of data management as possible.
Information extraction from intellectual project to mainstream. In 2008, information extraction was a weird corner of AI and database conferences. The database community took a huge leadership role on this topic, with Yago, WebTables, DeepDive, and many more systems. The technology advanced enough to allow the authors of this post to found Lattice Data, which was purchased by Apple last year.
Cloud and Infrastructure DB people run or are influential in many groups at data-centric companies, including Google, Microsoft including Office and the Cloud, Twitter, Amazon, and many more!
Analytics go mainstream. OLAP used to be an obscure database research topic, and an add-on for certain Oracle products. Now Actian Vector (VectorWise) and MonetDB are high-quality analytics systems, Tableau is worth 6.5B, Facebook and Google are unimaginable without analytics, and these analytics may have the power to shape democracy¹.

The Juicy Middle.

Second, the field hasn’t just sat on its laurels, exploiting past discoveries. The last decade of database research has made progress on a lot of hard intellectual problems that underpin our technical world

Approximate query answering
Data management for machine learning primitives
Distributed RDBMSes (with transactional guarantees) on huge clusters
Transaction processing in peer-to-peer (Blockchain)
Improved models of data privacy
New and asymptotically improved algorithms for graph, relational querying, and parallel query processing²

That’s not even including all the interesting data work going on in machine learning and visualization conferences. It’s not in SIGMOD, but it’s very relevant to us. It’s a good thing we have connections to other relevant fields. It’s amazing that the field has obtained truly international reach as industry and research roar together in the US, China, Europe, the Middle East, and the world. We should be proud of our contributions and thrilled that we are able to contribute to the most exciting problems of the day.

The Blindingly Bright Future

That said, many of the points we hear about – bad paper reviews, projects that are worthy but hard to pursue, too many papers – have a grain of truth. In many cases, reviews are not what they should be. It is true that we are no longer the only game in town for data management. There are more conferences, more intellectual threads, and our big claim to fame – the RDBMS – is now a much smaller fraction of the data management systems picture. Maybe we have to pick and choose our intellectual agenda more carefully than we used to, in order to make an impact. And, yes, it’s harder to build (and get funding for) really big software projects than it used to be. These are all problems, but most are also symptoms of data management’s incredible ongoing success.

All in all, the horizons and opportunities for data management are far broader and more exciting than they were 10 (or 20 or 30 or 40) years ago. Our broader field is at the forefront of many problems. Here is a woefully incomplete list of threads that we think are insanely exciting (apologies for inevitably missing so many other threads!):

The golden age of ML data management is upon us, both intellectually and in terms of commercial investment.
- ML eats the stack! Kraska et al.’s learned indexes, Andy Pavlo’s self-driving database, Barzan Mozafari’s DB Learning, ML-driven analytics in Macrobase –all following in the steps of the RAD lab’s vision.
- Programming is changing. The world builds ML in every data product, but has essentially no compiler and debugging infrastructure. Projects like Snorkel re-examine fundamentally how to program the ML stack.
- The next generation of frameworks. Tuning of ML models and Reinforcement learning in Ray. Integration of linear algebra and machine learning into core SQL-like primitives. And more!
Hardware is changing the core of data processing. Projects like quickstep, the use of FPGA in data processing, rethinking query architectures like Hyper or massively influential projects like column-store pioneer MonetDB.
The rise of the data enthusiast. More people than ever use data processing so work like Natural language interfaces and gestureDB are paving the way.
Data cleaning has been a hugely important problem with great progress from companies like Tamr and Trifacta. Also research projects like BoostClean and HoloClean–and many more! Much of this work is built on methods for managing uncertainty from Lise Getoor and others.
Data science is an organizing principle at many campuses with impacts on almost every aspect of society including:
- The BlueBrain project co-led by Anastasia Ailamaki
- Ce Zhang’s space.ml project peering deeper into the cosmos than ever before
- A data-driven effort to fight human trafficking recognized by a presidential medal, which numerous folks in our community contributed to including Juliana Freire and the two authors
- Efforts to make knowledge bases more useful in data science workflows from Daisy Wang, Fabian Suchanek, and others.
- A huge effort to build infrastructure for effective and safe data science and analytics, which includes Jennie Rogers, Aditya Parameswaran, Ashwin Machanavajjhala, Stratos Idreos, Alvin Cheung, Peter Alvaro, and many other DB people
Database people have leadership roles in new data-centric institutions. The Moore-Sloan centers are spearheaded by core DB people in the e-Science institute at UW (like Bill Howe) and at NYU (like Juliana). The initiative at UChicago is run by Mike Franklin, who also cofounded the Berkeley initiative. Hector Garcia-Molina and Chris were among the cofounders of the Stanford Data Science Initiative. Internationally, QCRI is led by Ahmed Elmagarmid.

We may not have the same level of ownership over these topics as we did over the RDBMS³, but the chance for our ideas to have impact is many fold greater than it ever has been. It is thrilling to participate in such a wide range of societal-scale problems.

A different take on the challenges facing the field.

We have some problems that are “good problems to have,” but are still problems. The field can always improve, and in particular, we agree with many of Mike Stonebraker’s concerns–the guy knows his stuff. We think an effective mindset when considering these problems is: how can we continue to attract the best minds on the planet to our area, and how can we build a community that allows those people to do their best work? Here are some ideas:

The paper-as-a-prize model is broken. We agree with Mike: no paper counting! But that’s not because we want people to save up their paper writing for a few large tomes. Conference papers should be a way to share progress on shared endeavors, not a reward at the finish line.

a. LPUs aren’t the issue. Surajit Chaudhuri once persuasively argued we need papers closer to LPUs to share progress more rapidly. We agree! The field is larger, and frequent structured communication is a good way to disseminate good ideas efficiently and quickly. We should look for ways to disseminate good ideas even more quickly, perhaps by encouraging different paper lengths, or by having paper deadlines immediately prior to conferences, or creating more high-visibility venues (a la CIDR) outside the summertime conference season.

b. It’s true there’s no single center, and papers are much harder to track. That’s not because people are bad or lazy, but because the world is bigger and better. This is one reason why reviews have probably gone down in quality, despite many quality controls around the review-writing process itself: shared related work is declining. More focused subtracks is one option. We should also consider simply admitting a lot more papers, and recognizing that the average paper’s fit-and-finish will go down, the career value of a paper acceptance will go down, but the stress of a single publication decision will also go down, and average utility to the reader will go up. Papers will become less like a high-stakes grant application, and more like a careful note to peers⁴. This path seems better than the current cycle of violence, and more practical than reducing the number of papers (thereby forcing the career value of a single acceptance even higher).
Projects should live fully, then die explicitly. If we are reconciled to a world of lots of papers, researchers can at least do everyone a favor and intellectually organize their efforts into a small number of projects. A project should have an online presence that supports the goal of effective peer communication and shared progress. A few ideas on what that could include:
- An obvious homepage
- A list of associated publications, with advice on what to read first.
- Use-case background notes: informal feedback or anecdotes from discussions with potential users, quick observations that don’t rise to the level of formal result, etc
- Well-documented open-source code
- Well-documented reusable datasets
- A small blog, updated regularly
- Occasional “office hours” when anyone can get a researcher on Skype
- For systems, a downloadable VM that allows someone to test-drive with a minimum of fuss
- A Viking funeral when there are no more updates coming

These ideas are hardly breakthroughs, but they are observed only haphazardly today (including by the authors of this post!), and they would help enormously.

We have done a better job than most fields at recognizing impact via software and startups, not just papers. We have started doing that for reproducing results. Let’s extend that same generosity to datasets, models, and data science findings. Building the RDBMS provided fantastic focus for the community, and room for people to contribute via software, often commercialized via startups. It’s a tradition we can be proud of. But many intrinsically interesting (if currently 0-billion dollar) problems don’t have much of a TAM and don’t lend themselves to successful startups. The community has taken steps in the right direction when it comes to recognizing reproducible results, with appropriate awards and an almost-standard practice of open-sourcing experimental code. We can follow the same playbook and recognize a new class of interesting problems by taking data science outputs seriously. We should consider adding SIGMOD awards to recognize the best dataset, the best data science analysis, and so on. This is the right thing to do intellectually, but it’s also pragmatic. Zero-billion dollar problems don’t always stay that way.
We need both theory and systems work. Theory publications shouldn’t come at the expense of systems-centric ones, but neither should we miss out on theoretical advances. We need them! A lot of critical data management topics – including data privacy, machine learning, and data cleaning – remain relatively poorly understood. These curiosity driven investigations have a way of attracting the best minds and opening new vistas to explore, and more pragmatically, right theory will make the systems sing. One cannot imagine building privacy or ML tools without at least basic guidance from theory.
Let’s define our field by intellectual challenges, not tools. We should not focus on the RDBMS binary, but the ideas it contains: the world’s first massively successful DSL, one of the maybe four data models that ever saw wide adoption⁵, optimizations, transactions, recovery, and more. We may well miss the next grand challenge if we insist that database research involve the binary or all of the previous ideas.
We should pitch a big tent. Data management is huge and exciting, and we have a better shot than anyone else at making contributions. We should avoid the tendency to argue about what really constitutes data management research. It is worth noting that the Machine Learning community is massive, has huge impact, and is not homogeneous at all. OSDI and NSDI have also cast a large tent. If we enforce a purity test, we will alienate the brightest young minds on the planet.

This is in some ways the golden age of data management–but is also fraught with real and public risks. Let’s take our task seriously, do the hard work to change the world for the better with data, and have a good time doing it.

Thanks to many other unnamed folks who helped read and contribute to this post!

Having a lot of impact doesn’t mean the impact will always be positive. This is something for us to work on. ↩
See the pioneering work of Ngo and others at LogicBlox, Ullman’s work on MapReduce or Suciu, Koutris, and Salihoglu’s new book) ↩
Though Larry Ellison might disagree with you about how much ownership we truly had. ↩
In economics, the bar for journal publication has become so high – publication of an article can take years – that it has ceased to function as an effective method for rapid peer communication. Instead, researchers share (privately and publicly) non-peer-reviewed paper drafts, with the expectation that the drafts will be revised and improved. It’s great! ↩
Text documents, relations, graphs… maybe spreadsheets? ↩