Peter's Paper Musings

Microblogging my favorite systems and database papers

By Peter Kraft

OLTP Through the Looking Glass, and What We Found There

This paper by database legend Mike Stonebraker is an absolute classic. It profiles a transactional database in great detail, determining the purpose of each CPU instruction. The authors find that <10% of CPU instructions are actually performing useful work. The remaining are split roughly evenly between four sources of overhead:

  • Buffer management (moving pages between a buffer pool and disk)
  • Locking (heavyweight row-level locks providing transaction concurrency control)
  • Latching (lightweight locks protecting data structure internals from concurrent accesses)
  • Logging (recording operations before executing them to enable recovery)

One one hand, these results are discouraging because there's no "high pole in the tent." Overhead comes from multiple sources, all of which are critical for traditional database functionality. On the other hand, these results suggest a radically different database architecture that could achieve incredible performance: a database that's wholly in-memory (no buffer pool), single-threaded per-partition (no locking or latching) and replicated (no logging).

This insight spawned a long-running research project (H-Store, and later VoltDB), which have shown that if you're willing to accept slightly less flexibility from your database, truly incredible performance is possible.

Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask

I like this paper because it goes out of its way to do a fair experimental comparison of two state-of-the-art query processing techniques: vectorization and data-centric code generation. These are two different ways of executing operations in in-memory columnar databases.

At a high level, the difference between vectorization and data-centric code generation is that vectorized queries process many tuples at once, while code-generated/compiled queries process many operators at once.

In systems that perform vectorization, each operation executes on a large batch--a vector--of data. For example, if you're summing a column, you might fetch 1000 numbers from the column at once and add them all up. The idea is to amortize iterator call overhead over many tuples and take advantage of CPU parallelism (SIMD).

In systems that perform data-centric code generation, multiple operations are fused together and applied to each tuple one at a time. For example, if your query is for green cars with four tires, you might fuse those together into a tight loop that iterates through every record and checks both its color and its number of tires. The idea is to amortize iterator call overhead over many iterators and take advantage of compiler optimizations on the generated code.

To really analyze both types of systems in a fair environment, the authors built two custom query engines that were identical in every way except their query execution strategy. They found that both strategies are good!

  • Code generation: Slightly better for compute-intensive queries due to better cache performance for individual operations. Also works better for smaller (OLTP) queries because it's tuple-at-a-time.
  • Vectorization: Slightly better at parallel data access for memory-bound queries and doesn't require a compilation step.

But both work well and are used in widely adopted systems.

Column Stores vs. Row Stores: How Different Are They, Really?

I like this paper from the early days of column stores because it explains how column stores are so effective for analytics and why their benefits are hard to replicate with the traditional row store architecture.

The obvious reason column stores are effective compared to row stores is that they allow queries to only access the columns they need and ignore the others. However, even when that behavior is simulated in row stores (for example, using materialized views containing only the columns needed by a query), column stores perform far better on analytical workloads. Why? The authors find a couple of reasons!

  1. Better block iteration. On analytical queries that evaluate a large number of tuples, a row store needs to iterate through tuples one at a time, extracting data from each tuple to perform predicate evaluation or aggregation. A column store can operate on blocks of values from the same column in a single function call, which takes advantage of CPU caching and parallelism (which is even more true now than it was back then).
  2. Compression. Column stores put similar data from the same column together, which can enable enormous space savings, and potentially enormous speedups, through compression, for example on a very sparse column. By contrast, a row store has to store whole rows, so even if one column is very sparse or otherwise low-entropy, it is still stored in its entirety.
  3. Late materialization. Even if a column store needs to return rows to answer a query, it can perform most operations (predicates, joins) only on a handful of columns and then materialize rows later for the data the query returns. This can greatly reduce time spent processing values in columns that are returned but not evaluated.

Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads

I really like this paper because it identifies a narrow but critical problem in ultra-low-latency systems, proposes a solution, then exhaustively benchmarks it.

The key observation of this paper is that efficiently serving requests with ultra-low latencies is hard because the operating system allocates cores at a granularity of several milliseconds. Therefore, even with an ultra-fast networking stack that can serve requests in microseconds, when a new request comes in you still need to wait milliseconds to get CPU time.

Prior systems got around that problem through static allocation: they had many cores busy-spin waiting for requests so that if a new request came in, it could be handled immediately. However, that's inefficient, because those cores are doing nothing most of the time.

The main idea in Shenango is to have a single busy-spinning thread, the IOKernel, which both makes core allocation decisions and handles network I/O for a number of application runtimes. The applications all run on user threads on top of kernel threads allocated to Shenango. This setup allows Shenango to efficiently allocate cores between the applications, letting them all serve requests with single-digit microsecond latency even as their relative loads change.

The IOKernel polls the NIC receive queue directly to find packets to forward to applications and polls application egress queues to forward their packets to the NIC. While doing this, it keeps track of how many packets are queued for processing at each application, allocating cores to applications that have queues (and removing cores from applications that don't). Because all applications run in user threads, these core allocation decisions execute in microseconds. There are also a ton of optimizations under the hood, particularly around ensuring application cache locality.

The extensive evaluation section shows incredible performance--serving cache requests for memcached on a single 12-core machine, it could handle 5M requests/second with a median response time of 37 microseconds and a p99.9 of 93 microseconds. Try comparing that to your own stack!

Main takeaway? Modern computers can be unbelievably, incredibly fast if we really want them to. We often don't take advantage of this performance because we like having high-level abstractions like OS thread scheduling or an OS network stack, but it's good to know what's possible for when it's really needed.

Kangaroo: Caching Billions of Tiny Objects on Flash

I really like this paper because it not only presents a clever algorithmic solution to an important systems problem, but also thoroughly evaluates it on real-world data.

The basic challenge here is that many systems (especially, but not only, in social media) want to cache billions of tiny objects (like new posts/messages) on SSDs to improve serving performance. However, existing cache strategies don't work well.

Log-structured caches write objects sequentially and index them in memory, but for tiny objects that index grows too large to fit in memory. Set-associative caches hash objects into "sets" so you don't need an index--you can look up an object's page by its hashed key--but every update requires an entire page write which rapidly degrades the SSD (you can only write to an SSD so many times before it wears out).

This paper's clever idea is to combine the two cache strategies to get their advantages without their disadvantages. They buffer incoming writes in a small log-structured cache, which writes to the SSD efficiently (as you're writing sequentially, so you write a page at a time) but doesn't need much memory (as it's small). Periodically, they export keys to a much larger set-associative cache, doing the exports in large batches to the same set to avoid degrading the SSD. When a read comes in, it first checks the log-structured cache, then goes to the larger set-associative cache.

This design produces a cache that's fast, doesn't require much memory, and doesn't degrade SSDs. The authors prove this with an extensive evaluation on production Facebook traces, verifying all these objectives.

One big takeaway: There are only so many ways you can optimize a system, no matter how large or complex. Caching and buffering are basic strategies, but if used cleverly are very effective!

How Good Are Query Optimizers, Really?

Query optimizers are a magic part of modern databases. You just write your query and the database figures out the fastest way to execute it! But how well do query optimizers work, really?

I like this paper because it empirically analyzes query optimization techniques on real-world data and draws some surprising conclusions. The authors focus in particular on the join order problem--in a query with multiple joins, in what order and how does the database execute the joins? This is important because the correct join execution strategy can make queries an order of magnitude faster.

The trickiest part of join order optimization is cardinality estimation, figuring out how many rows would be returned by a join. Good cardinality estimates allow the optimizer to pick joins that return few rows, making the query much faster. What the authors find is that popular benchmarks like TPC-H are nearly useless for measuring the accuracy of cardinality estimation. That's because the data in these benchmarks is synthetic and unrealistically uniform and independent in a way that trivializes cardinality estimation. On real data, the authors find that cardinality estimates are often off by a factor of more than 1000x, because real data is not uniform and independent in the way query optimizers expect.

Interestingly, the authors find that these disastrous estimates don't always lead to disastrously slow queries. The most common cause of awful performance is when the optimizer tries to be clever and chooses an unindexed nested-loop join based on a low cardinality estimate when the real cardinality is much higher. When those are excluded, the largest gaps in performance come in queries with many joins on foreign key indexes, which expand the size of the search space and make the gap between the fastest and slowest query plans much larger.

What are the takeaways? Well, first, be very careful benchmarking on synthetic data--simplifications like independence and uniformity can easily creep in. Second, query optimization matters much more for especially complex queries with many joins and many indexes. If you have those, watch out for what the query optimizer is doing!

NanoLog: A Nanosecond Scale Logging System

This cool paper from John Ousterhout's group shows how to create an incredibly fast logging system. I like it because it combines deep performance analysis with clever optimizations to make a system as fast as possible.

NanoLog is implemented as a more-or-less drop-in replacement for printf that's ~100x faster. The key observation is that the vast majority of log messages are never actually read. Thus, NanoLog tries to do as little work as possible at logging time, deferring work to a postprocessor that runs when someone actually reads a log entry.

When you log a message in NanoLog, it records only the dynamic part of the log message to an in-memory "staging buffer." To avoid overhead from synchronization or cache effects, these are single-consumer, single-producer per-thread circular buffers. Then, a background thread reads from each circular buffer, performs lightweight compression on each record, and writes it to disk. When you actually read a message, a postprocessor decompresses the records, interpolates them with the original static log message, and presents it to you just like a conventional logger.

The paper has a detailed evaluation section showing which optimizations actually work. The key takeaway is that the combination of 1) only logging dynamic content, not static strings and 2) performing lightweight compression before writing to disk were the most effective optimizations. I wish more papers would do analyses like this, it's good science!

Serializable Snapshot Isolation in Postgres

Up until 2011, Postgres had a subtle lie: it nominally supported running transactions with the strongest level of isolation--serializable isolation--but if you actually tried, your transactions would run with snapshot isolation, a weaker level.

The reason Postgres had difficulty supporting serializable isolation is because Postgres concurrency control is fundamentally multi-version and snapshot-based. This enables much better concurrency and performance than traditional two-phase locking, but makes providing serializability much harder.

To make Postgres serializable, the database community had to invent a whole new technique extending Postgres's native snapshot isolation: the aptly named serializable snapshot isolation.

The core algorithm is complex, but the main idea is to build a "serialization graph" of which objects transactions have modified. Then, after the transaction is done but before committing, check the serialization graph for "dangerous structures" that indicate the transaction is not serializable and abort/retry the transaction if they're found.

Because these checks run after the transaction and have relatively low false positive rates, this allows providing serializable isolation with a much higher level of concurrency (and better performance) than traditional lock-based techniques.

What Goes Around Comes Around: A Database History

Want to learn some database history? Check out this classic paper by Joe Hellerstein and Mike Stonebraker! It covers the history of databases up to 2005, and there's a sequel that goes up to the present.

What I found most interesting about this paper is that the relational model we all use now was neither obvious nor inevitable. Instead, early databases used different models that seem simpler and are easier to implement, but are much more complex to use in practice.

For example, IBM's IMS database from the 60's had a hierarchical data model where every record must have a unique key and a unique parent record. To query the data you would imperatively "walk the tree" and iterate through records in subtrees to find the data you're looking for. This kind of model is easy to understand and relatively simple to implement, but it makes large queries hard because you have to optimize them yourself (figuring out the most efficient way to walk the tree). It also can't model non-hierarchical relationships.

The contrast with these early data models shows why the relational model is so powerful: it can model almost any relationship, and, when combined with a declarative query language like SQL, it facilitates query optimizers that almost magically figure out the fastest way to give you the data you want.

Building An Elastic Query Engine on Disaggregated Storage

I love this paper from Snowflake diving deep into some of the design choices in their signature data warehouse.

The key idea behind Snowflake is compute-storage disaggregation. Instead of storing persistent data on compute nodes like in a traditional database, Snowflake stores them in S3 and downloads them to compute nodes only when needed for a query. This architecture is incredibly elastic as they only allocate compute resources to users who actually need them.

What's especially interesting about this paper are the optimizations it describes to make a disaggregated data warehouse work at scale. One particular challenge is managing ephemeral data, data needed only for a specific query (for example, tables exchanged during joins). This is hard because, unlike persistent data, its volumes are highly unpredictable. To solve this problem, Snowflake uses a custom storage system that couples ephemeral data to compute, storing it in memory and on local disk and spilling it to S3 when necessary. They also use this system as a write-through cache for persistent data to minimize the cost of moving it around under heavy traffic.

A Survey of B-Tree Locking Techniques

Want to learn how database locks actually work? Check out this incredibly thorough review by database legend Goetz Graefe, which dives deep into how databases use locks to protect your data and the integrity of your transactions.

One of the most interesting distinctions in this paper is between locks and latches:

  • Locks: Provide concurrency control between transactions. They're heavyweight, meant to be held for a long time, and support complex scheduling and deadlock detection policies. However, as a result, they're expensive to acquire and release, requiring thousands of CPU cycles.
  • Latches: Protect individual data structures from concurrent accesses by different threads/processes. They're lightweight (tens of CPU cycles per acquire/release), are held only while the data structure is being read or updated, and have minimal scheduling or deadlock detection capabilities and thus must be used very carefully. You might grab a latch before physically modifying a B-tree page in memory to ensure no one else concurrently writes to that page.

An Empirical Evaluation of In-Memory Multi-Version Concurrency Control

I really like this paper because it dives deep into the most widely used type of concurrency control today: multi-version concurrency control (MVCC).

The basic idea behind MVCC is for a database to store multiple versions of each data item so that transactions can read from items being written to by other transactions. This can greatly improve transaction throughput by allowing any number of reads to occur concurrently with a write.

Of course, there are many ways to implement MVCC, and the paper describes several of the most popular approaches and their tradeoffs, with detailed experiments to show how much those tradeoffs matter in practice.

When I was working on concurrency control in grad school, I spent a lot of time looking at this paper and its references to learn the field!

Metastable Failures in the Wild

Want to learn how systems fail at scale? Check out this paper on metastable failures, a complex class of failures characteristic of big distributed systems. A metastable failure is a bad state that continues even after the triggering event is removed. For example, if an overloaded system continues to perform badly even after load decreases, that's a metastable failure.

Here's a simple example of a metastable failure:

  1. An application server is operating healthily, but near saturation.
  2. A network issue takes it offline for a few seconds.
  3. When the server comes back online, it experiences a huge surge of traffic from clients retrying failed requests.
  4. This surge may cause requests to time out, triggering more retries (if the retry policy is naive).
  5. This cascades and prevents the server from doing useful work until the retry policy is fixed.

Notice how counterintuitive this is! The system was healthy until a brief outage happened, but that brief outage sent it into a catastrophic persistent failure even though nothing else about the system or workload changed.

Check out the paper to see more examples, detailed real-world case studies, and proposed solutions!

Consistency Tradeoffs in Modern Distributed Database Design

The famous CAP theorem is overrated. What the CAP theorem says is that in the event of a network partition, a system cannot be both consistent and available. This tradeoff is important, but not often relevant because network partitions are rare.

Instead, the important tradeoff is between consistency and latency. In a distributed system, these are fundamentally at odds, as stronger consistency models require more coordination between nodes, which increases latency. Unlike the CAP tradeoff, the consistency-latency tradeoff is one you always have to worry about, not just during a network partition.

Thus, when developers adopt a weaker consistency model, they're usually not doing it "because of CAP," they're doing it to improve latency.

This paper explains this concept in detail in the "PACELC" theorem: In case of a Partition (P), you have to choose between Availability (A) and Consistency (C), Else (E) between Latency (L) and Consistency (C).