Distributed Transaction in Database: From EPaxos to Accord

Replication v.s. Sharding

Before we start our introduction, let’s first recap two common terminologies that you often see in distributed systems: replication and sharding. They are totally different things but somewhat similar and very related, so sometimes people can confuse them.

Replication

Replication is when you store the same pieces of data into different nodes (usually on different physical machines) to ensure high availability. Most if not all databases employ some sort of replication. In general, replication is challenging (messages can be lost and nodes can fail) and you would need some consensus protocol to ensure different replicas agree on the same value and don’t diverge. Depending on the availability and consistency level they want to achieve, different databases choose different replication strategies.

Sharding

Sharding comes into place when your data cannot be fit on the single machine, or you deliberately want to distribute your data into different nodes. Sharding, a.k.a. partitioning, is supported by most if not all NoSQL databases. Distributed transactions, i.e. transactions across shards, are very challenging.

Intro to Consensus Algorithms

Let’s now get started by talking about replication. As we just said, to ensure different replicas see the same value, consensus algorithms are often used in databases. Note that some of these algorithms are unsafe under certain failures and some only ensure eventual consistency. Today we are only talking about the consensus algorithms that guarantee strong consistency. The word “strong consistency” is often overloaded and has different meanings in different contexts. Here, we mean that these algorithms can guarantee linearizability for a single key even in presence of node failure.

Paxos (1998)

A Paxos proposal takes two round trips to commit

Multi-Paxos (2001) & Raft (2014)

Two round trips for a single operation sound too bad, don’t they? Multi-Paxos (proposed by the author of Paxos in 2001) and Raft (proposed by a group of researchers who were in search of an understandable consensus algorithm in 2014) reduced the two round-trips into one round-trip, greatly improving the efficiency for replicas to reach consensus. The trick is simple: we can let one replica be the leader, and require that all client write & read requests must be submitted to the leader. Remember that we said Paxos requires two round-trips because there can be concurrent proposals from different replicas? Now that we only have one leader, it can order all the requests and there is no conflict anymore.

EPaxos (2012)

Similar to Multi-Paxos and Raft, EPaxos also aims to optimize the costly two round-trips and use only one round-trip if possible. Unlike the leader-based protocols, EPaxos is leaderless. How come? Didn’t we state that concurrent transactions with multiple proposers (a replica that starts a proposal is called a proposer) could cause races? Well, if you think about it, why on earth does concurrency cause conflicts? That is because when there is concurrency, it might be the case that different replicas receive requests in different orders. But does order really matter?

Left: No conflict, fast-path for both C1 and C2. Right: conflict, fast-path for C4 but slow-path for C3
EPaxos v.s. Multi-Paxos

Intro to Distributed Transaction

Finally, we have come to the distributed transaction topic. Wait… but why did we spend so much time discussing distributed replication? The answer is that modern distributed databases have both replication and sharding, and you need to take distributed replication into consideration when designing distributed transaction algorithms!

Single leader

FaunaDB and FoundationDB use a single leader to order conflicting transactions. This is great because it is (relatively) easy to implement and straightforward to understand, but it is impractical for super large clusters with hundreds of nodes.

Multiple leaders

Many other databases use multiple leaders, but the biggest challenge is how to ensure a consistent order for concurrent transactions that span different shards. A natural question to ask is: why not just use wall clocks to determine the order of transactions? The answer is that wall clocks are inaccurate! Your laptop might have a few milliseconds or even seconds of time difference compared to someone else’s laptop. In a highly concurrent environment, even a clock drift of a few milliseconds could cause serious inconsistencies.

Intro to Accord

Accord uses a completely different approach. Recall that in leaderless replication protocols (EPaxos and its variants), conflicting operations (writes to the same key) must be ordered. Also recall that in distributed transactions, conflicting transactions (reads/writes overlapping) must be ordered. Do they sound similar? This is indeed the motivation of Accord: since both replication and transactions have the same requirements, can we use a single protocol to achieve both?

Flexible Fast-path Electorates

You may remember we said earlier that fast-path quorum requires a super majority rather than a simple majority. Why is it the case? Here is an intuitive explanation:

5-replica set with fast quorum = 4
5-replica set with fast quorum = 3

Message Reorder Buffer

When we discussed EPaxos, we said if the voting replicas don’t witness the same ordering of transactions, then a second round (slow-path) is needed so that the coordinator could broadcast the total ordering (dependency graph). If we could reduce the chance of different replicas witnessing different ordering, then there is a higher chance of achieving a fast-path quorum. The trick used by Accord is to deliberately buffer messages until some estimated skew + latency expires.

Left: original message arrival time. Right: delayed message arrival time.

What’s next for Cassandra

Recently (Jun 2022), a query syntax prototype has been discussed. The Accord prototype can be found here (in Java, of course). From what I have heard, the estimated release of the new distributed transaction functionality is by the end of 2022.

References

  1. https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
  2. http://charap.co/reading-group-special-session-fast-general-purpose-transactions-in-apache-cassandra/
  3. https://www.youtube.com/watch?v=YAE7E-QEAvk
  4. https://www.synergylabs.org/courses/15-440/lectures/12-replication.pdf
  5. https://dl.acm.org/doi/abs/10.1145/3335772.3335939
  6. https://www.microsoft.com/en-us/research/publication/paxos-made-simple/
  7. https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro
  8. https://dl.acm.org/doi/abs/10.1145/2517349.2517350
  9. http://muratbuffalo.blogspot.com/2022/02/efficient-replication-via-timestamp.html
  10. https://www.usenix.org/conference/nsdi21/presentation/tollman

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Boxuan Li

Boxuan Li

Maintainer of JanusGraph, a popular distributed graph database