Distributed Transaction in Database: From EPaxos to Accord
NoSQL databases have been popular for years until people realize they still want ACID support from traditional databases. In recent years, NoSQL databases with transaction support (Spanner, CockroachDB, TiDB, YugabyteDB, etc.) emerge quickly. Apache Cassandra is finally joining the party and will add distributed transaction support soon. In this post, we will first introduce distributed replication and distributed transaction to equip you with some background knowledge. Then we will introduce how Accord, the protocol for distributed transactions in Cassandra, evolves from EPaxos and why it is the first leaderless protocol that is stable for large industrial databases.
Replication v.s. Sharding
Before we start our introduction, let’s first recap two common terminologies that you often see in distributed systems: replication and sharding. They are totally different things but somewhat similar and very related, so sometimes people can confuse them.
Replication is when you store the same pieces of data into different nodes (usually on different physical machines) to ensure high availability. Most if not all databases employ some sort of replication. In general, replication is challenging (messages can be lost and nodes can fail) and you would need some consensus protocol to ensure different replicas agree on the same value and don’t diverge. Depending on the availability and consistency level they want to achieve, different databases choose different replication strategies.
Sharding comes into place when your data cannot be fit on the single machine, or you deliberately want to distribute your data into different nodes. Sharding, a.k.a. partitioning, is supported by most if not all NoSQL databases. Distributed transactions, i.e. transactions across shards, are very challenging.
Usually, a database that supports sharding also supports replication because when you have more nodes (shards), there is a higher possibility that one of them fails. This means to run distributed transactions, these databases need to take both sharding and replication into account, which can be even more challenging.
Intro to Consensus Algorithms
Let’s now get started by talking about replication. As we just said, to ensure different replicas see the same value, consensus algorithms are often used in databases. Note that some of these algorithms are unsafe under certain failures and some only ensure eventual consistency. Today we are only talking about the consensus algorithms that guarantee strong consistency. The word “strong consistency” is often overloaded and has different meanings in different contexts. Here, we mean that these algorithms can guarantee linearizability for a single key even in presence of node failure.
There is only one consensus protocol, and that’s Paxos.
The original paper of Paxos is very fun and worth reading. It describes an imaginary parliament as a metaphor for distributed systems. The paper even rarely mentions “computer”, and that is why people don’t understand it for years. Believe it or not, most if not all correct consensus protocols can be deduced to Paxos, so it’s a foundation paper in this field. Anyways, Paxos itself is difficult to understand, and in this post, let’s just learn the intuition behind it.
In Paxos protocol, every replica can accept a client request and propose a value for a single key. It needs to broadcast the “proposal” to other replicas (a simple majority is enough) so everyone agrees on a single value. Intuitively, it should only take only one round-trip, doesn’t it? The tricky part is that different replicas might propose different values at the same time, and it is proved that if there is only one round, then there is no guarantee that all replicas agree on the same value. Remember, network messages can be arbitrarily delayed or lost, and nodes can fail too. Paxos is robust and safe as long as the majority of nodes are alive.
Multi-Paxos (2001) & Raft (2014)
Two round trips for a single operation sound too bad, don’t they? Multi-Paxos (proposed by the author of Paxos in 2001) and Raft (proposed by a group of researchers who were in search of an understandable consensus algorithm in 2014) reduced the two round-trips into one round-trip, greatly improving the efficiency for replicas to reach consensus. The trick is simple: we can let one replica be the leader, and require that all client write & read requests must be submitted to the leader. Remember that we said Paxos requires two round-trips because there can be concurrent proposals from different replicas? Now that we only have one leader, it can order all the requests and there is no conflict anymore.
Multi-Paxos and Raft are leader-based approaches. They successfully reduce the normal two round-trips into only one round-trip, but they have other drawbacks. First, what if the leader fails? Then the replicas must elect a new leader, and before the leader is elected, clients could experience service degradation, which is not good. Second, the leader has much more workload than other replicas, causing a workload imbalance among the nodes. Finally, a less well-known drawback of the leader-based approach is that it’s difficult to build distributed (cross-shard) transactions on top of it. We will discuss this point later.
Similar to Multi-Paxos and Raft, EPaxos also aims to optimize the costly two round-trips and use only one round-trip if possible. Unlike the leader-based protocols, EPaxos is leaderless. How come? Didn’t we state that concurrent transactions with multiple proposers (a replica that starts a proposal is called a proposer) could cause races? Well, if you think about it, why on earth does concurrency cause conflicts? That is because when there is concurrency, it might be the case that different replicas receive requests in different orders. But does order really matter?
It turns out that the ordering doesn’t matter when concurrent requests are not conflicting (writing to the same key). This is one of the key ideas behind EPaxos! When there is no conflict, we only need one round-trip to reach a consensus — this is called the “fast path”. Note that in ordinary Paxos, you only need a simple majority of votes to reach a consensus, but here you need a super majority (we will discuss this super majority later, but usually, it’s a number that is close to 3/4). On the contrary, when there are conflicts, and some replicas (that participate in the fast path) see different dependencies, then a second round is needed — this is called the “slow path”. In the slow path, the coordinator has to merge all dependency chains observed by replicas and broadcast the unified dependency graph to other replicas so that replicas will execute the operations in the same deterministic order.
In leader-based approaches, when the leader fails, the service might witness suspension for some duration (can be a few seconds). Leaderless approaches like EPaxos don’t suffer this problem. As shown in the benchmark below, EPaxos still maintains almost the same throughput when there is a node failure.
There is also another less discussed benefit of leaderless approaches: they make it easier to build distributed transactions. We will discuss that in the intro to Accord section.
Intro to Distributed Transaction
Finally, we have come to the distributed transaction topic. Wait… but why did we spend so much time discussing distributed replication? The answer is that modern distributed databases have both replication and sharding, and you need to take distributed replication into consideration when designing distributed transaction algorithms!
Let's say you have shards A, B, and C, …, and each shard has replicas A1, A2, A3, …, B1, B2, and B3, … We already learned that distributed replication requires some sort of coordination so that all replicas for a given shard reach consensus (agree on a single value). A distributed transaction is a transaction that spans multiple shards, which also requires some sort of coordination. For example, if Alice transfers 100 dollars to Bob, then shard A that stores Alice’s account and shard B that stores Bob’s account should both succeed or fail. Moreover, if there are conflicting transactions (transactions that have overlapping read or write keys), then we need to ensure those transactions are executed in the same order across shards. Most if not all databases build distributed transactions (if any) on top of a consensus (replication) layer. That is, “replicas” are “encapsulated” and transparent to the algorithms for transactions. This makes building distributed transaction algorithms easier, but it is also costly as you have two separate types of coordination (replication layer and transaction layer).
Now let’s outline the approaches of a few popular databases that support distributed transactions. Please note that this is only to the best knowledge, and databases, in general, can evolve quickly (who would have anticipated ACID support in Cassandra?), so please take the following summary with a grain of salt.
FaunaDB and FoundationDB use a single leader to order conflicting transactions. This is great because it is (relatively) easy to implement and straightforward to understand, but it is impractical for super large clusters with hundreds of nodes.
TiDB uses a global timestamp oracle to assign and distributed timestamps to transactions. Again, this timestamp oracle makes horizontal scaling and geo-replication hard if not impossible.
Many other databases use multiple leaders, but the biggest challenge is how to ensure a consistent order for concurrent transactions that span different shards. A natural question to ask is: why not just use wall clocks to determine the order of transactions? The answer is that wall clocks are inaccurate! Your laptop might have a few milliseconds or even seconds of time difference compared to someone else’s laptop. In a highly concurrent environment, even a clock drift of a few milliseconds could cause serious inconsistencies.
CockroachDB uses hybrid logical clocks (vector clock + wall clock) to determine orders. This works great except that it cannot provide “strict serializability”, sometimes also known as “linearizability”. Consequently, your queries can time travel in the sense that your read request does not necessarily return the latest value.
Google’s Spanner is a special one. It also uses physical time, but it does not suffer from the clock drift problem. Why? The key is that they use special hardware, i.e. atomic clocks to ensure the clock drift is very very small. The time difference is not entirely negligible but it is small enough that they could use software approaches to overcome it. Of course, this is expensive and it seems only Google has this offering.
Intro to Accord
Accord uses a completely different approach. Recall that in leaderless replication protocols (EPaxos and its variants), conflicting operations (writes to the same key) must be ordered. Also recall that in distributed transactions, conflicting transactions (reads/writes overlapping) must be ordered. Do they sound similar? This is indeed the motivation of Accord: since both replication and transactions have the same requirements, can we use a single protocol to achieve both?
The answer is yes. Accord is a protocol that achieves both distributed replication and distributed transactions. The authors claim that Accord is the first leaderless protocol that is stable for large industrial databases, and the first protocol that offers strict serializable transactions across regions in a single wide-area round-trip.
Essentially, Accord is a variant of EPaxos, with some engineering optimizations to increase the chance of fast-path success. Also note that EPaxos is only a consensus protocol for replication, while Accord combines distributed replication and distributed transaction. I strongly recommend you to read the Accord white paper but here is a summary of two important improvements that Accord does over EPaxos.
Flexible Fast-path Electorates
You may remember we said earlier that fast-path quorum requires a super majority rather than a simple majority. Why is it the case? Here is an intuitive explanation:
In the 5 replica set as shown in the picture above, the fast-path quorum must be at least 4. This is because, under 2-node failure (black nodes on the right), there should still be at least 1 node (purple node on the right) that is part of both fast-path1 and fast-path2 to recover. If a fast-path only requires 3 nodes, then under 2-node failure, there is a chance that no active node knows both fast-path1 and fast-path2.
Super majority is a bad requirement because it makes the chance of falling back to slow path high. In the above example, a failure of two arbitrary nodes will make any fast-path impossible. Accord, on the contrary, proposes a concept called flexible Fast-path Electorates. The idea is: what if we exclude 2 nodes from the fast-path quorum electorate?
We could dynamically adjust the fast-path electorates based on an estimate of network latency and node conditions. Let’s say we exclude the two bottom (grey) nodes from the fast-path electorate, as shown in the graph above. Now we only need 3 nodes to be on the fast path. Even if two nodes fail (black nodes on the right), we still have one overlapping node (purple node on the right) to recover both fast paths. In the best case, even under two-node failure (grey nodes), we can still achieve a fast path. This makes Accord more efficient than EPaxos because it has a lower chance of falling back to slow-path (two round-trips) under node failure.
Message Reorder Buffer
When we discussed EPaxos, we said if the voting replicas don’t witness the same ordering of transactions, then a second round (slow-path) is needed so that the coordinator could broadcast the total ordering (dependency graph). If we could reduce the chance of different replicas witnessing different ordering, then there is a higher chance of achieving a fast-path quorum. The trick used by Accord is to deliberately buffer messages until some estimated skew + latency expires.
For example, as shown in the graph above, if both node A and node B start two transactions at T0, then node C and D will see different orders of events because C is closer to A than B while B is closer to D than A. In Accord, A can estimate the max time that any replica will receive the message, which is T20 in this example. A will then tell all replicas that they should buffer their messages until T20. B will do the same and tell replicas they should buffer their messages until T10. This way, both C and D will observe that message from B comes before the message from A. This buffering is done via message reorder buffer. This idea comes from the paper EPaxos revisited, but Accord makes some engineering optimizations, which we will not discuss in detail here.
What’s next for Cassandra
Recently (Jun 2022), a query syntax prototype has been discussed. The Accord prototype can be found here (in Java, of course). From what I have heard, the estimated release of the new distributed transaction functionality is by the end of 2022.
Thanks for your patience in reading such a long post! I hope it sheds some light on the field of distributed transactions. If you are interested in more details about the Accord protocol, check out this Cassandra Enhancement Protocol.