Objectives of dynamo

dealing with failures in an infrastructure comprised of millions of components
data store that only need primary-key access
- use key value store, as RDBMS will limit scale and availability

Techniques used

data partitioned and replicated using consistent hashing
object versioning to maintain consistency
quorum-like technique and synchronization protocol to maintain consistency among replicas

§2.1 System Assumption & Requirement

Query: Key → Blob (blob size < 1 MB)
Give up ACID because of poor availability
- no isolation as in ACID, allow only single key update
Use 99.9th percentile of latency as efficiency metric and SLA definition
Assume to use in internal, non-hostile environment

§2.3 Design considerations

Synchronous replica coordination = Strong consistency = Trade-off availability under failure scenario

dynamo maintain only eventual consistency

Update conflict resolution at write vs at read

traditional: at write time, but failing write availability if cannot reach all or most of the replica
dynamo: conflict resolution at read

Conflict resolution scheme

at data store: we can only have last write wins
at application: allows app-specific merge algorithm

Cluster design:

incremental scale out
symmetry: all nodes have same responsibilities
decentralization

§3 Related work

P2P network with enough routing information at each peer can employ O(1) routing
Design to be robust against network partition
Do not focus on data integrity and security, but target for always writable

§4.1 System interface

get(key): return single object or multiple objects of conflicting versions

put(key, object): write replica to disk, with metadata like versions

§4.2 Partitioning

Partition of data over many storage nodes: use consistent hashing

ring of hash values, and “virtual nodes” placed on the ring
virtual nodes assigned a random value in the ring
coordinator node responsible for the range between it and its predecessor
each range is replicated to next \(N-1\) nodes in the ring in clockwise direction
each instance of dynamo host has multiple virtual nodes

Physical node = multiple virtual nodes for uniform load distribution even if the hash value is not uniform

Always \(N\) replicas for each data object: for high availability

§4.3 Replication

Every key has \(N\) replicas in the system. Coordinator node replicates a key to \(N-1\) clockwise successor nodes in the ring

§4.4 Data versioning

for eventual consistency
example: shopping cart is data, add to cart or remove from cart are put requests that create new version of the data
divergent versions are reconciled later
- syntactic reconciliation: system knows one version is older than another
- semantic reconciliation: only client knows how to collapse multiple branches of data evolution into one

§4.5 Get and put operations

Quorum system:

get or put requests are sent to top \(N\) reachable nodes in the preference list and wait for \(W\) or \(R\) responses before returns to client
successful read need \(R\) nodes’ response
successful write need \(W\) nodes’ response
\(R+W > N\), typically \(R=W=2\), \(N=3\)

§4.6 Hinted hand-off

Always maintain \(N\) replicas to \(N\) healthy nodes from preference list of a key

Example: key of an object should go to node A, and replica to B and C. But if node A is unreachable, a replica will sent to D as well. All replicas (esp. the one to D) will hint on their metadata that A should be the coordinator. So when A restored, D will forward the object back to A.

§4.7 Replica synchronization

Merkle tree: Hash tree such that

leaves are hashes of values of individual keys
parent nodes = hashes of their children

Dynamo node maintain a separate Merkle tree for each key range and use tree traversal to determine if there are any difference and perform synchronization

§4.8.1 Ring membership

Explicit mechanism to add or remove node from a dynamo ring.

Persistent store is used to hold membership. Gossip-based protocol to propagate membership changes and maintain an eventual consistent view of membership.

each node randomly connect another node in the range and synchronize their membership change histories
partitioning and placement also through gossip protocols

§4.8.2 External discovery

some nodes are “seed nodes” that all node to join a ring will know them and sync membership with them
seed nodes are discovered by external mechanism such as configuration service

§4.8.3 Failure detection

failure detection to avoid attempt to communicate with unreachable peers
- local notation of failure: no response to messages sent
- decentralized failure detection: reported over gossip protocol

Only temporary failure should be detected and learned from gossip protocol. Permanent node add/removal should be done using explicit mechanism to avoid confusion.

§4.9 Add/remove storage nodes

New node into system:

assigned with a few new tokens randomly scattered on the ring
key range changes, some nodes have to transfer the range to other nodes
transfer offer and confirmation messages between nodes
- confirmation avoids duplicated transfers

§5 Implementation

Each dynamo node has the following function:

request coordination (handles get and put)
membership & failure detection
local persistence engine (non-volatile storage of key-value data)

Local persistence: use MySQL, Berkeley DB, etc.

Request coordination: Read write requests

event-driven messaging substrate used
communication = Java NIO channels
example of read
1. send read request to nodes
2. wait for min number of required responses
3. if too few responses within time bound, fail the request
4. gather all responses and determine which one to return
5. syntactic reconciliation if needed and generate write request
coordinator do “read repair” if stale versions are found
optimize for read-after-write: write coordinator is chosen to be the one replied fastest to the previous read

§6 Performance

Write latency is longer than read latency because of disk access for persistent storage. The 99.9th percentile latency is around 200ms.

§6.2 Uniform load distribution

Different partitioning scheme of key ranges in ring evaluated (see fig 3 of the paper):

Strategy 1: \(T\) random tokens per node, partition by token value

when nodes joins/leaves, ranges changes
space at each node to maintain membership is linear to the number of nodes in system
node join = steal key range from other nodes
- nodes to scan persistent store to hand off data
- slow bootstrapping, recalculation of Merkle tree on many nodes

Strategy 2: \(T\) random tkoens per node, equal sized partition

hash space divided to \(Q\) equal sized partitions
Each node assigned \(T\) tokens, \(S\) nodes in the system
\(Q \gg N\) and \(Q \gg ST\), \(Q\) is a fixed constant

Strategy 3: \(Q/S\) tokens per node, equal sized partition

each node has \(Q/S\) tokens
node leaves the system = redistribute its tokens to remaining nodes
node joins the system = steal tokens from other nodes to maintain \(Q/S\) tokens per node

Strategy 3 has best load balancing efficiency

fast to bootstrap
easy to archive

Because nodes now handles the \(Q\) partitions separately, we can assume and renounce responsibility of a partition like adding and removing a file. No need to recalculate the Merkle tree.

§6.3 Divergent versions

Divergent versions due to (1) node failure and (2) concurrent write

§6.4 Coordination

Request coordination component: client vs server

Server: client send request to any node on the ring

send by load balancer
any dynamo node can be a coordinator for read and any node on the key’s preference list can be coordinator for write

Client: At application library to implement the request coordination logic

client needs to sync membership state with a random dynamo node periodically
better performance in latency
efficiency depends on how fresh the membership information is at the client

References

Understanding the architecture of Cassandra: Some authors of dynamo built Cassandra and they share a few architectural strategies

Bibliographic data

@inproceedings{
   title = "Dynamo: Amazon's Highly Available Key-value Store",
   author = "Giuseppe DeCandia and Deniz Hastorun and Madan Jampani and Guna vardhan Kakulapati and Avinash Lakshman and Alex Pilchin and Swam inathan Sivasubramanian and Peter Vosshall and Werner Vogels",
   booktitle = "Proc. SOSP'07",
   page = "205--220",
   address = "Stevenson, WA",
   month = "October",
   year = "2007",
}