Distributed Systems
Distributed systems are the foundation of modern scalable architectures. Understanding these concepts is essential for designing systems that handle failures gracefully while maintaining performance at scale.
CAP Theorem
The CAP theorem states that a distributed system can only guarantee two of three properties simultaneously:
┌─────────────────┐
│ Consistency │
│ (All nodes see │
│ same data) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ │ ▼
┌─────────────────┐ │ ┌─────────────────┐
│ Availability │◄─────┴───►│ Partition │
│ (Every request │ │ Tolerance │
│ gets response) │ │ (Works despite │
└─────────────────┘ │ network splits)│
└─────────────────┘CAP Trade-offs
| Choice | Sacrifices | Use Case | Examples |
|---|---|---|---|
| CP | Availability | Financial systems, inventory | MongoDB, HBase, Redis Cluster |
| AP | Consistency | Social feeds, caching | Cassandra, DynamoDB, CouchDB |
| CA | Partition Tolerance | Single-node systems | Traditional RDBMS (theoretical) |
Important Clarification
- Partition tolerance is not optional in distributed systems—network partitions will happen
- The real choice is between consistency and availability during a partition
- Most systems are CP or AP, not purely one or the other
PACELC Theorem
PACELC extends CAP by addressing behavior when there’s no partition:
If (Partition) then (Availability vs Consistency)
Else (Latency vs Consistency)| System | During Partition | Normal Operation |
|---|---|---|
| DynamoDB | AP | EL (Eventual, Low latency) |
| MongoDB | CP | EC (Consistent, higher latency) |
| Cassandra | AP | EL (tunable) |
| Spanner | CP | EC (TrueTime) |
Consistency Models
Strong Consistency
Every read returns the most recent write. All nodes see the same data at the same time.
Client A writes X=5 ─────► All nodes: X=5
Client B reads X ─────► Returns 5 (guaranteed)Pros: Simple to reason about, no stale reads Cons: Higher latency, reduced availability during partitions
Eventual Consistency
Given enough time without updates, all replicas converge to the same value.
Client A writes X=5 ─────► Node 1: X=5
Node 2: X=3 (stale)
Node 3: X=5
... time passes ...
All nodes: X=5Pros: High availability, low latency Cons: Stale reads possible, complex conflict resolution
Causal Consistency
Operations that are causally related are seen in the same order by all nodes.
User A posts: "Hello" (happens first)
User B replies: "Hi there" (causally dependent)
All nodes must show A's post before B's reply
But independent posts can appear in any orderRead Your Writes
A client always sees their own writes, even if other clients don’t yet.
Monotonic Reads
Once a client reads a value, subsequent reads won’t return older values.
Consensus Algorithms
Consensus ensures all nodes agree on a single value, even with failures.
Raft Consensus
Raft is designed for understandability. It breaks consensus into three sub-problems:
1. Leader Election
┌─────────────────────────────────────────────────────────┐
│ RAFT STATES │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ timeout ┌─────────────┐ │
│ │ Follower │──────────────►│ Candidate │ │
│ └──────────┘ └─────────────┘ │
│ ▲ │ │
│ │ │ wins election │
│ │ discovers leader ▼ │
│ │ ┌──────────┐ │
│ └─────────────────────│ Leader │ │
│ └──────────┘ │
└─────────────────────────────────────────────────────────┘Election Process:
- Follower times out (no heartbeat from leader)
- Becomes Candidate, increments term, votes for self
- Requests votes from other nodes
- Wins with majority votes → becomes Leader
- Sends heartbeats to maintain leadership
2. Log Replication
Leader receives write ──► Appends to local log
──► Sends AppendEntries to followers
──► Waits for majority acknowledgment
──► Commits entry
──► Responds to client3. Safety Properties
- Election Safety: At most one leader per term
- Leader Append-Only: Leader never overwrites/deletes entries
- Log Matching: If two logs have same index/term, all preceding entries match
- State Machine Safety: If a server applies a log entry, no other server applies a different entry for same index
Paxos (Simplified)
Paxos is harder to understand but more flexible. Basic roles:
- Proposers: Propose values
- Acceptors: Accept/reject proposals
- Learners: Learn the chosen value
Two-Phase Protocol:
Phase 1 (Prepare):
Proposer ──► Acceptors: "Prepare(n)"
Acceptors ──► Proposer: "Promise(n)" or reject
Phase 2 (Accept):
Proposer ──► Acceptors: "Accept(n, value)"
Acceptors ──► Proposer: "Accepted" or reject
Value chosen when majority of Acceptors acceptRaft vs Paxos
| Aspect | Raft | Paxos |
|---|---|---|
| Understandability | Designed for clarity | Notoriously complex |
| Leader | Strong leader required | Flexible, leaderless possible |
| Implementation | Easier to implement correctly | Many variants, easy to get wrong |
| Use Cases | etcd, Consul, CockroachDB | Chubby, Spanner (Multi-Paxos) |
Replication Strategies
Single-Leader (Master-Slave)
┌──────────────┐
│ Leader │◄──── All Writes
│ (Primary) │
└──────┬───────┘
│ Replication
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Follower 1 │ │ Follower 2 │ │ Follower 3 │
│ (Replica) │ │ (Replica) │ │ (Replica) │
└──────────────┘ └──────────────┘ └──────────────┘
▲ ▲ ▲
└─────────────────┴─────────────────┘
Reads (scale out)Pros: Simple, no write conflicts, easy to reason about Cons: Single point of failure (mitigated by failover), write bottleneck
Multi-Leader (Master-Master)
┌──────────────┐ ┌──────────────┐
│ Leader 1 │◄───────►│ Leader 2 │
│ (Datacenter │ sync │ (Datacenter │
│ US) │ │ EU) │
└──────────────┘ └──────────────┘
│ │
▼ ▼
US Clients EU ClientsPros: Lower latency for geographically distributed users, no single write bottleneck Cons: Write conflicts possible, complex conflict resolution needed
Leaderless (Quorum-based)
Write to any node ──► Replicate to W nodes
Read from any node ──► Read from R nodes
Consistency guaranteed when: W + R > N
Where:
N = Total replicas
W = Write quorum
R = Read quorumExample (N=3):
- W=2, R=2: Strong consistency (overlap guaranteed)
- W=1, R=1: Eventual consistency (no overlap)
- W=3, R=1: Write-heavy optimization
Conflict Resolution:
- Last-Write-Wins (LWW) using timestamps
- Version vectors
- CRDTs (Conflict-free Replicated Data Types)
Partitioning (Sharding)
Hash Partitioning
hash(key) mod N = partition number
Problems:
- Adding/removing nodes requires rehashing everything
- Hot spots if hash function not uniformConsistent Hashing
0°
│
┌────┴────┐
╱ ╲
│ Node A │ ──► Keys 0-90°
│ ● │
│ ● │ ──► Node B: Keys 91-180°
╲ Node B ╱
└────┬────┘
│
180°
│
┌────┴────┐
╱ ╲
│ ● │ ──► Node C: Keys 181-270°
│ Node C │
│ ● │ ──► Node D: Keys 271-359°
╲ Node D ╱
└────┬────┘
│
360°Benefits:
- Adding node: Only K/N keys need to move (not all)
- Virtual nodes: Better load distribution
- Used by: DynamoDB, Cassandra, Memcached
Range Partitioning
Partition 1: A-F
Partition 2: G-L
Partition 3: M-R
Partition 4: S-ZPros: Range queries efficient, natural ordering Cons: Hot spots if data not uniformly distributed
Leader Election
Bully Algorithm
1. Node detects leader failure
2. Sends ELECTION to all higher-ID nodes
3. If no response → becomes leader, announces COORDINATOR
4. If response → waits for new leader announcementPros: Simple, guarantees highest-ID node wins Cons: High message overhead, assumes reliable failure detection
Ring Algorithm
1. Node sends ELECTION message around ring
2. Each node adds its ID to message
3. When message returns to originator
4. Highest ID in message becomes leader
5. COORDINATOR message sent around ringDistributed Transactions
Two-Phase Commit (2PC)
Phase 1: Prepare
Coordinator ──► Participants: "Prepare to commit"
Participants ──► Coordinator: "Ready" or "Abort"
Phase 2: Commit/Abort
If all Ready:
Coordinator ──► Participants: "Commit"
Else:
Coordinator ──► Participants: "Abort"Problems:
- Blocking: Participants hold locks until decision
- Coordinator failure: Participants stuck in uncertain state
- Not partition tolerant
Saga Pattern
Break transaction into local transactions with compensating actions:
┌─────────────────────────────────────────────────────────┐
│ ORDER SAGA │
├─────────────────────────────────────────────────────────┤
│ │
│ T1: Create Order ──► T2: Reserve Inventory ──► │
│ │
│ T3: Process Payment ──► T4: Ship Order │
│ │
│ If T3 fails: │
│ C2: Release Inventory ◄── C1: Cancel Order │
│ │
└─────────────────────────────────────────────────────────┘Saga Execution:
- Choreography: Events trigger next step (decentralized)
- Orchestration: Central coordinator manages steps
Clocks and Ordering
Physical Clocks
Wall-clock time, synchronized via NTP. Problem: clock skew (drift between machines).
Logical Clocks (Lamport Timestamps)
Rules:
1. Before event, increment counter
2. On send: attach counter to message
3. On receive: counter = max(local, received) + 1
Guarantees: If A happened-before B, then L(A) \< L(B)
Limitation: L(A) \< L(B) doesn't mean A happened-before BVector Clocks
Node A: [A:1, B:0, C:0] ──► Event at A
Node A: [A:2, B:0, C:0] ──► Another event at A
Node B: [A:0, B:1, C:0] ──► Event at B
A sends to B:
Node B: [A:2, B:2, C:0] ──► Merge and increment
Concurrent events detectable:
[A:2, B:0] vs [A:0, B:2] ──► Neither dominates, concurrentFailure Handling
Types of Failures
| Type | Description | Example |
|---|---|---|
| Crash | Node stops and doesn’t recover | Process killed |
| Crash-Recovery | Node stops but may recover | Server reboot |
| Omission | Messages lost | Network packet drop |
| Byzantine | Node behaves arbitrarily (malicious) | Compromised server |
Failure Detection
Heartbeat-based:
Every T seconds: Node sends "I'm alive" to monitor
If no heartbeat for K*T seconds: Consider node deadTrade-offs:
- Short timeout: Fast detection, more false positives
- Long timeout: Fewer false positives, slow detection
Phi Accrual Failure Detector:
- Outputs probability of failure (0-1) instead of binary
- Adapts to network conditions
- Used by Cassandra, Akka
Interview Quick Reference
Common Questions
-
“Explain CAP theorem”
- State the three properties
- Explain partition tolerance is mandatory
- Give examples of CP vs AP systems
-
“How does Raft handle leader failure?”
- Followers timeout without heartbeat
- Election process with term increment
- Majority vote required
-
“Design a distributed counter”
- Discuss consistency requirements
- Options: Single leader, CRDTs, quorum reads/writes
-
“How to handle network partition?”
- Detect partition
- Choose availability or consistency
- Heal when partition ends
Decision Framework
Need strong consistency?
├─ Yes → Single leader or CP system
│ ├─ Global scale? → Spanner, CockroachDB
│ └─ Single region? → PostgreSQL with replicas
└─ No → Can tolerate eventual consistency
├─ Write-heavy? → Cassandra, DynamoDB
└─ Read-heavy? → Redis, MemcachedNumbers to Know
| Metric | Typical Value |
|---|---|
| Network round-trip (same DC) | 0.5-1 ms |
| Network round-trip (cross-region) | 50-150 ms |
| Disk seek | 10 ms (HDD), 0.1 ms (SSD) |
| Raft election timeout | 150-300 ms |
| Heartbeat interval | 50-100 ms |