Understanding the CAP Theorem

Koushith

I've been organizing my thoughts on system design, specifically the CAP Theorem. It's a concept that comes up constantly in engineering discussions. I wanted to turn my handwritten notes into a reference that cuts through the noise and focuses on the actual trade-offs we face when building distributed systems.

Here is the breakdown straight from my notebook.

My notes on CAP theoremClick to enlarge

The CAP Theorem states that in a distributed system, you can have only 2 out of 3 of the following properties:

  1. Consistency: All nodes see the same data at the same time. When a write is made to one node, all subsequent reads from any node will return that updated value.
  2. Availability: Every request to a non-failing node receives a response, without a guarantee that it contains the most recent data.
  3. Partition Tolerance: The system continues to operate despite arbitrary message loss or failure of a part of a system (i.e., network partitions between nodes). Even if the network is down, the system continues to work.

In Short: In any distributed system, partition tolerance is a must. Network failure will happen, and your system needs to handle them. Since Partition Tolerance is non-negotiable in distributed systems, the CAP Theorem really boils down to a single choice:

Do you prioritize Consistency or Availability when a network partition occurs?

Note: Sometimes showing stale data is better than showing no data at all.

When to Choose Consistency?

Some systems absolutely require consistency, even at the cost of availability.

  • Ticket Booking (e.g., BookMyShow): Imagine User A is booking a seat. Due to a network issue, User B selects the same seat. If the system is available but inconsistent, you now have 2 people showing up for the same seat.
  • E-commerce Inventory: You can't show the wrong stock numbers.
  • Financial Systems: Stock trading platforms need to show up-to-date order book entries.

When to Choose Availability?

In the majority of systems, we can tolerate some inconsistency and should prioritize availability.

  • Social Media: If User A updates their profile picture, it is perfectly fine if User B sees the old picture for some time.
  • Netflix: If a description is updated, it doesn't have to be updated instantly everywhere.
  • Review Sites: Again, slight delays in data propagation are acceptable.

Ask yourself: Would it be catastrophic if the users briefly saw inconsistent data?

  • YES → Choose Consistency
  • NO → Choose Availability

Advanced Considerations

As systems grow, the choice between both isn't always binary. You can have a Hybrid Approach.

  • BookMyShow: Seat selection needs Strong Consistency, but viewing event descriptions or general details can prioritize Availability (showing outdated data is acceptable).
  • Tinder: Matching needs Consistency (if both users swipe right at the same time, they need to match). However, viewing a user's profile is slightly acceptable if it's outdated (e.g., a user just uploaded a new image).

If you prioritize consistency

  1. **Distributed Transactions

Ensuring multiple data nodes store data (Cache & DB) and remain in sync adds complexity. It guarantees consistency across all nodes, but this means users will likely experience higher latency as the system ensures data is consistent across all nodes.

  1. **Single Node Solutions

Using a single DB instance avoids propagation issues. While this limits scalability, it eliminates consistency challenges by having a single source of truth (e.g., PostgreSQL, SQL, Dynamo, Google Spanner).

If You Prioritize Availability

If your design prioritizes availability, your design can include:

  1. Multiple Replicas: Scaling to additional read replicas with asynchronous replication allows reads to be served from any replica, even if it is slightly behind. This improves read performance and availability at the cost of potential staleness.

    • Tech: Redis Clusters, DynamoDB, Cassandra.
  2. CDC (Change Data Capture): Using CDC to track changes in the primary DB and synchronize data asynchronously. Changes are streamed to a target system like a Data Warehouse or other application.

Different Levels of Consistency

It's not just "consistent" or "not consistent." There are levels:

  • Strong Consistency: All reads reflect the most recent writes. This is the most expensive consistency model in terms of performance. (Used in Banks, Shopping Inventory).
  • Causal Consistency: Related events will appear in the same order to all users. If Process A communicates a value to Process B, any subsequent actions by Process B will happen after Process A's action. (Used in social media comment threads and collaborative applications).
  • Read-Your-Own-Writes Consistency: A user always sees their own updates first, though other users might see older versions. (e.g., Profile updates, user might click and update again and again if they don't see their change).
  • Eventual Consistency: Used for faster, "optimistic" writes. Updates will propagate eventually. The system will become consistent over time but may temporarily have inconsistencies. (Used in social media posts, google reviews, etc.).

TL;DR

Does every read need to read the most recent write?

  • Yes → Prioritize Consistency.
  • No → Prioritize Availability.

Used LLMs to correct grammar and typos.