Sharding
Sharding is a horizontal scaling strategy that distributes data across multiple independent database servers (shards). Each shard holds a subset of the total data and operates as an autonomous database. Unlike partitioning (which splits data within a single instance), sharding splits data across machines, enabling virtually unlimited growth in storage and throughput.
Why Shard
- A single server has hit its CPU, memory, or disk ceiling (vertical scaling limit reached)
- Write throughput exceeds what one machine can handle
- Data must be geographically distributed for latency or compliance (GDPR, data residency)
- Read/write ratio demands more parallelism than replicas alone can provide
Horizontal Sharding
The most common form. Rows are distributed across shards based on a shard key. Each shard has the same schema but holds different rows.
graph TD App[Application] --> Router[Shard Router / Proxy] Router --> S1[(Shard 1<br/>users A-F)] Router --> S2[(Shard 2<br/>users G-M)] Router --> S3[(Shard 3<br/>users N-S)] Router --> S4[(Shard 4<br/>users T-Z)] style S1 fill:#e1f5fe style S2 fill:#e8f5e9 style S3 fill:#fff3e0 style S4 fill:#fce4ec
Shard Key Selection
The shard key is the single most critical design decision. It determines which shard holds each row.
| Requirement | Why It Matters |
|---|---|
| High cardinality | Many distinct values → even distribution across shards |
| Query alignment | Most queries should target a single shard (avoid scatter-gather) |
| Write distribution | Avoid hotspots where one shard gets disproportionate writes |
| Stability | Shard key should rarely change (changing it means moving the row) |
Good shard keys: tenant_id (multi-tenant SaaS), user_id (social apps), region (geo-distributed)
Bad shard keys: created_at (all new writes hit the latest shard), status (low cardinality), auto-increment id (sequential writes create hotspots)
Horizontal Sharding Strategies
Range-Based Sharding
Rows are assigned to shards by key ranges.
graph LR subgraph Shard_1["Shard 1"] R1["user_id 1 – 1,000,000"] end subgraph Shard_2["Shard 2"] R2["user_id 1,000,001 – 2,000,000"] end subgraph Shard_3["Shard 3"] R3["user_id 2,000,001 – 3,000,000"] end
| Pros | Cons |
|---|---|
| Range queries stay on one shard | Uneven distribution if data isn’t uniform |
| Simple to understand and implement | New data clusters on the “latest” shard (hotspot) |
| Easy rebalancing by splitting ranges | Requires manual range management |
Hash-Based Sharding
shard_id = hash(shard_key) mod N
graph TD K1["user_id = 42"] -->|"hash(42) mod 4 = 2"| S3["Shard 2"] K2["user_id = 99"] -->|"hash(99) mod 4 = 3"| S4["Shard 3"] K3["user_id = 7"] -->|"hash(7) mod 4 = 3"| S4 K4["user_id = 200"] -->|"hash(200) mod 4 = 0"| S1["Shard 0"]
| Pros | Cons |
|---|---|
| Even distribution regardless of key patterns | Range queries must hit all shards |
| No hotspots from sequential keys | Adding/removing shards remaps almost all keys (unless consistent hashing) |
| Simple routing logic | Harder to reason about data locality |
Directory-Based Sharding
A centralized Lookup Table maps each key (or key range) to a specific shard.
graph TD App[Application] --> Dir[(Shard Directory<br/>key → shard mapping)] Dir --> S1[(Shard 1)] Dir --> S2[(Shard 2)] Dir --> S3[(Shard 3)]
| Pros | Cons |
|---|---|
| Maximum flexibility — any key can go anywhere | Directory is a single point of failure |
| Easy rebalancing (update the directory) | Extra lookup on every query (latency) |
| Supports complex placement policies | Directory itself must be highly available |
Geographic Sharding
Shard by user region for data locality and compliance.
graph TD subgraph US["US Region"] US_DB[(US Shard<br/>us-east-1)] end subgraph EU["EU Region"] EU_DB[(EU Shard<br/>eu-west-1)] end subgraph APAC["APAC Region"] APAC_DB[(APAC Shard<br/>ap-southeast-1)] end US_User[US User] --> US_DB EU_User[EU User] --> EU_DB APAC_User[APAC User] --> APAC_DB
Ideal for GDPR compliance (EU data stays in EU) and reducing latency to end users.
Vertical Sharding
Instead of splitting rows, vertical sharding splits tables (or groups of tables) across different servers. Each shard has a different schema serving different functional areas.
graph TD App[Application] --> Auth[(Auth Shard<br/>users, sessions, roles)] App --> Orders[(Orders Shard<br/>orders, order_items, payments)] App --> Catalog[(Catalog Shard<br/>products, categories, reviews)] App --> Analytics[(Analytics Shard<br/>events, metrics, logs)] style Auth fill:#e1f5fe style Orders fill:#e8f5e9 style Catalog fill:#fff3e0 style Analytics fill:#fce4ec
When Vertical Sharding Makes Sense
- Functional decomposition — Each microservice owns its own data store
- Different access patterns — Auth needs fast reads, Analytics needs bulk writes
- Different DBMS per shard — Auth on PostgreSQL, Catalog on MongoDB, Analytics on ClickHouse
- Team ownership — Each team manages its own shard independently
Vertical + Horizontal Combined
Real-world systems often combine both. First split by domain (vertical), then shard large tables within each domain (horizontal):
graph TD App[Application Layer] --> Auth[Auth Service] App --> Orders[Orders Service] Auth --> Auth_DB[(Auth DB<br/>single instance)] Orders --> Router[Shard Router] Router --> OS1[(Orders Shard 1<br/>tenant A-M)] Router --> OS2[(Orders Shard 2<br/>tenant N-Z)]
Consistent Hashing
Standard modular hashing (hash(key) mod N) breaks when N changes — nearly every key gets remapped. Consistent hashing solves this by mapping both nodes and keys onto a hash ring.
graph TD subgraph Hash_Ring["Hash Ring (0 — 2^32)"] direction TB N1["Node A<br/>position 1000"] N2["Node B<br/>position 4000"] N3["Node C<br/>position 7000"] end K1["Key X<br/>hash = 2500"] -.->|"next node clockwise"| N2 K2["Key Y<br/>hash = 5500"] -.->|"next node clockwise"| N3 K3["Key Z<br/>hash = 8500"] -.->|"wraps around"| N1
When a node is added, only keys between the new node and its predecessor are remapped (~1/N of total keys). When a node is removed, its keys shift to the next node clockwise. Used by Cassandra, DynamoDB, Redis Cluster, and Amazon S3.
Virtual nodes further improve balance — each physical node owns multiple positions on the ring, smoothing out distribution.
Implementation Phases
A structured approach to deploying sharding:
graph LR A[1. Data Modeling] --> B[2. Shard Creation] B --> C[3. Data Migration] C --> D[4. Query Routing] D --> E[5. Shard Management] E --> F[6. Monitoring]
| Phase | Activities |
|---|---|
| 1. Data Modeling | Choose shard key based on access patterns, cardinality, write distribution |
| 2. Shard Creation | Provision physical/virtual/container databases per shard |
| 3. Data Migration | Scripts and tools to redistribute existing data across shards |
| 4. Query Routing | Deploy middleware/proxy layer that routes queries to the correct shard |
| 5. Shard Management | Add/remove shards, rebalance data, handle shard failures |
| 6. Monitoring | Track uneven shard sizes, high latencies, hardware failures; alert and auto-rebalance |
Challenges
Cross-Shard Queries
Joins across shards require a scatter-gather pattern — query all shards, merge results at the application or proxy layer. Dramatically slower than single-shard queries.
Mitigation: Denormalize data so common queries stay within one shard. Co-locate related data on the same shard (e.g., user + user’s orders on the same shard).
Cross-Shard Transactions
ACID transactions spanning multiple shards require Two-Phase Commit (2PC) or application-level coordination (Saga pattern). Both add latency and complexity.
Rebalancing
When a shard becomes a hotspot or storage fills up, data must be migrated to new shards. Strategies:
- Consistent hashing — Minimizes data movement
- Dynamic splitting — Split a shard into two when it exceeds a threshold (like Cassandra’s virtual nodes)
- Online migration — Dual-write to old and new shard, then cut over
Schema Changes
DDL must be coordinated across all shards — ALTER TABLE on 50 shards simultaneously. Tools like gh-ost (GitHub) and pt-online-schema-change (Percona) help.
Global Secondary Indexes
Each shard has its own local indexes. A query filtering on a non-shard-key column must hit every shard. Solutions:
- Global index table — Separate table mapping secondary key → shard + primary key
- Broadcast queries — Accept the fan-out cost for rare queries
- Denormalize — Copy the indexed attribute into a secondary Lookup Table sharded differently
Sharding in Practice
| System | Sharding Model |
|---|---|
| Cassandra | Automatic consistent hashing with virtual nodes |
| MongoDB | Built-in range or hash sharding with config servers |
| PostgreSQL | Manual (Citus extension for automatic), or application-level |
| MySQL | Vitess (YouTube’s sharding proxy), ProxySQL, or application-level |
| CockroachDB | Automatic range-based splitting with Raft consensus |
| Amazon DynamoDB | Automatic hash-based partitioning, fully managed |
| Redis Cluster | Hash slots (16384 slots distributed across nodes) |
Decision Framework
flowchart TD A[Is one server at capacity?] -->|No| B[Use partitioning or read replicas] A -->|Yes| C{What's the bottleneck?} C -->|Write throughput| D[Horizontal sharding by write-heavy key] C -->|Storage| E[Horizontal sharding by data volume] C -->|Different workloads| F[Vertical sharding by domain] C -->|Latency / compliance| G[Geographic sharding] D --> H{Need cross-shard joins?} E --> H H -->|Rarely| I[Hash-based sharding] H -->|Often| J[Range-based or directory sharding<br/>co-locate related data]