Database Sharding and Partitioning

Partitioning divides a large table into smaller, more manageable pieces within a single DBMS instance. Sharding extends this concept across multiple database servers, where each shard is an independent database holding a subset of the data. Both techniques improve scalability, query performance, and manageability for large datasets.

Partitioning (Single Instance)

Horizontal Partitioning

Splits rows across partitions based on a partition key.

-- PostgreSQL declarative partitioning
CREATE TABLE orders (
    id SERIAL,
    order_date DATE,
    customer_id INT,
    amount DECIMAL
) PARTITION BY RANGE (order_date);
 
CREATE TABLE orders_2025 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
 
CREATE TABLE orders_2026 PARTITION OF orders
    FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');

Partition strategies:

StrategyHow it worksBest for
RangeKey falls within a range (dates, IDs)Time-series, archival
Hashhash(key) mod NEven distribution
ListKey matches explicit valuesCategorical data (region, status)
CompositeCombination of the aboveComplex access patterns

Vertical Partitioning

Splits columns across tables. Frequently-accessed columns in one partition, rarely-accessed or large columns (BLOBs) in another. Both share the same Primary Key.

-- Hot data (frequently queried)
CREATE TABLE user_profile (id, name, email, status);
 
-- Cold data (rarely queried)
CREATE TABLE user_details (id, bio, avatar_blob, preferences_json);

Benefits of Partitioning

  • Partition pruning — The optimizer scans only relevant partitions
  • Parallel scans — Different partitions can be scanned concurrently
  • Maintenance — Drop or archive entire partitions instead of DELETE operations
  • Index size — Smaller per-partition indexes fit in memory

Sharding (Multi-Instance)

Sharding distributes partitions across separate database servers. Each shard is a standalone DBMS handling its own subset of data.

Shard Key Selection

The shard key determines which shard holds each row. Critical design decision:

  • High cardinality — Shard key should have many distinct values for even distribution
  • Query alignment — Most queries should target a single shard (avoid cross-shard joins)
  • Write distribution — Avoid hotspots where one shard receives disproportionate writes

Sharding Strategies

  • Range-based — Shard by key ranges (user IDs 1-1M on shard 1, 1M-2M on shard 2). Risk: uneven distribution if ranges aren’t balanced.
  • Hash-based — hash(shard_key) mod N. Even distribution but range queries hit all shards.
  • Directory-based — A lookup table maps keys to shards. Most flexible but the directory is a single point of failure.
  • Geographic — Shard by user region for data locality compliance (GDPR) and lower latency.

Challenges

  • Cross-shard queriesJoins across shards require scatter-gather, dramatically slower
  • Cross-shard transactions — Require Two-Phase Commit (2PC) or application-level coordination
  • Rebalancing — Adding/removing shards requires data migration (consistent hashing reduces this)
  • Schema changes — Must be coordinated across all shards
  • No global indexes — Each shard maintains its own indexes; global uniqueness requires coordination

Consistent Hashing

Maps both shard nodes and keys onto a hash ring. When a node is added or removed, only 1/N of keys need to be remapped (vs. nearly all keys with simple modular hashing). Used by Cassandra, DynamoDB, and Redis Cluster.

When to Use What

ScenarioApproach
Table too large for efficient queriesPartitioning
Single server at capacitySharding
Time-series data with retention policiesRange partitioning
Multi-region deploymentGeographic sharding
Need cross-row transactionsPartitioning (avoid sharding)

Sources

  • Sharding — Deep-dive on horizontal, vertical, and geographic sharding with strategies and diagrams
  • Fragmentation — Theoretical foundation: horizontal, vertical, hybrid fragmentation with relational algebra