Databases

Sharding

● Advanced ⏱ 14 min read database

Sharding is horizontal partitioning of a database: splitting a large dataset across multiple database servers, each holding a subset of the data. Each partition is called a shard. Together, the shards hold the complete dataset. Sharding allows a system to scale storage and write throughput beyond what a single server can handle.

What Is Sharding

Vertical scaling (adding more CPU, RAM, or faster disks to one server) has a ceiling. At some point, a single machine cannot handle the data volume or write rate a growing system demands. Sharding breaks the data horizontally: instead of one table with 10 billion rows on one server, you have ten servers each holding 1 billion rows of the same table.

Unlike replication (which copies the same data to multiple nodes for read scaling and fault tolerance), sharding distributes different data to different nodes. Each row lives on exactly one shard. This distributes both storage and write load proportionally.

💡
Sharding vs Partitioning

In many databases, "partitioning" refers to dividing a table within a single server (PostgreSQL table partitioning, MySQL partitioning). "Sharding" typically means partitioning across multiple servers. The logical concepts are the same; the operational difference is that sharding requires routing logic to direct queries to the correct server.

Sharding: data is partitioned across multiple database nodes; each shard holds a subset of rows

The Shard Key

The shard key is the column (or combination of columns) used to determine which shard a row belongs to. Choosing the right shard key is the most critical sharding decision — it affects data distribution, query efficiency, and operational complexity for the lifetime of the system.

A good shard key:

Common shard key choices: user ID (for user-centric data), tenant ID (for SaaS), geographic region, or a combination. The right choice is application-specific.

Range Sharding

In range sharding, the shard key space is divided into contiguous ranges. Each shard owns one range.

Shard 1: user_id 1        – 10,000,000
Shard 2: user_id 10,000,001 – 20,000,000
Shard 3: user_id 20,000,001 – 30,000,000

Advantages: Range queries (e.g., "all users with IDs between 5M and 8M") hit a single shard. Adding new shards for new ranges is straightforward. Data locality: users in the same range co-locate, which is useful if related entities reference each other.

Disadvantages: Sequential inserts (auto-increment IDs, timestamps) always go to the latest range shard, creating a write hot spot. As data grows unevenly in ranges, some shards become much larger than others, requiring periodic rebalancing.

Hash Sharding

In hash sharding, the shard key is passed through a hash function, and the result modulo the number of shards determines the target shard.

shard = hash(user_id) % num_shards

Advantages: Hash functions distribute data uniformly regardless of key distribution patterns. Sequential keys (IDs, timestamps) scatter evenly. No single shard receives all new writes.

Disadvantages: Range queries across shard keys require querying all shards (scatter-gather). Adding or removing shards changes the modulus, which reroutes most existing data to different shards — a resharding problem. Consistent hashing (see the Consistent Hashing guide) mitigates this by only remapping a fraction of keys when nodes are added or removed.

Directory-Based Sharding

A lookup service (the directory) maintains a mapping from shard key values to the shard that holds them. When routing a query, the application first queries the directory to find the correct shard, then queries that shard.

Advantages: Maximum flexibility — the directory can implement any mapping logic. Moving data between shards only requires updating the directory entry, not rehashing. Supports heterogeneous shard sizes.

Disadvantages: The directory is a single point of failure and a performance bottleneck (every query requires a directory lookup). Caching the directory at the application layer reduces latency but adds consistency complexity. The directory itself must be highly available and consistent.

Hot Spots

A hot spot occurs when traffic concentrates on one shard disproportionately. Common causes:

Mitigation strategies:

Cross-Shard Queries

Once data is spread across shards, queries that span multiple shards become expensive. A query without the shard key in its WHERE clause must be sent to all shards (scatter-gather) and the results merged at the application or middleware layer.

Problems with cross-shard queries:

The solution is to design your access patterns around the shard key. If a query is always "get all orders for user X," and user_id is the shard key, every order query hits a single shard. If the query is "get all orders in the last 24 hours across all users," it's a scatter-gather — consider keeping a secondary index service or a separate analytics replica for such queries.

Resharding

Resharding — moving data between shards, usually because existing shards are too large or too hot — is one of the most operationally painful aspects of sharding. With modulo-N hashing, adding a single shard changes N, which remaps approximately (N-1)/N of all keys — nearly all data needs to move.

Approaches to reduce resharding pain:

Design Considerations