Storage
Every system stores data somewhere. The question is not whether to store it but how — and the answer has profound consequences for performance, cost, scalability, and durability. Block storage, file storage, and object storage are fundamentally different models for organizing and accessing data, each suited to different workloads. Understanding them clearly lets you make deliberate choices instead of defaulting to whatever is familiar.
Storage Types
There are three primary storage abstractions used in modern system design:
- Block storage — raw storage volumes treated like physical hard drives. Used by operating systems and databases.
- File storage — a hierarchical directory tree of named files. Used for shared filesystems and networked file access.
- Object storage — a flat namespace of immutable objects addressed by unique keys. Used for large-scale unstructured data like images, videos, and backups.
Each model exposes a different interface to the data, has different performance characteristics, and scales differently. They are not interchangeable.
Block Storage
Block storage presents raw storage as a sequence of fixed-size blocks (typically 512 bytes to 4 KB). The storage system has no knowledge of the data structure — it just reads and writes blocks at addresses. The operating system or application is responsible for organizing data on top of the raw blocks, typically using a filesystem (ext4, XFS, NTFS) or a database engine.
How It Works
A block storage volume appears to the OS as a local disk. The OS formats it with a filesystem, and applications access files through normal file I/O syscalls. Alternatively, a database can use a block device directly without a filesystem (raw I/O), bypassing the OS page cache for more predictable performance.
Characteristics
- Low latency: Direct block access enables sub-millisecond I/O. SSDs attached via NVMe achieve <0.1ms latency. This is essential for databases.
- High IOPS: Block storage can sustain hundreds of thousands of random I/O operations per second, making it suitable for transactional workloads.
- Attached to a single host: Traditional block storage volumes mount to one server at a time. Multiple servers cannot share the same block device (with exceptions like clustered filesystems).
- Fixed capacity: Block volumes have a defined size. Expanding requires resizing the volume and then the filesystem, often with a brief interruption.
Examples
- Cloud: AWS EBS (Elastic Block Store), GCP Persistent Disk, Azure Managed Disks
- On-premises: Local NVMe SSDs, SAN-attached LUNs
- Use cases: OS boot volumes, relational databases (PostgreSQL, MySQL), NoSQL databases (MongoDB, Cassandra), VM disk images
HDDs (spinning disks) are cheap per GB but have high seek latency (~5–10ms) due to mechanical movement. SSDs have no moving parts, delivering ~0.1ms latency and much higher IOPS at higher cost per GB. NVMe SSDs (attached directly to the PCIe bus rather than through SATA/SAS) are another 5–10× faster than SATA SSDs. For databases, NVMe SSDs are the default choice whenever the budget allows. HDDs survive for cold archival storage where cost dominates over latency.
File Storage
File storage organizes data as a hierarchy of directories and named files, accessible through standard filesystem protocols (POSIX, NFS, SMB/CIFS). Clients mount the storage and interact with it using familiar file operations: open, read, write, seek, close.
How It Works
A file storage server (NAS appliance or cloud service) exports one or more shares over a network protocol. Clients mount the share and it appears as a local filesystem. Multiple clients can mount the same share simultaneously, enabling shared access to the same files from many servers — the key capability that distinguishes file storage from block storage.
Characteristics
- Shared access: Multiple clients can read and write to the same filesystem concurrently. The storage server handles locking and consistency.
- Familiar interface: Applications use standard filesystem APIs. No code changes needed to use network file storage instead of local disk.
- Hierarchical namespace: Data is organized in directories, supporting human-readable paths and metadata (permissions, timestamps, ownership).
- Higher latency than block: Network overhead adds latency. NFS is typically 1–5ms vs <0.1ms for local NVMe. Not suitable for latency-sensitive database workloads.
- Scales to hundreds of terabytes: NAS appliances and cloud file services (AWS EFS, Azure Files) can store very large amounts of data, though not at the scale of object storage.
Examples
- Cloud: AWS EFS (Elastic File System), Azure Files, GCP Filestore
- Protocols: NFS (Linux), SMB/CIFS (Windows), AFP (macOS)
- Use cases: Shared application configuration, home directories, content management systems, media production workflows where multiple machines need read/write access to the same files
Object Storage
Object storage stores data as discrete objects in a flat namespace. Each object consists of the data itself, a unique key (the object’s identifier), and metadata (content type, size, custom attributes). There is no directory hierarchy — all objects live in a flat bucket, though keys can contain slashes to simulate folder paths.
How It Works
Objects are accessed via an HTTP API (typically S3-compatible). You PUT an object to store it, GET it to retrieve it, and DELETE it to remove it. Objects are immutable — you cannot update a portion of an object in place; you must write a new version. This immutability is fundamental to how object storage achieves its durability and scalability guarantees.
Characteristics
- Virtually unlimited scale: Object storage scales to exabytes. AWS S3 stores trillions of objects. No capacity planning required — you pay for what you use.
- High durability: S3 offers 99.999999999% (eleven nines) durability by redundantly storing objects across multiple availability zones. Data loss is extraordinarily unlikely.
- High throughput for large objects: Excellent for streaming large files (videos, backups, dataset exports). Multipart upload enables parallel upload of large objects.
- Higher latency for small objects: An HTTP round-trip to retrieve a small file is 10–100ms, far higher than block storage. Not suitable for databases or any workload requiring frequent random access to small pieces of data.
- Immutable writes: Objects cannot be partially updated. This simplifies replication and enables strong durability guarantees, but means you cannot append to a log file stored as an object.
- Eventual consistency (historically): S3 now offers strong read-after-write consistency, but many object stores are eventually consistent for list operations.
Examples
- Cloud: AWS S3, GCP Cloud Storage, Azure Blob Storage, Cloudflare R2
- Self-hosted: MinIO, Ceph
- Use cases: User-uploaded images and videos, static website assets, data lake storage, ML training datasets, database backups, log archives, build artifacts
The AWS S3 API has become the de facto standard for object storage. Every major cloud provider, most self-hosted solutions (MinIO, Ceph), and dozens of data tools (Spark, Flink, DVC, MLflow) support the S3 API natively. When designing a system that needs object storage, designing for the S3 API gives you maximum portability and tool compatibility, regardless of which backend you actually use.
Comparing Storage Types
| Block | File | Object | |
|---|---|---|---|
| Access method | Raw I/O / filesystem | NFS, SMB (POSIX) | HTTP API (S3) |
| Namespace | Blocks at addresses | Hierarchical (directories) | Flat (key-value) |
| Latency | Sub-ms (NVMe) | 1–5ms (network) | 10–100ms (HTTP) |
| Throughput | Very high (random I/O) | High | Very high (large sequential) |
| Shared access | No (single host) | Yes (multi-client) | Yes (HTTP) |
| Mutability | Mutable | Mutable | Immutable (replace only) |
| Scale | TB per volume | Hundreds of TB | Exabytes |
| Durability | Depends on RAID/replication | Depends on NAS config | 11 nines (e.g. S3) |
| Best for | Databases, OS volumes | Shared filesystems | Media, backups, data lakes |
RAID
RAID (Redundant Array of Independent Disks) combines multiple physical drives into a single logical volume that provides redundancy, performance, or both. RAID is implemented either in hardware (a dedicated RAID controller) or in software (Linux mdadm, ZFS).
Common RAID Levels
RAID 0 — Striping
Data is split (striped) across multiple disks. Reads and writes are parallelized, doubling throughput with two disks, tripling with three. No redundancy — if any single disk fails, all data is lost. Used only when performance is paramount and data loss is acceptable (scratch disks, temporary caches).
RAID 1 — Mirroring
Every write goes to two (or more) disks simultaneously. The content of all disks is identical. If one disk fails, the other contains a complete copy. Read performance can be doubled (reads served from either disk). 50% storage efficiency — two 1 TB disks give 1 TB usable. Simple and robust. Commonly used for OS boot drives and small critical datasets.
RAID 5 — Striping with Distributed Parity
Data and parity information are striped across at least three disks. Parity allows reconstruction of any single failed disk. Reads are fast (parallelized). Writes require parity calculation (some overhead). Can tolerate one disk failure. Storage efficiency is (N-1)/N — three 1 TB disks give 2 TB usable. The most common RAID level for NAS devices.
RAID 6 — Striping with Double Parity
Like RAID 5 but with two parity blocks, allowing two simultaneous disk failures. Requires at least four disks. Higher write overhead than RAID 5. Preferred for large arrays where the probability of a second disk failing during a rebuild is non-negligible.
RAID 10 — Mirroring + Striping
Combines RAID 1 mirroring with RAID 0 striping. Requires at least four disks. Provides both the redundancy of mirroring and the performance of striping. 50% storage efficiency. The preferred choice for databases: high IOPS, high throughput, and can survive multiple simultaneous failures as long as the failed drives are not the same mirror pair.
| Level | Redundancy | Min disks | Efficiency | Best for |
|---|---|---|---|---|
| RAID 0 | None | 2 | 100% | Max performance, no durability |
| RAID 1 | 1 disk failure | 2 | 50% | Simple redundancy, small arrays |
| RAID 5 | 1 disk failure | 3 | (N-1)/N | NAS, balanced performance/redundancy |
| RAID 6 | 2 disk failures | 4 | (N-2)/N | Large arrays, higher safety |
| RAID 10 | 1 per mirror pair | 4 | 50% | Databases, high-performance workloads |
RAID protects against disk hardware failure. It does not protect against accidental deletion, ransomware, filesystem corruption, or datacenter failure. A RAID 1 mirror that contains corrupted data mirrors the corruption to both drives. Always maintain separate backups (offsite, point-in-time snapshots) in addition to RAID.
NAS and SAN
NAS and SAN are two approaches to providing networked storage, differing fundamentally in what they expose over the network.
NAS — Network Attached Storage
A NAS device presents a filesystem over the network using file-level protocols (NFS, SMB). Clients mount the share and see files and directories. The NAS handles the filesystem internally; clients just see file operations. Multiple clients can share the same NAS simultaneously. NAS is easy to set up and manage, and is the standard approach for shared file storage in small-to-medium environments.
SAN — Storage Area Network
A SAN presents raw block devices over a dedicated high-speed network (typically Fibre Channel or iSCSI). The client OS sees a disk device, formats it with a filesystem, and manages it like a local drive. SANs provide very low latency (comparable to local block storage) and are used for performance-critical workloads like large databases and virtualization platforms. SANs are expensive and complex to operate — primarily used in enterprise data centers.
| NAS | SAN | |
|---|---|---|
| What it exposes | Files (NFS, SMB) | Raw blocks (FC, iSCSI) |
| Who manages filesystem | NAS device | Client OS |
| Simultaneous clients | Yes (file sharing) | Typically one per LUN |
| Performance | Good | Excellent (near-local) |
| Complexity | Low | High |
| Cost | Moderate | High |
Distributed File Systems
When storage requirements exceed what a single machine can provide — petabytes of data, thousands of concurrent clients — distributed file systems spread data across many machines while presenting a single unified namespace.
HDFS — Hadoop Distributed File System
HDFS was designed for batch processing of very large files (think: 128 MB–GB+ per file). It stores files by splitting them into large blocks (default 128 MB) and replicating each block across multiple DataNodes (typically 3×). A centralized NameNode tracks block locations.
- Write-once, read-many: Files are not updated in place. Optimized for sequential streaming reads, not random access.
- Designed for MapReduce: Computation is moved to the data. Jobs run on the same nodes that store the data, avoiding network transfer.
- Not suitable for: Small files (NameNode memory overhead), low-latency access, frequent updates.
Ceph
A self-hosted distributed storage platform that provides block storage (RBD), file storage (CephFS), and object storage (RGW with S3-compatible API) over the same cluster. Designed for large-scale on-premises deployments where you need the flexibility of all three storage types without separate systems.
GFS — Google File System
The predecessor to HDFS (and published before it), GFS was designed for Google’s workloads: large files, high sequential throughput, fault tolerance at scale. Its design influenced HDFS and much of the thinking behind distributed storage systems. GFS is not publicly available; HDFS is the open-source equivalent.
Storage in System Design
Every component of a system stores data somewhere. Matching the storage type to the workload is one of the most important architectural decisions:
Decision Framework
- Need low latency random I/O? → Block storage (database volumes, NVMe SSDs). Relational databases, NoSQL stores, and message queues all need block storage.
- Need multiple servers to share the same files? → File storage (NFS/EFS). Application config, shared media, build artifacts that multiple CI workers need simultaneously.
- Need to store large amounts of unstructured data durably and cheaply? → Object storage (S3). User uploads, backups, ML datasets, logs, static assets. This is the default for anything that isn’t a database or a shared filesystem.
- Need to process petabytes in batch? → Distributed file system (HDFS) or object storage as a data lake (S3 + Spark/Athena).
Common Patterns
Application servers are stateless: Application code and configuration come from object storage or are baked into container images. Local disk on application servers is ephemeral. User-generated content goes directly to object storage, never to the local filesystem of an application server.
Databases use block storage: Attach a high-performance block volume (EBS io2, local NVMe) to your database server. The database manages the filesystem. Use RAID 10 or cloud-managed replication for durability.
Object storage as the data lake: Ingest raw data into S3 (or equivalent). Query it with serverless tools (AWS Athena, BigQuery) or batch processing frameworks (Spark). Avoid moving large datasets to HDFS on-premises unless you have a specific reason to — managed object storage is simpler to operate and often cheaper.
When a component needs to store data, explicitly name the storage type and justify it. “User profile photos go to S3 — object storage gives us eleven-nines durability and virtually unlimited scale at low cost” is far stronger than “we store images somewhere.” For databases, mention block storage and the volume type (SSD/NVMe vs HDD). If asked about large-scale data processing, distinguish HDFS from modern S3-based data lake architectures. Bring up RAID only if the question involves on-premises infrastructure or data durability at the disk level.