Normalization & Denormalization
Normalization is the process of organizing a relational database schema to reduce data redundancy and eliminate update anomalies. Denormalization is the deliberate reversal of normalization to optimize read performance at the cost of controlled redundancy. Both are tools — neither is universally correct. Understanding when to apply each is a core skill for database design at scale.
What Is Normalization?
A normalized schema stores each fact exactly once. If a customer’s name changes, you update one row in one table and the change is reflected everywhere it is referenced. A denormalized schema might store the customer name in every order row — a name change requires updating thousands of rows, and if any are missed, the data is inconsistent.
Normalization was formalized by E. F. Codd in the 1970s through a series of normal forms (NF), each building on the previous. For practical database design, understanding 1NF through BCNF covers the vast majority of real-world situations.
Update Anomalies
Before the normal forms, it helps to understand what they prevent. Consider this unnormalized table storing orders with embedded customer and product data:
| order_id | customer_name | customer_city | product_name | product_price | quantity |
|---|---|---|---|---|---|
| 1001 | Alice | London | Laptop | 1200 | 1 |
| 1002 | Alice | London | Mouse | 25 | 2 |
| 1003 | Bob | Paris | Laptop | 1200 | 1 |
Three classes of anomalies arise:
- Update anomaly: Alice moves to Berlin. You must update every row where
customer_name = 'Alice'. Miss one row and you have inconsistent data. - Insert anomaly: You cannot record a new product in the catalog without attaching it to an existing order. Product data is entangled with order data.
- Delete anomaly: If Bob’s single order (1003) is deleted, you lose the fact that a Laptop costs 1200 — the only row containing that product.
Normalization eliminates these anomalies by ensuring data is stored in the right place.
First Normal Form (1NF)
A table is in 1NF when:
- Every column contains atomic (indivisible) values — no arrays, no comma-separated lists, no nested records.
- Each row is uniquely identifiable (a primary key exists).
- The order of rows and columns is irrelevant to the data’s meaning.
A common 1NF violation: storing multiple phone numbers in a single column as "555-1234, 555-5678". The fix: a separate phone_numbers table with a foreign key to users.
PostgreSQL natively supports array columns, which technically violate 1NF. Arrays are useful for simple, rarely-queried multi-value data (tags, feature flags), but if you need to query individual elements, filter on them, or join against them, a separate table with a foreign key is almost always the right design. Array columns resist indexing, reporting, and referential integrity.
Second Normal Form (2NF)
A table is in 2NF when it is in 1NF and every non-key column is fully functionally dependent on the entire primary key — not just part of it.
2NF violations only occur in tables with composite primary keys. If part of the primary key determines a non-key column, that column should be in a separate table.
Example: an order_items table with composite key (order_id, product_id). If product_name depends only on product_id (not on the combination of order_id + product_id), it’s a partial dependency — a 2NF violation. The fix: move product_name to a products table and reference it via product_id.
| Violates 2NF | Fixed |
|---|---|
order_items(order_id, product_id, product_name, quantity) |
order_items(order_id, product_id, quantity)products(product_id, product_name, price) |
Third Normal Form (3NF)
A table is in 3NF when it is in 2NF and no non-key column depends on another non-key column. In other words: all non-key columns depend on the key, the whole key, and nothing but the key.
This eliminates transitive dependencies. Example: a customers table with (customer_id, zip_code, city, state). Here city and state depend on zip_code, not directly on customer_id. The transitive chain is: customer_id → zip_code → city.
Fix: extract a zip_codes(zip_code, city, state) table. The customers table keeps only zip_code as a foreign key. Now if a zip code’s city changes (rare, but possible), one row in zip_codes covers every customer with that zip code.
Most production schemas aim for 3NF as the baseline. It eliminates the most common sources of redundancy without introducing excessive join complexity.
Boyce-Codd Normal Form (BCNF)
BCNF is a slightly stronger version of 3NF. A table is in BCNF when every determinant is a candidate key. (A determinant is any column or set of columns that functionally determines another column.)
BCNF violations are rarer and arise in tables with overlapping candidate keys. When they occur, they introduce subtle anomalies that 3NF doesn’t catch. In practice, if your schema is in 3NF you will rarely encounter BCNF violations, and when you do, decomposing to BCNF sometimes causes a loss of functional dependencies that cannot be preserved without joins — a practical trade-off that sometimes leads designers to accept a 3NF schema instead.
Denormalization
Fully normalized schemas are clean and consistent, but they can be slow for read-heavy workloads. A query that joins 8 tables to render a user’s dashboard imposes significant database load for every page view. Denormalization trades some data consistency for read performance by introducing controlled redundancy.
Common signals that denormalization is warranted:
- A frequently executed query joins many large tables.
- Read latency is unacceptable and caching doesn’t fit (data is too dynamic or too personalized).
- Reporting queries aggregate data across millions of rows by joining several tables.
- The system is read-heavy; writes are infrequent or can tolerate extra latency.
Denormalization is a trade-off that introduces maintenance burden: the application must keep redundant copies in sync on every write. Start with a normalized schema. Add indexes. Add caching. Profile actual slow queries with EXPLAIN ANALYZE. Only denormalize when you have identified a specific performance problem that these tools cannot solve.
Denormalization Strategies
Duplicated Columns
Copy a column from one table into another to eliminate a join. Store customer_name directly in the orders table alongside customer_id. The application must update both tables when a customer renames, or accept that orders display the name at time of purchase (which is often the correct business requirement anyway).
Computed Columns
Precompute and store a derived value instead of calculating it at query time. Store order_total in the orders table rather than summing order_items.price * quantity on every read. The application updates order_total whenever items are added or removed.
PostgreSQL supports generated columns (GENERATED ALWAYS AS) for simple computed columns that are maintained automatically by the database engine.
Materialized Views
A materialized view is a precomputed query result stored as a physical table. Unlike a regular view (which re-executes the query on every access), a materialized view is computed once and refreshed on demand.
CREATE MATERIALIZED VIEW monthly_revenue AS
SELECT DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY 1;
-- Refresh manually or on a schedule
REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_revenue;
Materialized views are excellent for dashboards, reports, and aggregate metrics. The staleness window (time between refreshes) must be acceptable for the use case. REFRESH CONCURRENTLY avoids locking readers during the refresh.
Embedding in Document Stores
In document databases (MongoDB, Firestore), denormalization is the default design pattern. Instead of referencing a related document by ID (requiring a second query), you embed the related data directly inside the parent document. A blog post embeds its author’s display name and avatar URL rather than storing only an author ID.
This is efficient for reads but creates an update problem: if the author changes their name, every embedded post document must be updated. This is acceptable when the embedded data is stable or when the read-time join alternative would be too expensive.
NoSQL and Denormalization
NoSQL databases are built around denormalization as a first-class design pattern. Because most NoSQL stores do not support multi-table joins, the data model must be designed around query patterns from the start.
In Cassandra, data is modeled by the queries that will read it. If you need to look up a user’s tweets sorted by time, and also look up all tweets containing a hashtag, you create two separate tables (column families): one keyed by user_id, one keyed by hashtag. The same tweet is written to both tables — explicit denormalization in exchange for O(1) lookup on either access pattern.
DynamoDB encourages a single-table design where multiple entity types share a table, using carefully chosen partition and sort keys to serve all access patterns without joins.
Design Considerations
- Normalize for writes, denormalize for reads. A write-heavy transactional system (banking, order processing) benefits from normalization. A read-heavy analytical system (dashboards, reports) benefits from denormalization.
- OLTP vs OLAP: Online transaction processing systems run frequent short reads and writes — normalize. Online analytical processing systems run infrequent, expensive read queries — consider a star schema (fact tables + dimension tables) which is a structured form of denormalization.
- Maintain consistency at write time. When you denormalize, the application or a database trigger must keep redundant copies consistent. Decide who owns this logic and make it explicit.
- Version your denormalized snapshots. Sometimes the right answer is to snapshot the data at write time (the address used at order placement, not the current customer address). This is semantically correct, not just a performance hack.
Demonstrate that you think about access patterns when modeling data. “The write path normalizes order data into customers, orders, and order_items tables for consistency and auditability. The read path for the dashboard aggregates over millions of rows, so we maintain a materialized view refreshed every 5 minutes — staleness is acceptable for reporting.” This signals awareness of the OLTP/OLAP trade-off without overengineering the solution.