SQL Databases
SQL databases have been the backbone of production systems for five decades. Despite the NoSQL wave, relational databases remain the default choice for most applications — and for good reason. When you need strong consistency, complex queries, and referential integrity, nothing else comes close. Understanding how they work internally lets you use them effectively and know exactly where their limits are.
What Is SQL?
SQL (Structured Query Language) is the standard language for interacting with relational databases. It covers four areas:
- DDL (Data Definition Language): Creating and modifying schema —
CREATE TABLE,ALTER TABLE,DROP. - DML (Data Manipulation Language): Reading and writing data —
SELECT,INSERT,UPDATE,DELETE. - DCL (Data Control Language): Access control —
GRANT,REVOKE. - TCL (Transaction Control Language): Transaction management —
BEGIN,COMMIT,ROLLBACK.
SQL is declarative: you describe what you want, not how to get it. The database engine figures out the most efficient execution strategy.
The Relational Model
A relational database organizes data into tables (also called relations). Each table has a fixed set of columns (attributes), and each row (tuple) represents one record. The schema enforces that every row in a table has the same columns, and each column has a defined data type.
Keys and Relationships
- Primary key: A column (or combination of columns) that uniquely identifies each row. Cannot be NULL. Indexed automatically. The physical row order in InnoDB is determined by the primary key (clustered index).
- Foreign key: A column in one table that references the primary key of another. Enforces referential integrity — you cannot insert a row with a foreign key value that doesn’t exist in the referenced table, and you cannot delete a referenced row without handling the dependent rows first.
- Unique key: Ensures all values in a column (or column set) are distinct. Unlike a primary key, unique columns can contain NULL (in most databases).
Joins
Joins combine rows from two or more tables based on a related column. They are the defining feature of the relational model — data is stored once (normalized) and assembled at query time:
- INNER JOIN: Returns rows that have matching values in both tables.
- LEFT JOIN: Returns all rows from the left table and matched rows from the right; NULL on the right if no match.
- RIGHT JOIN: Opposite of LEFT JOIN.
- FULL OUTER JOIN: Returns all rows from both tables, with NULLs where there is no match.
A common misconception is that joins are expensive. A well-indexed join between two tables returns results in milliseconds. Joins become slow when joining large tables without indexes, when the query plan chooses a nested-loop join over a hash join, or when returning millions of rows. The fix is almost always adding the right index or rewriting the query — not switching to a database that doesn’t have joins.
ACID Properties
SQL databases guarantee ACID properties for transactions. These four properties are what make relational databases trustworthy for financial, medical, and any other data where correctness is non-negotiable.
Atomicity
A transaction is all-or-nothing. If a transaction contains ten SQL statements and the ninth fails, none of the ten are committed — the database rolls back to its state before the transaction began. This prevents partial updates: a bank transfer that debits one account but never credits the other is impossible under atomicity.
Consistency
A transaction must transition the database from one valid state to another. All integrity constraints (foreign keys, unique indexes, check constraints, NOT NULL) must hold before the transaction commits. If a statement would violate a constraint, the transaction is rolled back.
Isolation
Concurrent transactions must not interfere with each other. Without isolation, you get anomalies:
- Dirty read: Reading uncommitted data from another transaction.
- Non-repeatable read: Reading the same row twice in a transaction and getting different values because another transaction modified it in between.
- Phantom read: A query returns different rows when re-run within a transaction because another transaction inserted or deleted rows matching the query.
SQL defines four isolation levels that trade off anomaly prevention for concurrency:
| Isolation Level | Dirty Read | Non-repeatable Read | Phantom Read |
|---|---|---|---|
| READ UNCOMMITTED | Possible | Possible | Possible |
| READ COMMITTED | Prevented | Possible | Possible |
| REPEATABLE READ | Prevented | Prevented | Possible |
| SERIALIZABLE | Prevented | Prevented | Prevented |
Most databases default to READ COMMITTED (PostgreSQL) or REPEATABLE READ (MySQL/InnoDB). SERIALIZABLE is the strictest — transactions execute as if they were run one at a time — but at the cost of lower concurrency. PostgreSQL implements SERIALIZABLE through Serializable Snapshot Isolation (SSI), which detects conflicts at commit time rather than blocking.
Durability
Once a transaction commits, it is permanent. The database writes the transaction to the Write-Ahead Log (WAL) on disk before acknowledging the commit. If the server crashes immediately after the commit acknowledgment, the transaction is recovered on restart by replaying the WAL. Durability is only as strong as your storage — losing the WAL (e.g. from a storage failure with fsync disabled) can corrupt the database.
Some guides suggest setting fsync=off in PostgreSQL or innodb_flush_log_at_trx_commit=0 in MySQL to speed up writes. This disables the durability guarantee: a system crash after a commit can leave your database in a corrupt, unrecoverable state. Only do this in disposable test environments where data loss is acceptable.
Indexes
An index is a separate data structure that the database maintains alongside the table to speed up data retrieval. Without an index, a query that filters by a non-primary-key column must scan every row in the table (a full table scan). With an index, the database jumps directly to the matching rows.
B-Tree Indexes
The default index type in virtually all relational databases. Data is stored in a balanced tree structure, sorted by the indexed column(s). B-trees support:
- Exact lookups:
WHERE email = 'foo@example.com' - Range queries:
WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31' - Prefix matches:
WHERE name LIKE 'John%'(but notLIKE '%John') - Sorting:
ORDER BY indexed_columncan use the index to avoid a sort step
Composite Indexes
An index on multiple columns. A composite index on (last_name, first_name) can be used for queries filtering by last_name alone, or by both last_name and first_name, but not by first_name alone. Column order matters — the index is sorted first by the leftmost column. This is the leftmost prefix rule.
Covering Indexes
An index that contains all columns needed to satisfy a query — no need to read the actual table rows. If a query selects only email and user_id and both are in the index, the database reads the index only. This is called an index-only scan and is significantly faster than touching the table heap.
Index Trade-offs
Indexes speed up reads but slow down writes. Every INSERT, UPDATE, or DELETE that touches indexed columns must also update the index. A table with 10 indexes on it pays 10 extra write operations per row modification. Add indexes for columns that appear in WHERE, JOIN, and ORDER BY clauses on hot queries. Remove indexes that are never used.
Query Planning
When you submit a SQL query, the database does not execute it directly. The query planner (or optimizer) generates multiple possible execution plans and estimates the cost of each, then executes the cheapest one. Costs are estimated using table statistics (row counts, column cardinality, value histograms) that the database collects periodically.
Understanding query plans is the most important skill for diagnosing slow queries. In PostgreSQL, use EXPLAIN ANALYZE to see the actual plan and timing:
EXPLAIN ANALYZE
SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.country = 'US'
AND o.created_at > '2024-01-01';
Key things to look for in a query plan:
- Seq Scan on a large table: the query is doing a full table scan. Usually means a missing index.
- Nested Loop with a large outer result set: O(N×M) complexity. Check if the inner table has an index on the join column.
- Hash Join: typically efficient for joining large tables with no usable index on the join key. Uses memory to build a hash table.
- Rows estimate vs. actual: large discrepancies mean stale statistics — run
ANALYZEon the table.
Popular SQL Databases
| Database | Strengths | Best For |
|---|---|---|
| PostgreSQL | Full SQL compliance, JSONB, extensions, advanced indexing (GIN, GiST, BRIN), excellent MVCC | General purpose, complex queries, geospatial, full-text search, JSON workloads |
| MySQL / MariaDB | Widely deployed, InnoDB engine, good replication ecosystem, simple to operate | Web applications, read-heavy workloads, large ecosystem support |
| SQLite | Embedded, zero-config, single-file, no server process | Mobile apps, desktop apps, testing, small embedded databases |
| SQL Server | Deep Windows/Azure integration, business intelligence tooling | Enterprise Microsoft stacks, .NET applications |
| CockroachDB | Distributed SQL with horizontal scaling, strong consistency, Postgres-compatible wire protocol | Global applications needing SQL semantics with distributed scale |
| PlanetScale | Horizontal sharding on MySQL, online schema changes with no locking, branch-based deployments | MySQL workloads at large scale with zero-downtime migrations |
When to Use SQL
SQL databases are the right default for most systems. Reach for them when:
- Your data has clear relationships — users have orders, orders have line items, line items reference products. The relational model handles this naturally and efficiently.
- You need transactions across multiple entities — transferring money between accounts, placing an order that decrements inventory, any workflow where partial completion is unacceptable.
- You need ad-hoc queries — SQL’s expressiveness lets you answer questions you didn’t anticipate at schema design time. NoSQL stores often only support the access patterns you designed for.
- You need strong consistency — when reading stale data would cause correctness problems.
- Your schema is relatively stable — though modern SQL databases handle schema migrations well, frequent radical restructuring is more painful than with schemaless stores.
When in doubt, start with PostgreSQL. It handles relational workloads, JSONB for semi-structured data, full-text search via tsvector, geospatial queries via PostGIS, time-series via TimescaleDB extension, and analytical queries via columnar extensions. Most systems that reach for specialized NoSQL databases could have stayed on PostgreSQL longer than they thought. Add NoSQL only when PostgreSQL genuinely cannot meet a specific requirement.