Databases

SQL Databases

● Beginner ⏱ 10 min read database

SQL databases have been the backbone of production systems for five decades. Despite the NoSQL wave, relational databases remain the default choice for most applications — and for good reason. When you need strong consistency, complex queries, and referential integrity, nothing else comes close. Understanding how they work internally lets you use them effectively and know exactly where their limits are.

What Is SQL?

SQL (Structured Query Language) is the standard language for interacting with relational databases. It covers four areas:

DDL (Data Definition Language): Creating and modifying schema — CREATE TABLE, ALTER TABLE, DROP.
DML (Data Manipulation Language): Reading and writing data — SELECT, INSERT, UPDATE, DELETE.
DCL (Data Control Language): Access control — GRANT, REVOKE.
TCL (Transaction Control Language): Transaction management — BEGIN, COMMIT, ROLLBACK.

SQL is declarative: you describe what you want, not how to get it. The database engine figures out the most efficient execution strategy.

The Relational Model

A relational database organizes data into tables (also called relations). Each table has a fixed set of columns (attributes), and each row (tuple) represents one record. The schema enforces that every row in a table has the same columns, and each column has a defined data type.

Keys and Relationships

Primary key: A column (or combination of columns) that uniquely identifies each row. Cannot be NULL. Indexed automatically. The physical row order in InnoDB is determined by the primary key (clustered index).
Foreign key: A column in one table that references the primary key of another. Enforces referential integrity — you cannot insert a row with a foreign key value that doesn’t exist in the referenced table, and you cannot delete a referenced row without handling the dependent rows first.
Unique key: Ensures all values in a column (or column set) are distinct. Unlike a primary key, unique columns can contain NULL (in most databases).

Joins

Joins combine rows from two or more tables based on a related column. They are the defining feature of the relational model — data is stored once (normalized) and assembled at query time:

INNER JOIN: Returns rows that have matching values in both tables.
LEFT JOIN: Returns all rows from the left table and matched rows from the right; NULL on the right if no match.
RIGHT JOIN: Opposite of LEFT JOIN.
FULL OUTER JOIN: Returns all rows from both tables, with NULLs where there is no match.

💡

Joins Are Not Slow by Default

A common misconception is that joins are expensive. A well-indexed join between two tables returns results in milliseconds. Joins become slow when joining large tables without indexes, when the query plan chooses a nested-loop join over a hash join, or when returning millions of rows. The fix is almost always adding the right index or rewriting the query — not switching to a database that doesn’t have joins.

ACID Properties

SQL databases guarantee ACID properties for transactions. These four properties are what make relational databases trustworthy for financial, medical, and any other data where correctness is non-negotiable.

Atomicity

A transaction is all-or-nothing. If a transaction contains ten SQL statements and the ninth fails, none of the ten are committed — the database rolls back to its state before the transaction began. This prevents partial updates: a bank transfer that debits one account but never credits the other is impossible under atomicity.

Consistency

A transaction must transition the database from one valid state to another. All integrity constraints (foreign keys, unique indexes, check constraints, NOT NULL) must hold before the transaction commits. If a statement would violate a constraint, the transaction is rolled back.

Isolation

Concurrent transactions must not interfere with each other. Without isolation, you get anomalies:

Dirty read: Reading uncommitted data from another transaction.
Non-repeatable read: Reading the same row twice in a transaction and getting different values because another transaction modified it in between.
Phantom read: A query returns different rows when re-run within a transaction because another transaction inserted or deleted rows matching the query.

SQL defines four isolation levels that trade off anomaly prevention for concurrency:

Isolation Level	Dirty Read	Non-repeatable Read	Phantom Read
READ UNCOMMITTED	Possible	Possible	Possible
READ COMMITTED	Prevented	Possible	Possible
REPEATABLE READ	Prevented	Prevented	Possible
SERIALIZABLE	Prevented	Prevented	Prevented

Most databases default to READ COMMITTED (PostgreSQL) or REPEATABLE READ (MySQL/InnoDB). SERIALIZABLE is the strictest — transactions execute as if they were run one at a time — but at the cost of lower concurrency. PostgreSQL implements SERIALIZABLE through Serializable Snapshot Isolation (SSI), which detects conflicts at commit time rather than blocking.

Durability

Once a transaction commits, it is permanent. The database writes the transaction to the Write-Ahead Log (WAL) on disk before acknowledging the commit. If the server crashes immediately after the commit acknowledgment, the transaction is recovered on restart by replaying the WAL. Durability is only as strong as your storage — losing the WAL (e.g. from a storage failure with fsync disabled) can corrupt the database.

⚠️

Never Disable fsync in Production

Some guides suggest setting fsync=off in PostgreSQL or innodb_flush_log_at_trx_commit=0 in MySQL to speed up writes. This disables the durability guarantee: a system crash after a commit can leave your database in a corrupt, unrecoverable state. Only do this in disposable test environments where data loss is acceptable.

Indexes

An index is a separate data structure that the database maintains alongside the table to speed up data retrieval. Without an index, a query that filters by a non-primary-key column must scan every row in the table (a full table scan). With an index, the database jumps directly to the matching rows.

B-Tree Indexes

The default index type in virtually all relational databases. Data is stored in a balanced tree structure, sorted by the indexed column(s). B-trees support:

Exact lookups: WHERE email = 'foo@example.com'
Range queries: WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31'
Prefix matches: WHERE name LIKE 'John%' (but not LIKE '%John')
Sorting: ORDER BY indexed_column can use the index to avoid a sort step

Composite Indexes

An index on multiple columns. A composite index on (last_name, first_name) can be used for queries filtering by last_name alone, or by both last_name and first_name, but not by first_name alone. Column order matters — the index is sorted first by the leftmost column. This is the leftmost prefix rule.

Covering Indexes

An index that contains all columns needed to satisfy a query — no need to read the actual table rows. If a query selects only email and user_id and both are in the index, the database reads the index only. This is called an index-only scan and is significantly faster than touching the table heap.

Index Trade-offs

Indexes speed up reads but slow down writes. Every INSERT, UPDATE, or DELETE that touches indexed columns must also update the index. A table with 10 indexes on it pays 10 extra write operations per row modification. Add indexes for columns that appear in WHERE, JOIN, and ORDER BY clauses on hot queries. Remove indexes that are never used.

Query Planning

When you submit a SQL query, the database does not execute it directly. The query planner (or optimizer) generates multiple possible execution plans and estimates the cost of each, then executes the cheapest one. Costs are estimated using table statistics (row counts, column cardinality, value histograms) that the database collects periodically.

Understanding query plans is the most important skill for diagnosing slow queries. In PostgreSQL, use EXPLAIN ANALYZE to see the actual plan and timing:

EXPLAIN ANALYZE
SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.country = 'US'
  AND o.created_at > '2024-01-01';

Key things to look for in a query plan:

Seq Scan on a large table: the query is doing a full table scan. Usually means a missing index.
Nested Loop with a large outer result set: O(N×M) complexity. Check if the inner table has an index on the join column.
Hash Join: typically efficient for joining large tables with no usable index on the join key. Uses memory to build a hash table.
Rows estimate vs. actual: large discrepancies mean stale statistics — run ANALYZE on the table.

Popular SQL Databases

Database	Strengths	Best For
PostgreSQL	Full SQL compliance, JSONB, extensions, advanced indexing (GIN, GiST, BRIN), excellent MVCC	General purpose, complex queries, geospatial, full-text search, JSON workloads
MySQL / MariaDB	Widely deployed, InnoDB engine, good replication ecosystem, simple to operate	Web applications, read-heavy workloads, large ecosystem support
SQLite	Embedded, zero-config, single-file, no server process	Mobile apps, desktop apps, testing, small embedded databases
SQL Server	Deep Windows/Azure integration, business intelligence tooling	Enterprise Microsoft stacks, .NET applications
CockroachDB	Distributed SQL with horizontal scaling, strong consistency, Postgres-compatible wire protocol	Global applications needing SQL semantics with distributed scale
PlanetScale	Horizontal sharding on MySQL, online schema changes with no locking, branch-based deployments	MySQL workloads at large scale with zero-downtime migrations

When to Use SQL

SQL databases are the right default for most systems. Reach for them when:

Your data has clear relationships — users have orders, orders have line items, line items reference products. The relational model handles this naturally and efficiently.
You need transactions across multiple entities — transferring money between accounts, placing an order that decrements inventory, any workflow where partial completion is unacceptable.
You need ad-hoc queries — SQL’s expressiveness lets you answer questions you didn’t anticipate at schema design time. NoSQL stores often only support the access patterns you designed for.
You need strong consistency — when reading stale data would cause correctness problems.
Your schema is relatively stable — though modern SQL databases handle schema migrations well, frequent radical restructuring is more painful than with schemaless stores.

✅

PostgreSQL as the Default Choice

When in doubt, start with PostgreSQL. It handles relational workloads, JSONB for semi-structured data, full-text search via tsvector, geospatial queries via PostGIS, time-series via TimescaleDB extension, and analytical queries via columnar extensions. Most systems that reach for specialized NoSQL databases could have stayed on PostgreSQL longer than they thought. Add NoSQL only when PostgreSQL genuinely cannot meet a specific requirement.