Write-Ahead Logging (WAL)
Write-Ahead Logging is a protocol that ensures database recoverability by requiring that all modifications are written to a sequential log before being applied to the actual data pages on disk. WAL is the foundation of the ARIES Recovery Algorithm and is used by virtually every production DBMS including PostgreSQL, MySQL (InnoDB), SQLite, Oracle, and SQL Server.
The WAL Rule
No data page is written to disk until all log records describing changes to that page have been flushed to stable storage.
This single invariant guarantees that after a crash, the system can reconstruct the database state by replaying or undoing log records.
How It Works
- A transaction modifies data in the buffer pool (in-memory cache)
- A log record describing the change is appended to the WAL buffer
- Before the dirty data page can be flushed to disk, the WAL buffer must be flushed first (write-ahead guarantee)
- On COMMIT, the log records up to the commit point are forced to stable storage
- The actual data pages can be written to disk lazily (at checkpoint time or by the buffer manager)
Log Record Structure
Each log record typically contains:
| Field | Description |
|---|---|
| LSN | Log Sequence Number — monotonically increasing identifier |
| Transaction ID | Which transaction generated this record |
| Page ID | Which data page was modified |
| Undo info | Previous value (for rollback) |
| Redo info | New value (for replay after crash) |
| Prev LSN | Previous log record for this transaction (forms a chain) |
Recovery with WAL
After a crash, the recovery manager uses the log to:
- Redo all committed transactions whose data pages may not have been flushed (ensures durability)
- Undo all uncommitted transactions whose data pages may have been flushed (ensures atomicity)
This is the core of the ARIES Recovery Algorithm’s three-phase recovery (Analysis → Redo → Undo).
Checkpointing
Periodically, the DBMS writes a checkpoint record to the log and flushes all dirty pages. This bounds recovery time — only log records after the last checkpoint need to be processed.
- Sharp Checkpoint — Pauses all transactions, flushes everything. Simple but causes downtime.
- Fuzzy Checkpoint — Records the current state without pausing transactions. More complex but non-blocking. Used by ARIES.
SQLite WAL Mode
SQLite offers two journaling modes:
- Rollback Journal (default) — The original page is copied to a journal file before modification. On crash, the journal restores original pages.
- WAL Mode — Modifications go to a separate WAL file. Readers see a consistent snapshot from before the WAL. Enables concurrent reads during writes.
PRAGMA journal_mode = WAL;WAL mode advantages in SQLite:
- Readers don’t block writers
- Writers don’t block readers
- Faster for write-heavy workloads (sequential writes to WAL vs random writes to DB file)
MySQL / InnoDB
InnoDB uses a three-log architecture:
| Log | Purpose | Structure | Used For |
|---|---|---|---|
| Redo log | WAL for crash recovery | Fixed-size circular buffer; records physical page changes | Redo phase — replay committed changes not yet flushed to data files |
| Undo log | Rollback + MVCC | Records old row versions (before-images) | Transaction rollback; MVCC read snapshots — readers follow undo chain to find their version |
| Binary log | Replication + PITR | Logical log of all DDL/DML statements | Not used for crash recovery directly; essential for primary→replica replication and point-in-time recovery |
Redo Log Flow
graph LR TX[Transaction] -->|1. Write| RB[Redo Log Buffer] RB -->|2. Flush on COMMIT| RL[(Redo Log on Disk)] TX -->|3. Modify| BP[Buffer Pool - in memory] BP -->|4. Lazy flush| DF[(Data Files on Disk)] RL -->|Crash recovery| DF
On crash, InnoDB reads the redo log and replays any committed changes that weren’t flushed from the buffer pool to data files. The undo log then rolls back any uncommitted transactions. Long-running transactions cause the undo log to grow, which can degrade MVCC read performance — keep transactions short.