Write-Ahead Logging (WAL)

Write-Ahead Logging is a protocol that ensures database recoverability by requiring that all modifications are written to a sequential log before being applied to the actual data pages on disk. WAL is the foundation of the ARIES Recovery Algorithm and is used by virtually every production DBMS including PostgreSQL, MySQL (InnoDB), SQLite, Oracle, and SQL Server.

The WAL Rule

No data page is written to disk until all log records describing changes to that page have been flushed to stable storage.

This single invariant guarantees that after a crash, the system can reconstruct the database state by replaying or undoing log records.

How It Works

  1. A transaction modifies data in the buffer pool (in-memory cache)
  2. A log record describing the change is appended to the WAL buffer
  3. Before the dirty data page can be flushed to disk, the WAL buffer must be flushed first (write-ahead guarantee)
  4. On COMMIT, the log records up to the commit point are forced to stable storage
  5. The actual data pages can be written to disk lazily (at checkpoint time or by the buffer manager)

Log Record Structure

Each log record typically contains:

FieldDescription
LSNLog Sequence Number — monotonically increasing identifier
Transaction IDWhich transaction generated this record
Page IDWhich data page was modified
Undo infoPrevious value (for rollback)
Redo infoNew value (for replay after crash)
Prev LSNPrevious log record for this transaction (forms a chain)

Recovery with WAL

After a crash, the recovery manager uses the log to:

  1. Redo all committed transactions whose data pages may not have been flushed (ensures durability)
  2. Undo all uncommitted transactions whose data pages may have been flushed (ensures atomicity)

This is the core of the ARIES Recovery Algorithm’s three-phase recovery (Analysis → Redo → Undo).

Checkpointing

Periodically, the DBMS writes a checkpoint record to the log and flushes all dirty pages. This bounds recovery time — only log records after the last checkpoint need to be processed.

  • Sharp Checkpoint — Pauses all transactions, flushes everything. Simple but causes downtime.
  • Fuzzy Checkpoint — Records the current state without pausing transactions. More complex but non-blocking. Used by ARIES.

SQLite WAL Mode

SQLite offers two journaling modes:

  • Rollback Journal (default) — The original page is copied to a journal file before modification. On crash, the journal restores original pages.
  • WAL Mode — Modifications go to a separate WAL file. Readers see a consistent snapshot from before the WAL. Enables concurrent reads during writes.
PRAGMA journal_mode = WAL;

WAL mode advantages in SQLite:

  • Readers don’t block writers
  • Writers don’t block readers
  • Faster for write-heavy workloads (sequential writes to WAL vs random writes to DB file)

MySQL / InnoDB

InnoDB uses a three-log architecture:

LogPurposeStructureUsed For
Redo logWAL for crash recoveryFixed-size circular buffer; records physical page changesRedo phase — replay committed changes not yet flushed to data files
Undo logRollback + MVCCRecords old row versions (before-images)Transaction rollback; MVCC read snapshots — readers follow undo chain to find their version
Binary logReplication + PITRLogical log of all DDL/DML statementsNot used for crash recovery directly; essential for primary→replica replication and point-in-time recovery

Redo Log Flow

graph LR
    TX[Transaction] -->|1. Write| RB[Redo Log Buffer]
    RB -->|2. Flush on COMMIT| RL[(Redo Log on Disk)]
    TX -->|3. Modify| BP[Buffer Pool - in memory]
    BP -->|4. Lazy flush| DF[(Data Files on Disk)]
    RL -->|Crash recovery| DF

On crash, InnoDB reads the redo log and replays any committed changes that weren’t flushed from the buffer pool to data files. The undo log then rolls back any uncommitted transactions. Long-running transactions cause the undo log to grow, which can degrade MVCC read performance — keep transactions short.

Sources