Query Processing and Optimization

Query processing is the sequence of steps a DBMS uses to translate a SQL statement into an efficient execution plan. The query optimizer selects the plan with the lowest estimated cost from among many equivalent Relational Algebra expressions. Understanding this pipeline is essential for writing performant queries and diagnosing slow ones with Explain Analyze.

Query Processing Pipeline

flowchart TD
    A[SQL Query] --> B[Parser]
    B -->|Syntax check, AST generation| C[Query Rewrite]
    C -->|View expansion, predicate simplification| D[Optimizer]
    D -->|Generate candidate plans, estimate costs, select best| E[Executor]
    E -->|Execute the chosen plan| F[Result Set]

Optimization Strategies

Rule-Based (Heuristic)

Apply transformation rules that are generally beneficial:

  • Push selections down — Filter rows as early as possible (reduces intermediate result size)
  • Push projections down — Drop unused columns early (reduces I/O and memory)
  • Convert subqueries to joins — Often more efficient than correlated subqueries
  • Predicate simplification — Combine and simplify WHERE conditions
  • Join reordering — Commutative and associative properties of joins allow reordering

Cost-Based

Estimate the cost of each candidate plan using:

  • Statistics — Table sizes, column cardinality, histogram distributions, null fractions
  • Cost model — Estimated I/O cost (disk pages read) + CPU cost (comparisons, hashing)
  • Access path selection — Sequential scan vs index scan vs index-only scan

The optimizer evaluates many plans and picks the cheapest. Statistics are maintained by ANALYZE (PostgreSQL) or auto-updated.

Join Algorithms

AlgorithmHow It WorksBest When
Nested LoopFor each row in outer table, scan inner tableSmall tables, indexed inner table
Hash JoinBuild hash table on smaller relation, probe with largerEqui-joins, no useful index
Sort-Merge JoinSort both inputs on join key, mergeBoth inputs already sorted or large
Index Nested LoopLike nested loop but inner table accessed via indexIndex exists on join column

Access Methods

  • Sequential (Full Table) Scan — Read every page. Used when no useful index or when most rows needed.
  • Index Scan — Traverse B-tree index to find matching rows, then fetch data pages. Efficient for selective queries.
  • Index-Only Scan — All needed columns are in the index; no heap access required. Fastest.
  • Bitmap Index Scan — Build a bitmap of matching pages from the index, then scan those pages. Good for medium selectivity or combining multiple indexes.

EXPLAIN and EXPLAIN ANALYZE

View the query plan without and with execution:

-- Show the plan (no execution)
EXPLAIN SELECT * FROM orders WHERE status = 'shipped';
 
-- Show the plan with actual execution times
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'shipped';

Key things to look for:

  • Scan type — Seq Scan vs Index Scan (is your index being used?)
  • Estimated vs actual rows — Large discrepancy indicates stale statistics
  • Sort operations — Disk-based sorts are expensive; consider adding indexes
  • Nested loops on large tables — May indicate a missing index

See Explain Analyze for detailed interpretation.

Common Optimization Techniques

Sources