Query Processing and Optimization

Query processing is the sequence of steps a DBMS uses to translate a SQL statement into an efficient execution plan. The query optimizer selects the plan with the lowest estimated cost from among many equivalent Relational Algebra expressions. Understanding this pipeline is essential for writing performant queries and diagnosing slow ones with Explain Analyze.

Query Processing Pipeline

flowchart TD
    A[SQL Query] --> B[Parser]
    B -->|Syntax check, AST generation| C[Query Rewrite]
    C -->|View expansion, predicate simplification| D[Optimizer]
    D -->|Generate candidate plans, estimate costs, select best| E[Executor]
    E -->|Execute the chosen plan| F[Result Set]

Optimization Strategies

Rule-Based (Heuristic)

Apply transformation rules that are generally beneficial:

Push selections down — Filter rows as early as possible (reduces intermediate result size)
Push projections down — Drop unused columns early (reduces I/O and memory)
Convert subqueries to joins — Often more efficient than correlated subqueries
Predicate simplification — Combine and simplify WHERE conditions
Join reordering — Commutative and associative properties of joins allow reordering

Cost-Based

Estimate the cost of each candidate plan using:

Statistics — Table sizes, column cardinality, histogram distributions, null fractions
Cost model — Estimated I/O cost (disk pages read) + CPU cost (comparisons, hashing)
Access path selection — Sequential scan vs index scan vs index-only scan

The optimizer evaluates many plans and picks the cheapest. Statistics are maintained by ANALYZE (PostgreSQL) or auto-updated.

Join Algorithms

Algorithm	How It Works	Best When
Nested Loop	For each row in outer table, scan inner table	Small tables, indexed inner table
Hash Join	Build hash table on smaller relation, probe with larger	Equi-joins, no useful index
Sort-Merge Join	Sort both inputs on join key, merge	Both inputs already sorted or large
Index Nested Loop	Like nested loop but inner table accessed via index	Index exists on join column

Access Methods

Sequential (Full Table) Scan — Read every page. Used when no useful index or when most rows needed.
Index Scan — Traverse B-tree index to find matching rows, then fetch data pages. Efficient for selective queries.
Index-Only Scan — All needed columns are in the index; no heap access required. Fastest.
Bitmap Index Scan — Build a bitmap of matching pages from the index, then scan those pages. Good for medium selectivity or combining multiple indexes.

EXPLAIN and EXPLAIN ANALYZE

View the query plan without and with execution:

-- Show the plan (no execution)
EXPLAIN SELECT * FROM orders WHERE status = 'shipped';
 
-- Show the plan with actual execution times
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'shipped';

Key things to look for:

Scan type — Seq Scan vs Index Scan (is your index being used?)
Estimated vs actual rows — Large discrepancy indicates stale statistics
Sort operations — Disk-based sorts are expensive; consider adding indexes
Nested loops on large tables — May indicate a missing index

See Explain Analyze for detailed interpretation.

Common Optimization Techniques

Create indexes on columns used in WHERE, JOIN, and ORDER BY
Use CTEs or views for readability, but be aware of materialization behavior
Avoid SELECT * — project only needed columns
Use EXISTS over IN for correlated subqueries when only existence matters
Partition large tables — see Database Sharding and Partitioning
Keep statistics up to date (ANALYZE in PostgreSQL)

Quartz 5

Explorer

Query Processing and Optimization

Query Processing and Optimization

Query Processing Pipeline

Optimization Strategies

Rule-Based (Heuristic)

Cost-Based

Join Algorithms

Access Methods

EXPLAIN and EXPLAIN ANALYZE

Common Optimization Techniques

Sources

Graph View

Table of Contents

Backlinks