Query Processing and Optimization
Query processing is the sequence of steps a DBMS uses to translate a SQL statement into an efficient execution plan. The query optimizer selects the plan with the lowest estimated cost from among many equivalent Relational Algebra expressions. Understanding this pipeline is essential for writing performant queries and diagnosing slow ones with Explain Analyze.
Query Processing Pipeline
flowchart TD A[SQL Query] --> B[Parser] B -->|Syntax check, AST generation| C[Query Rewrite] C -->|View expansion, predicate simplification| D[Optimizer] D -->|Generate candidate plans, estimate costs, select best| E[Executor] E -->|Execute the chosen plan| F[Result Set]
Optimization Strategies
Rule-Based (Heuristic)
Apply transformation rules that are generally beneficial:
- Push selections down — Filter rows as early as possible (reduces intermediate result size)
- Push projections down — Drop unused columns early (reduces I/O and memory)
- Convert subqueries to joins — Often more efficient than correlated subqueries
- Predicate simplification — Combine and simplify WHERE conditions
- Join reordering — Commutative and associative properties of joins allow reordering
Cost-Based
Estimate the cost of each candidate plan using:
- Statistics — Table sizes, column cardinality, histogram distributions, null fractions
- Cost model — Estimated I/O cost (disk pages read) + CPU cost (comparisons, hashing)
- Access path selection — Sequential scan vs index scan vs index-only scan
The optimizer evaluates many plans and picks the cheapest. Statistics are maintained by ANALYZE (PostgreSQL) or auto-updated.
Join Algorithms
| Algorithm | How It Works | Best When |
|---|---|---|
| Nested Loop | For each row in outer table, scan inner table | Small tables, indexed inner table |
| Hash Join | Build hash table on smaller relation, probe with larger | Equi-joins, no useful index |
| Sort-Merge Join | Sort both inputs on join key, merge | Both inputs already sorted or large |
| Index Nested Loop | Like nested loop but inner table accessed via index | Index exists on join column |
Access Methods
- Sequential (Full Table) Scan — Read every page. Used when no useful index or when most rows needed.
- Index Scan — Traverse B-tree index to find matching rows, then fetch data pages. Efficient for selective queries.
- Index-Only Scan — All needed columns are in the index; no heap access required. Fastest.
- Bitmap Index Scan — Build a bitmap of matching pages from the index, then scan those pages. Good for medium selectivity or combining multiple indexes.
EXPLAIN and EXPLAIN ANALYZE
View the query plan without and with execution:
-- Show the plan (no execution)
EXPLAIN SELECT * FROM orders WHERE status = 'shipped';
-- Show the plan with actual execution times
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'shipped';Key things to look for:
- Scan type — Seq Scan vs Index Scan (is your index being used?)
- Estimated vs actual rows — Large discrepancy indicates stale statistics
- Sort operations — Disk-based sorts are expensive; consider adding indexes
- Nested loops on large tables — May indicate a missing index
See Explain Analyze for detailed interpretation.
Common Optimization Techniques
- Create indexes on columns used in WHERE, JOIN, and ORDER BY
- Use CTEs or views for readability, but be aware of materialization behavior
- Avoid
SELECT *— project only needed columns - Use
EXISTSoverINfor correlated subqueries when only existence matters - Partition large tables — see Database Sharding and Partitioning
- Keep statistics up to date (
ANALYZEin PostgreSQL)