Normalization
Normalization is the process of organizing a relational database to reduce redundancy and prevent update anomalies. It decomposes relations based on Functional Dependencies to satisfy progressively stricter normal forms. The guiding principles:
- Do Not Replicate Data, instead reference them. Point at data.
- Use integers for keys and for references.
- Add a special
keycolumn to each table, which you will make references to.
Why Normalize
Without normalization, a poorly designed table can suffer from:
- Insertion anomaly — Cannot add data without unrelated data (e.g., can’t add a department without an employee)
- Update anomaly — Changing one fact requires updating multiple rows (e.g., renaming a department across all employee rows)
- Deletion anomaly — Deleting data loses unrelated facts (e.g., deleting the last employee loses the department info)
Normal Forms
First Normal Form (1NF)
Every attribute contains only atomic (indivisible) values. No repeating groups, no arrays, no comma-separated lists in a single column.
-- Violates 1NF (multi-valued)
| student | courses |
|---------|------------------|
| Alice | CS5200, CS5800 |
-- Satisfies 1NF
| student | course |
|---------|---------|
| Alice | CS5200 |
| Alice | CS5800 |
Second Normal Form (2NF)
Satisfies 1NF and every non-key attribute is fully functionally dependent on the entire Primary Key. No partial dependencies — relevant only when the primary key is a Composite Key.
-- Violates 2NF (student_name depends only on student_id, not the full key)
PK: (student_id, course_id)
student_id | course_id | student_name | grade
-- Fix: decompose
Students(student_id PK, student_name)
Enrollments(student_id FK, course_id FK, grade)
Third Normal Form (3NF)
Satisfies 2NF and no non-key attribute is transitively dependent on the Primary Key. Every non-key attribute depends directly on the key and nothing but the key.
-- Violates 3NF (dept_name depends on dept_id, not directly on emp_id)
emp_id PK | dept_id | dept_name
-- Fix: decompose
Employees(emp_id PK, dept_id FK)
Departments(dept_id PK, dept_name)
Boyce-Codd Normal Form (BCNF)
A stricter version of 3NF. For every non-trivial functional dependency X → Y, X must be a Superkey. Handles edge cases where 3NF allows certain anomalies with overlapping candidate keys.
3NF vs BCNF — the difference: 3NF allows a non-superkey determinant if the dependent attribute is a prime attribute (part of a Candidate Key). BCNF does not — every determinant must be a Superkey, period.
R(StudentID, CourseCode, InstructorName)
FDs: {StudentID, CourseCode} → InstructorName
InstructorName → CourseCode
-- 3NF? YES — InstructorName → CourseCode is okay because CourseCode is prime
-- BCNF? NO — InstructorName is not a superkey
-- Fix: decompose
R1(InstructorName, CourseCode) -- InstructorName is now a key
R2(StudentID, CourseCode) -- enrollment relation
Note: a well-designed database based on a sound conceptual model is most often already in 3NF. Experienced practitioners verify normal forms through inspection rather than formally computing functional dependencies.
Fourth Normal Form (4NF)
Satisfies BCNF and no multivalued dependencies unless the determinant is a Superkey. Addresses independent multi-valued facts about an entity.
Fifth Normal Form (5NF)
Satisfies 4NF and every join dependency is implied by the candidate keys. Ensures lossless decomposition for all possible joins.
graph TD UNF[Unnormalized] --> NF1[1NF - Atomic values] NF1 --> NF2[2NF - No partial dependencies] NF2 --> NF3[3NF - No transitive dependencies] NF3 --> BCNF[BCNF - Every determinant is a superkey] BCNF --> NF4[4NF - No multivalued dependencies] NF4 --> NF5[5NF - All join dependencies implied by keys]
Denormalization
Deliberately introducing redundancy to improve read performance at the cost of write complexity. Common in:
- Data Warehousing / OLAP — star schemas denormalize dimensions for fast aggregation
- Caching layers — Materialized aggregates
- High-read workloads where joins become bottlenecks
The decision to denormalize should be driven by measured performance data (see Explain Analyze), not assumptions.