Fact Table
A fact table is the central table in a Star Schema or Snowflake Schema, storing quantitative measures (metrics) and foreign keys to dimension tables. It represents the events or transactions being analyzed — each row is one measurable occurrence.
Structure
| Column Type | Purpose | Examples |
|---|---|---|
| Foreign keys | Link to dimensions (who, what, where, when) | date_key, product_key, store_key |
| Measures | Numeric values to aggregate | revenue, quantity, cost, duration |
| Degenerate dimensions | Dimension attributes without their own table | invoice_number, order_number |
Fact Types
| Type | Grain | Example |
|---|---|---|
| Transaction | One row per event | Each sale, each click, each shipment |
| Periodic snapshot | One row per time period | Daily account balance, monthly inventory |
| Accumulating snapshot | One row per entity lifecycle | Order tracking (ordered → shipped → delivered) |
Measure Types
- Additive — Can be summed across all dimensions (revenue, quantity). Most common.
- Semi-additive — Can be summed across some dimensions but not time (account balance — average over time, sum across accounts).
- Non-additive — Cannot be summed (ratios, percentages). Must be computed from additive components.
Granularity
Granularity defines what one row in the fact table represents — the most important design decision:
- Fine grain (e.g., individual transactions) → More detail, larger table, flexible drill-down
- Coarse grain (e.g., monthly summaries) → Smaller table, faster aggregation, but can’t drill to detail
Factors: business requirements (what questions must be answerable?), query performance, storage costs, data availability/quality.
Factless Fact Tables
A fact table with no measures — it records the occurrence of events or relationships between dimensions:
CREATE TABLE enrollment_fact (
student_id INT REFERENCES dim_student(id),
course_id INT REFERENCES dim_course(id),
term_id INT REFERENCES dim_term(id),
PRIMARY KEY (student_id, course_id, term_id)
-- No measure columns!
);Enables questions like “How many CS majors take 4+ courses in Fall 2025?” without querying the live OLTP database. Key advantage: can have indexes on all columns since factless fact tables are refreshed in batch (unlike operational tables where many indexes would slow writes during active registration).
Sparsity
When not all dimension combinations have facts, the table becomes sparse (many nulls/zeros):
| Design | Strategy | Trade-off |
|---|---|---|
| Sparse | Store only non-null records | Smaller table, more complex queries (outer joins, COALESCE) |
| Dense | Store all possible combinations with null placeholders | Simpler queries, much larger table |
| Hybrid | Dense for low-sparsity dimensions, sparse for high-sparsity | Balanced |
One Big Table (OBT) Pattern
An alternative to star schema — denormalize everything into a single wide table (hundreds of columns):
- Advantages: No joins, simpler queries, faster reads, easier data ingestion
- Disadvantages: Massive redundancy, consistency challenges, adding new dimensions requires schema change, limited governance
- Best for: Big data environments (Apache Hive, Spark, BigQuery) where joins are expensive across distributed nodes
- Significant limitation: Cannot add new dimensions without schema migration (unless using schema-less NoSQL, but then SQL analytics aren’t possible)
Performance Optimization
Seven techniques for fast analytical queries:
- Indexing — Composite indexes on frequently co-used dimension keys
- Partitioning — Time-based (year/quarter/month) or geography-based; see Database Sharding and Partitioning
- Aggregation tables — Separate pre-calculated summary tables at higher hierarchy levels
- Materialized views — Pre-computed results of frequent complex queries
- Query optimization — Use explain plans, filter early, proper join conditions
- Caching — Redis/Memcached for frequently accessed results
- Parallel processing — Distributed computing (Hadoop/Spark) for concurrent execution across partitions