Fact Table

A fact table is the central table in a Star Schema or Snowflake Schema, storing quantitative measures (metrics) and foreign keys to dimension tables. It represents the events or transactions being analyzed — each row is one measurable occurrence.

Structure

Column TypePurposeExamples
Foreign keysLink to dimensions (who, what, where, when)date_key, product_key, store_key
MeasuresNumeric values to aggregaterevenue, quantity, cost, duration
Degenerate dimensionsDimension attributes without their own tableinvoice_number, order_number

Fact Types

TypeGrainExample
TransactionOne row per eventEach sale, each click, each shipment
Periodic snapshotOne row per time periodDaily account balance, monthly inventory
Accumulating snapshotOne row per entity lifecycleOrder tracking (ordered → shipped → delivered)

Measure Types

  • Additive — Can be summed across all dimensions (revenue, quantity). Most common.
  • Semi-additive — Can be summed across some dimensions but not time (account balance — average over time, sum across accounts).
  • Non-additive — Cannot be summed (ratios, percentages). Must be computed from additive components.

Granularity

Granularity defines what one row in the fact table represents — the most important design decision:

  • Fine grain (e.g., individual transactions) → More detail, larger table, flexible drill-down
  • Coarse grain (e.g., monthly summaries) → Smaller table, faster aggregation, but can’t drill to detail

Factors: business requirements (what questions must be answerable?), query performance, storage costs, data availability/quality.

Factless Fact Tables

A fact table with no measures — it records the occurrence of events or relationships between dimensions:

CREATE TABLE enrollment_fact (
    student_id INT REFERENCES dim_student(id),
    course_id INT REFERENCES dim_course(id),
    term_id INT REFERENCES dim_term(id),
    PRIMARY KEY (student_id, course_id, term_id)
    -- No measure columns!
);

Enables questions like “How many CS majors take 4+ courses in Fall 2025?” without querying the live OLTP database. Key advantage: can have indexes on all columns since factless fact tables are refreshed in batch (unlike operational tables where many indexes would slow writes during active registration).

Sparsity

When not all dimension combinations have facts, the table becomes sparse (many nulls/zeros):

DesignStrategyTrade-off
SparseStore only non-null recordsSmaller table, more complex queries (outer joins, COALESCE)
DenseStore all possible combinations with null placeholdersSimpler queries, much larger table
HybridDense for low-sparsity dimensions, sparse for high-sparsityBalanced

One Big Table (OBT) Pattern

An alternative to star schema — denormalize everything into a single wide table (hundreds of columns):

  • Advantages: No joins, simpler queries, faster reads, easier data ingestion
  • Disadvantages: Massive redundancy, consistency challenges, adding new dimensions requires schema change, limited governance
  • Best for: Big data environments (Apache Hive, Spark, BigQuery) where joins are expensive across distributed nodes
  • Significant limitation: Cannot add new dimensions without schema migration (unless using schema-less NoSQL, but then SQL analytics aren’t possible)

Performance Optimization

Seven techniques for fast analytical queries:

  1. Indexing — Composite indexes on frequently co-used dimension keys
  2. Partitioning — Time-based (year/quarter/month) or geography-based; see Database Sharding and Partitioning
  3. Aggregation tables — Separate pre-calculated summary tables at higher hierarchy levels
  4. Materialized views — Pre-computed results of frequent complex queries
  5. Query optimization — Use explain plans, filter early, proper join conditions
  6. CachingRedis/Memcached for frequently accessed results
  7. Parallel processing — Distributed computing (Hadoop/Spark) for concurrent execution across partitions