Data Warehousing

A data warehouse is a centralized repository of integrated data from multiple sources, optimized for analytical queries and reporting rather than transactional processing. It stores historical, subject-oriented, time-variant data that supports business intelligence, decision-making, and OLAP operations. Data warehouses sit at the core of the ETL → Warehouse → BI pipeline.

Key Characteristics

  • Subject-Oriented — Organized around business subjects (sales, customers, products) rather than applications
  • Integrated — Data from disparate sources is cleaned, transformed, and unified into a consistent schema
  • Time-Variant — Stores historical data with timestamps; supports trend analysis across time periods
  • Non-Volatile — Data is loaded in bulk and rarely modified; optimized for read-heavy workloads

Data Warehouse vs Data Lake

AspectData WarehouseData Lake
Data formatStructured, schema-on-writeRaw, schema-on-read
ProcessingETL (Extract, Transform, Load)ELT (Extract, Load, Transform)
SchemaStar Schema / Snowflake SchemaNo predefined schema
UsersBusiness analysts, BI toolsData scientists, ML engineers
Query performanceOptimized (pre-aggregated)Variable (depends on compute)
CostHigher storage (curated)Lower storage (raw)

Architecture Approaches

Inmon (Top-Down)

Bill Inmon’s approach builds a centralized, normalized enterprise data warehouse first, then derives department-specific data marts from it. Formal definition: “A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decisions.”

Kimball (Bottom-Up)

Ralph Kimball’s approach builds dimensional data marts first (using star schemas), which collectively form the data warehouse via the Data Warehouse Bus Architecture — a set of conformed dimensions shared across marts.

Detailed Comparison

AspectInmon (Top-Down)Kimball (Bottom-Up)
ArchitectureEnterprise-centric, single source of truthDepartment-centric, Data Warehouse Bus
Data modelHighly normalized (3NF)Denormalized dimensional (Star Schema)
Data martsDependent — derived from centralized warehouseIndependent — built first, aggregated later
MethodologyLong-term strategic, data governance focusAgile, iterative, quick value delivery
Time to valueLonger (build warehouse first)Faster (marts deliver value immediately)
StrengthsGovernance, consistency, reduced redundancy, complex historical analysisRapid delivery, flexibility, user-friendly ad-hoc querying
Key tradeoffGovernance & control vs. business agilityQuality & consistency vs. user empowerment

ETL Pipeline

  1. Extract — Pull data from source systems (OLTP databases, APIs, files)
  2. Transform — Clean, deduplicate, conform, aggregate, and apply business rules
  3. Load — Write transformed data into the warehouse

Modern alternatives:

  • ELT — Load raw data first (into a Data Lake or cloud warehouse), transform in place using SQL
  • Streaming ETL — Real-time ingestion with tools like Kafka, Apache Flink

ETL Integration Challenges

The Transform stage must resolve data quality issues from heterogeneous sources:

  • Category encoding conflicts — Gender stored as "m"/"f" in one system, "M"/"F"/"B"/"H"/"U"/"T" in another, 1/0 in a third
  • Field naming conflicts — Same concept stored under different column names across sources
  • Primary key selection — Choosing a unified key when sources use different identifier schemes
  • Duplicate data — Same record appearing in multiple sources
  • Missing data — Default values vs imputation strategies
  • Inconsistent values — Conflicting data for the same entity across sources
  • Backflushing — Updated, cleansed data is sometimes copied back to operational sources to improve source data quality

Database Models for Warehousing

ModelExamplesBest For
ColumnarAmazon Redshift, Snowflake, ClickHouseAnalytical aggregations over wide tables
RelationalPostgreSQL, MySQLSmaller-scale warehouses
In-MemorySAP HANA, MemSQLUltra-low latency analytics
NoSQLCassandra, HBaseSemi-structured high-volume data
Time-SeriesInfluxDB, Prometheus, TimescaleDBIoT, monitoring, financial ticks
NewSQLGoogle Spanner, CockroachDBSQL + horizontal scale with ACID
Distributed ProcessingApache Hadoop, SparkBig data analytics, ML pipelines

Dimensional Modeling

The core design methodology for data warehouses (Kimball approach):

  • Fact Table — Central table containing quantitative measures (revenue, quantity, duration) and foreign keys to dimension tables
  • Dimension Table — Descriptive context for facts (who, what, where, when, why)
  • Star Schema — Fact table surrounded by denormalized dimension tables
  • Snowflake Schema — Star schema with normalized dimension tables

Sources