Data Warehousing
A data warehouse is a centralized repository of integrated data from multiple sources, optimized for analytical queries and reporting rather than transactional processing. It stores historical, subject-oriented, time-variant data that supports business intelligence, decision-making, and OLAP operations. Data warehouses sit at the core of the ETL → Warehouse → BI pipeline.
Key Characteristics
- Subject-Oriented — Organized around business subjects (sales, customers, products) rather than applications
- Integrated — Data from disparate sources is cleaned, transformed, and unified into a consistent schema
- Time-Variant — Stores historical data with timestamps; supports trend analysis across time periods
- Non-Volatile — Data is loaded in bulk and rarely modified; optimized for read-heavy workloads
Data Warehouse vs Data Lake
| Aspect | Data Warehouse | Data Lake |
|---|---|---|
| Data format | Structured, schema-on-write | Raw, schema-on-read |
| Processing | ETL (Extract, Transform, Load) | ELT (Extract, Load, Transform) |
| Schema | Star Schema / Snowflake Schema | No predefined schema |
| Users | Business analysts, BI tools | Data scientists, ML engineers |
| Query performance | Optimized (pre-aggregated) | Variable (depends on compute) |
| Cost | Higher storage (curated) | Lower storage (raw) |
Architecture Approaches
Inmon (Top-Down)
Bill Inmon’s approach builds a centralized, normalized enterprise data warehouse first, then derives department-specific data marts from it. Formal definition: “A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decisions.”
Kimball (Bottom-Up)
Ralph Kimball’s approach builds dimensional data marts first (using star schemas), which collectively form the data warehouse via the Data Warehouse Bus Architecture — a set of conformed dimensions shared across marts.
Detailed Comparison
| Aspect | Inmon (Top-Down) | Kimball (Bottom-Up) |
|---|---|---|
| Architecture | Enterprise-centric, single source of truth | Department-centric, Data Warehouse Bus |
| Data model | Highly normalized (3NF) | Denormalized dimensional (Star Schema) |
| Data marts | Dependent — derived from centralized warehouse | Independent — built first, aggregated later |
| Methodology | Long-term strategic, data governance focus | Agile, iterative, quick value delivery |
| Time to value | Longer (build warehouse first) | Faster (marts deliver value immediately) |
| Strengths | Governance, consistency, reduced redundancy, complex historical analysis | Rapid delivery, flexibility, user-friendly ad-hoc querying |
| Key tradeoff | Governance & control vs. business agility | Quality & consistency vs. user empowerment |
- Faster time to value
- Star Schema / Snowflake Schema design
- Data marts are independent
- Easier for business users to query
ETL Pipeline
- Extract — Pull data from source systems (OLTP databases, APIs, files)
- Transform — Clean, deduplicate, conform, aggregate, and apply business rules
- Load — Write transformed data into the warehouse
Modern alternatives:
- ELT — Load raw data first (into a Data Lake or cloud warehouse), transform in place using SQL
- Streaming ETL — Real-time ingestion with tools like Kafka, Apache Flink
ETL Integration Challenges
The Transform stage must resolve data quality issues from heterogeneous sources:
- Category encoding conflicts — Gender stored as
"m"/"f"in one system,"M"/"F"/"B"/"H"/"U"/"T"in another,1/0in a third - Field naming conflicts — Same concept stored under different column names across sources
- Primary key selection — Choosing a unified key when sources use different identifier schemes
- Duplicate data — Same record appearing in multiple sources
- Missing data — Default values vs imputation strategies
- Inconsistent values — Conflicting data for the same entity across sources
- Backflushing — Updated, cleansed data is sometimes copied back to operational sources to improve source data quality
Database Models for Warehousing
| Model | Examples | Best For |
|---|---|---|
| Columnar | Amazon Redshift, Snowflake, ClickHouse | Analytical aggregations over wide tables |
| Relational | PostgreSQL, MySQL | Smaller-scale warehouses |
| In-Memory | SAP HANA, MemSQL | Ultra-low latency analytics |
| NoSQL | Cassandra, HBase | Semi-structured high-volume data |
| Time-Series | InfluxDB, Prometheus, TimescaleDB | IoT, monitoring, financial ticks |
| NewSQL | Google Spanner, CockroachDB | SQL + horizontal scale with ACID |
| Distributed Processing | Apache Hadoop, Spark | Big data analytics, ML pipelines |
Dimensional Modeling
The core design methodology for data warehouses (Kimball approach):
- Fact Table — Central table containing quantitative measures (revenue, quantity, duration) and foreign keys to dimension tables
- Dimension Table — Descriptive context for facts (who, what, where, when, why)
- Star Schema — Fact table surrounded by denormalized dimension tables
- Snowflake Schema — Star schema with normalized dimension tables