Data Warehousing

A data warehouse is a centralized repository of integrated data from multiple sources, optimized for analytical queries and reporting rather than transactional processing. It stores historical, subject-oriented, time-variant data that supports business intelligence, decision-making, and OLAP operations. Data warehouses sit at the core of the ETL → Warehouse → BI pipeline.

Key Characteristics

Subject-Oriented — Organized around business subjects (sales, customers, products) rather than applications
Integrated — Data from disparate sources is cleaned, transformed, and unified into a consistent schema
Time-Variant — Stores historical data with timestamps; supports trend analysis across time periods
Non-Volatile — Data is loaded in bulk and rarely modified; optimized for read-heavy workloads

Data Warehouse vs Data Lake

Aspect	Data Warehouse	Data Lake
Data format	Structured, schema-on-write	Raw, schema-on-read
Processing	ETL (Extract, Transform, Load)	ELT (Extract, Load, Transform)
Schema	Star Schema / Snowflake Schema	No predefined schema
Users	Business analysts, BI tools	Data scientists, ML engineers
Query performance	Optimized (pre-aggregated)	Variable (depends on compute)
Cost	Higher storage (curated)	Lower storage (raw)

Architecture Approaches

Inmon (Top-Down)

Bill Inmon’s approach builds a centralized, normalized enterprise data warehouse first, then derives department-specific data marts from it. Formal definition: “A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decisions.”

Kimball (Bottom-Up)

Ralph Kimball’s approach builds dimensional data marts first (using star schemas), which collectively form the data warehouse via the Data Warehouse Bus Architecture — a set of conformed dimensions shared across marts.

Detailed Comparison

Aspect	Inmon (Top-Down)	Kimball (Bottom-Up)
Architecture	Enterprise-centric, single source of truth	Department-centric, Data Warehouse Bus
Data model	Highly normalized (3NF)	Denormalized dimensional (Star Schema)
Data marts	Dependent — derived from centralized warehouse	Independent — built first, aggregated later
Methodology	Long-term strategic, data governance focus	Agile, iterative, quick value delivery
Time to value	Longer (build warehouse first)	Faster (marts deliver value immediately)
Strengths	Governance, consistency, reduced redundancy, complex historical analysis	Rapid delivery, flexibility, user-friendly ad-hoc querying
Key tradeoff	Governance & control vs. business agility	Quality & consistency vs. user empowerment

Faster time to value
Star Schema / Snowflake Schema design
Data marts are independent
Easier for business users to query

ETL Pipeline

Extract — Pull data from source systems (OLTP databases, APIs, files)
Transform — Clean, deduplicate, conform, aggregate, and apply business rules
Load — Write transformed data into the warehouse

Modern alternatives:

ELT — Load raw data first (into a Data Lake or cloud warehouse), transform in place using SQL
Streaming ETL — Real-time ingestion with tools like Kafka, Apache Flink

ETL Integration Challenges

The Transform stage must resolve data quality issues from heterogeneous sources:

Category encoding conflicts — Gender stored as "m"/"f" in one system, "M"/"F"/"B"/"H"/"U"/"T" in another, 1/0 in a third
Field naming conflicts — Same concept stored under different column names across sources
Primary key selection — Choosing a unified key when sources use different identifier schemes
Duplicate data — Same record appearing in multiple sources
Missing data — Default values vs imputation strategies
Inconsistent values — Conflicting data for the same entity across sources
Backflushing — Updated, cleansed data is sometimes copied back to operational sources to improve source data quality

Database Models for Warehousing

Model	Examples	Best For
Columnar	Amazon Redshift, Snowflake, ClickHouse	Analytical aggregations over wide tables
Relational	PostgreSQL, MySQL	Smaller-scale warehouses
In-Memory	SAP HANA, MemSQL	Ultra-low latency analytics
NoSQL	Cassandra, HBase	Semi-structured high-volume data
Time-Series	InfluxDB, Prometheus, TimescaleDB	IoT, monitoring, financial ticks
NewSQL	Google Spanner, CockroachDB	SQL + horizontal scale with ACID
Distributed Processing	Apache Hadoop, Spark	Big data analytics, ML pipelines

Dimensional Modeling

The core design methodology for data warehouses (Kimball approach):

Fact Table — Central table containing quantitative measures (revenue, quantity, duration) and foreign keys to dimension tables
Dimension Table — Descriptive context for facts (who, what, where, when, why)
Star Schema — Fact table surrounded by denormalized dimension tables
Snowflake Schema — Star schema with normalized dimension tables

Quartz 5

Explorer

Data Warehousing

Data Warehousing

Key Characteristics

Data Warehouse vs Data Lake

Architecture Approaches

Inmon (Top-Down)

Kimball (Bottom-Up)

Detailed Comparison

ETL Pipeline

ETL Integration Challenges

Database Models for Warehousing

Dimensional Modeling

Sources

Graph View

Table of Contents

Backlinks