1. Data Warehouse (DW)
A centralized repository designed for structured data (tables, rows, columns). Optimized for business intelligence (BI), reporting, and analytics.
Data type: Structured (relational, transactional, processed).
Schema: Schema-on-write (define schema before loading).
Cost: Expensive (compute + storage tightly coupled).
Use cases: Dashboards, trend analysis, financial reporting.
Examples: Snowflake, Amazon Redshift, Google BigQuery, Teradata.
2. Data Lake
A storage system that holds raw data of all types (structured, semi-structured, unstructured). Designed for flexibility and large-scale storage, not optimized for BI directly.
Data type: Structured (CSV, Parquet), semi-structured (JSON, XML), unstructured (logs, images, videos).
Schema: Schema-on-read (define schema when querying).
Cost: Cheaper (storage & compute decoupled).
Use cases: Data science, machine learning, big data analytics.
Examples: Amazon S3 + Athena, Azure Data Lake Storage, Hadoop HDFS.
3. Data Lakehouse
A hybrid approach that combines the best of both data lakes and warehouses. Stores all data types like a data lake, but supports structured querying and ACID transactions like a warehouse.
Data type: All types (structured + unstructured).
Schema: Flexible → supports both schema-on-write and schema-on-read.
Cost: More cost-effective than DW, scalable like DL.
Use cases: BI + ML/AI + advanced analytics in one platform.
Examples: Databricks Delta Lake, Apache Iceberg, Snowflake (newer versions).
Feature | Data Warehouse | Data Lake | Data Lakehouse |
---|---|---|---|
Data Types | Structured only | All (structured → raw) | All (structured + raw) |
Schema | Schema-on-write | Schema-on-read | Both |
Performance | High for BI | Slower (raw queries) | Optimized (BI + ML) |
Cost | High | Low | Medium/Low |
Best For | Business reporting | ML, AI, Big Data | Unified analytics |