Data Lakes vs Data Warehouses: Key Differences

After this topic, you will be able to:

Differentiate between OLTP, OLAP, data lakes, and data warehouses
Analyze star schema vs snowflake schema trade-offs for analytical workloads
Compare data lake and data warehouse architectures for different business needs

TL;DR

Data lakes store raw, unstructured data in flat object storage for exploratory analytics and ML, while data warehouses store structured, schema-on-write data optimized for fast SQL queries and BI dashboards. OLTP databases handle transactional workloads (writes, updates, ACID), while OLAP systems handle analytical workloads (aggregations, scans, read-heavy). Modern architectures often combine both: raw data lands in lakes, transformed data moves to warehouses, and lakehouses attempt to unify both paradigms.

Cheat Sheet: Data Lake = schema-on-read, cheap storage, flexible but slower queries. Data Warehouse = schema-on-write, expensive storage, fast structured queries. OLTP = row-oriented, normalized, transactional. OLAP = column-oriented, denormalized, analytical.

Context

When Netflix wants to analyze viewing patterns across 200 million users, they don’t query their production MySQL databases—they’d bring the system to its knees. Instead, they replicate data into analytical systems optimized for scanning billions of rows. This is the fundamental split between operational and analytical data systems, and understanding it is critical for any system design interview involving data at scale.

The choice between data lakes and warehouses isn’t academic—it determines your storage costs, query performance, data governance, and team productivity. Amazon stores petabytes of raw clickstream data in S3 (their data lake) but loads aggregated metrics into Redshift (their warehouse) for executive dashboards. Uber’s data scientists explore raw trip data in their lake but business analysts query pre-modeled tables in their warehouse. The distinction matters because each architecture makes different trade-offs between flexibility, performance, and cost.

Interviewers ask about this topic because it reveals whether you understand the entire data lifecycle: how transactional data flows from OLTP systems into analytical stores, when to transform data (ETL vs ELT), and how to design schemas for analytical queries. A senior engineer should know that OLTP databases use row-oriented storage with B-trees for fast point lookups, while OLAP systems use columnar storage with compression for fast aggregations. The architecture you choose affects everything from query latency to monthly AWS bills.

Side-by-Side Comparison

OLTP vs OLAP

Dimension	OLTP (Transactional)	OLAP (Analytical)
Primary Use	Handle business transactions (orders, payments, user updates)	Answer business questions (revenue trends, user cohorts, forecasts)
Query Pattern	Point lookups, small updates (SELECT/UPDATE single rows)	Large scans, aggregations (SUM/AVG across millions of rows)
Storage	Row-oriented (entire row stored together)	Column-oriented (each column stored separately)
Schema	Normalized (3NF) to avoid update anomalies	Denormalized (star/snowflake) for query performance
Example	PostgreSQL storing e-commerce orders	Snowflake analyzing quarterly sales by region
When Better	User-facing apps needing <100ms response, frequent writes	Batch reports, dashboards, ML training needing to scan TBs

Data Lake vs Data Warehouse

Dimension	Data Lake	Data Warehouse
Schema	Schema-on-read (applied when querying)	Schema-on-write (enforced when loading)
Data Types	Raw, unstructured (logs, JSON, images, video)	Structured, cleaned, business-ready tables
Storage	Object storage (S3, ADLS) with flat namespace	Proprietary columnar formats (Redshift, BigQuery)
Cost	Very cheap ($0.023/GB/month S3)	Expensive ($23/TB/month Snowflake compute)
Query Speed	Slow (scan raw files, no indexes)	Fast (optimized storage, statistics, caching)
Users	Data scientists, ML engineers exploring data	Business analysts, executives running dashboards
Example	Netflix storing raw viewing logs in S3	Netflix aggregating daily watch hours by title in Redshift
When Better	Exploratory analysis, ML feature engineering, archival	Repeated queries, BI tools, regulatory reporting

Star Schema vs Snowflake Schema

Dimension	Star Schema	Snowflake Schema
Structure	Fact table surrounded by denormalized dimension tables	Fact table with normalized dimension tables (sub-dimensions)
Joins	Single join from fact to dimension	Multiple joins through dimension hierarchies
Storage	More redundant (dimension data duplicated)	Less redundant (dimensions normalized)
Query Speed	Faster (fewer joins, better for BI tools)	Slower (more joins, complex execution plans)
Example	Sales fact table joins directly to Product dimension	Sales fact joins Product, which joins Category, which joins Department
When Better	BI dashboards, simple aggregations, query performance critical	Complex hierarchies, storage optimization, data integrity important

OLTP vs OLAP: Storage and Query Patterns

graph LR
    subgraph OLTP System
        A["User Request<br/><i>UPDATE user SET email=...</i>"]
        B["Row-Oriented Storage<br/><i>B-Tree Index</i>"]
        C[("PostgreSQL<br/>User Table")]
        D["Single Row<br/><i>user_id: 123</i><br/>name, email, address..."]
    end
    
    subgraph OLAP System
        E["Analytics Query<br/><i>SELECT AVG(amount) FROM sales</i>"]
        F["Columnar Storage<br/><i>Compressed Columns</i>"]
        G[("Snowflake<br/>Sales Fact")]
        H["Amount Column Only<br/><i>Skip other 20 columns</i><br/>Scan 10M values"]
    end
    
    A --"1. Point Lookup"--> B
    B --"2. Read Row"--> C
    C --"3. Return"--> D
    
    E --"1. Scan Column"--> F
    F --"2. Read Compressed"--> G
    G --"3. Aggregate"--> H

OLTP systems use row-oriented storage with B-tree indexes for fast point lookups of entire rows, while OLAP systems use columnar storage to scan only relevant columns across millions of rows. This fundamental difference in storage layout drives the performance characteristics of each system.

Deep Analysis

OLTP: The Transactional Workhorse

OLTP databases like PostgreSQL and MySQL are optimized for the write-heavy, low-latency workloads that power user-facing applications. They store data in row-oriented format because transactions typically touch entire rows: when you update a user’s email address, you need the entire user record. B-tree indexes provide O(log n) lookups for primary keys, and ACID guarantees ensure your bank transfer either completes fully or rolls back entirely.

The problem is that OLTP databases are terrible at analytical queries. Scanning millions of rows to calculate average order value requires reading entire rows even though you only need two columns (order_id and amount). The row-oriented storage means you’re reading 90% irrelevant data. Worse, running heavy analytics on production databases starves transactional queries of resources—your checkout page slows down because someone’s running a quarterly report.

OLAP: Built for Analytics

OLAP systems like Snowflake, BigQuery, and Redshift solve this by storing data in columnar format. Each column is stored separately and heavily compressed (timestamps compress 10x, repeated values compress 100x). When you calculate SUM(revenue), the database reads only the revenue column, not entire rows. Columnar storage also enables vectorized execution: modern CPUs process 1000 values per instruction using SIMD.

The trade-off is that OLAP systems are slow at updates. Updating a single row requires rewriting entire column chunks. That’s why OLAP databases are append-only: you load data in batches (hourly, daily) rather than updating individual records. For detailed normalization concepts, see Databases Overview, but the key insight is that OLAP systems intentionally denormalize data to avoid expensive joins during queries.

Data Lakes: Flexibility at Scale

Data lakes emerged when companies realized they were throwing away valuable data because it didn’t fit warehouse schemas. Netflix receives terabytes of device telemetry daily—playback errors, network conditions, A/B test events. Defining schemas upfront is impossible when you don’t know what questions you’ll ask.

A data lake stores raw data in object storage (S3, Azure Data Lake) organized by ingestion date: s3://lake/events/2024/01/15/. Data is typically partitioned by date for efficient scanning. The “schema-on-read” approach means you define structure when querying: Spark reads JSON files and infers schema, Athena queries Parquet files using SQL. This flexibility enables exploratory analysis—data scientists can experiment with raw data without waiting for ETL pipelines.

The downside is governance chaos. Without enforced schemas, data quality degrades: field names change, types drift, documentation disappears. Query performance suffers because you’re scanning raw files without indexes or statistics. This is why companies build “data lake zones”: raw data lands in Bronze, cleaned data moves to Silver, business-ready aggregates live in Gold. Each zone adds structure while preserving raw data for reprocessing.

Data Warehouses: Performance and Governance

Data warehouses enforce schema-on-write: data is validated, cleaned, and transformed before loading. This upfront cost pays dividends in query performance and data quality. Redshift pre-computes statistics, builds zone maps (min/max values per block), and materializes aggregates. Queries that would scan terabytes in a lake complete in seconds in a warehouse.

The star schema is the workhorse of warehouse design. A central fact table stores metrics (sales_amount, quantity) with foreign keys to dimension tables (product, customer, date). Dimensions are denormalized: the product table includes category_name and department_name directly rather than joining to separate tables. This denormalization trades storage for query speed—exactly what analytics needs. For more on denormalization strategies, see Denormalization.

Snowflake schemas normalize dimensions into hierarchies (product → category → department), reducing redundancy but requiring more joins. They’re rare in practice because storage is cheap and query performance matters more. The exception is when dimension hierarchies change frequently—normalizing makes updates easier.

ETL vs ELT: Where Transformation Happens

ETL (Extract-Transform-Load) transforms data before loading into the warehouse. You run Spark jobs that clean data, join tables, and aggregate metrics, then load results into Redshift. This was necessary when warehouse compute was expensive—you offloaded transformation to cheaper Hadoop clusters.

ELT (Extract-Load-Transform) loads raw data into the warehouse first, then transforms using SQL. Modern warehouses like Snowflake and BigQuery have cheap, elastic compute, making in-warehouse transformation viable. ELT is simpler (no separate Spark cluster) and more flexible (analysts can re-transform without re-extracting). The trade-off is that you’re storing more raw data in expensive warehouse storage rather than cheap S3.

Star Schema vs Snowflake Schema Structure

graph TB
    subgraph Star Schema - Denormalized
        SF["Sales Fact<br/><i>order_id, product_id, customer_id</i><br/>amount, quantity, date"]
        PD["Product Dimension<br/><i>product_id, name</i><br/>category_name, brand_name<br/>department_name"]
        CD["Customer Dimension<br/><i>customer_id, name</i><br/>city, state, country"]
        DD["Date Dimension<br/><i>date_id, date</i><br/>month, quarter, year"]
        
        SF --"Single Join"--> PD
        SF --"Single Join"--> CD
        SF --"Single Join"--> DD
    end
    
    subgraph Snowflake Schema - Normalized
        SF2["Sales Fact<br/><i>order_id, product_id, customer_id</i><br/>amount, quantity, date"]
        PD2["Product<br/><i>product_id, name</i><br/>category_id"]
        CAT["Category<br/><i>category_id, name</i><br/>department_id"]
        DEPT["Department<br/><i>department_id, name</i>"]
        CD2["Customer<br/><i>customer_id, name</i><br/>city_id"]
        CITY["City<br/><i>city_id, name</i><br/>state_id"]
        
        SF2 --"Join 1"--> PD2
        PD2 --"Join 2"--> CAT
        CAT --"Join 3"--> DEPT
        SF2 --"Join 1"--> CD2
        CD2 --"Join 2"--> CITY
    end

Star schema denormalizes dimensions for single-join queries (faster, more storage), while snowflake schema normalizes dimensions into hierarchies (slower queries, less redundancy). Star schema is the default choice for most analytical workloads where query performance outweighs storage costs.

Data Lake Architecture with Processing Zones

graph LR
    subgraph Data Sources
        APP["Application Logs<br/><i>JSON</i>"]
        DB["Database CDC<br/><i>Change Events</i>"]
        API["External APIs<br/><i>Third-party Data</i>"]
    end
    
    subgraph Bronze Zone - Raw Data
        S3B[("S3 Bronze<br/><i>s3://lake/raw/2024/01/15/</i><br/>Schema-on-read<br/>Partitioned by date")]
    end
    
    subgraph Silver Zone - Cleaned Data
        SPARK1["Spark Job<br/><i>Validate, Dedupe</i><br/>Parse JSON"]
        S3S[("S3 Silver<br/><i>Parquet format</i><br/>Cleaned & Validated<br/>Typed columns")]
    end
    
    subgraph Gold Zone - Business Ready
        SPARK2["Spark Job<br/><i>Aggregate, Join</i><br/>Business Logic"]
        S3G[("S3 Gold<br/><i>Denormalized tables</i><br/>Pre-aggregated metrics<br/>Ready for BI")]
    end
    
    DW[("Data Warehouse<br/><i>Snowflake/Redshift</i><br/>Fast SQL queries")]
    BI["BI Tools<br/><i>Tableau, Looker</i>"]
    ML["ML Pipelines<br/><i>Feature Engineering</i>"]
    
    APP & DB & API --"1. Ingest Raw"--> S3B
    S3B --"2. Clean & Validate"--> SPARK1
    SPARK1 --"3. Write Parquet"--> S3S
    S3S --"4. Transform"--> SPARK2
    SPARK2 --"5. Write Aggregates"--> S3G
    S3G --"6. Load"--> DW
    DW --"7. Query"--> BI
    S3S & S3G --"Read Raw/Processed"--> ML

Data lakes organize data into Bronze (raw ingestion), Silver (cleaned and validated), and Gold (business-ready aggregates) zones. This medallion architecture preserves raw data for reprocessing while providing curated datasets for analytics and ML. Data scientists explore Silver/Gold zones while BI tools query the warehouse loaded from Gold.

ETL vs ELT: Where Transformation Happens

graph TB
    subgraph ETL - Transform Before Load
        SRC1[("Source DB<br/><i>PostgreSQL</i>")]
        EXT1["Extract<br/><i>Read tables</i>"]
        SPARK["Transform<br/><i>Spark Cluster</i><br/>Clean, Join, Aggregate<br/>Heavy compute"]
        LOAD1["Load<br/><i>INSERT cleaned data</i>"]
        WH1[("Warehouse<br/><i>Redshift</i><br/>Only final tables")]
        
        SRC1 --"1. Extract"--> EXT1
        EXT1 --"2. Raw data"--> SPARK
        SPARK --"3. Transformed"--> LOAD1
        LOAD1 --"4. Load clean"--> WH1
    end
    
    subgraph ELT - Transform After Load
        SRC2[("Source DB<br/><i>PostgreSQL</i>")]
        EXT2["Extract<br/><i>Read tables</i>"]
        LOAD2["Load<br/><i>INSERT raw data</i>"]
        WH2[("Warehouse<br/><i>Snowflake</i><br/>Raw + transformed")]
        SQL["Transform<br/><i>SQL in warehouse</i><br/>CREATE TABLE AS SELECT<br/>Elastic compute"]
        
        SRC2 --"1. Extract"--> EXT2
        EXT2 --"2. Raw data"--> LOAD2
        LOAD2 --"3. Load raw"--> WH2
        WH2 --"4. Transform SQL"--> SQL
        SQL --"5. Materialized views"--> WH2
    end
    
    NOTE1["ETL: Offload compute to Spark<br/>when warehouse is expensive<br/>Less warehouse storage"]
    NOTE2["ELT: Use warehouse compute<br/>when elastic and cheap<br/>Simpler, more flexible"]

ETL transforms data in external compute (Spark) before loading into the warehouse, minimizing warehouse storage and compute costs. ELT loads raw data first and transforms using SQL in the warehouse, leveraging elastic compute and enabling faster iteration. Modern cloud warehouses with cheap, scalable compute favor ELT for simplicity.

Decision Framework

Choose OLTP When:

You need sub-100ms point lookups by primary key (user profile, order details)
Workload is write-heavy with frequent updates (inventory counts, account balances)
ACID guarantees are critical (financial transactions, booking systems)
Data size is <1TB and fits on a single powerful instance
Example: E-commerce checkout, social media posts, real-time bidding

Choose OLAP When:

Queries scan millions of rows for aggregations (revenue by region, user cohorts)
Workload is read-heavy with batch updates (nightly ETL, hourly refreshes)
You need to join large fact tables with dimensions (sales × products × customers)
Data size is >10TB requiring distributed storage
Example: Executive dashboards, quarterly reports, churn analysis

Choose Data Lake When:

Data is unstructured or semi-structured (logs, JSON, images, video)
Schema is unknown or evolving (exploratory analysis, new data sources)
Storage cost is primary concern (archival, compliance, long-term retention)
Users are data scientists comfortable with Spark/Python
You need to preserve raw data for reprocessing (ML feature engineering)
Example: Netflix device telemetry, Uber trip GPS traces, genomics research

Choose Data Warehouse When:

Data is structured and schema is stable (sales, customers, products)
Query performance is critical (sub-second dashboard loads)
Users are business analysts using BI tools (Tableau, Looker)
Data governance and quality are important (regulatory reporting)
You need complex joins and aggregations (star schema queries)
Example: Amazon retail analytics, Stripe payment reporting, Airbnb booking metrics

Hybrid Approach (Most Common):

Modern architectures use both: raw data lands in S3 (lake), transformed data loads into Snowflake (warehouse), and BI tools query the warehouse while ML pipelines read the lake. This “lakehouse” pattern combines flexibility and performance. Companies like Databricks and Snowflake are converging: Databricks added Delta Lake for ACID on S3, Snowflake added external tables to query S3 directly.

Schema Design Decision:

Star Schema: Default choice for most warehouses. Use when query performance matters more than storage efficiency. Denormalize dimensions fully.
Snowflake Schema: Only when dimension hierarchies change frequently (organizational structures, product categories) and you need to update once rather than in many places. Accept slower queries.

Real-World Examples

Netflix: Lake-to-Warehouse Pipeline

Netflix ingests 500TB of viewing data daily into S3 (their data lake). Raw events include every play, pause, seek, and error from 200 million users. Data scientists use Spark to explore this raw data for ML models (recommendation algorithms, encoding optimization). Meanwhile, ETL jobs aggregate data into hourly metrics (watch time by title, completion rates by region) and load into their data warehouse. Business analysts query the warehouse using Tableau for executive dashboards showing daily active users and content performance. The interesting detail: Netflix keeps 7 years of raw data in S3 for $0.023/GB/month but only 90 days of aggregated data in their expensive warehouse. When they need to reprocess historical data (new ML feature, bug fix), they re-run Spark jobs against S3.

Amazon Retail: Star Schema at Scale

Amazon’s retail data warehouse uses a massive star schema with a central sales fact table (order_id, product_id, customer_id, timestamp, amount, quantity) surrounded by dimension tables (products, customers, sellers, warehouses). The product dimension is denormalized: it includes category_name, brand_name, and department_name directly rather than joining to separate tables. This denormalization means the product dimension has 50+ columns and significant redundancy (“Electronics” appears millions of times), but queries like “total sales by category” require a single join instead of three. Amazon runs thousands of such queries daily for pricing algorithms, inventory optimization, and seller analytics. The warehouse is partitioned by date (order_date) and distributed by customer_id for parallel processing across Redshift nodes.

Uber: ELT with Incremental Materialization

Uber loads raw trip data (GPS traces, driver events, rider actions) into their data warehouse using ELT. Rather than pre-aggregating in Spark, they load raw events and use SQL to create materialized views: hourly trips by city, daily earnings by driver, weekly surge patterns. These views refresh incrementally—only new data is processed, not the entire history. This approach lets analysts iterate quickly: they write SQL transformations in the warehouse rather than waiting for Spark jobs. The trade-off is higher warehouse compute costs, but Uber’s data team found that analyst productivity gains outweighed the expense. The interesting detail: Uber uses dbt (data build tool) to manage these SQL transformations as code with version control and testing, treating analytics like software engineering.

Netflix Lake-to-Warehouse Pipeline

graph LR
    subgraph Data Sources
        DEVICES["200M Devices<br/><i>Play, Pause, Seek</i><br/>500TB/day"]
    end
    
    subgraph Data Lake - S3
        RAW[("Raw Events<br/><i>s3://netflix/events/</i><br/>7 years retention<br/>$0.023/GB/month")]
        SPARK_ML["Spark Jobs<br/><i>ML Feature Engineering</i><br/>Recommendation models<br/>Encoding optimization"]
    end
    
    subgraph ETL Pipeline
        SPARK_AGG["Spark Aggregation<br/><i>Hourly rollups</i><br/>Watch time by title<br/>Completion rates"]
    end
    
    subgraph Data Warehouse
        WH[("Warehouse<br/><i>90 days retention</i><br/>Aggregated metrics<br/>Expensive storage")]
        TABLEAU["Tableau Dashboards<br/><i>Executive reports</i><br/>DAU, Content performance"]
    end
    
    REPROCESS["Reprocess Historical<br/><i>New ML features</i><br/>Bug fixes<br/>Re-run Spark on S3"]
    
    DEVICES --"1. Stream events"--> RAW
    RAW --"2. Explore raw"--> SPARK_ML
    RAW --"3. Aggregate"--> SPARK_AGG
    SPARK_AGG --"4. Load metrics"--> WH
    WH --"5. Query"--> TABLEAU
    RAW -."6. Reprocess when needed".-> REPROCESS
    REPROCESS -."7. New features".-> SPARK_ML

Netflix stores 7 years of raw viewing events in S3 for cheap long-term retention and ML exploration, while loading only 90 days of aggregated metrics into their expensive data warehouse for BI dashboards. This hybrid approach balances storage costs with query performance, allowing reprocessing of historical data when needed.

Interview Essentials

Mid-Level

Explain the difference between OLTP and OLAP databases. Why can’t you run analytics on your production PostgreSQL?

What is a star schema? Walk through a sales fact table with product, customer, and date dimensions.

Describe a data lake. How does schema-on-read differ from schema-on-write?

When would you choose a data lake over a data warehouse?

What is columnar storage and why does it help analytical queries?

Senior

Design a data pipeline that ingests clickstream events and powers both real-time dashboards and ML models. Where does data land first?

Compare star schema vs snowflake schema. When would you normalize dimensions?

Explain ETL vs ELT. How has cloud warehouse pricing changed this trade-off?

Your data warehouse queries are slow. Walk through your debugging process: partitioning, distribution keys, sort keys, statistics.

How would you handle slowly changing dimensions (customer addresses change over time)? Discuss Type 1 vs Type 2 SCDs.

Design a lakehouse architecture that supports both SQL analytics and Spark ML pipelines on the same data.

Staff+

You’re migrating from on-prem Teradata to cloud. Discuss the trade-offs between Snowflake, BigQuery, and Redshift for a 100TB warehouse with 500 analysts.

Design a data platform that handles 10PB of data across lakes and warehouses. How do you ensure data quality, lineage, and governance at scale?

Your company wants a single source of truth but data scientists need raw data and analysts need clean tables. Design the architecture and organizational model.

Discuss the convergence of lakes and warehouses (lakehouse). What are the technical challenges of ACID transactions on object storage?

How would you design a cost-effective archival strategy that keeps 10 years of data queryable but minimizes storage and compute costs?

Common Interview Questions

Why not just use a data lake for everything? (Answer: Query performance, data quality, analyst productivity)

Can you update data in a data warehouse? (Answer: Yes, but it’s slow—OLAP systems are optimized for append-only workloads)

What’s the difference between a data warehouse and a database? (Answer: Warehouses are specialized OLAP databases with columnar storage, MPP architecture, and denormalized schemas)

How do you partition a fact table? (Answer: By date for time-series queries, by customer_id for user-centric queries, or both using composite partitioning)

Red Flags to Avoid

Confusing OLTP and OLAP—saying you’d run analytics on production MySQL without explaining replication to a read replica or warehouse

Not knowing what columnar storage is or why it helps aggregations

Suggesting a snowflake schema without justifying why you’d accept slower queries for normalized dimensions

Claiming data lakes are always better because they’re cheaper, ignoring query performance and governance challenges

Not mentioning partitioning when discussing large fact tables—partitioning by date is table stakes for analytical workloads

Designing a star schema with normalized dimensions (that’s a snowflake schema) or vice versa

Key Takeaways

OLTP vs OLAP is about workload, not technology: OLTP handles transactional writes with row-oriented storage and B-trees; OLAP handles analytical scans with columnar storage and compression. Never run heavy analytics on production OLTP databases—replicate to a warehouse first.

Data lakes provide flexibility, warehouses provide performance: Lakes store raw data cheaply in object storage with schema-on-read for exploration; warehouses store structured data expensively with schema-on-write for fast queries. Modern architectures use both: raw data in lakes, curated data in warehouses.

Star schemas are the default for analytical workloads: Denormalize dimensions into a central fact table surrounded by dimension tables. Accept storage redundancy for query performance. Only use snowflake schemas when dimension updates are frequent and complex.

ETL vs ELT depends on where compute is cheaper: ETL transforms before loading (offload to Spark when warehouse compute is expensive); ELT loads then transforms (use warehouse SQL when compute is cheap and elastic). Modern cloud warehouses favor ELT for simplicity and flexibility.

Partitioning and distribution are critical for scale: Partition fact tables by date for time-series queries, distribute by high-cardinality keys (customer_id) for parallel processing. Without proper partitioning, queries scan entire tables and timeout.