Data Lakes vs Data Warehouses: Key Differences
After this topic, you will be able to:
- Differentiate between OLTP, OLAP, data lakes, and data warehouses
- Analyze star schema vs snowflake schema trade-offs for analytical workloads
- Compare data lake and data warehouse architectures for different business needs
TL;DR
Data lakes store raw, unstructured data in flat object storage for exploratory analytics and ML, while data warehouses store structured, schema-on-write data optimized for fast SQL queries and BI dashboards. OLTP databases handle transactional workloads (writes, updates, ACID), while OLAP systems handle analytical workloads (aggregations, scans, read-heavy). Modern architectures often combine both: raw data lands in lakes, transformed data moves to warehouses, and lakehouses attempt to unify both paradigms.
Cheat Sheet: Data Lake = schema-on-read, cheap storage, flexible but slower queries. Data Warehouse = schema-on-write, expensive storage, fast structured queries. OLTP = row-oriented, normalized, transactional. OLAP = column-oriented, denormalized, analytical.
Context
When Netflix wants to analyze viewing patterns across 200 million users, they don’t query their production MySQL databases—they’d bring the system to its knees. Instead, they replicate data into analytical systems optimized for scanning billions of rows. This is the fundamental split between operational and analytical data systems, and understanding it is critical for any system design interview involving data at scale.
The choice between data lakes and warehouses isn’t academic—it determines your storage costs, query performance, data governance, and team productivity. Amazon stores petabytes of raw clickstream data in S3 (their data lake) but loads aggregated metrics into Redshift (their warehouse) for executive dashboards. Uber’s data scientists explore raw trip data in their lake but business analysts query pre-modeled tables in their warehouse. The distinction matters because each architecture makes different trade-offs between flexibility, performance, and cost.
Interviewers ask about this topic because it reveals whether you understand the entire data lifecycle: how transactional data flows from OLTP systems into analytical stores, when to transform data (ETL vs ELT), and how to design schemas for analytical queries. A senior engineer should know that OLTP databases use row-oriented storage with B-trees for fast point lookups, while OLAP systems use columnar storage with compression for fast aggregations. The architecture you choose affects everything from query latency to monthly AWS bills.
Side-by-Side Comparison
OLTP vs OLAP
| Dimension | OLTP (Transactional) | OLAP (Analytical) |
|---|---|---|
| Primary Use | Handle business transactions (orders, payments, user updates) | Answer business questions (revenue trends, user cohorts, forecasts) |
| Query Pattern | Point lookups, small updates (SELECT/UPDATE single rows) | Large scans, aggregations (SUM/AVG across millions of rows) |
| Storage | Row-oriented (entire row stored together) | Column-oriented (each column stored separately) |
| Schema | Normalized (3NF) to avoid update anomalies | Denormalized (star/snowflake) for query performance |
| Example | PostgreSQL storing e-commerce orders | Snowflake analyzing quarterly sales by region |
| When Better | User-facing apps needing <100ms response, frequent writes | Batch reports, dashboards, ML training needing to scan TBs |
Data Lake vs Data Warehouse
| Dimension | Data Lake | Data Warehouse |
|---|---|---|
| Schema | Schema-on-read (applied when querying) | Schema-on-write (enforced when loading) |
| Data Types | Raw, unstructured (logs, JSON, images, video) | Structured, cleaned, business-ready tables |
| Storage | Object storage (S3, ADLS) with flat namespace | Proprietary columnar formats (Redshift, BigQuery) |
| Cost | Very cheap ($0.023/GB/month S3) | Expensive ($23/TB/month Snowflake compute) |
| Query Speed | Slow (scan raw files, no indexes) | Fast (optimized storage, statistics, caching) |
| Users | Data scientists, ML engineers exploring data | Business analysts, executives running dashboards |
| Example | Netflix storing raw viewing logs in S3 | Netflix aggregating daily watch hours by title in Redshift |
| When Better | Exploratory analysis, ML feature engineering, archival | Repeated queries, BI tools, regulatory reporting |
Star Schema vs Snowflake Schema
| Dimension | Star Schema | Snowflake Schema |
|---|---|---|
| Structure | Fact table surrounded by denormalized dimension tables | Fact table with normalized dimension tables (sub-dimensions) |
| Joins | Single join from fact to dimension | Multiple joins through dimension hierarchies |
| Storage | More redundant (dimension data duplicated) | Less redundant (dimensions normalized) |
| Query Speed | Faster (fewer joins, better for BI tools) | Slower (more joins, complex execution plans) |
| Example | Sales fact table joins directly to Product dimension | Sales fact joins Product, which joins Category, which joins Department |
| When Better | BI dashboards, simple aggregations, query performance critical | Complex hierarchies, storage optimization, data integrity important |
OLTP vs OLAP: Storage and Query Patterns
graph LR
subgraph OLTP System
A["User Request<br/><i>UPDATE user SET email=...</i>"]
B["Row-Oriented Storage<br/><i>B-Tree Index</i>"]
C[("PostgreSQL<br/>User Table")]
D["Single Row<br/><i>user_id: 123</i><br/>name, email, address..."]
end
subgraph OLAP System
E["Analytics Query<br/><i>SELECT AVG(amount) FROM sales</i>"]
F["Columnar Storage<br/><i>Compressed Columns</i>"]
G[("Snowflake<br/>Sales Fact")]
H["Amount Column Only<br/><i>Skip other 20 columns</i><br/>Scan 10M values"]
end
A --"1. Point Lookup"--> B
B --"2. Read Row"--> C
C --"3. Return"--> D
E --"1. Scan Column"--> F
F --"2. Read Compressed"--> G
G --"3. Aggregate"--> H
OLTP systems use row-oriented storage with B-tree indexes for fast point lookups of entire rows, while OLAP systems use columnar storage to scan only relevant columns across millions of rows. This fundamental difference in storage layout drives the performance characteristics of each system.
Deep Analysis
OLTP: The Transactional Workhorse
OLTP databases like PostgreSQL and MySQL are optimized for the write-heavy, low-latency workloads that power user-facing applications. They store data in row-oriented format because transactions typically touch entire rows: when you update a user’s email address, you need the entire user record. B-tree indexes provide O(log n) lookups for primary keys, and ACID guarantees ensure your bank transfer either completes fully or rolls back entirely.
The problem is that OLTP databases are terrible at analytical queries. Scanning millions of rows to calculate average order value requires reading entire rows even though you only need two columns (order_id and amount). The row-oriented storage means you’re reading 90% irrelevant data. Worse, running heavy analytics on production databases starves transactional queries of resources—your checkout page slows down because someone’s running a quarterly report.
OLAP: Built for Analytics
OLAP systems like Snowflake, BigQuery, and Redshift solve this by storing data in columnar format. Each column is stored separately and heavily compressed (timestamps compress 10x, repeated values compress 100x). When you calculate SUM(revenue), the database reads only the revenue column, not entire rows. Columnar storage also enables vectorized execution: modern CPUs process 1000 values per instruction using SIMD.
The trade-off is that OLAP systems are slow at updates. Updating a single row requires rewriting entire column chunks. That’s why OLAP databases are append-only: you load data in batches (hourly, daily) rather than updating individual records. For detailed normalization concepts, see Databases Overview, but the key insight is that OLAP systems intentionally denormalize data to avoid expensive joins during queries.
Data Lakes: Flexibility at Scale
Data lakes emerged when companies realized they were throwing away valuable data because it didn’t fit warehouse schemas. Netflix receives terabytes of device telemetry daily—playback errors, network conditions, A/B test events. Defining schemas upfront is impossible when you don’t know what questions you’ll ask.
A data lake stores raw data in object storage (S3, Azure Data Lake) organized by ingestion date: s3://lake/events/2024/01/15/. Data is typically partitioned by date for efficient scanning. The “schema-on-read” approach means you define structure when querying: Spark reads JSON files and infers schema, Athena queries Parquet files using SQL. This flexibility enables exploratory analysis—data scientists can experiment with raw data without waiting for ETL pipelines.
The downside is governance chaos. Without enforced schemas, data quality degrades: field names change, types drift, documentation disappears. Query performance suffers because you’re scanning raw files without indexes or statistics. This is why companies build “data lake zones”: raw data lands in Bronze, cleaned data moves to Silver, business-ready aggregates live in Gold. Each zone adds structure while preserving raw data for reprocessing.
Data Warehouses: Performance and Governance
Data warehouses enforce schema-on-write: data is validated, cleaned, and transformed before loading. This upfront cost pays dividends in query performance and data quality. Redshift pre-computes statistics, builds zone maps (min/max values per block), and materializes aggregates. Queries that would scan terabytes in a lake complete in seconds in a warehouse.
The star schema is the workhorse of warehouse design. A central fact table stores metrics (sales_amount, quantity) with foreign keys to dimension tables (product, customer, date). Dimensions are denormalized: the product table includes category_name and department_name directly rather than joining to separate tables. This denormalization trades storage for query speed—exactly what analytics needs. For more on denormalization strategies, see Denormalization.
Snowflake schemas normalize dimensions into hierarchies (product → category → department), reducing redundancy but requiring more joins. They’re rare in practice because storage is cheap and query performance matters more. The exception is when dimension hierarchies change frequently—normalizing makes updates easier.
ETL vs ELT: Where Transformation Happens
ETL (Extract-Transform-Load) transforms data before loading into the warehouse. You run Spark jobs that clean data, join tables, and aggregate metrics, then load results into Redshift. This was necessary when warehouse compute was expensive—you offloaded transformation to cheaper Hadoop clusters.
ELT (Extract-Load-Transform) loads raw data into the warehouse first, then transforms using SQL. Modern warehouses like Snowflake and BigQuery have cheap, elastic compute, making in-warehouse transformation viable. ELT is simpler (no separate Spark cluster) and more flexible (analysts can re-transform without re-extracting). The trade-off is that you’re storing more raw data in expensive warehouse storage rather than cheap S3.
Star Schema vs Snowflake Schema Structure
graph TB
subgraph Star Schema - Denormalized
SF["Sales Fact<br/><i>order_id, product_id, customer_id</i><br/>amount, quantity, date"]
PD["Product Dimension<br/><i>product_id, name</i><br/>category_name, brand_name<br/>department_name"]
CD["Customer Dimension<br/><i>customer_id, name</i><br/>city, state, country"]
DD["Date Dimension<br/><i>date_id, date</i><br/>month, quarter, year"]
SF --"Single Join"--> PD
SF --"Single Join"--> CD
SF --"Single Join"--> DD
end
subgraph Snowflake Schema - Normalized
SF2["Sales Fact<br/><i>order_id, product_id, customer_id</i><br/>amount, quantity, date"]
PD2["Product<br/><i>product_id, name</i><br/>category_id"]
CAT["Category<br/><i>category_id, name</i><br/>department_id"]
DEPT["Department<br/><i>department_id, name</i>"]
CD2["Customer<br/><i>customer_id, name</i><br/>city_id"]
CITY["City<br/><i>city_id, name</i><br/>state_id"]
SF2 --"Join 1"--> PD2
PD2 --"Join 2"--> CAT
CAT --"Join 3"--> DEPT
SF2 --"Join 1"--> CD2
CD2 --"Join 2"--> CITY
end
Star schema denormalizes dimensions for single-join queries (faster, more storage), while snowflake schema normalizes dimensions into hierarchies (slower queries, less redundancy). Star schema is the default choice for most analytical workloads where query performance outweighs storage costs.
Data Lake Architecture with Processing Zones
graph LR
subgraph Data Sources
APP["Application Logs<br/><i>JSON</i>"]
DB["Database CDC<br/><i>Change Events</i>"]
API["External APIs<br/><i>Third-party Data</i>"]
end
subgraph Bronze Zone - Raw Data
S3B[("S3 Bronze<br/><i>s3://lake/raw/2024/01/15/</i><br/>Schema-on-read<br/>Partitioned by date")]
end
subgraph Silver Zone - Cleaned Data
SPARK1["Spark Job<br/><i>Validate, Dedupe</i><br/>Parse JSON"]
S3S[("S3 Silver<br/><i>Parquet format</i><br/>Cleaned & Validated<br/>Typed columns")]
end
subgraph Gold Zone - Business Ready
SPARK2["Spark Job<br/><i>Aggregate, Join</i><br/>Business Logic"]
S3G[("S3 Gold<br/><i>Denormalized tables</i><br/>Pre-aggregated metrics<br/>Ready for BI")]
end
DW[("Data Warehouse<br/><i>Snowflake/Redshift</i><br/>Fast SQL queries")]
BI["BI Tools<br/><i>Tableau, Looker</i>"]
ML["ML Pipelines<br/><i>Feature Engineering</i>"]
APP & DB & API --"1. Ingest Raw"--> S3B
S3B --"2. Clean & Validate"--> SPARK1
SPARK1 --"3. Write Parquet"--> S3S
S3S --"4. Transform"--> SPARK2
SPARK2 --"5. Write Aggregates"--> S3G
S3G --"6. Load"--> DW
DW --"7. Query"--> BI
S3S & S3G --"Read Raw/Processed"--> ML
Data lakes organize data into Bronze (raw ingestion), Silver (cleaned and validated), and Gold (business-ready aggregates) zones. This medallion architecture preserves raw data for reprocessing while providing curated datasets for analytics and ML. Data scientists explore Silver/Gold zones while BI tools query the warehouse loaded from Gold.
ETL vs ELT: Where Transformation Happens
graph TB
subgraph ETL - Transform Before Load
SRC1[("Source DB<br/><i>PostgreSQL</i>")]
EXT1["Extract<br/><i>Read tables</i>"]
SPARK["Transform<br/><i>Spark Cluster</i><br/>Clean, Join, Aggregate<br/>Heavy compute"]
LOAD1["Load<br/><i>INSERT cleaned data</i>"]
WH1[("Warehouse<br/><i>Redshift</i><br/>Only final tables")]
SRC1 --"1. Extract"--> EXT1
EXT1 --"2. Raw data"--> SPARK
SPARK --"3. Transformed"--> LOAD1
LOAD1 --"4. Load clean"--> WH1
end
subgraph ELT - Transform After Load
SRC2[("Source DB<br/><i>PostgreSQL</i>")]
EXT2["Extract<br/><i>Read tables</i>"]
LOAD2["Load<br/><i>INSERT raw data</i>"]
WH2[("Warehouse<br/><i>Snowflake</i><br/>Raw + transformed")]
SQL["Transform<br/><i>SQL in warehouse</i><br/>CREATE TABLE AS SELECT<br/>Elastic compute"]
SRC2 --"1. Extract"--> EXT2
EXT2 --"2. Raw data"--> LOAD2
LOAD2 --"3. Load raw"--> WH2
WH2 --"4. Transform SQL"--> SQL
SQL --"5. Materialized views"--> WH2
end
NOTE1["ETL: Offload compute to Spark<br/>when warehouse is expensive<br/>Less warehouse storage"]
NOTE2["ELT: Use warehouse compute<br/>when elastic and cheap<br/>Simpler, more flexible"]
ETL transforms data in external compute (Spark) before loading into the warehouse, minimizing warehouse storage and compute costs. ELT loads raw data first and transforms using SQL in the warehouse, leveraging elastic compute and enabling faster iteration. Modern cloud warehouses with cheap, scalable compute favor ELT for simplicity.
Decision Framework
Choose OLTP When:
- You need sub-100ms point lookups by primary key (user profile, order details)
- Workload is write-heavy with frequent updates (inventory counts, account balances)
- ACID guarantees are critical (financial transactions, booking systems)
- Data size is <1TB and fits on a single powerful instance
- Example: E-commerce checkout, social media posts, real-time bidding
Choose OLAP When:
- Queries scan millions of rows for aggregations (revenue by region, user cohorts)
- Workload is read-heavy with batch updates (nightly ETL, hourly refreshes)
- You need to join large fact tables with dimensions (sales × products × customers)
- Data size is >10TB requiring distributed storage
- Example: Executive dashboards, quarterly reports, churn analysis
Choose Data Lake When:
- Data is unstructured or semi-structured (logs, JSON, images, video)
- Schema is unknown or evolving (exploratory analysis, new data sources)
- Storage cost is primary concern (archival, compliance, long-term retention)
- Users are data scientists comfortable with Spark/Python
- You need to preserve raw data for reprocessing (ML feature engineering)
- Example: Netflix device telemetry, Uber trip GPS traces, genomics research
Choose Data Warehouse When:
- Data is structured and schema is stable (sales, customers, products)
- Query performance is critical (sub-second dashboard loads)
- Users are business analysts using BI tools (Tableau, Looker)
- Data governance and quality are important (regulatory reporting)
- You need complex joins and aggregations (star schema queries)
- Example: Amazon retail analytics, Stripe payment reporting, Airbnb booking metrics
Hybrid Approach (Most Common):
Modern architectures use both: raw data lands in S3 (lake), transformed data loads into Snowflake (warehouse), and BI tools query the warehouse while ML pipelines read the lake. This “lakehouse” pattern combines flexibility and performance. Companies like Databricks and Snowflake are converging: Databricks added Delta Lake for ACID on S3, Snowflake added external tables to query S3 directly.
Schema Design Decision:
- Star Schema: Default choice for most warehouses. Use when query performance matters more than storage efficiency. Denormalize dimensions fully.
- Snowflake Schema: Only when dimension hierarchies change frequently (organizational structures, product categories) and you need to update once rather than in many places. Accept slower queries.
Real-World Examples
Netflix: Lake-to-Warehouse Pipeline
Netflix ingests 500TB of viewing data daily into S3 (their data lake). Raw events include every play, pause, seek, and error from 200 million users. Data scientists use Spark to explore this raw data for ML models (recommendation algorithms, encoding optimization). Meanwhile, ETL jobs aggregate data into hourly metrics (watch time by title, completion rates by region) and load into their data warehouse. Business analysts query the warehouse using Tableau for executive dashboards showing daily active users and content performance. The interesting detail: Netflix keeps 7 years of raw data in S3 for $0.023/GB/month but only 90 days of aggregated data in their expensive warehouse. When they need to reprocess historical data (new ML feature, bug fix), they re-run Spark jobs against S3.
Amazon Retail: Star Schema at Scale
Amazon’s retail data warehouse uses a massive star schema with a central sales fact table (order_id, product_id, customer_id, timestamp, amount, quantity) surrounded by dimension tables (products, customers, sellers, warehouses). The product dimension is denormalized: it includes category_name, brand_name, and department_name directly rather than joining to separate tables. This denormalization means the product dimension has 50+ columns and significant redundancy (“Electronics” appears millions of times), but queries like “total sales by category” require a single join instead of three. Amazon runs thousands of such queries daily for pricing algorithms, inventory optimization, and seller analytics. The warehouse is partitioned by date (order_date) and distributed by customer_id for parallel processing across Redshift nodes.
Uber: ELT with Incremental Materialization
Uber loads raw trip data (GPS traces, driver events, rider actions) into their data warehouse using ELT. Rather than pre-aggregating in Spark, they load raw events and use SQL to create materialized views: hourly trips by city, daily earnings by driver, weekly surge patterns. These views refresh incrementally—only new data is processed, not the entire history. This approach lets analysts iterate quickly: they write SQL transformations in the warehouse rather than waiting for Spark jobs. The trade-off is higher warehouse compute costs, but Uber’s data team found that analyst productivity gains outweighed the expense. The interesting detail: Uber uses dbt (data build tool) to manage these SQL transformations as code with version control and testing, treating analytics like software engineering.
Netflix Lake-to-Warehouse Pipeline
graph LR
subgraph Data Sources
DEVICES["200M Devices<br/><i>Play, Pause, Seek</i><br/>500TB/day"]
end
subgraph Data Lake - S3
RAW[("Raw Events<br/><i>s3://netflix/events/</i><br/>7 years retention<br/>$0.023/GB/month")]
SPARK_ML["Spark Jobs<br/><i>ML Feature Engineering</i><br/>Recommendation models<br/>Encoding optimization"]
end
subgraph ETL Pipeline
SPARK_AGG["Spark Aggregation<br/><i>Hourly rollups</i><br/>Watch time by title<br/>Completion rates"]
end
subgraph Data Warehouse
WH[("Warehouse<br/><i>90 days retention</i><br/>Aggregated metrics<br/>Expensive storage")]
TABLEAU["Tableau Dashboards<br/><i>Executive reports</i><br/>DAU, Content performance"]
end
REPROCESS["Reprocess Historical<br/><i>New ML features</i><br/>Bug fixes<br/>Re-run Spark on S3"]
DEVICES --"1. Stream events"--> RAW
RAW --"2. Explore raw"--> SPARK_ML
RAW --"3. Aggregate"--> SPARK_AGG
SPARK_AGG --"4. Load metrics"--> WH
WH --"5. Query"--> TABLEAU
RAW -."6. Reprocess when needed".-> REPROCESS
REPROCESS -."7. New features".-> SPARK_ML
Netflix stores 7 years of raw viewing events in S3 for cheap long-term retention and ML exploration, while loading only 90 days of aggregated metrics into their expensive data warehouse for BI dashboards. This hybrid approach balances storage costs with query performance, allowing reprocessing of historical data when needed.
Interview Essentials
Mid-Level
Explain the difference between OLTP and OLAP databases. Why can’t you run analytics on your production PostgreSQL?
What is a star schema? Walk through a sales fact table with product, customer, and date dimensions.
Describe a data lake. How does schema-on-read differ from schema-on-write?
When would you choose a data lake over a data warehouse?
What is columnar storage and why does it help analytical queries?
Senior
Design a data pipeline that ingests clickstream events and powers both real-time dashboards and ML models. Where does data land first?
Compare star schema vs snowflake schema. When would you normalize dimensions?
Explain ETL vs ELT. How has cloud warehouse pricing changed this trade-off?
Your data warehouse queries are slow. Walk through your debugging process: partitioning, distribution keys, sort keys, statistics.
How would you handle slowly changing dimensions (customer addresses change over time)? Discuss Type 1 vs Type 2 SCDs.
Design a lakehouse architecture that supports both SQL analytics and Spark ML pipelines on the same data.
Staff+
You’re migrating from on-prem Teradata to cloud. Discuss the trade-offs between Snowflake, BigQuery, and Redshift for a 100TB warehouse with 500 analysts.
Design a data platform that handles 10PB of data across lakes and warehouses. How do you ensure data quality, lineage, and governance at scale?
Your company wants a single source of truth but data scientists need raw data and analysts need clean tables. Design the architecture and organizational model.
Discuss the convergence of lakes and warehouses (lakehouse). What are the technical challenges of ACID transactions on object storage?
How would you design a cost-effective archival strategy that keeps 10 years of data queryable but minimizes storage and compute costs?
Common Interview Questions
Why not just use a data lake for everything? (Answer: Query performance, data quality, analyst productivity)
Can you update data in a data warehouse? (Answer: Yes, but it’s slow—OLAP systems are optimized for append-only workloads)
What’s the difference between a data warehouse and a database? (Answer: Warehouses are specialized OLAP databases with columnar storage, MPP architecture, and denormalized schemas)
How do you partition a fact table? (Answer: By date for time-series queries, by customer_id for user-centric queries, or both using composite partitioning)
Red Flags to Avoid
Confusing OLTP and OLAP—saying you’d run analytics on production MySQL without explaining replication to a read replica or warehouse
Not knowing what columnar storage is or why it helps aggregations
Suggesting a snowflake schema without justifying why you’d accept slower queries for normalized dimensions
Claiming data lakes are always better because they’re cheaper, ignoring query performance and governance challenges
Not mentioning partitioning when discussing large fact tables—partitioning by date is table stakes for analytical workloads
Designing a star schema with normalized dimensions (that’s a snowflake schema) or vice versa
Key Takeaways
OLTP vs OLAP is about workload, not technology: OLTP handles transactional writes with row-oriented storage and B-trees; OLAP handles analytical scans with columnar storage and compression. Never run heavy analytics on production OLTP databases—replicate to a warehouse first.
Data lakes provide flexibility, warehouses provide performance: Lakes store raw data cheaply in object storage with schema-on-read for exploration; warehouses store structured data expensively with schema-on-write for fast queries. Modern architectures use both: raw data in lakes, curated data in warehouses.
Star schemas are the default for analytical workloads: Denormalize dimensions into a central fact table surrounded by dimension tables. Accept storage redundancy for query performance. Only use snowflake schemas when dimension updates are frequent and complex.
ETL vs ELT depends on where compute is cheaper: ETL transforms before loading (offload to Spark when warehouse compute is expensive); ELT loads then transforms (use warehouse SQL when compute is cheap and elastic). Modern cloud warehouses favor ELT for simplicity and flexibility.
Partitioning and distribution are critical for scale: Partition fact tables by date for time-series queries, distribute by high-cardinality keys (customer_id) for parallel processing. Without proper partitioning, queries scan entire tables and timeout.