Web Server Caching: Nginx & Varnish Guide

TL;DR

Web server caching places a reverse proxy (Nginx, Varnish, Apache Traffic Server) between clients and application servers to cache HTTP responses. This reduces application load by 70-90% for cacheable content, improves response times from 200ms to 5ms, and handles traffic spikes without scaling backend infrastructure. Cheat Sheet: Use for static assets, API responses with predictable TTLs, and read-heavy endpoints. Avoid for user-specific content without proper cache key design.

The Analogy

Think of a web server cache as a restaurant’s expediting station. Instead of every order going back to the kitchen (application servers), the expediter keeps popular dishes warm and ready at the pass. When a waiter orders a burger, the expediter can hand it over in 5 seconds instead of waiting 3 minutes for the kitchen to cook it. The expediter only sends orders to the kitchen for custom requests or when the warming trays run empty. This is exactly how Nginx or Varnish sits between users and your application—serving cached responses instantly while only forwarding cache misses to the backend.

Why This Matters in Interviews

Web server caching is a fundamental interview topic because it’s the first line of defense in system scalability. Interviewers use it to assess whether you understand the HTTP caching model, can distinguish between different caching layers (CDN vs web server vs application), and know when caching at the reverse proxy makes sense versus other strategies. Senior candidates are expected to discuss cache key design, invalidation strategies, and trade-offs between memory usage and hit rates. This topic often appears when designing read-heavy systems like content platforms, e-commerce product pages, or API gateways.

Core Concept

Web server caching operates at the reverse proxy layer—the first server-side component that receives client requests. Unlike CDN caching (covered in CDN Caching) which operates at the network edge, web server caching sits in your data center, directly in front of application servers. Popular implementations include Nginx (with proxy_cache or fastcgi_cache), Varnish Cache, and Apache Traffic Server. The cache stores HTTP responses keyed by request attributes (URL, headers, cookies) and serves them directly without touching application code. This is distinct from Application Caching, which caches data structures within application memory. Web server caching is particularly effective for content that’s expensive to generate but identical across users—think product listings, blog posts, or API responses that change infrequently.

Cache Key Design Impact on Hit Rate

graph TB
    Request["Incoming Request<br/>URL: /api/video/abc123<br/>Cookie: session=xyz<br/>Accept-Language: en<br/>Accept-Encoding: gzip"]
    
    subgraph Bad["❌ Poor Cache Key Design"]
        BadKey["Cache Key:<br/>URL + ALL headers + cookies"]
        BadResult["Result: Every user creates<br/>separate cache entry<br/>Hit Rate: 5-10%"]
    end
    
    subgraph Good["✅ Optimized Cache Key Design"]
        GoodKey["Cache Key:<br/>URL + Accept-Language + Accept-Encoding<br/>(exclude session cookie)"]
        GoodResult["Result: Users share cache entries<br/>by language & encoding<br/>Hit Rate: 85-95%"]
    end
    
    Request --> Bad
    Request --> Good
    BadKey --> BadResult
    GoodKey --> GoodResult

Including unnecessary request attributes (like session cookies) in the cache key creates over-segmentation, where each user gets a separate cache entry. Optimal cache keys include only attributes that affect the response content.

Full Page vs Fragment Caching Comparison

graph LR
    subgraph FullPage["Full Page Caching"]
        FP_Request["Request: /product/123"]
        FP_Cache[("Cache Entry:<br/>Complete HTML page")]
        FP_Response["Response: Entire page<br/>(5ms from cache)"]
        FP_Request --> FP_Cache
        FP_Cache --> FP_Response
    end
    
    subgraph Fragment["Fragment Caching (ESI)"]
        FR_Request["Request: /product/123"]
        FR_Proxy["Varnish Proxy"]
        FR_Desc[("Cached:<br/>Product description")]
        FR_Reviews[("Cached:<br/>Reviews section")]
        FR_Cart["Dynamic:<br/>User's cart"]
        FR_Assemble["Assemble page<br/>from fragments"]
        FR_Response["Response: Mixed<br/>cached + dynamic"]
        
        FR_Request --> FR_Proxy
        FR_Proxy --> FR_Desc
        FR_Proxy --> FR_Reviews
        FR_Proxy --> FR_Cart
        FR_Desc --> FR_Assemble
        FR_Reviews --> FR_Assemble
        FR_Cart --> FR_Assemble
        FR_Assemble --> FR_Response
    end

Full page caching stores complete responses for maximum speed but requires all content to be identical across users. Fragment caching (ESI) allows mixing cached public content with dynamic personalized sections, trading some performance for flexibility.

TTL Strategy Decision Tree

graph TB
    Start["Content Type Analysis"]
    Immutable{"Content<br/>immutable?"}
    UserSpecific{"User-specific<br/>content?"}
    ChangeFreq{"Update<br/>frequency?"}
    Staleness{"Staleness<br/>tolerance?"}
    
    Start --> Immutable
    Immutable -->|Yes<br/>versioned assets| Long["TTL: 1 year<br/>Example: app.v123.js<br/>Hit Rate: 99%"]
    Immutable -->|No| UserSpecific
    UserSpecific -->|Yes| NoCache["No caching or<br/>user-keyed cache<br/>Example: /my/profile"]
    UserSpecific -->|No| ChangeFreq
    ChangeFreq -->|Seconds| Micro["Microcaching: 1-5s<br/>Example: viral news<br/>Protects from spikes"]
    ChangeFreq -->|Minutes| Staleness
    ChangeFreq -->|Hours/Days| Medium["TTL: 1-24 hours<br/>+ active invalidation<br/>Example: product catalog"]
    Staleness -->|High tolerance| Short["TTL: 5-15 min<br/>Example: news feed<br/>Balance freshness/load"]
    Staleness -->|Low tolerance| VeryShort["TTL: 1-2 min<br/>+ purge on update<br/>Example: inventory count"]

TTL selection depends on content mutability, update frequency, and staleness tolerance. Immutable versioned assets can be cached indefinitely, while frequently changing content requires short TTLs or active invalidation strategies.

Cache Thrashing from Undersized Cache

sequenceDiagram
    participant Client
    participant Cache as Undersized Cache<br/>(100MB for 500MB working set)
    participant App as Application Server
    
    Note over Cache: Cache fills to capacity
    Client->>Cache: Request popular item A
    Cache->>Cache: Evict item B to make room
    Cache->>App: Fetch item A (miss)
    App->>Cache: Return item A
    Cache->>Client: Serve item A
    
    Note over Cache: Item B requested again
    Client->>Cache: Request item B
    Cache->>Cache: Evict item C to make room
    Cache->>App: Fetch item B (miss)
    App->>Cache: Return item B
    Cache->>Client: Serve item B
    
    Note over Cache: Item A requested again (still within TTL)
    Client->>Cache: Request item A
    Cache->>Cache: Item A was evicted!<br/>Evict item D to make room
    Cache->>App: Fetch item A again (miss)
    App->>Cache: Return item A
    Cache->>Client: Serve item A
    
    Note over Cache,App: Result: 0% hit rate despite caching<br/>Popular items evicted before TTL expires

When cache size is smaller than the working set, popular items are evicted before their TTL expires, causing repeated cache misses. This thrashing defeats the purpose of caching and can increase backend load beyond having no cache at all.

How It Works

When a request arrives at the reverse proxy, the cache generates a cache key from the request (typically the URL plus specific headers like Accept-Encoding). If the key exists in cache and hasn’t expired, the proxy returns the cached response immediately—this is a cache hit. On a cache miss, the proxy forwards the request to the application server, receives the response, stores it in cache according to HTTP headers (Cache-Control, Expires), and returns it to the client. The cache respects HTTP semantics: Cache-Control: private prevents caching, max-age sets TTL, and Vary headers tell the cache which request headers affect the response. For example, if your API returns Vary: Accept-Language, the cache stores separate entries for English and Spanish responses to the same URL. Nginx uses a two-tier memory structure: a shared memory zone for metadata (keys, expiration times) and disk storage for response bodies. Varnish keeps everything in memory for maximum speed but requires careful capacity planning.

Web Server Cache Request Flow

graph LR
    Client["Client Browser"]
    Proxy["Reverse Proxy<br/>(Nginx/Varnish)"]
    Cache[("Cache Storage<br/>Memory + Disk")]
    App["Application Server"]
    
    Client --"1. GET /api/products"--> Proxy
    Proxy --"2. Generate cache key<br/>(URL + headers)"--> Cache
    Cache --"3a. Cache HIT<br/>(return cached response)"--> Proxy
    Proxy --"3b. Cache MISS"--> App
    App --"4. Generate response<br/>+ Cache-Control headers"--> Proxy
    Proxy --"5. Store in cache<br/>(respect TTL)"--> Cache
    Proxy --"6. Return response"--> Client

On cache hit (3a), the proxy returns the response immediately without touching the application server. On cache miss (3b), the proxy forwards the request, caches the response according to HTTP headers, and returns it to the client.

Key Principles

principle: Cache Key Design Determines Effectiveness explanation: The cache key must capture all request dimensions that affect the response while avoiding over-segmentation. A poorly designed key either caches user-specific data incorrectly (security issue) or creates too many cache entries (low hit rate). The default key is usually the full URL, but you often need to include headers like Accept-Encoding (gzip vs brotli), Accept (JSON vs XML), or custom headers for API versioning. Conversely, you must exclude session cookies from the cache key for public content, or every logged-in user creates a separate cache entry. example: YouTube’s video metadata API caches responses keyed by video_id and Accept-Language, but explicitly ignores the user’s authentication cookie. This allows millions of users to share the same cached metadata while still serving localized titles and descriptions.

principle: TTL Balancing: Freshness vs Hit Rate explanation: Longer TTLs increase cache hit rates but risk serving stale data. Shorter TTLs keep data fresh but increase backend load. The optimal TTL depends on how often your data changes and your tolerance for staleness. Most systems use tiered TTLs: 1 year for immutable assets (versioned CSS/JS), 5 minutes for frequently updated content (news feeds), and 1 hour for semi-static content (product catalogs). example: E-commerce sites typically cache product detail pages for 10-30 minutes. When inventory drops to zero, they use cache invalidation (see Cache Invalidation) to purge the entry immediately rather than waiting for TTL expiration.

principle: Bypass Mechanisms for Dynamic Content explanation: Not all requests should hit the cache. User-specific pages, POST requests, and authenticated API calls typically bypass the cache entirely. The reverse proxy needs rules to identify cacheable vs non-cacheable traffic. This is configured through cache bypass conditions: skip caching if the request has a session cookie, if the URL contains /admin/, or if the response sets Set-Cookie. example: Nginx configuration: proxy_cache_bypass $cookie_session; ensures that any request with a session cookie goes directly to the application server, preventing accidental caching of personalized content.

Deep Dive

Types / Variants

Full Page Caching

Caches the entire HTTP response body for a URL. This is the most common and effective form, used for static pages, API responses, and any content that’s identical for all users. The cache stores the complete response including headers and body. Configuration is straightforward: set a TTL and define cache keys. Full page caching can achieve 95%+ hit rates for read-heavy content like blog posts or product catalogs.

Fragment Caching

Caches portions of a page while dynamically generating others. This is less common at the web server level (more typical in Application Caching) but possible with Edge Side Includes (ESI) in Varnish. For example, an e-commerce page might cache the product description and reviews (ESI fragments) while dynamically inserting the user’s shopping cart. ESI allows the reverse proxy to assemble pages from multiple cached fragments plus dynamic content.

Microcaching

Extremely short TTLs (1-5 seconds) that protect against traffic spikes without serving stale data. Even a 1-second cache can absorb a sudden surge of requests to the same URL—think of a news article going viral. The first request in each second hits the backend, but subsequent requests within that second are served from cache. This is particularly effective for APIs under heavy load where even brief caching provides massive relief.

Trade-offs

dimension: Memory vs Disk Storage option_a: Varnish (memory-only) provides 10-100x faster cache hits (sub-millisecond) and simpler architecture but requires expensive RAM and loses cache on restart. option_b: Nginx (memory + disk) uses less RAM, persists cache across restarts, and handles larger cache sizes, but disk I/O adds 5-20ms latency on cache hits. decision_framework: Use Varnish for high-traffic, read-heavy workloads where cache hit rate is critical and you can afford RAM (e.g., media streaming). Use Nginx for general-purpose caching with mixed workloads and when cache persistence matters.

dimension: Cache Invalidation Strategy option_a: TTL-based expiration is simple and requires no coordination but serves stale data until expiration and wastes cache space on unchanged content. option_b: Active invalidation (purging specific cache keys when data changes) keeps cache fresh but requires application integration and can create race conditions. decision_framework: Start with TTL-based expiration for simplicity. Add selective purging for critical paths where staleness is unacceptable (inventory, pricing). See Cache Invalidation for implementation patterns.

dimension: Cache Warming vs Cold Start option_a: Pre-warming the cache by crawling popular URLs before traffic arrives ensures high initial hit rates but requires build-time integration and delays deployments. option_b: Cold start (empty cache) is simpler and faster to deploy but causes a thundering herd on the backend when cache is empty after restarts. decision_framework: Use cache warming for predictable traffic patterns (e.g., product launches, scheduled content releases). Accept cold starts for unpredictable workloads and implement rate limiting on the backend to handle the initial surge.

Common Pitfalls

pitfall: Caching User-Specific Content Without Proper Keys why_it_happens: Developers forget to include user identifiers in cache keys, causing one user’s personalized content to be served to another. This is a critical security and privacy issue. how_to_avoid: Always audit cache key configuration. For user-specific content, either include user ID in the cache key or bypass caching entirely. Use Cache-Control: private headers to prevent intermediate caches from storing the response.

pitfall: Ignoring Cache-Control Headers from Application why_it_happens: The reverse proxy is configured to cache everything for a fixed TTL, overriding the application’s Cache-Control headers. This breaks the application’s ability to control caching behavior. how_to_avoid: Configure the proxy to respect Cache-Control headers from the backend. Use proxy_cache_valid in Nginx to set default TTLs only when the backend doesn’t specify caching headers.

pitfall: Undersized Cache Leading to Thrashing why_it_happens: The cache is too small for the working set, causing constant evictions and low hit rates. Popular items are evicted before their TTL expires, defeating the purpose of caching. how_to_avoid: Monitor cache hit rates and eviction rates. Size the cache to hold at least your hot dataset (typically 20% of total content that receives 80% of traffic). For Nginx, this means setting proxy_cache_path with adequate max_size. For Varnish, allocate sufficient RAM.

Real-World Examples

company: YouTube system: Video Metadata API usage_detail: YouTube uses Nginx as a reverse proxy cache for video metadata (titles, descriptions, view counts). Metadata is cached for 5 minutes with cache keys that include video ID and language. This reduces load on the metadata service by 85% while ensuring view counts stay reasonably fresh. When a video goes viral, the cache absorbs millions of requests without overwhelming the backend. Cache invalidation is triggered when creators update video details, using Nginx’s proxy_cache_purge module to immediately remove stale entries.

company: The Guardian (News Media) system: Article Delivery usage_detail: The Guardian uses Varnish to cache article pages with a 1-minute TTL (microcaching). This extremely short TTL ensures breaking news updates appear quickly while still protecting the CMS from traffic spikes when articles go viral on social media. During peak events, a single article might receive 10,000 requests per second, but only 60 requests per minute reach the backend. Varnish’s ESI support allows them to cache the article body while dynamically inserting real-time comment counts and personalized recommendation widgets.

company: Shopify system: Storefront Pages usage_detail: Shopify’s edge infrastructure uses Nginx to cache merchant storefront pages with TTLs ranging from 5 minutes (home pages) to 1 hour (product pages). Cache keys include the store domain, URL path, and device type (mobile vs desktop). When merchants update product information, Shopify’s backend publishes invalidation events that purge affected cache entries across all edge locations. This architecture allows Shopify to serve millions of storefronts without proportionally scaling application servers—most traffic is served directly from Nginx cache.

Interview Expectations

Mid-Level

Explain the difference between web server caching and CDN caching. Describe how a reverse proxy cache works (request flow, cache hit vs miss). Discuss basic cache key design (URL-based) and TTL configuration. Understand when to use web server caching (static content, read-heavy APIs) versus when to bypass it (user-specific data, POST requests). Be able to configure basic Nginx caching with proxy_cache directives.

Senior

Design cache key strategies for complex scenarios (multi-tenant systems, API versioning, internationalization). Explain trade-offs between different reverse proxy solutions (Nginx vs Varnish vs Apache Traffic Server). Discuss cache invalidation approaches and their consistency guarantees. Calculate cache sizing requirements based on traffic patterns and hit rate targets. Address cache stampede problems and mitigation strategies (request coalescing, probabilistic early expiration). Explain how web server caching fits into a multi-layer caching strategy alongside CDN and application caches.

Staff+

Architect a global caching strategy that coordinates web server caches across multiple data centers with consistency requirements. Design cache warming pipelines for predictable traffic events (product launches, content releases). Optimize cache memory allocation and eviction policies for cost efficiency. Implement sophisticated cache key normalization to maximize hit rates (URL canonicalization, header normalization). Design monitoring and alerting for cache health (hit rates, eviction rates, memory pressure). Explain how to handle cache poisoning attacks and implement cache security controls. Discuss advanced patterns like stale-while-revalidate and cache hierarchies.

Common Interview Questions

How would you design caching for a multi-tenant SaaS application where each tenant has custom branding?

Your cache hit rate dropped from 80% to 40% after a deployment. How do you debug this?

Explain the difference between proxy_cache and fastcgi_cache in Nginx. When would you use each?

How do you prevent cache stampede when a popular cache entry expires under high load?

Design a cache invalidation strategy for an e-commerce site where product prices change frequently.

What’s the impact of adding Vary: User-Agent to your cache headers? How would you optimize this?

Key Takeaways

Web server caching operates at the reverse proxy layer (Nginx, Varnish) between clients and application servers, distinct from CDN caching (edge) and application caching (in-process).

Cache key design is critical: include all request dimensions that affect the response (URL, language, encoding) while excluding user-specific attributes for public content.

TTL selection balances freshness and hit rate—use tiered TTLs (1 year for immutable assets, minutes for dynamic content) and consider microcaching (1-5 seconds) for traffic spike protection.

Web server caching is most effective for read-heavy, publicly accessible content (static assets, API responses, product pages) and can reduce backend load by 70-90%.

Common pitfalls include caching user-specific content without proper keys, ignoring Cache-Control headers from applications, and undersizing cache leading to thrashing and low hit rates.

Prerequisites

HTTP Protocol Basics - Understanding HTTP headers and caching semantics

Reverse Proxy Architecture - How reverse proxies fit into system architecture

CDN Caching - Edge caching upstream from web servers

Application Caching - In-process caching downstream from web servers

Cache Invalidation - Strategies for keeping cached data fresh

Next Steps

Cache Consistency - Maintaining consistency across cache layers

Cache Eviction Policies - LRU, LFU, and other eviction strategies