Security Patterns in System Design

TL;DR

Security in distributed systems provides confidentiality, integrity, and availability (CIA triad) against malicious attacks. It’s not a feature you add at the end—it’s a design constraint that shapes every architectural decision from trust boundaries to data flow. In interviews, security demonstrates your ability to think adversarially and design systems that fail safely under attack.

Cheat Sheet: CIA triad (confidentiality, integrity, availability) + defense in depth + principle of least privilege + trust boundaries + threat modeling = security-first design.

The Analogy

Think of security like designing a bank vault system. You don’t just build a strong door—you create concentric layers of protection: perimeter fencing (network security), lobby checkpoints (authentication), vault access controls (authorization), cameras everywhere (audit logging), and time-delayed locks (rate limiting). Each layer assumes the previous one might fail. The vault designer thinks like a thief, imagining every possible attack vector, then builds defenses that make each attack economically infeasible. That’s exactly how you approach distributed system security: assume breach, verify everything, and make attacks too expensive to execute.

Why This Matters in Interviews

Security comes up in nearly every system design interview, either explicitly (“design a secure payment system”) or implicitly (“how do you prevent abuse?”). Interviewers use security questions to assess three things: (1) your ability to think adversarially and anticipate attack vectors, (2) your understanding of security fundamentals beyond surface-level buzzwords, and (3) your judgment in balancing security with usability and performance. Strong candidates naturally weave security into their designs rather than treating it as an afterthought. They demonstrate depth by discussing specific mechanisms (JWT vs session tokens, TLS 1.3 vs 1.2) and breadth by considering the entire attack surface from network to application to data layer.

Core Concept

Security in distributed systems is fundamentally about protecting three properties—confidentiality (data isn’t exposed to unauthorized parties), integrity (data isn’t tampered with), and availability (legitimate users can access the system)—across multiple machines, networks, and trust boundaries. Unlike monolithic applications where you control the entire execution environment, distributed systems expose attack surfaces at every network hop, service boundary, and data store. This expanded attack surface means you can’t rely on perimeter security alone; you need defense in depth with security controls at every layer.

The challenge intensifies because distributed systems involve multiple actors with different trust levels: end users, internal services, third-party APIs, and infrastructure components. Each interaction crosses a trust boundary where you must verify identity, enforce authorization, and audit actions. At companies like Netflix, a single user request might traverse 50+ microservices, each needing to make independent security decisions without creating a performance bottleneck. This requires rethinking security from “castle and moat” (strong perimeter, trusted interior) to “zero trust” (verify everything, trust nothing).

Effective security design starts with threat modeling—systematically identifying what you’re protecting (assets), who might attack it (threat actors), how they might attack (attack vectors), and what damage they could cause (impact). This analysis drives your security architecture. For a payment system, you’d identify credit card data as a high-value asset, consider both external attackers and malicious insiders, analyze vectors like SQL injection and man-in-the-middle attacks, and recognize that a breach could mean regulatory fines, customer loss, and reputational damage. Only after this analysis can you design proportionate controls.

CIA Triad in Distributed Systems

graph TB
    subgraph Security Properties
        CIA["CIA Triad"]
        C["Confidentiality<br/><i>Data not exposed to<br/>unauthorized parties</i>"]
        I["Integrity<br/><i>Data not tampered<br/>with or modified</i>"]
        A["Availability<br/><i>Legitimate users<br/>can access system</i>"]
    end
    
    subgraph Threats
        T1["Data Breach<br/><i>Stolen credentials,<br/>SQL injection</i>"]
        T2["Data Tampering<br/><i>MITM attacks,<br/>unauthorized edits</i>"]
        T3["DDoS Attack<br/><i>Resource exhaustion,<br/>service disruption</i>"]
    end
    
    subgraph Controls
        CT1["Encryption + Access Control"]
        CT2["Signatures + Audit Logs"]
        CT3["Rate Limiting + Redundancy"]
    end
    
    CIA --> C
    CIA --> I
    CIA --> A
    
    T1 -."threatens".-> C
    T2 -."threatens".-> I
    T3 -."threatens".-> A
    
    C --> CT1
    I --> CT2
    A --> CT3

The CIA triad defines three core security properties that must be protected in distributed systems. Each property faces specific threats and requires different security controls. Unlike monolithic systems, distributed architectures must maintain these properties across multiple trust boundaries and network hops.

How It Works

Security in distributed systems operates through layered controls that work together to prevent, detect, and respond to attacks. Here’s how security flows through a typical request:

Step 1: Network Security (Perimeter Defense) The request first hits your network edge, where firewalls, DDoS protection (like Cloudflare or AWS Shield), and Web Application Firewalls (WAF) filter obvious attacks. The WAF inspects HTTP traffic for SQL injection attempts, XSS payloads, and known exploit patterns. At Stripe, their edge layer blocks 99.9% of malicious traffic before it reaches application servers, using rate limiting (max 100 requests/second per IP) and geographic restrictions (blocking requests from sanctioned countries). This layer provides availability protection—keeping attackers from overwhelming your infrastructure.

Step 2: Transport Security (Encryption in Transit) Legitimate traffic proceeds over TLS 1.3, which encrypts data in transit and provides mutual authentication. The client verifies the server’s certificate (signed by a trusted CA like Let’s Encrypt), and optionally the server verifies the client’s certificate (mTLS for service-to-service communication). This prevents man-in-the-middle attacks where an attacker intercepts and modifies traffic. Netflix enforces TLS for all inter-service communication, with automatic certificate rotation every 24 hours to limit the blast radius if a private key is compromised.

Step 3: Authentication (Identity Verification) The API gateway authenticates the request by validating credentials—typically a JWT token, OAuth access token, or API key. Authentication answers “who are you?” The gateway verifies the token’s signature (using the issuer’s public key), checks expiration, and validates claims (issuer, audience, scope). If valid, the gateway extracts the user identity and passes it downstream. Google’s infrastructure uses short-lived tokens (15-minute expiry) with refresh tokens for extended sessions, balancing security (stolen tokens expire quickly) with usability (users don’t constantly re-authenticate).

Step 4: Authorization (Access Control) Each service independently checks “what are you allowed to do?” using the authenticated identity. This might be role-based access control (RBAC: “admins can delete users”), attribute-based access control (ABAC: “users can edit their own posts”), or relationship-based access control (ReBAC: “editors of document X can share it”). Uber’s authorization system evaluates policies like “riders can view their own trips” by checking the trip’s rider_id against the authenticated user_id. Authorization happens at every service boundary—never trust that upstream services enforced the right policies.

Step 5: Data Security (Encryption at Rest) When data is stored, it’s encrypted using keys managed by a key management service (AWS KMS, HashiCorp Vault). Sensitive fields like credit card numbers use field-level encryption with separate keys per customer, so compromising one key doesn’t expose all data. Airbnb encrypts personally identifiable information (PII) at the application layer before writing to databases, ensuring that even database administrators can’t read sensitive data without application-level access.

Step 6: Audit and Monitoring (Detection and Response) Every security-relevant action—authentication attempts, authorization decisions, data access—generates audit logs sent to a centralized SIEM (Security Information and Event Management) system. Anomaly detection algorithms flag suspicious patterns: a user suddenly accessing 10,000 records, login attempts from a new country, or API calls outside normal business hours. When Dropbox detects unusual access patterns, they automatically trigger additional authentication challenges (step-up authentication) or temporarily lock the account pending investigation.

Step 7: Incident Response (Containment and Recovery) When an attack is detected, automated playbooks revoke compromised credentials, isolate affected services, and trigger alerts to the security team. GitHub’s incident response includes automatically rotating all OAuth tokens if their signing key is compromised, forcing all users to re-authenticate. This limits the attack window to minutes rather than days.

Request Flow Through Security Layers

graph LR
    Client["Client<br/><i>Web/Mobile App</i>"]
    
    subgraph Edge Layer
        WAF["WAF<br/><i>SQL injection,<br/>XSS filtering</i>"]
        DDoS["DDoS Protection<br/><i>Rate limiting:<br/>100 req/sec/IP</i>"]
    end
    
    subgraph API Gateway
        TLS["TLS 1.3<br/><i>Encrypted transport</i>"]
        Auth["Authentication<br/><i>JWT validation</i>"]
    end
    
    subgraph Application Layer
        Service1["User Service<br/><i>Authorization:<br/>RBAC check</i>"]
        Service2["Payment Service<br/><i>Authorization:<br/>ABAC check</i>"]
    end
    
    subgraph Data Layer
        DB[("Database<br/><i>Encrypted at rest<br/>AES-256</i>")]
        KMS["Key Management<br/><i>AWS KMS</i>"]
    end
    
    SIEM["SIEM<br/><i>Audit logs +<br/>Anomaly detection</i>"]
    
    Client --"1. HTTPS Request"--> WAF
    WAF --"2. Filter malicious<br/>patterns"--> DDoS
    DDoS --"3. Check rate limit"--> TLS
    TLS --"4. Decrypt + verify<br/>certificate"--> Auth
    Auth --"5. Validate JWT<br/>signature + expiry"--> Service1
    Service1 --"6. Check user<br/>permissions"--> Service2
    Service2 --"7. Verify payment<br/>authorization"--> DB
    DB --"8. Decrypt data<br/>using key"--> KMS
    
    Auth -."Log: auth attempt".-> SIEM
    Service1 -."Log: access decision".-> SIEM
    Service2 -."Log: payment action".-> SIEM
    DB -."Log: data access".-> SIEM

A typical request traverses seven security layers, each providing defense in depth. The WAF and DDoS protection filter attacks at the edge (99.9% of malicious traffic blocked here). TLS encrypts transport. Authentication validates identity. Each service independently authorizes actions. Data is encrypted at rest with keys from KMS. Every layer logs to SIEM for anomaly detection, creating an audit trail for forensic analysis.

Key Principles

Principle 1: Defense in Depth (Layered Security) Never rely on a single security control. Assume every layer will eventually fail, and design so that breaching one layer doesn’t compromise the entire system. If an attacker bypasses your WAF (maybe through a zero-day vulnerability), they still face authentication, authorization, encryption, rate limiting, and audit logging. Twitter’s security architecture includes network segmentation (DMZ for public services, private VPC for databases), application-level authentication, database-level access controls, and encryption at rest. When their network perimeter was breached in 2020, the attackers still couldn’t access encrypted user data without application-level credentials. This principle means accepting redundancy—you’ll have both network firewalls and application-level input validation, even though they partially overlap.

Principle 2: Principle of Least Privilege Grant the minimum permissions necessary for each component to function, and nothing more. A service that only reads user profiles shouldn’t have write access to the database. A background job that processes images shouldn’t have network access to payment systems. At Amazon, services run with IAM roles that explicitly list allowed actions (“s3:GetObject on bucket X”) rather than broad permissions (“s3:*”). When the Capital One breach happened, the attacker exploited overly permissive IAM roles—a web application had permissions to list and read all S3 buckets, not just its own. Least privilege limits blast radius: if one service is compromised, the attacker can’t pivot to other systems. This extends to human access too—engineers get read-only production access by default, requiring explicit approval and time-limited elevation for write operations.

Principle 3: Zero Trust Architecture Never assume that requests from inside your network are trustworthy. Traditional security models trusted anything inside the corporate firewall, but modern threats include compromised internal services, malicious insiders, and lateral movement after initial breach. Zero trust means every request—even between internal microservices—must be authenticated, authorized, and encrypted. Google’s BeyondCorp initiative eliminated their corporate VPN, instead requiring every service call to present a cryptographically verified identity and pass authorization checks. When a Netflix microservice calls another, it presents a short-lived certificate proving its identity, and the receiving service validates that the caller is authorized for that specific operation. This prevents a compromised service from freely accessing other internal systems.

Principle 4: Fail Securely (Secure Defaults) When something goes wrong—a service crashes, a network partition occurs, a configuration is missing—the system should fail in a way that preserves security, even at the cost of availability. If your authorization service is unreachable, deny the request rather than allowing it through. If certificate validation fails, reject the connection rather than falling back to unencrypted communication. Stripe’s payment processing fails closed: if they can’t verify a merchant’s identity or validate fraud checks, they decline the transaction rather than risking fraudulent charges. This principle conflicts with availability goals, requiring careful judgment. For non-critical features (like recommendation systems), you might fail open to preserve user experience. For security-critical operations (authentication, payment authorization), always fail closed.

Principle 5: Security Through Transparency (Audit Everything) You can’t detect or respond to attacks if you don’t know what’s happening. Log every security-relevant event with sufficient context to reconstruct what happened: who made the request (user ID, service identity), what they tried to do (API endpoint, operation), when it happened (timestamp with millisecond precision), whether it succeeded (HTTP status, error code), and why it was allowed or denied (authorization policy matched). Dropbox logs every file access with user ID, file ID, IP address, and access method (web, mobile, API). These logs feed into real-time anomaly detection and provide forensic evidence after incidents. The challenge is balancing completeness with privacy and performance—don’t log sensitive data (passwords, credit card numbers) and use sampling for high-volume operations. Audit logs must be immutable and stored separately from application data so attackers can’t cover their tracks by deleting logs.

Zero Trust Architecture: Service-to-Service Communication

graph TB
    subgraph Traditional: Castle and Moat
        Firewall1["Corporate Firewall"]
        Internal1["Internal Network<br/><i>All services trusted</i>"]
        S1["Service A"]
        S2["Service B"]
        S3["Service C"]
        
        Firewall1 --> Internal1
        Internal1 --> S1
        Internal1 --> S2
        Internal1 --> S3
        S1 --"Unencrypted,<br/>no auth"--> S2
        S2 --"Unencrypted,<br/>no auth"--> S3
    end
    
    subgraph Zero Trust: Verify Everything
        S4["Service A<br/><i>mTLS cert</i>"]
        S5["Service B<br/><i>mTLS cert</i>"]
        S6["Service C<br/><i>mTLS cert</i>"]
        AuthZ["Authorization Service<br/><i>Policy engine</i>"]
        
        S4 --"1. Present cert +<br/>request permission"--> AuthZ
        AuthZ --"2. Verify identity +<br/>check policy"--> S5
        S4 --"3. Encrypted call<br/>with short-lived token"--> S5
        S5 --"4. Validate token +<br/>enforce authorization"--> S5
        S5 --"5. Similar flow"--> S6
    end
    
    Attacker["Attacker<br/><i>Compromised Service A</i>"]
    
    Attacker -."Can access B & C<br/>freely".-> S2
    Attacker -."Blocked: no valid cert<br/>or authorization".-> S5

Traditional ‘castle and moat’ security trusts everything inside the network perimeter, allowing lateral movement after initial breach. Zero trust requires every service call to present cryptographic identity (mTLS certificate) and pass authorization checks, even between internal services. When Service A is compromised in a zero trust architecture, the attacker cannot freely access Services B and C without valid certificates and authorization policies.

Deep Dive

Types / Variants

Authentication Mechanisms

Session-Based Authentication uses server-side session storage. After login, the server creates a session ID, stores user data in Redis or a database, and returns the session ID in a cookie. Subsequent requests include the cookie, and the server looks up session data. This provides strong security (server controls session lifetime, can revoke instantly) but requires shared session storage across servers, creating a stateful bottleneck. Netflix used session-based auth in their early monolith but migrated to stateless tokens when scaling to thousands of microservices. Use session-based auth when you need instant revocation (banking, admin panels) and have a small number of servers.

Token-Based Authentication (JWT) embeds user data in a cryptographically signed token that the client stores and includes in requests. The server validates the signature without database lookups, making it stateless and scalable. JWTs contain claims (user ID, roles, expiration) encoded as JSON and signed with HMAC or RSA. The downside is you can’t revoke tokens before expiration—if a token is stolen, it’s valid until it expires. Spotify uses short-lived JWTs (15-minute expiry) with refresh tokens (30-day expiry) stored server-side, combining stateless performance with revocation capability. Use JWTs for microservices and mobile apps where stateless scaling matters more than instant revocation.

OAuth 2.0 / OpenID Connect delegates authentication to a specialized identity provider (Google, Okta, Auth0). Users authenticate with the provider, which issues access tokens that your application validates. This centralizes security expertise, provides single sign-on across applications, and shifts liability for credential storage. Slack uses OAuth to let users sign in with Google or Microsoft accounts, avoiding the need to store passwords. The complexity is managing token refresh, handling provider outages, and dealing with multiple token types (access, refresh, ID). Use OAuth when integrating with third-party identity providers or building multi-tenant SaaS platforms.

Mutual TLS (mTLS) requires both client and server to present certificates, proving identity at the transport layer. This is common for service-to-service authentication in microservices. Uber’s infrastructure uses mTLS for all internal communication, with certificates automatically issued and rotated by their internal CA. The challenge is certificate management at scale—issuing, distributing, rotating, and revoking certificates for thousands of services. Use mTLS for high-security environments (financial services, healthcare) and service meshes (Istio, Linkerd) that automate certificate lifecycle.

Authorization Models

Role-Based Access Control (RBAC) assigns users to roles (admin, editor, viewer) and grants permissions to roles. Simple to understand and implement, but becomes unwieldy with many roles and doesn’t handle contextual permissions well (“editors can modify their own posts”). GitHub uses RBAC for repository permissions (read, write, admin) combined with team-based inheritance. Use RBAC for internal tools and applications with clear organizational hierarchies.

Attribute-Based Access Control (ABAC) evaluates policies based on attributes of the user (department, clearance level), resource (classification, owner), and environment (time of day, IP address). Policies like “users in the finance department can access confidential documents during business hours from office IPs” provide fine-grained control. AWS IAM uses ABAC with condition keys that check request context. The complexity is policy authoring and debugging—it’s hard to understand why a request was denied. Use ABAC for complex enterprise environments with dynamic access requirements.

Relationship-Based Access Control (ReBAC) models permissions as relationships in a graph. “Alice can edit document X because she’s in the ‘editors’ group for folder Y, which contains X.” Google Drive uses ReBAC to handle nested folder permissions and sharing. Zanzibar, Google’s authorization system, evaluates billions of permission checks per second by precomputing and caching relationship graphs. Use ReBAC for collaborative applications (Google Docs, Notion) where permissions are defined by user relationships to resources.

Encryption Approaches

Encryption at Rest protects stored data using symmetric encryption (AES-256). Keys are managed by a KMS, with separate keys per tenant or data classification. Slack encrypts all messages at rest using AWS KMS, with keys rotated annually. The challenge is key management—you need secure key storage, access controls on keys, and key rotation without downtime. Use encryption at rest for compliance (GDPR, HIPAA) and protecting against physical theft of storage media.

Encryption in Transit uses TLS to protect data moving between services. TLS 1.3 provides forward secrecy (compromising today’s keys doesn’t decrypt past traffic) and faster handshakes. Cloudflare enforces TLS 1.3 for all traffic, with automatic fallback to 1.2 for legacy clients. The challenge is certificate management and performance overhead (TLS adds 1-2ms latency per connection). Use TLS everywhere—it’s now fast enough that there’s no excuse for unencrypted traffic.

End-to-End Encryption (E2EE) encrypts data on the client, so even the service provider can’t read it. WhatsApp uses the Signal protocol for E2EE messaging—messages are encrypted on the sender’s device and only decrypted on the recipient’s device. The tradeoff is losing server-side features like search, content moderation, and backup. Use E2EE for privacy-critical applications (messaging, healthcare) where users must trust no one but themselves.

Authentication Mechanisms Comparison

graph TB
    subgraph Session-Based
        Client1["Client"]
        Server1["Server"]
        SessionStore[("Redis<br/><i>Session storage</i>")]
        
        Client1 --"1. Login credentials"--> Server1
        Server1 --"2. Create session ID<br/>+ store user data"--> SessionStore
        Server1 --"3. Return session ID<br/>in cookie"--> Client1
        Client1 --"4. Request + cookie"--> Server1
        Server1 --"5. Lookup session"--> SessionStore
    end
    
    subgraph Token-Based JWT
        Client2["Client"]
        Server2["Server<br/><i>Stateless</i>"]
        
        Client2 --"1. Login credentials"--> Server2
        Server2 --"2. Generate JWT<br/>{user_id, roles, exp}<br/>+ sign with secret"--> Client2
        Client2 --"3. Store JWT locally"--> Client2
        Client2 --"4. Request + JWT<br/>in Authorization header"--> Server2
        Server2 --"5. Verify signature<br/>+ check expiration<br/>(no DB lookup)"--> Server2
    end
    
    subgraph OAuth 2.0
        Client3["Client App"]
        AuthProvider["Identity Provider<br/><i>Google, Okta</i>"]
        Server3["Resource Server"]
        
        Client3 --"1. Redirect to<br/>provider"--> AuthProvider
        AuthProvider --"2. User authenticates"--> AuthProvider
        AuthProvider --"3. Return auth code"--> Client3
        Client3 --"4. Exchange code<br/>for access token"--> AuthProvider
        Client3 --"5. Request + token"--> Server3
        Server3 --"6. Validate token<br/>with provider"--> AuthProvider
    end
    
    Props1["✓ Instant revocation<br/>✓ Strong security<br/>✗ Stateful bottleneck<br/>✗ Shared storage needed"]
    Props2["✓ Stateless scaling<br/>✓ No DB lookups<br/>✗ Can't revoke before expiry<br/>✗ Token size overhead"]
    Props3["✓ Centralized identity<br/>✓ Single sign-on<br/>✗ Provider dependency<br/>✗ Complex token refresh"]
    
    SessionStore -.-> Props1
    Server2 -.-> Props2
    AuthProvider -.-> Props3

Three primary authentication mechanisms with different tradeoffs. Session-based provides instant revocation but requires shared state. JWT enables stateless scaling but can’t be revoked before expiration (mitigated with short expiry + refresh tokens). OAuth delegates to specialized identity providers, centralizing security expertise but adding external dependencies. Netflix migrated from sessions to JWT when scaling to thousands of microservices.

Trade-offs

Security vs. Performance

Every security control adds latency and computational overhead. TLS handshakes add 1-2 round trips (50-200ms on high-latency connections). JWT signature validation takes 0.1-1ms per request. Database-level encryption adds 5-10% CPU overhead. At Netflix’s scale (millions of requests per second), these costs multiply into significant infrastructure expenses. The decision framework: (1) measure the actual performance impact with realistic load testing, (2) identify the minimum security controls required for compliance and risk tolerance, (3) optimize hot paths (cache JWT validation results, use TLS session resumption, hardware-accelerated encryption). For read-heavy workloads, cache authorization decisions with short TTLs (30-60 seconds). For write-heavy workloads, use asynchronous audit logging so security doesn’t block the critical path. Stripe accepts 2-5ms additional latency for encryption and fraud checks because payment security is non-negotiable, but they optimize aggressively to keep it under 5ms.

Security vs. Usability

Strict security often frustrates users. Multi-factor authentication reduces account takeovers by 99.9% but adds friction to every login. Short token expiration (5 minutes) limits stolen token damage but forces frequent re-authentication. IP-based restrictions prevent unauthorized access but break for users on VPNs or traveling. The decision framework: (1) risk-based authentication—require MFA only for sensitive operations (changing password, large transfers) or anomalous behavior (new device, unusual location), (2) progressive security—start with low-friction methods (biometrics, push notifications) and escalate only when needed, (3) measure user impact with metrics like authentication success rate and time-to-complete. Google uses risk-based authentication: logging in from your usual device and location requires just a password, but accessing from a new country triggers MFA. Dropbox allows users to stay logged in for 30 days on trusted devices while requiring re-authentication on new devices.

Centralized vs. Decentralized Security

Centralized security (API gateway handles all authentication/authorization) simplifies management and provides a single enforcement point, but creates a bottleneck and single point of failure. Decentralized security (each service validates independently) scales better and provides defense in depth, but risks inconsistent enforcement and configuration drift. The decision framework: (1) use centralized authentication (API gateway validates tokens) for consistency and performance, (2) use decentralized authorization (each service checks permissions) for defense in depth and context-aware decisions, (3) provide shared libraries and policy engines (OPA, Casbin) to ensure consistent policy evaluation. Netflix’s architecture has edge gateways handle authentication and rate limiting, but each microservice independently validates that the caller is authorized for the specific operation. This prevents a compromised gateway from granting unauthorized access to backend services.

Fail Open vs. Fail Closed

When a security service is unavailable, you must choose: fail open (allow requests through, risking security breach) or fail closed (deny requests, impacting availability). For authentication failures, always fail closed—better to have an outage than allow unauthorized access. For authorization failures, the decision depends on criticality: fail closed for payment processing and data modification, but consider failing open for read-only operations on non-sensitive data. The decision framework: (1) classify operations by security criticality (critical, high, medium, low), (2) define failure modes for each class (critical always fails closed, low might fail open), (3) implement circuit breakers with different thresholds (fail closed after 3 consecutive errors, fail open after 50% error rate for 5 minutes). Uber’s authorization service fails closed for ride requests (can’t allow unauthorized rides) but fails open for viewing past trips (low risk, high user impact if unavailable).

Compliance vs. Flexibility

Regulatory requirements (PCI-DSS, HIPAA, GDPR) mandate specific security controls, but these often conflict with engineering velocity and system flexibility. PCI-DSS requires network segmentation and quarterly vulnerability scans, adding operational overhead. GDPR’s right to deletion requires building data deletion pipelines across all systems. The decision framework: (1) identify minimum compliance requirements and implement them first, (2) design systems to be compliance-friendly from the start (data classification, audit logging, encryption), (3) automate compliance checks (infrastructure as code, policy as code) to reduce manual overhead. Stripe built their infrastructure to be PCI-DSS compliant by default, with automatic network segmentation, encrypted storage, and audit logging, so individual teams don’t need to think about compliance for every feature.

Security vs. Performance: Optimization Strategies

graph LR
    subgraph Request Path
        Client["Client"]
        Edge["Edge<br/><i>TLS termination<br/>+1-2ms</i>"]
        Gateway["API Gateway<br/><i>JWT validation<br/>+0.5ms</i>"]
        Service["Service<br/><i>Authorization<br/>+0.3ms</i>"]
        DB[("Database<br/><i>Encryption overhead<br/>+5-10% CPU</i>")]
    end
    
    subgraph Optimization Layer
        TLSCache["TLS Session Cache<br/><i>Reuse handshake<br/>-50% latency</i>"]
        JWTCache["JWT Validation Cache<br/><i>Cache results 60s<br/>-90% validation cost</i>"]
        AuthZCache["AuthZ Decision Cache<br/><i>Cache policies 30s<br/>-80% checks</i>"]
        HWAccel["Hardware Acceleration<br/><i>AES-NI instructions<br/>-60% CPU overhead</i>"]
    end
    
    subgraph Async Layer
        AuditQueue["Audit Log Queue<br/><i>Async logging<br/>non-blocking</i>"]
        SIEM["SIEM"]
    end
    
    Client --"HTTPS"--> Edge
    Edge --"Validated request"--> Gateway
    Gateway --"Authenticated request"--> Service
    Service --"Query"--> DB
    
    Edge -."Cache hit: skip handshake".-> TLSCache
    Gateway -."Cache hit: skip validation".-> JWTCache
    Service -."Cache hit: skip policy eval".-> AuthZCache
    DB -."Use CPU instructions".-> HWAccel
    
    Service --"Fire and forget"--> AuditQueue
    AuditQueue --"Batch process"--> SIEM
    
    Metrics["Baseline: 50ms total<br/>TLS: +2ms → 52ms<br/>JWT: +0.5ms → 52.5ms<br/>AuthZ: +0.3ms → 52.8ms<br/>Encryption: +5% CPU<br/><br/>With optimizations:<br/>TLS cache: 51ms (-50%)<br/>JWT cache: 50.5ms (-90%)<br/>AuthZ cache: 50.3ms (-80%)<br/>HW accel: +2% CPU (-60%)<br/><br/>Final: 50.3ms, +2% CPU"]
    
    DB -.-> Metrics

Security controls add measurable latency and CPU overhead at each layer. Optimization strategies include: (1) TLS session caching to reuse handshakes, (2) JWT validation result caching with short TTLs, (3) authorization decision caching for repeated checks, (4) hardware-accelerated encryption (AES-NI), and (5) asynchronous audit logging to avoid blocking the critical path. At Netflix scale, these optimizations reduce security overhead from 5.8ms to 0.3ms per request while maintaining strong security guarantees.

Math & Calculations

Token Expiration and Security Window

When designing token-based authentication, you must balance security (short expiration limits stolen token damage) with usability (long expiration reduces re-authentication friction). The security window is the maximum time an attacker can use a stolen token.

Formula: Security Window = Token Expiration Time

Variables:

Token Expiration Time: How long the token remains valid
Refresh Token Expiration: How long the refresh token remains valid (typically much longer)
Re-authentication Frequency: How often users must enter credentials

Worked Example: Spotify’s token strategy

Access Token Expiration: 15 minutes
Refresh Token Expiration: 30 days
Security Window for Stolen Access Token: 15 minutes maximum
Security Window for Stolen Refresh Token: 30 days maximum (but requires detection and revocation)

If an attacker steals an access token at minute 7 of its lifetime, they have 8 minutes of access before it expires. If they steal a refresh token, they can generate new access tokens for 30 days unless the user changes their password or the system detects anomalous usage.

Tradeoff Calculation: Reducing access token expiration from 60 minutes to 15 minutes reduces the security window by 75%, but increases token refresh requests by 4x (more load on auth servers). For a system with 10 million active users, that’s 40 million additional requests per hour (11,000 requests/second). At 1ms per refresh, that’s 11 CPU cores just for token refresh.

Rate Limiting and Attack Economics

Rate limiting makes attacks economically infeasible by limiting how many attempts an attacker can make.

Formula: Attack Cost = (Target Attempts / Rate Limit) × Cost Per Second

Variables:

Target Attempts: Number of attempts needed to succeed (e.g., 1 million for brute force)
Rate Limit: Maximum attempts per second per IP
Cost Per Second: Attacker’s cost (proxy IPs, compute time)

Worked Example: Brute forcing a 6-digit PIN

Total Possible PINs: 1,000,000 (000000 to 999999)
Rate Limit: 3 attempts per minute per account
Time to Exhaust Space: 1,000,000 / 3 = 333,333 minutes = 231 days
With Account Lockout After 5 Failed Attempts: Attack becomes infeasible

If the attacker tries to parallelize across 10,000 accounts, they need 10,000 valid account identifiers and can still only try 30,000 PINs per minute (500 per second). At this rate, exhausting the space takes 33 minutes, but the attack generates obvious anomalies (10,000 accounts all failing authentication simultaneously).

Encryption Key Rotation and Blast Radius

Key rotation limits the amount of data exposed if a key is compromised.

Formula: Exposed Data = Encryption Rate × Key Lifetime

Variables:

Encryption Rate: GB of data encrypted per day
Key Lifetime: How long before rotating to a new key
Exposed Data: Amount of data decryptable with a compromised key

Worked Example: Slack’s message encryption

Message Volume: 100 TB of new messages per day
Key Rotation: Every 30 days
Exposed Data per Key: 100 TB × 30 = 3,000 TB (3 PB)

If a key is compromised, the attacker can decrypt 3 PB of messages. By rotating keys daily instead of monthly, the exposure drops to 100 TB (30x reduction), but increases operational complexity (30x more keys to manage). The decision depends on data sensitivity and key management overhead.

Audit Log Retention and Forensic Window

Audit logs enable detecting and investigating security incidents, but storage costs scale with retention period.

Formula: Storage Cost = Log Volume × Retention Period × Storage Price

Variables:

Log Volume: GB of logs per day
Retention Period: How long to keep logs
Storage Price: Cost per GB per month (e.g., $0.023/GB for S3)

Worked Example: GitHub’s audit logs

Log Volume: 10 TB per day (100 million events × 100 KB per event)
Retention Period: 90 days (compliance requirement)
Total Storage: 10 TB × 90 = 900 TB
Monthly Cost: 900,000 GB × $0.023 = $20,700

Extending retention to 365 days increases cost to $84,000/month. The decision framework: (1) keep hot logs (last 30 days) in fast storage for real-time analysis, (2) move warm logs (31-90 days) to cheaper storage for investigations, (3) archive cold logs (90+ days) to glacier storage ($0.004/GB) for compliance, reducing cost to $14,600/month for 365-day retention.