DNS Explained: How Domain Resolution Works

After this topic, you will be able to:

Explain the DNS resolution process from client query to IP address response
Differentiate between authoritative and recursive DNS servers in the resolution chain
Analyze TTL configuration trade-offs for different application requirements
Compare DNS record types (A, AAAA, CNAME, MX, TXT) and their use cases in system design

TL;DR

DNS (Domain Name System) is the internet’s distributed directory service that translates human-readable domain names like www.netflix.com into IP addresses like 52.85.151.23. It operates through a hierarchical system of servers (root, TLD, authoritative) with multiple caching layers to optimize performance. Understanding DNS is critical for system design because it’s the first step in every user request and directly impacts latency, availability, and traffic routing strategies.

Cheat Sheet: DNS query flow → Recursive resolver checks cache → Queries root server (.com) → Queries TLD server (netflix.com) → Queries authoritative server → Returns IP → Client connects. TTL controls cache duration. A records map names to IPs, CNAME creates aliases, MX routes email.

The Analogy

DNS is like calling 411 directory assistance before the smartphone era. You tell the operator “I need the phone number for Joe’s Pizza on Main Street” (domain name), and they look through a hierarchical filing system: first finding the city directory (root server), then the street directory (TLD server), then the specific business listing (authoritative server), and finally giving you the phone number (IP address). The operator might remember recent lookups (caching) to answer faster next time. Just as you’d write down frequently called numbers in your own phonebook (browser cache), DNS systems cache results at multiple levels to avoid repeating the full lookup process.

Why This Matters in Interviews

DNS questions appear in virtually every system design interview because DNS is the entry point for all internet traffic. Interviewers use DNS to assess your understanding of distributed systems, caching strategies, and performance optimization. A mid-level engineer should explain the resolution process and basic record types. Senior engineers must discuss TTL trade-offs, failover strategies, and how DNS integrates with load balancers and CDNs. Staff+ engineers should architect DNS solutions for global services, including geographic routing, health checks, and disaster recovery. The depth of your DNS knowledge signals whether you understand how real production systems handle billions of requests. Weak DNS understanding is a red flag that you haven’t worked on internet-facing systems at scale.

Core Concept

DNS is a globally distributed, hierarchical database that maps domain names to IP addresses and other resources. Created in 1983 to replace manually maintained host files, DNS now handles trillions of queries daily across millions of domains. The system’s genius lies in its hierarchical structure combined with aggressive caching—no single server knows all mappings, yet queries typically resolve in under 50 milliseconds.

DNS serves multiple critical functions beyond simple name-to-IP translation. It enables load balancing by returning different IPs based on geography or server health. It routes email through MX records. It proves domain ownership through TXT records used by services like Google Workspace and SSL certificate authorities. It even enables service discovery in microservices architectures through SRV records. Understanding DNS is understanding how the internet routes traffic, which makes it foundational for designing any distributed system.

The protocol operates primarily over UDP port 53 for speed, falling back to TCP for large responses or zone transfers. This choice reflects DNS’s performance-critical role—every web request, API call, and mobile app interaction starts with a DNS lookup. A slow or unavailable DNS service doesn’t just delay your application; it makes it completely unreachable.

DNS Hierarchical Architecture

graph TB
    subgraph Root Level
        Root["Root Servers (13 addresses)<br/><i>Know about all TLDs</i>"]
    end
    
    subgraph TLD Level
        TLD_com[".com TLD Servers<br/><i>Verisign</i>"]
        TLD_org[".org TLD Servers<br/><i>PIR</i>"]
        TLD_net[".net TLD Servers<br/><i>Verisign</i>"]
        TLD_uk[".uk TLD Servers<br/><i>Nominet</i>"]
    end
    
    subgraph Authoritative Level
        Auth_netflix["netflix.com NS<br/><i>ns-1234.awsdns.com</i>"]
        Auth_stripe["stripe.com NS<br/><i>ns1.stripe.com</i>"]
        Auth_google["google.com NS<br/><i>ns1.google.com</i>"]
        Auth_bbc["bbc.co.uk NS<br/><i>ns1.bbc.co.uk</i>"]
    end
    
    Root -->|"Delegates .com"| TLD_com
    Root -->|"Delegates .org"| TLD_org
    Root -->|"Delegates .net"| TLD_net
    Root -->|"Delegates .uk"| TLD_uk
    
    TLD_com -->|"Delegates netflix.com"| Auth_netflix
    TLD_com -->|"Delegates stripe.com"| Auth_stripe
    TLD_com -->|"Delegates google.com"| Auth_google
    TLD_uk -->|"Delegates bbc.co.uk"| Auth_bbc

DNS operates as a hierarchical tree where each level delegates authority to the next. Root servers know about TLD servers, TLD servers know about authoritative nameservers for each domain, and authoritative servers hold the actual records. This delegation allows the system to scale to billions of domains without any single point of knowledge.

Multi-Layer DNS Caching Strategy

graph LR
    User["👤 User"] -->|"1. www.netflix.com?"| Browser
    
    subgraph Client Device
        Browser["Browser Cache<br/><i>TTL: 5 min</i>"]
        OS["OS Cache<br/><i>TTL: 1 min</i>"]
    end
    
    subgraph ISP/Public Resolver
        Resolver["Recursive Resolver<br/><i>1.1.1.1 / 8.8.8.8</i><br/>TTL: 60-300s"]
    end
    
    subgraph DNS Infrastructure
        Root["Root Servers"]
        TLD[".com TLD"]
        Auth["Authoritative NS<br/><i>netflix.com</i>"]
    end
    
    Browser -->|"2. Cache miss"| OS
    OS -->|"3. Cache miss"| Resolver
    Resolver -->|"4. Cache miss<br/>Full recursive lookup"| Root
    Root --> TLD
    TLD --> Auth
    Auth -->|"5. IP + TTL=60s"| Resolver
    Resolver -->|"6. Cached result"| OS
    OS -->|"7. Cached result"| Browser
    Browser -->|"8. IP address"| User
    
    Resolver -.->|"Subsequent queries<br/>(within TTL)"| User

DNS caching occurs at multiple layers with different TTL values. After the first lookup, subsequent queries hit cache at the nearest layer, dramatically reducing latency and load. The recursive resolver’s cache serves all users of that resolver, making it the most impactful cache layer for overall system performance.

Netflix Global DNS-Based Traffic Routing

graph TB
    User_US["👤 User in US<br/><i>New York</i>"]
    User_EU["👤 User in EU<br/><i>London</i>"]
    User_ASIA["👤 User in Asia<br/><i>Tokyo</i>"]
    
    subgraph Route 53 Authoritative DNS
        DNS["netflix.com<br/>Authoritative NS<br/><i>Geolocation Routing</i>"]
    end
    
    subgraph US Region
        CDN_US1["Open Connect CDN<br/>52.85.151.23<br/><i>Virginia</i>"]
        CDN_US2["Open Connect CDN<br/>54.192.45.67<br/><i>California</i>"]
    end
    
    subgraph EU Region
        CDN_EU1["Open Connect CDN<br/>13.224.78.12<br/><i>Ireland</i>"]
        CDN_EU2["Open Connect CDN<br/>13.225.89.34<br/><i>Frankfurt</i>"]
    end
    
    subgraph Asia Region
        CDN_ASIA1["Open Connect CDN<br/>13.35.12.45<br/><i>Tokyo</i>"]
        CDN_ASIA2["Open Connect CDN<br/>13.36.23.56<br/><i>Singapore</i>"]
    end
    
    User_US -->|"1. DNS query<br/>netflix.com"| DNS
    User_EU -->|"1. DNS query<br/>netflix.com"| DNS
    User_ASIA -->|"1. DNS query<br/>netflix.com"| DNS
    
    DNS -->|"2. Return US IPs<br/>TTL: 60s<br/>Weighted routing"| User_US
    DNS -->|"2. Return EU IPs<br/>TTL: 60s<br/>Weighted routing"| User_EU
    DNS -->|"2. Return Asia IPs<br/>TTL: 60s<br/>Weighted routing"| User_ASIA
    
    User_US -.->|"3. Connect to<br/>nearest CDN"| CDN_US1
    User_US -.->|"Failover"| CDN_US2
    User_EU -.->|"3. Connect to<br/>nearest CDN"| CDN_EU1
    User_EU -.->|"Failover"| CDN_EU2
    User_ASIA -.->|"3. Connect to<br/>nearest CDN"| CDN_ASIA1
    User_ASIA -.->|"Failover"| CDN_ASIA2
    
    Note1["Health checks remove<br/>failed nodes within 60s"]
    Note2["Weighted routing enables<br/>gradual rollouts (1%→5%→25%)"]

Netflix uses DNS geolocation routing to direct users to the nearest CDN node, minimizing latency. Route 53 returns different IP addresses based on the client’s location, with short 60-second TTLs enabling rapid failover. Weighted routing allows gradual rollouts of new CDN nodes, and health checks automatically remove failed nodes from DNS responses.

How It Works

When you type www.netflix.com into your browser, a sophisticated resolution process begins. Your browser first checks its own cache—if you visited Netflix recently, it might already know the IP address. If not, your operating system checks its cache. Still no match? The query goes to your configured recursive DNS resolver, typically provided by your ISP or a service like Cloudflare (1.1.1.1) or Google (8.8.8.8).

The recursive resolver acts as your agent, doing the heavy lifting of the hierarchical lookup. It first checks its own cache. On a cache miss, it contacts a root nameserver—there are 13 root server addresses (actually hundreds of physical servers using anycast) that know about all top-level domains. The root server responds: “I don’t know about netflix.com, but the .com TLD servers do—here are their addresses.”

The resolver then queries a .com TLD nameserver, which responds: “I don’t have the specific IP, but the authoritative nameservers for netflix.com do—here are their addresses.” Finally, the resolver queries Netflix’s authoritative nameserver, which responds with the actual IP address (or multiple IPs for load balancing). The resolver caches this result based on the TTL (Time To Live) value and returns it to your browser.

This process seems slow, but caching makes it lightning fast in practice. After the first user in your city looks up netflix.com, the recursive resolver caches the result. Subsequent queries from anyone using that resolver get instant responses from cache. Your browser might cache for 5 minutes, your OS for 1 minute, and the recursive resolver for whatever TTL Netflix specifies (often 60-300 seconds for frequently changing services).

The beauty of this hierarchical design is delegation and distribution. Netflix controls its own authoritative nameservers and can update records instantly. The .com registry only needs to know which nameservers are authoritative for each domain, not every individual record. Root servers only need to know about TLD servers. This delegation allows the system to scale to billions of domains without any single point of knowledge or failure.

DNS Resolution Flow: Hierarchical Lookup Process

sequenceDiagram
    participant Browser
    participant OS as OS Cache
    participant Resolver as Recursive Resolver<br/>(ISP/1.1.1.1)
    participant Root as Root Server
    participant TLD as .com TLD Server
    participant Auth as Authoritative NS<br/>(netflix.com)
    
    Browser->>Browser: 1. Check browser cache
    Note over Browser: Cache miss
    Browser->>OS: 2. Query OS cache
    Note over OS: Cache miss
    OS->>Resolver: 3. DNS query: www.netflix.com
    Resolver->>Resolver: 4. Check resolver cache
    Note over Resolver: Cache miss - start recursive lookup
    Resolver->>Root: 5. Query: www.netflix.com?
    Root-->>Resolver: 6. Refer to .com TLD servers
    Resolver->>TLD: 7. Query: www.netflix.com?
    TLD-->>Resolver: 8. Refer to netflix.com NS
    Resolver->>Auth: 9. Query: www.netflix.com?
    Auth-->>Resolver: 10. A record: 52.85.151.23<br/>TTL: 60s
    Resolver->>Resolver: 11. Cache result (60s)
    Resolver-->>OS: 12. Return IP: 52.85.151.23
    OS->>OS: 13. Cache result
    OS-->>Browser: 14. Return IP
    Browser->>Browser: 15. Cache result
    Note over Browser,Auth: Subsequent queries hit cache until TTL expires

Complete DNS resolution flow showing the hierarchical lookup process. The recursive resolver performs the multi-hop query on behalf of the client, querying root, TLD, and authoritative servers in sequence. Caching at each layer (browser, OS, resolver) eliminates most of these steps for subsequent queries until the TTL expires.

Key Principles

principle: Hierarchical Delegation explanation: DNS distributes authority through a tree structure where each level delegates responsibility to the next. Root servers delegate to TLD servers (.com, .org, .uk), which delegate to authoritative nameservers for specific domains. This prevents any single entity from needing to know all mappings and allows domain owners to control their own records. example: When Stripe adds a new subdomain like api.stripe.com, they update their own authoritative nameservers. The .com registry doesn’t need to know, and root servers certainly don’t. This delegation means Stripe can make changes instantly without coordinating with upstream authorities. Compare this to the old HOSTS.TXT system where every change required updating a central file distributed to all computers.

principle: Aggressive Multi-Layer Caching explanation: DNS caches at every possible layer—browser, operating system, recursive resolver, and even authoritative servers cache their own zone data. Each cache has a TTL that balances freshness against query load. This caching is what makes DNS fast enough for real-time use despite the multi-hop resolution process. example: Cloudflare’s 1.1.1.1 resolver handles over 1.5 trillion DNS queries daily, but only a tiny fraction require full recursive lookups. Most queries hit cache. When Netflix updates a record, they set a low TTL (60 seconds) during deployments so changes propagate quickly. For stable services, TTLs of 3600 seconds (1 hour) or 86400 seconds (24 hours) reduce query load by 60-99%.

principle: Eventual Consistency Through TTL explanation: DNS accepts eventual consistency rather than strong consistency. When you update a DNS record, different resolvers see the change at different times based on when their cached copies expire. This trade-off enables massive scale and performance but requires careful TTL management during changes. example: When Amazon deploys a new version of their API, they might lower the TTL from 300 seconds to 60 seconds an hour before deployment. This ensures most clients see the new IP within a minute of the change. After the deployment stabilizes, they raise the TTL back to 300 seconds to reduce query load. During the transition period, some clients might still hit old servers—applications must handle this gracefully.

principle: Redundancy at Every Level explanation: DNS achieves high availability through redundancy. Root servers use anycast (multiple physical servers sharing one IP). TLD operators run multiple nameservers in different locations. Domain owners should run at least two authoritative nameservers in different data centers. Recursive resolvers query multiple servers if one fails. example: Route 53 provides four nameservers for each hosted zone, distributed across AWS regions. If you configure ns-1234.awsdns-12.com as your primary nameserver, Route 53 automatically provides three more (ns-5678, ns-9012, ns-3456) in different regions. Even if an entire AWS region fails, DNS resolution continues. This is why you see multiple NS records in DNS configurations.

principle: Separation of Recursive and Authoritative Roles explanation: DNS servers play two distinct roles that should never be mixed. Recursive resolvers serve clients and perform lookups on their behalf. Authoritative nameservers answer queries about domains they control. Separating these roles improves security, performance, and scalability. example: Google’s 8.8.8.8 is a recursive resolver—it performs lookups for anyone who asks but has no authority over any domains. Google’s authoritative nameservers (ns1.google.com, ns2.google.com) only answer queries about Google-owned domains and never perform recursive lookups. Mixing these roles would allow cache poisoning attacks and create performance bottlenecks.

Deep Dive

Types / Variants

DNS record types serve different purposes in system design. A records map domain names to IPv4 addresses (e.g., www.stripe.com → 54.187.174.169). These are the most common records and what most people think of as “DNS.” AAAA records do the same for IPv6 addresses, increasingly important as IPv4 addresses exhaust. CNAME records create aliases, pointing one name to another (e.g., blog.stripe.com → stripe.ghost.io). CNAMEs are crucial for integrating third-party services—you can point your subdomain to a SaaS provider’s domain, and they handle the underlying IP management.

MX records specify mail servers for a domain, with priority values for fallback (e.g., 10 mail1.google.com, 20 mail2.google.com). Lower priority numbers are tried first. TXT records store arbitrary text, used for domain verification (Google Search Console), email authentication (SPF, DKIM, DMARC), and service discovery. NS records specify which nameservers are authoritative for a domain or subdomain—these are how delegation works. SRV records enable service discovery by specifying the hostname and port for specific services, used heavily in Kubernetes and microservices.

SOA (Start of Authority) records contain metadata about a DNS zone: the primary nameserver, the email of the domain administrator, the zone’s serial number (for change tracking), and timing parameters for zone transfers. Every zone must have exactly one SOA record. PTR records enable reverse DNS lookups, mapping IP addresses back to domain names, primarily used for email server reputation and security logging.

Managed DNS providers like Route 53 and Cloudflare offer advanced record types beyond the standard set. Alias records (Route 53) or CNAME flattening (Cloudflare) solve the problem that you can’t use a CNAME at the zone apex (example.com itself). These proprietary solutions let you point your root domain to a load balancer or CDN while maintaining the performance benefits of A records. Geolocation records return different IPs based on the client’s location, enabling geographic load balancing. Weighted records distribute traffic across multiple IPs based on percentages, useful for gradual rollouts and A/B testing.

Common DNS Record Types and Use Cases

graph TB
    subgraph Domain: stripe.com
        Root["stripe.com<br/><i>Zone Apex</i>"]
    end
    
    subgraph A Records - IPv4 Mapping
        A1["A: 54.187.174.169<br/><i>Primary web server</i>"]
        A2["A: 52.24.56.78<br/><i>Secondary web server</i>"]
    end
    
    subgraph CNAME Records - Aliases
        CNAME1["blog.stripe.com<br/>CNAME → stripe.ghost.io<br/><i>Third-party blog platform</i>"]
        CNAME2["www.stripe.com<br/>CNAME → stripe.com<br/><i>Redirect www to apex</i>"]
    end
    
    subgraph MX Records - Email Routing
        MX1["MX 10: mail1.google.com<br/><i>Primary mail server</i>"]
        MX2["MX 20: mail2.google.com<br/><i>Backup mail server</i>"]
    end
    
    subgraph TXT Records - Verification
        TXT1["TXT: v=spf1 include:_spf.google.com<br/><i>Email authentication</i>"]
        TXT2["TXT: google-site-verification=abc123<br/><i>Domain ownership proof</i>"]
    end
    
    subgraph NS Records - Delegation
        NS1["NS: ns1.stripe.com<br/><i>Primary nameserver</i>"]
        NS2["NS: ns2.stripe.com<br/><i>Secondary nameserver</i>"]
    end
    
    Root --> A1
    Root --> A2
    Root --> CNAME1
    Root --> CNAME2
    Root --> MX1
    Root --> MX2
    Root --> TXT1
    Root --> TXT2
    Root --> NS1
    Root --> NS2

DNS record types serve different purposes in system architecture. A records map names to IPs for direct access, CNAMEs create aliases for third-party integrations, MX records route email with priority-based fallback, TXT records prove domain ownership and configure email security, and NS records delegate authority to nameservers.

Trade-offs

TTL Configuration presents the classic trade-off between freshness and load. Short TTLs (60-300 seconds) mean changes propagate quickly and you can rapidly shift traffic during incidents. However, they generate 12-24x more DNS queries, increasing costs and load on your nameservers. Long TTLs (3600-86400 seconds) dramatically reduce query volume and improve client performance (fewer lookups), but changes take hours to propagate. The decision depends on your change frequency. Netflix uses short TTLs because they deploy constantly and need rapid failover. A corporate website that changes monthly can use 24-hour TTLs.

Authoritative Nameserver Placement involves choosing between self-hosted and managed DNS. Running your own authoritative nameservers (BIND, PowerDNS) gives complete control and eliminates third-party dependencies, but requires expertise in DNS security, DDoS mitigation, and global distribution. Managed services like Route 53 or Cloudflare provide instant global distribution, DDoS protection, and 100% SLA guarantees, but add vendor lock-in and ongoing costs. Most companies choose managed DNS for internet-facing services and self-host only for internal infrastructure.

Record Type Selection requires understanding the implications. CNAME records are convenient but add an extra lookup—the resolver must first resolve the CNAME target, then resolve that to an IP. This adds 10-30ms of latency. A records are faster but require updating multiple records when IPs change. For high-traffic services, use A records at the zone apex and CNAMEs only for subdomains where the extra latency is acceptable. For services behind load balancers, use A records pointing to multiple load balancer IPs rather than CNAMEs.

Recursive Resolver Choice impacts privacy and performance. ISP-provided resolvers are geographically close (low latency) but often have poor cache hit rates, log queries, and inject ads. Public resolvers like Cloudflare 1.1.1.1 or Google 8.8.8.8 have massive cache hit rates and better performance but centralize DNS traffic with a few companies. For enterprise systems, running your own recursive resolvers (Unbound, BIND) provides control but requires maintenance and doesn’t benefit from global cache sharing.

DNS-Based Load Balancing vs. Application-Level is a critical architectural decision. DNS load balancing (returning multiple A records or using geographic routing) is simple and works at the network layer, but clients might cache one IP and stick to it, creating imbalanced load. DNS also can’t detect server health in real-time—if a server fails, DNS might still return its IP until the TTL expires. Application-level load balancing (using a load balancer like ELB or HAProxy behind a single DNS name) provides better health checking and traffic distribution but adds a single point of failure and extra network hop. The solution: use DNS for geographic distribution (route US traffic to US data centers) and application load balancers within each region for fine-grained balancing.

Common Pitfalls

pitfall: Forgetting DNS Propagation During Deployments why_it_happens: Engineers update DNS records and expect immediate changes, forgetting that clients and resolvers have cached the old values. This causes partial outages where some users see the new system and others see the old, leading to inconsistent behavior or errors. how_to_avoid: Lower TTLs to 60 seconds at least one TTL period before making changes. If your current TTL is 3600 seconds, lower it an hour before deployment. After the change, wait for the old TTL to expire before considering the migration complete. Monitor both old and new endpoints during the transition. After stabilization, raise TTLs back to normal to reduce query load.

pitfall: Using CNAMEs at the Zone Apex why_it_happens: The DNS specification prohibits CNAME records at the zone apex (example.com) because it conflicts with required SOA and NS records. Engineers try to point their root domain to a CDN or load balancer using a CNAME and wonder why it doesn’t work. how_to_avoid: Use A records at the zone apex, pointing to the IP addresses of your load balancers or CDN edge servers. If your CDN provides dynamic IPs, use provider-specific solutions like Route 53 Alias records or Cloudflare CNAME flattening. These act like CNAMEs but return A records to clients, satisfying both requirements.

pitfall: Not Running Multiple Authoritative Nameservers why_it_happens: Cost-cutting or laziness leads to running a single authoritative nameserver. When that server fails or is DDoS’d, the entire domain becomes unreachable—not just the website, but email, APIs, everything. how_to_avoid: Always run at least two authoritative nameservers in different data centers or availability zones. Managed DNS providers do this automatically. If self-hosting, use different providers or cloud regions for redundancy. Configure your domain registrar with all nameserver addresses. Test failover by blocking one nameserver and verifying resolution still works.

pitfall: Ignoring DNS Query Volume in Capacity Planning why_it_happens: DNS seems simple and cheap, so engineers don’t monitor query volume or costs. Then a viral event or DDoS attack generates billions of queries, overwhelming nameservers or generating massive bills from managed DNS providers. how_to_avoid: Monitor DNS query volume and set up alerts for unusual spikes. Understand your DNS provider’s pricing model—Route 53 charges per million queries, which adds up at scale. Use longer TTLs for stable records to reduce query volume. Implement rate limiting on authoritative nameservers. For very high-traffic services, consider anycast DNS or multiple DNS providers for redundancy and load distribution.

pitfall: Exposing Internal Infrastructure Through DNS why_it_happens: Engineers create DNS records for internal services (database.internal.company.com) in public DNS zones, revealing infrastructure details to attackers. Even if the services aren’t publicly accessible, the DNS records leak information about your architecture. how_to_avoid: Use split-horizon DNS: public DNS zones for internet-facing services, private DNS zones for internal infrastructure. AWS Route 53 Private Hosted Zones, Azure Private DNS, and Google Cloud DNS private zones provide this separation. Never put internal hostnames, database servers, or infrastructure details in public DNS records.

Real-World Examples

company: Netflix system: Global Content Delivery usage_detail: Netflix uses DNS as the first layer of their sophisticated traffic routing system. When you visit netflix.com, their authoritative nameservers (managed by AWS Route 53) return different IP addresses based on your geographic location, directing you to the nearest Open Connect CDN node. They use short TTLs (60 seconds) to enable rapid traffic shifting during deployments or outages. During peak hours, Netflix’s DNS infrastructure handles over 100 million queries per hour. They use weighted routing to gradually roll out new CDN nodes—starting with 1% of traffic, then 5%, then 25%, monitoring error rates at each step. If a CDN node fails health checks, Route 53 automatically stops returning its IP within 60 seconds, before most users notice. Netflix also uses DNS for A/B testing infrastructure changes, routing a small percentage of users to experimental configurations. Their DNS strategy is so critical that they maintain relationships with multiple DNS providers and can switch between them within minutes if one experiences issues.

company: Cloudflare system: 1.1.1.1 Public Resolver usage_detail: Cloudflare’s 1.1.1.1 is one of the world’s largest recursive DNS resolvers, handling over 1.5 trillion queries daily. They achieve this scale through aggressive caching and anycast routing—the IP address 1.1.1.1 is announced from over 275 data centers worldwide, so your query goes to the nearest location. Their cache hit rate exceeds 95%, meaning only 5% of queries require full recursive resolution. Cloudflare uses this massive query volume to detect DNS-based attacks and malware—they can see when a domain suddenly receives millions of queries from infected computers. They also offer DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) to encrypt DNS queries, preventing ISPs from seeing or modifying your DNS traffic. For their authoritative DNS service, Cloudflare provides free DNS hosting with DDoS protection, making money by upselling CDN and security services. They use their global network to absorb massive DNS DDoS attacks—in 2021, they mitigated a 2 Tbps attack targeting a customer’s DNS infrastructure without the customer experiencing any downtime.

company: Amazon Route 53 system: AWS Service Discovery usage_detail: Route 53 powers DNS for millions of AWS customers and AWS’s own infrastructure. The name “Route 53” refers to TCP/UDP port 53, the standard DNS port. Beyond basic DNS, Route 53 provides health checking and automatic failover—it continuously monitors endpoints and stops returning IPs for failed servers. For AWS services like ELB and CloudFront, Route 53 offers Alias records that provide CNAME-like functionality at the zone apex without the performance penalty. Route 53 also integrates with AWS’s service discovery for ECS and EKS, automatically creating and updating DNS records as containers start and stop. Their traffic flow feature enables complex routing policies: geographic routing to direct users to the nearest region, latency-based routing to send users to the fastest region (not necessarily the closest), weighted routing for gradual migrations, and failover routing for disaster recovery. Route 53’s pricing model charges per hosted zone ($0.50/month) and per million queries ($0.40), making it cost-effective for small sites but expensive at scale—a site with 1 billion queries/month pays $400 just for DNS. This pricing drives many high-traffic sites to use longer TTLs or multiple DNS providers to reduce costs.

Interview Expectations

Mid-Level

Explain the DNS resolution process from client to authoritative nameserver, including the role of recursive resolvers, root servers, TLD servers, and authoritative servers. Describe common record types (A, CNAME, MX) and when to use each. Discuss basic caching and TTL concepts. Demonstrate understanding that DNS is a distributed system with eventual consistency. Be able to debug why a DNS change isn’t visible yet (TTL hasn’t expired). Show awareness that DNS is often the first point of failure in outages.

Senior

Design DNS architecture for a multi-region application, including authoritative nameserver placement, TTL strategy, and failover mechanisms. Explain trade-offs between DNS-based load balancing and application-level load balancing. Discuss how to minimize DNS propagation time during deployments. Describe how DNS integrates with CDNs and load balancers for traffic routing. Calculate DNS query volume and costs for a given traffic pattern. Explain DNS security concerns (cache poisoning, DDoS) and mitigation strategies. Design a migration strategy from one DNS provider to another with zero downtime.

Staff+

Architect DNS solutions for global-scale services handling billions of requests. Design geographic routing strategies that balance latency, cost, and data sovereignty requirements. Explain how to use DNS for disaster recovery and multi-cloud failover. Discuss advanced DNS features like EDNS Client Subnet for CDN optimization. Design DNS infrastructure that survives regional outages, DDoS attacks, and DNS provider failures. Calculate the impact of DNS resolution time on overall system latency and user experience. Propose DNS strategies for complex scenarios like gradual traffic migration between cloud providers, A/B testing infrastructure changes, or handling sudden traffic spikes. Explain how DNS fits into the broader system architecture, including its interaction with service mesh, API gateways, and observability systems.

Common Interview Questions

Walk me through what happens when a user types www.example.com into their browser

How would you minimize downtime when changing DNS providers?

Your DNS TTL is 3600 seconds and you need to do an emergency failover. What do you do?

How does DNS-based load balancing differ from using a load balancer? When would you use each?

A customer reports they’re still seeing the old IP address 10 minutes after you updated DNS. Why?

How would you design DNS for a system that needs to route users to the nearest data center?

What’s the difference between a CNAME and an A record? When would you use each?

How do CDNs use DNS to route users to edge servers?

Your DNS provider is under DDoS attack. How do you maintain service availability?

Explain how DNS caching works at different layers and why it matters for system design

Red Flags to Avoid

Not knowing the difference between recursive and authoritative DNS servers

Thinking DNS changes are instant without understanding TTL and caching

Suggesting CNAME records at the zone apex (example.com)

Not considering DNS as a potential single point of failure

Ignoring DNS query volume and costs in capacity planning

Not understanding how DNS integrates with load balancers and CDNs

Proposing overly complex DNS schemes when simple solutions work better

Not knowing common DNS record types beyond A records

Failing to mention redundancy (multiple nameservers) in DNS design

Not considering DNS propagation time when planning deployments or migrations

Key Takeaways

DNS is a hierarchical, distributed system that translates domain names to IP addresses through a chain of servers: recursive resolver → root → TLD → authoritative. Understanding this flow is essential because DNS is the entry point for all internet traffic and a common source of production issues.

Caching at multiple layers (browser, OS, recursive resolver) makes DNS fast, but creates eventual consistency challenges. TTL configuration is a critical trade-off: short TTLs (60-300s) enable rapid changes but increase query load and costs; long TTLs (3600-86400s) reduce load but delay propagation. Always lower TTLs before planned changes.

DNS serves multiple roles beyond name resolution: load balancing through multiple A records, geographic routing, email routing (MX records), domain verification (TXT records), and service discovery. Modern systems use DNS as the first layer of traffic routing, integrated with CDNs and load balancers.

Redundancy is non-negotiable: run multiple authoritative nameservers in different locations, use managed DNS providers with global distribution, and consider multiple DNS providers for critical services. DNS failures make your entire system unreachable, not just slow.

DNS-based load balancing is simple but limited—clients cache IPs and DNS can’t detect real-time health. Use DNS for geographic distribution (routing users to the nearest region) and application load balancers within regions for fine-grained traffic management and health checking.

Prerequisites

IP Addressing and Subnets - Understanding IP addresses is essential before learning how DNS maps names to them

HTTP/HTTPS Protocols - DNS resolution happens before HTTP requests, so understanding the full request flow is important

Next Steps

CDN Overview - CDNs use DNS for geographic routing to edge servers, building on DNS fundamentals

Load Balancing Strategies - Understanding how DNS integrates with application load balancers for traffic distribution

Service Discovery - Modern microservices use DNS for service discovery, extending DNS concepts to internal infrastructure

Caching Strategies - DNS caching is a specific application of general caching principles

High Availability Patterns - DNS redundancy and failover are critical HA patterns

DDoS Mitigation - DNS is a common DDoS target, requiring specific protection strategies