DNS is Simple. DNS is Hard.

How a "simple" lookup system turns into a distributed systems problem

Posted on by Adam Wespiser

DNS is Simple. DNS is Hard.

DNS looks like a simple mapping:

DNS :: Domain Name → IP Address

Mental Model

Let’s start with the base case: a user opens their browser and navigates to this page.

At a glance, the DNS model is brutally simple:

wespiser.com → 104.21.13.171

However, when your application makes a DNS request, it doesn’t go straight to the authoritative server. It goes to a recursive resolver—usually run by your ISP, your company, or a public provider (like 8.8.8.8).

That resolver does the actual work:

  1. Query root servers
  2. Follow referrals to TLD servers
  3. Query the authoritative name server
  4. Cache the result
  5. Return the answer

At small scale, this feels like a lookup.

At internet scale, this is a distributed system.


Internet building block

For a taste of how critical this step is, on October 21, 2016, Dyn, a DNS provider critical for many of the most popular web platforms, went down for hours.

The attack was simple: have your botnet send DNS requests that are more expensive to resolve than they are to generate. Millions of unique subdomains forced resolvers to bypass caches, triggering a flood of upstream lookups that overwhelmed Dyn’s infrastructure.

The result? Reddit, Twitter, PayPal, and others were unavailable for hours.

The real failure wasn’t Dyn went down.

The failure was everyone depended on Dyn.

DNS is one of the few systems where you ship a change, or suffer a failure, and then wait for independent caches across the internet to agree with you.

DNS is hard.


Where it breaks down

Close your eyes and imagine: your phone rings. An exasperated manager pulls you into a service outage. You don’t know anything yet.

What do you check?

Are the servers turned on and getting power?

Is the network connected and are nodes receiving messages?

Does DNS work?

This was the path AWS engineers found themselves walking on the night of October 19–20, 2025, when US-EAST-1 began failing.

By 12:26 AM PDT, the team had narrowed the event to DNS resolution issues for the regional DynamoDB endpoint. The underlying problem: a race condition in DynamoDB’s DNS management system.

In simple terms: the database servers were still there, the network mostly still existed, but the naming layer that told systems how to reach DynamoDB had broken.

The failure wasn’t just a race condition.

It was a race condition in a system where partial state is globally visible—and cached.

Multiple automation paths were updating DNS without coordination. When those updates collided, DNS didn’t fail cleanly. It propagated inconsistent state outward.

Once that happened, everything depending on DynamoDB couldn’t reliably find it.

DNS looks like configuration. But it behaves like a control plane.

DNS is hard.


Check the cache

A few years ago, I worked as an infrastructure engineer at a cloud database company. Our mission was straightforward: take a database, put it in the cloud, and make it reliable for our customers and cheap to run for us.

Also: pick up the phone when things weren’t working, and build the system to minimize such calls.

The DNS portion of this story starts with a desire to save money by removing expensive dependencies like ELB from a simple ingress route:

Route53 → ELB → compute clusters

to something more flexible:

Route53 → Cloudflare Tunnels → compute clusters

On paper, this wasn’t especially complicated.

From a systems perspective, this felt controlled.

From a DNS perspective, we were about to push a global change into a system we didn’t control—and couldn’t observe.


The Plan

The rollout strategy was straightforward:

We targeted a two-hour migration window during working hours.

From a systems perspective, this felt safe.

From a DNS perspective, we were initiating a global convergence event and hoping it behaved for our control plane.


The Reality

We only had two ways to know if the DNS change was correct:

We had no global signal, no encompassing metrics dashboard to check. Nothing that told us what the system actually believed. Most of the migration went smoothly. Changes applied, traffic flowed, TLS held.

Then we hit an issue.

Some Kubernetes clusters were holding onto DNS state longer than expected. Even after the change, parts of the system were still resolving the old configuration.

Nothing in Route53 was wrong.
Nothing in Cloudflare was wrong.

But the system wasn’t converging.

We eventually tracked it down to DNS caching inside the clusters. We had to manually restart services to clear the cached state.


The Lesson

From our planning, review, and execution, the migration was correct.

From DNS’s perspective, the system was still in transition somewhere.

That gap is where things break.

DNS doesn’t give you a clean cutover.

It gives you a period where different parts of the world believe different things about your system.

And unless you explicitly account for that, you don’t have a deployment.

You have a coordination problem.

DNS is hard.


How things fail

To summarize where things break:

1. No global view of state

There is no “current DNS state,” only:
“what does resolution look like from here, right now?”


2. Caching

Caching happens everywhere:

You can’t find them all, and you definitely can’t clear them all.


3. Time is a hidden variable

TTL settings exist, but they are not strictly enforced.

DNS doesn’t change instantly. It converges over time—and not all at once.


4. Multi-provider complexity

Route53, Cloudflare, internal DNS—all need to work together.

Each layer adds more state and more ways to be wrong.


5. It’s part of everything

TLS validation, service discovery, load balancing, failover.

When DNS is wrong, infrastructure breaks.


DNS is hard because it’s a distributed system with:


Conclusion

DNS is simple. It’s a name resolution model that fits in your head.

In reality, it’s a globe-spanning distributed system with low visibility, weak consistency, and pervasive caching.

It looks like configuration.

It behaves like a control plane.

The gap between those two is where outages live.

DNS is hard.