DNS is Simple. DNS is Hard.
How a "simple" lookup system turns into a distributed systems problem

DNS looks like a simple mapping:
DNS :: Domain Name → IP Address
Mental Model
Let’s start with the base case: a user opens their browser and navigates to this page.
At a glance, the DNS model is brutally simple:
wespiser.com → 104.21.13.171
However, when your application makes a DNS request, it doesn’t go straight to the authoritative server. It goes to a recursive resolver—usually run by your ISP, your company, or a public provider (like 8.8.8.8).
That resolver does the actual work:
- Query root servers
- Follow referrals to TLD servers
- Query the authoritative name server
- Cache the result
- Return the answer
At small scale, this feels like a lookup.
At internet scale, this is a distributed system.
Internet building block
For a taste of how critical this step is, on October 21, 2016, Dyn, a DNS provider critical for many of the most popular web platforms, went down for hours.
The attack was simple: have your botnet send DNS requests that are more expensive to resolve than they are to generate. Millions of unique subdomains forced resolvers to bypass caches, triggering a flood of upstream lookups that overwhelmed Dyn’s infrastructure.
The result? Reddit, Twitter, PayPal, and others were unavailable for hours.
The real failure wasn’t Dyn went down.
The failure was everyone depended on Dyn.
DNS is one of the few systems where you ship a change, or suffer a failure, and then wait for independent caches across the internet to agree with you.
DNS is hard.
Where it breaks down
Close your eyes and imagine: your phone rings. An exasperated manager pulls you into a service outage. You don’t know anything yet.
What do you check?
Are the servers turned on and getting power?
Is the network connected and are nodes receiving messages?
Does DNS work?
This was the path AWS engineers found themselves walking on the night of October 19–20, 2025, when US-EAST-1 began failing.
By 12:26 AM PDT, the team had narrowed the event to DNS resolution issues for the regional DynamoDB endpoint. The underlying problem: a race condition in DynamoDB’s DNS management system.
In simple terms: the database servers were still there, the network mostly still existed, but the naming layer that told systems how to reach DynamoDB had broken.
The failure wasn’t just a race condition.
It was a race condition in a system where partial state is globally visible—and cached.
Multiple automation paths were updating DNS without coordination. When those updates collided, DNS didn’t fail cleanly. It propagated inconsistent state outward.
Once that happened, everything depending on DynamoDB couldn’t reliably find it.
DNS looks like configuration. But it behaves like a control plane.
DNS is hard.
Check the cache
A few years ago, I worked as an infrastructure engineer at a cloud database company. Our mission was straightforward: take a database, put it in the cloud, and make it reliable for our customers and cheap to run for us.
Also: pick up the phone when things weren’t working, and build the system to minimize such calls.
The DNS portion of this story starts with a desire to save money by removing expensive dependencies like ELB from a simple ingress route:
Route53 → ELB → compute clusters
to something more flexible:
Route53 → Cloudflare Tunnels → compute clusters
On paper, this wasn’t especially complicated.
From a systems perspective, this felt controlled.
From a DNS perspective, we were about to push a global change into a system we didn’t control—and couldn’t observe.
The Plan
The rollout strategy was straightforward:
- Stand up Cloudflare Tunnels alongside existing ELB ingress
- Route traffic through both paths
- Flip DNS one provider/region at a time
- Verify traffic flow before proceeding
We targeted a two-hour migration window during working hours.
From a systems perspective, this felt safe.
From a DNS perspective, we were initiating a global convergence event and hoping it behaved for our control plane.
The Reality
We only had two ways to know if the DNS change was correct:
- running
digfrom wherever we happened to be
- querying our control plane to see if it could connect to the data plane
We had no global signal, no encompassing metrics dashboard to check. Nothing that told us what the system actually believed. Most of the migration went smoothly. Changes applied, traffic flowed, TLS held.
Then we hit an issue.
Some Kubernetes clusters were holding onto DNS state longer than expected. Even after the change, parts of the system were still resolving the old configuration.
Nothing in Route53 was wrong.
Nothing in Cloudflare was wrong.
But the system wasn’t converging.
We eventually tracked it down to DNS caching inside the clusters. We had to manually restart services to clear the cached state.
The Lesson
From our planning, review, and execution, the migration was correct.
From DNS’s perspective, the system was still in transition somewhere.
That gap is where things break.
DNS doesn’t give you a clean cutover.
It gives you a period where different parts of the world believe different things about your system.
And unless you explicitly account for that, you don’t have a deployment.
You have a coordination problem.
DNS is hard.
How things fail
To summarize where things break:
1. No global view of state
There is no “current DNS state,” only:
“what does resolution look like from here, right now?”
2. Caching
Caching happens everywhere:
- clusters
- browsers
- operating systems
- recursive resolvers
- load balancers
- even inside your own services
You can’t find them all, and you definitely can’t clear them all.
3. Time is a hidden variable
TTL settings exist, but they are not strictly enforced.
DNS doesn’t change instantly. It converges over time—and not all at once.
4. Multi-provider complexity
Route53, Cloudflare, internal DNS—all need to work together.
Each layer adds more state and more ways to be wrong.
5. It’s part of everything
TLS validation, service discovery, load balancing, failover.
When DNS is wrong, infrastructure breaks.
DNS is hard because it’s a distributed system with:
- very large blast radius
- weak, implicit consistency
- hidden state
Conclusion
DNS is simple. It’s a name resolution model that fits in your head.
In reality, it’s a globe-spanning distributed system with low visibility, weak consistency, and pervasive caching.
It looks like configuration.
It behaves like a control plane.
The gap between those two is where outages live.
DNS is hard.