Unexplained DNS Glitch Threatened Internet Stability, Now Fixed

For more than four days, a server at the core of the Internet’s domain name system was out of sync with its 12 root server peers due to an unexplained glitch. This server, maintained by Internet carrier Cogent Communications, is one of the 13 root servers that provision the Internet’s root zone, which sits at the top of the hierarchical distributed database known as the domain name system, or DNS.

The DNS process begins when a user enters a domain name in their browser. The browser queries the local stub resolver in the local operating system, which forwards the query to a recursive resolver. If needed, the recursive resolver contacts the root server to determine the authoritative name server for the top-level domain. The name server then returns the IP address.

Given the crucial role a root server provides in ensuring one device can find any other device on the Internet, there are 13 of them geographically dispersed all over the world. Each root server is, in fact, a cluster of servers that are also geographically dispersed, providing even more redundancy.

Normally, the 13 root servers—each operated by a different entity—march in lockstep. When a change is made to the contents they host, it generally occurs on all of them within a few seconds or minutes at most. This tight synchronization is crucial for ensuring stability. If one root server directs traffic lookups to one intermediate server and another root server sends lookups to a different intermediate server, important parts of the Internet as we know it could collapse.

More importantly, root servers store the cryptographic keys necessary to authenticate some of the intermediate servers under a mechanism known as DNSSEC. If keys aren’t identical across all 13 root servers, there’s an increased risk of attacks such as DNS cache poisoning.

For reasons that remain unclear outside of Cogent, all 12 instances of the c-root it’s responsible for maintaining suddenly stopped updating on Saturday. This could have caused stability and security problems worldwide. However, the issue has now been fixed, and the cause remains under investigation.

Read more: arstechnica.com