Resolver centricity experiment

The Domain Name System is a distributed database with a hierarchical structure. A complete DNS resolution is a time-consuming process composed of iterations of network communications between the DNS resolver and the name server chain. Hence, caching has been introduced to speed up the process and reduce Internet traffic. As our main traffic source, the caching behaviour of resolvers directly affects the traffic we receive at the .nz DNS infrastructure.

Resolvers are implemented differently to prioritise the NS RRset for a domain depending on where it came from, the parent or the child zone. These differences in implementation result in different traffic volumes to authoritative name servers when the NS RRset has different TTL (Time To Live) values configured in parent zone and child zone. DNS standards do not prescribe the behaviour of the resolver (its centricity).

In this post we outline an experiment created to detect centricity patterns of major local ISP resolvers in New Zealand and public DNS resolvers. The work is inspired by a presentation from Ólafur Guðmundsson available here.

DNS Resolution

The figure below shows the non-cached resolution process for the address of the domain name string.sub.experiment.nz.

The process involves the following steps (for a clean cache):

  1. The client (end-user) sends a DNS query to the resolver configured in their network settings (normally the ISP's DNS resolver or a public DNS resolver).
  2. The resolver doesn't know the address for string.sub.experiment.nz, but knows the IP addresses for the root name servers, so it asks one of them.
  3. The root server also doesn't have the information about string.sub.experiment.nz, but it holds the locations of the name servers for each top-level domain (such as .nz, .au, .com, etc.), so it answers with the referral (NS RRset) to the .nz name servers, who might know the address of string.sub.experiment.nz.
  4. The resolver then starts another iterative query against one of .nz name servers.
  5. The .nz name server holds a list of name servers for each domain delegated in its zone rather than the address of string.sub.experiment.nz, so it responds with the referral to experiment.nz name servers.
  6. The resolver does its third iterative query against one of experiment.nz name servers.
  7. As per the preceding servers, it replies with the referral to sub.experiment.nz name servers, which are the name servers that hold the information about string.sub.experiment.nz.
  8. The resolver sends the query to sub.experiment.nz name server.
  9. The name server replies with the A or AAAA records (depending on the query type) containing IPv4 or IPv6 addresses for string.sub.experiment.nz.
  10. The resolver sends back the final answer to client.

During the process, the resolver handles the recursive resolution task for the client by repeating the query and following successive referrals until gets to the final answer.

Resolver caching and time-to-live (TTL)

In normal operations, a resolver caches the answers it gets from name servers for future reference. This allows it to speed up DNS resolution and reduce the traffic to name servers. To keep updated with changes in zone data, every DNS record has a TTL. This TTL controls how long the record can be cached before it is no longer considered valid.

Besides caching answers from client's queries, a resolver also caches referrals (NS RRsets) to the name servers it gets in the process. These will be later used when an answer is not in its cache. Depending on its implementation, some resolvers only keep the NS RRsets from parent zone, which are called parent-centric resolvers. Other resolvers overwrite parent NS RRsets with child zone NS RRsets when receiving answers from child zone, which are called child-centric resolvers.

In real life, it's very common to have different TTL values configured for NS RRsets in parent and child zones, making caching time different between parent-centric and child-centric resolvers for the same query.

Parent and Child Zone Setup

To detect a resolver's centricity pattern, we set up a child zone and a parent zone and the delegation chain as shown in the figure blow.

The steps are outlined below:

  1. We registered experiment.nz with two name servers, and for simplicity, pointed the two name servers to the same IP address. The following records are added to the .nz zone:

    experiment.nz. 86400 IN NS ns1.experiment.nz.
    experiment.nz. 86400 IN NS ns2.experiment.nz.
    ns1.experiment.nz. 86400 IN A 150.242.40.246
    ns2.experiment.nz. 86400 IN A 150.242.40.246

    (Note: 86400 is the default TTL for all delegations and cannot be altered)

  2. We set up the experiment.nz zone:

    experiment.nz. 86400 IN NS ns1.experiment.nz.
    experiment.nz. 86400 IN NS ns2.experiment.nz.
    ns1.experiment.nz. 86400 IN A 150.242.40.246
    ns2.experiment.nz. 86400 IN A 150.242.40.246
    sub.experiment.nz. 10 IN NS ns.sub.experiment.nz.
    ns.sub.experiment.nz. 10 IN A 150.242.41.248

  3. We set up sub.experiment.nz zone:

    sub.experiment.nz. 120 IN NS ns.sub.experiment.nz.
    ns.sub.experiment.nz. 120 IN A 150.242.41.248
    * 0 IN A 127.0.53.53

    We use the wildcard to match any query name in our experiment and set TTL of 0 so it will not be cached, which allows us to focus on the caching time of the referral.

RIPE Atlas

We used RIPE Atlas, an Internet measurement platform that provides thousands of active probes located around the world and a REST API to conduct measurements, and simulated DNS queries from end users. We selected 80 active probes located in New Zealand, all of which can be used to query public resolvers such as GoogleDNS and OpenDNS, and compiled a list of ISP resolvers' addresses for providers hosting probes.

Experiment

Each probe sent DNS queries to their resolver IP addresses at fixed interval for a period of time. Measurements for different resolvers were run in parallel to improve the efficiency, and measurements for the same resolver (multiple IPs) were run one by one to ensure they did not interfere with each other's data in the cache.

To help identify queries from different measurements we generated a unique query name every time by composing the query name using "resolver name" + "resolver ip" + "probe id" + timestamp of the run. For example, a query for name opendns208.67.222.222-10269-1480038557.sub.experiment.nz was sent by probe #10269 to OpenDNS's 208.67.222.222 ip address, and 1480038557 was the start timestamp of the program.

Result Analysis

From the query log we extract four fields to build query sequences at both parent and child name servers:

  • timestamp (ts)
  • resolver's service IP (probing target)
  • probe ID
  • resolver's source IP querying the name server

As the TTL for the queried record is set to 0, every query should be sent to the child name server. The query is sent to parent server as well only when the NS record is not found in the cache (first query or TTL expire) and the resolver has to ask parent name server for the referral. As each query name is unique, by checking the parent and child query logs, we are able to mark a query with 'P' if it is sent to both servers, or 'C' if it is sent only to the child server.

With a fixed query interval of 60s, we tested three different combinations of parent TTL and child TTL, and verified the expected patterns as below:

  • parent-ttl=10s, child-ttl=120s

    PPPPPPPPPPPPPPPPP <- Parent Centric
    PCPCPCPCPCPCPCPCP <- Child Centric

  • parent-ttl=5m, child-ttl=30s

    PCCCCPCCCCPCCCCPC <- Parent Centric
    PPPPPPPPPPPPPPPPP <- Child Centric

  • parent-ttl=5m, child-ttl=10m

    PCCCCPCCCCPCCCCPC <- Parent Centric
    PCCCCCCCCCPCCCCCC <- Child Centric

There could be other patterns indicating minimum TTL enforcement, TTL stretching, etc.

We found for a single measurement (same target and same probe), the source IP address in the query log changes for each query. This may due to the architecture of a high performance resolver where a load balancer may be installed announcing the service IP address with two or more DNS servers behind doing the recursive resolution. The choice of server is unpredictable, and could be affected by many factors such as policy, server load, network performance or probe location.

We could also tell if the balanced servers were sharing caches (behaving as one server), by analysing the consecutive queries using the same service IP. If not, each source IP will be treated as an independent caching server, and be analysed separately.

For local ISP resolvers, we observed 33 servers (source IP addresses) in the query log with different patterns as below:

  • parent-centric: 5 servers
  • child-centric: 13 servers
  • minimum (parent-ttl, child-ttl): 11 servers
  • mixed pattern: 2 servers from the same resolver show a pattern mixed with child-centric and NS non-cached.
  • NS non-cached: 2 servers from the same resolver send queries to parent and child with probing interval regardless of NS TTL value.

Among all the servers above, two from the same resolver were sending queries to child server with larger interval than probing interval, possibly because the 0 TTL of the queried record was replaced by a value of minimum TTL enforcement.

For public DNS resolvers, we get the following results:

  • 6 addresses from OpenDNS, all of which behave as child-centric
  • 73 addresses in four prefixes 173.194.171/24, 173.194.93/24, 74.125.41/24, 103.9.106/24 from GoogleDNS, which show a quite unique behaviour: most of queries use two different IPs to query the parent and child name servers. As GoogleDNS's addresses 8.8.8.8 and 8.8.4.4 are very popular probing targets on RIPE Atlas and easy to get conflicts, we only probed them with parent-ttl=10s and child-ttl=120s, in which a steady parent-centric pattern was shown, and could do the same for testing of other values in the future.

Future Work

In this post we detailed an experiment to detect the centricity of local ISP resolvers and public resolvers using RIPE Atlas with probes located in New Zealand. Interesting patterns and behaviours were found indicating different implementations and architectures of resolvers. Besides centricity, other factors such as TTL values and query intervals for different domain names also play a critical role in how much traffic volume to .nz name servers is reduced by caching, which we will explore in the future.