RFC 8767 7. stale data vs new lookup

Solution 1:

The text you are quoting is in the context of this (immediately before your text):

Stale data is used only when refreshing has failed in order to adhere to the original intent of the design of the DNS and the behavior expected by operators. If stale data were to always be used immediately and then a cache refresh attempted after the client response has been sent, the resolver would frequently be sending data that it would have had no trouble refreshing.

What it means is that only when the resolver does not manage to fetch new data before the "client response timer" fires (which is where the suggested 1.8s is relevant, earlier in the text), it is allowed to serve stale data (with shortened TTL).

This also does not mean that it stops processing this lookup just because it served stale data to the client. It should still try to finish the lookup and update the cache for the future to try to avoid serving stale data the next time.

If the lookup is not just slower than "client response timer" (suggested 1.8s) but rather that it is slower than the "query resolution timer" (max time for a query, suggested 10-30s) or even outright impossible, then it is suggested that it only needs to attempt the query again every "failure recheck timer" seconds (suggested 30s).

Ie, if you tried to refresh within the last 30s (or whatever failure recheck timer is used) and that attempt outright failed (or possibly is still ongoing), you are good to serve stale data. Otherwise no.

You should see it as the goal to not serve stale data. What RFC8767 says is essentially that when the traditional behavior would have been to time out (from the client's point of view) or to return an error because you can't get current data, you are allowed to temporarily serve stale data while you are continuing to try to get the current data.

From Section 5. Example Method:

When a recursive resolver receives a request, it should start the
client response timer. This timer is used to avoid client timeouts.
It should be configurable, with a recommended value of 1.8 seconds as
being just under a common timeout value of 2 seconds while still
giving the resolver a fair shot at resolving the name.

The resolver then checks its cache for any unexpired records that
satisfy the request and returns them if available. If it finds no
relevant unexpired data and the Recursion Desired flag is not set in
the request, it should immediately return the response without
consulting the cache for expired records. Typically, this response
would be a referral to authoritative nameservers covering the zone,
but the specifics are implementation dependent.

If iterative lookups will be done, then the failure recheck timer
is consulted. Attempts to refresh from non-responsive or otherwise
failing authoritative nameservers are recommended to be done no more
frequently than every 30 seconds. If this request was received
within this period, the cache may be immediately consulted for stale
data to satisfy the request.

Outside the period of the failure recheck timer, the resolver
should start the query resolution timer and begin the iterative
resolution process. This timer bounds the work done by the
resolver when contacting external authorities and is commonly
around 10 to 30 seconds. If this timer expires on an attempted
lookup that is still being processed, the resolution effort is
abandoned.

If the answer has not been completely determined by the time the
client response timer has elapsed, the resolver should then check its
cache to see whether there is expired data that would satisfy the
request. If so, it adds that data to the response message with a TTL
greater than 0 (as specified in Section 4). The response is then
sent to the client while the resolver continues its attempt to
refresh the data.

When no authorities are able to be reached during a resolution
attempt, the resolver should attempt to refresh the delegation and
restart the iterative lookup process with the remaining time on the
query resolution timer. This resumption should be done only once per
resolution effort.

Outside the resolution process, the maximum stale timer is used for
cache management and is independent of the query resolution process.
This timer is conceptually different from the maximum cache TTL that
exists in many resolvers, the latter being a clamp on the value of
TTLs as received from authoritative servers and recommended to be 7
days in the TTL definition in Section 4. The maximum stale timer
should be configurable. It defines the length of time after a record
expires that it should be retained in the cache. The suggested value
is between 1 and 3 days.