Force request to miss cache but still store the response

I have a slow web app that I've placed Varnish in front of. All of the pages are static (they don't vary for a different user), but they need to be updated every 5 minutes so they contain recent data.

I have a simple script (wget --mirror) that crawls the entire website every 15 minutes. Each crawl takes about 5 minutes. The point of the crawl is to update every page in the Varnish cache so that a user never has to wait for the page to generate (since all pages have been generated recently thanks to the spider).

The timeline looks like this:

  • 00:00:00: Cache flushed
  • 00:00:00: Spider starts crawling to update cache with new pages
  • 00:05:00: Spider finishes crawling, all pages are updated until 00:15:00

A request that comes in between 0:00:00 and 0:05:00 might hit a page that hasn't been updated yet, and will be forced to wait a few seconds for a response. This isn't acceptable.

What I'd like to do is, perhaps using some VCL magic, always foward requests from the spider to the backend, but still store the response in the cache. This way, a user will never have to wait for a page to generate since there is no 5-minute window in which parts of the cache are empty (except perhaps at server startup).

How can I do this?


Solution 1:

req.hash_always_miss should do the trick.

Don't do a full cache flush at the start of the spider run. Instead, just set the spider to work - and in your vcl_recv, set the spider's requests to always miss the cache lookup; they'll fetch a new copy from the backend.

acl spider {
  "127.0.0.1";
  /* or whereever the spider comes from */
}

sub vcl_recv {
  if (client.ip ~ spider) {
    set req.hash_always_miss = true;
  }
  /* ... and continue as normal with the rest of the config */
}

While that's happening and until the new response is in the cache, clients will continue to seamlessly get the older cache served to them (as long as it's still within its TTL).

Solution 2:

Shane's answer above is better than this one. This is an alternative solution which is more complicated and has additional problems. Please upvote Shane's response, not this one. I am just showing another method of solving the problem.


My initial thought was to return (pass); in vcl_recv and then, after the request has been fetched, in vcl_fetch, somehow instruct Varnish that it should cache the object, even thought it was specifically passed earlier.

It turns out this isn't possible:

If you chose to pass the request in an earlier VCL function (e.g.: vcl_recv), you will still execute the logic of vcl_fetch, but the object will not enter the cache even if you supply a cache time.

So the next-best thing is trigger a lookup just like a normal request, but make sure it always fails. There's no way to influence the lookup process, so it's always going to hit (assuming it is cached; if it's not, then it's going to miss and store anyway). But we can influence vcl_hit:

sub vcl_hit {
    # is this our spider?
    if (req.http.user-agent ~ "Wget" && client.ip ~ spider) {
        # it's the spider, so purge the existing object
        set obj.ttl = 0s;
        return (restart);
    }

    return (deliver);
}

We can't force it not to use the cache, but we can purge that object from the cache and restart the entire process. Now it goes back to the beginning, at vcl_recv, where it eventually does another lookup. Since we purged the object we're trying to update already, it will miss, then fetch the data and update the cache.

A little complicated, but it works. The only window for a user getting stuck between a purge and the response being stored is the time for the single request to process. Not perfect, but pretty good.