Mirror/Cache Websites on a Mobile Server

I think the feasibility is pretty low, you have to consider the following issues which combined I think rule out the possibility of a "usable" or transparent offline mirror.

  • HTTPs traffic is increasingly common these days and you won't be able to cache this without installing a CA certificate on their device which users should be very hesitant to do.

  • Many websites rely heavily on client side HTTP requests (ie. AJAX) to function, and in most cases sites go out of their way to avoid AJAX requests being cached by appending a timestamp to the URL so that every request is treated as a unique URL.

  • You can basically rule out any stateful site (ie. on that requires a login) - obviously you can't cache person X's facebook profile unless they've already viewed it and even if they have viewed it the value of these sites is severely diminished without real time updates. Plus this means your cache lookup will have to depend on the value of a cookie, therefore decreasing the chance that you'll get a hit on a page requested earlier by someone else.

  • How do most people get to sites? People rarely type URL's, typically they search for things - even things which they know the URL of such as facebook. It would be a challenge to try and cache complex search engine results because they're likely to be stateful (eg. if you search google while logged into a google account your results will be different)

  • When browsing the web what is the percentage of new content vs. content you've seen before? Even when browsing a site you visit a lot like facebook you'll frequently click onto new pages, etc.

  • Some sites now use WebSockets, not sure about the exact details but I imagine it'd be difficult to emulate / replay the WebSockets interaction.

If you have some reason to believe that your users will be visiting the same set of pages (eg. a set of documentation) a large percentage of the time and this content is not stateful then it might be feasible.


I actually set up something very similar to this once a year for a week-long event that's held in the middle of nowhere, so I have a little experience to share.

First, the TL;DR: You can do it, but it won't work nearly as well as you (or your higher-ups) might hope. It might not be worth bothering, especially if the interruptions are brief. But you might want to do it anyway, in order to save bandwidth and provide a faster experience when you are connected to 3G.


The component you're looking for is a transparent proxy, one which intercepts outgoing HTTP requests, which weren't intended by the client to be proxied, and diverts them to a proxy server. And squid is the most common software used for transparent proxying. This is what I use.

The way this works is: A switch or router will intercept packets intended for port 80 of a remote address, and mangle them so that they end up connecting to the proxy instead. It then checks its cache and if the cache misses it goes to the network. Typical proxy stuff. I do this diversion with some simple Linux iptables rules, though many routers and switches can also be configured to do it.

For your purposes, you will also need to do some significant tweaking to squid's configuration, to override its cache handling. In particular you will want to cause it to serve a stale cached item when it fails to revalidate it on the network. I don't have the configuration for this offhand, since it isn't necessary in my design, where I'm at a fixed point and have continuous wireless service. But some careful documentation reading ought to suggest a way to do it.

You will also want to create some custom Squid error pages which refer to your company and explain the various out of service conditions to be expected.

And now for the down side.

You won't be able to do this with HTTPS requests at all. While Squid does support a method of intercepting HTTPS requests similarly to HTTP requests, you won't be able to use it as it would require creating a CA and installing a certificate in every client's browser. Easy enough for an enterprise, but not something you can do for a public service. And even if you could, it is not at all user friendly, will set off alarms in any privacy-minded person's mind, and it is illegal to do so in some countries.

In addition, WebSockets, used by many web sites these days, will almost always fail when a transparent proxy is involved, because the proxy -- doing what it is supposed to do -- mangles the upgrade request beyond recognition. There is little you can do about this, except advise users to explicitly use the proxy server. In this case the browser knows to format the request differently, using HTTP CONNECT, so that it will pass through the proxy unmolested.

Finally, after having spoken to some people familiar with traveling on Australia's trains, I learned that these outages can sometimes last 10 to 15 minutes. There's very little that you can do about this; someone browsing the web during that time is quite likely to go try to click on a link to a site you haven't yet cached, and you are not much better off than you are now, though if you have the cache in place you can at least advise the passenger of the situation (at least on HTTP). While the Internet is out, passengers might be better served by looking out the windows and trying to spot the Nullarbor Nymph.


And some basic stats. Last year the service used 42 GB of data and served an additional 17GB from cache. This year the service used 87 GB of data and served just 744 MB from cache. That's not a mistaken calculation, or as far as I can tell a configuration error. The majority of the difference between caching last year and this year seems to be that more major web sites are now forcing HTTPS. For instance, last year I was able to cache some YouTube videos. This year I could not, because they are now served over HTTPS.

With more and more web sites moving to HTTPS, this caching strategy becomes less and less viable every year, and running the cache at all seems to be more and more pointless.

My recommendation is that you not bother. But you could set one up and run a trial on one train, and then measure the results.

You might also experiment with instructing users to configure the proxy explicitly, so that you can handle HTTPS and WebSockets, though in my experience this is something that's difficult for users to get right. You might be able to implement WPAD to configure some users automatically, but be aware that Android and iOS devices have poor or no support for it.