Best practice for proxying package repositories

Solution 1:

We use Squid for this; the nice thing about squid is that you can set individual expiry of objects based on a pattern match, fairly easily, which allows the metadata from the yum repo to be purged fairly quickly. The config we have which implements this:

refresh_pattern (Release|Packages(.gz)*)$      0       20%     2880
refresh_pattern (\.xml|xml\.gz)$      0       20%     2880
refresh_pattern ((sqlite.bz2)*)$      0       20%     2880
refresh_pattern (\.deb|\.udeb)$   1296000 100% 1296000
refresh_pattern (\.rpm|\.srpm)$   1296000 100% 1296000
refresh_pattern .        0    20%    4320

http://www.squid-cache.org/Doc/config/refresh_pattern/

Solution 2:

That's a definitive use case for a proxy. A normal proxy, not a reverse-proxy (aka. load balancers).

The most well-known and free and open-source is squid. Luckily it's one of the few good open-source software that can easily be installed with a single apt-get install squid3 and configured with a single file /etc/squid3/squid.conf.

We'll go over the good practices and the lessons to known about.

The official configuration file slightly modified (the 5000 useless commented lines were removed).

#       WELCOME TO SQUID 3.4.8
#       ----------------------------
#
#       This is the documentation for the Squid configuration file.
#       This documentation can also be found online at:
#               http://www.squid-cache.org/Doc/config/
#
#       You may wish to look at the Squid home page and wiki for the
#       FAQ and other documentation:
#               http://www.squid-cache.org/
#               http://wiki.squid-cache.org/SquidFaq
#               http://wiki.squid-cache.org/ConfigExamples
#

###########################################################
# ACL
###########################################################

acl SSL_ports port 443
acl Safe_ports port 80          # http
acl Safe_ports port 21          # ftp
acl Safe_ports port 443         # https
acl Safe_ports port 1025-65535  # unregistered ports

acl CONNECT method CONNECT

#####################################################
# Recommended minimum Access Permission configuration
#####################################################
# Deny requests to certain unsafe ports
http_access deny !Safe_ports

# Deny CONNECT to other than secure SSL ports
http_access deny CONNECT !SSL_ports

# Only allow cachemgr access from localhost
http_access allow localhost manager
http_access deny manager

#####################################################
# ACL
#####################################################

# access is limited to our subnets
acl mycompany_net   src 10.0.0.0/8

# access is limited to whitelisted domains
# ".example.com" includes all subdomains of example.com
acl repo_domain dstdomain .keyserver.ubuntu.com
acl repo_domain dstdomain .debian.org
acl repo_domain dstdomain .python.org

# clients come from a known subnet AND go to a known domain
http_access allow repo_domain mycompany_net

# And finally deny all other access to this proxy
http_access deny all

#####################################################
# Other
#####################################################

# default proxy port is 3128
http_port 0.0.0.0:3128

# don't forward internal private IP addresses
forwarded_for off

# disable ALL caching
# bandwidth is cheap. debugging cache related bugs is expensive.
cache deny all

# logs
# Note: not sure if squid configures logrotate or not
access_log daemon:/var/log/squid3/access.log squid
access_log syslog:squid.INFO squid


# leave coredumps in the first cache dir
coredump_dir /var/spool/squid3

# force immediaty expiry of items in the cache.
# caching is disabled. This setting is set as an additional precaution.
refresh_pattern .               0       0%      0

Client Configuration - Environment Variables

Configure these two environment variables on all systems.

http_proxy=squid.internal.mycompany.com:3128
https_proxy=squid.internal.mycompany.com:3128

Most http client libraries (libcurl, httpclient, ...) are self configuring using the environment variables. Most applications are using one of the common libraries and thus support proxying out-of-the-box (without the dev necessarily knowing that they do).

Note that the syntax is strict:

  1. The variable name http_proxy MUST be lowercase on most Linux.
  2. The variable value MUST NOT begin with http(s):// (the proxying protocol is NOT http(s)).

Client Configuration - Specific

Some applications are ignoring environment variables and/or are run as service before variables can be set (e.g. debian apt).

These applications will require special configuration (e.g. /etc/apt.conf).

HTTPS Proxying - Connect

HTTPS proxying is fully supported by design. It uses a special "CONNECT" method which establishes some sort of tunnel between the browser and the proxy.

Dunno much about that thing but I've never had issues with it in years. It just works.

HTTPS Special Case - Transparent Proxy

A note on transparent proxy. (i.e. The proxy is hidden and it intercepts clients requests ala. man-in-the-middle).

Transparent proxies are breaking HTTPS. The client doesn't know that there is a proxy and has no reason to use the special Connect method.

The client tries a direct HTTPS connection... that is intercepted. The interception is detected and errors are thrown all over the place. (HTTPS is meant to detect man-in-he-middle attacks).

Domain and CDN whitelisting

Domain and subdomain whitelisting is fully supported by squid. Nonetheless, it's bound to fail in unexpected ways from time to time.

Modern websites can have all sort of domain redirections and CDN. That will break ACL when people didn't go the extra mile to put everything neatly in a single domain.

Sometimes there will be an installer or a package that wants to call the homeship or retrieve external dependencies before running. It will fail every single time and there is nothing you can do about it.

Caching

The provided configuration file is disabling all form of caching. Better safe than sorry.

Personally, I'm running things in the cloud at the moment, all instances have at least 100 Mbps connectivity and the provider runs its own repos for popular stuff (e.g. Debian) which are discovered automatically. That makes bandwidth a commodity I couldn't care less about.

I'd rather totally disable caching than experience a single caching bug that will melt my brain in troubleshooting. Every single person on the internet CANNOT get their caching headers right.

Not all environments have the same requirements though. You may go the extra mile and configure caching.

NEVER EVER require authentication on the proxy

There is an option to require password authentication from clients, typically with their LDAP accounts. It will break every browser and every command line tool in the universe.

If you want to do authentication on the proxy, don't.

If management wants authentication, explain that it's not possible.

If you're a dev and you just joined a company that is blocking direct internet AND forcing proxy authentication, RUN AWAY WHILE YOU CAN.

Conclusion

We went through the common configuration, common mistakes and things one must known about proxying.

Lesson learnt:

  • There is a good open-source software for proxying (squid)
  • It's simple and easy to configure (a single short file)
  • All (optional) security measures have tradeoffs
  • Most advanced options will break stuff and come back to haunt you
  • Transparent proxies are breaking HTTPS
  • Proxy authentication is evil

As usual in programming and system design, it's critical to manage requirements and expectations.

I'd recommend to stick to the basics when setting up a proxy. Generally speaking, a plain proxy without any particular filtering will work well and not give any trouble. Just gotta remember to (auto) configure the clients.

Solution 3:

This won't solve all your tasks, but maybe this is still helpful. Despite the name, apt-cacher-ng doesn't only work with Debian and derivatives, and is

a caching proxy. Specialized for package files from Linux distributors, primarily for Debian (and Debian based) distributions but not limited to those.

I'm using this in production in a similar (Debian based) environment like yours.

However, AFAIK, this won't support rubygems, PyPI, PECL, CPAN or npm and doesn't provide granular ACLs.

Personally, I think that investigating Squid is a good idea. If you implement a setup in the end, could you please share your experiences? I'm quite interested in how it goes.

Solution 4:

we had a similar challenge and have solved it using local repos and a snapshot based storage system. We basically update the development repository, clone it for testing, clone that for staging and finally for production. The amount of disk used is limited that way, plus it's all slow sata storage and that's ok.

The clients get the repository info from our configuration management so switching is easy if necessary.

You could achieve what you want using ace's on the proxy server using user-agent strings or source ips/mask combinations and restricting their access to certain domains, but if you do that one problem I see is that of different versions of packages/libraries. So if one of the hosts may access cpan and requests module xxx::yyy unless the client instructs to use a specific version, will pull the latest from cpan (or pypy or rubygems), which may or may not be the one that was already cached in the proxy. So you might end up with different versions on the same environment. You will not have that problem if you use local repositories.