Configuration management: push versus pull based topology

The more established configuration management (CM) systems like Puppet and Chef use a pull-based approach: clients poll a centralized master periodically for updates. Some of them offer a masterless approach as well (so, push-based), but state that it is 'not for production' (Saltstack) or 'less scalable' (Puppet). The only system that I know of that is push-based from the start is runner-up Ansible.

What is the specific scalability advantage of a pull based system? Why is it supposedly easier to add more pull-masters than push-agents?

For example, agiletesting.blogspot.nl writes:

in a 'pull' system, clients contact the server independently of each other, so the system as a whole is more scalable than a 'push' system

On the other hand, Rackspace demonstrates that they can handle 15K systems with a push-based model.

infastructures.org writes:

We swear by a pull methodology for maintaining infrastructures, using a tool like SUP, CVSup, an rsync server, or cfengine. Rather than push changes out to clients, each individual client machine needs to be responsible for polling the gold server at boot, and periodically afterwards, to maintain its own rev level. Before adopting this viewpoint, we developed extensive push-based scripts based on ssh, rsh, rcp, and rdist. The problem we found with the r-commands (or ssh) was this: When you run an r-command based script to push a change out to your target machines, odds are that if you have more than 30 target hosts one of them will be down at any given time. Maintaining the list of commissioned machines becomes a nightmare. In the course of writing code to correct for this, you will end up with elaborate wrapper code to deal with: timeouts from dead hosts; logging and retrying dead hosts; forking and running parallel jobs to try to hit many hosts in a reasonable amount of time; and finally detecting and preventing the case of using up all available TCP sockets on the source machine with all of the outbound rsh sessions. Then you still have the problem of getting whatever you just did into the install images for all new hosts to be installed in the future, as well as repeating it for any hosts that die and have to be rebuilt tomorrow. After the trouble we went through to implement r-command based replication, we found it's just not worth it. We don't plan on managing an infrastructure with r-commands again, or with any other push mechanism for that matter. They don't scale as well as pull-based methods.

Isn't that an implementation problem instead of an architectural one? Why is it harder to write a threaded push client than a threaded pull server?


Solution 1:

The problem with push based systems is that you have to have a complete model of the entire architecture on the central push node. You can't push to a machine that you don't know about.

It can obviously work, but it takes a lot of work to keep it in sync.

Using things like Mcollective, you can convert Puppet and other CM's into a push based system. Generally, it's trivial to convert a pull system to a push based one, but not always simple to go the other way.

There is also the question of organizational politics. A push based system puts all the control hands of the central admins. It can be very hard to manage complexity that way. I think the scaling issue is a red herring, either approach scales if you just look at the number of clients. In many ways push is easier to scale. However, dynamic configuration does more or less imply that you have at least a pull version of client registration.

Ultimately, it's about which system matches the workflow and ownership in your organization. As a general rule, pull systems are more flexible.

Solution 2:

In case it is of interest to anyone, I guess at minimum I can give a user experience report having put made my first use of Ansible's out of the box push capability in the context of patch management of multi-host setups of mission-critical systems in the Amazon cloud. To understand my preconceptions or biases, I should explain that I have a preference for Ruby at the automation scripting level and have set up projects to use master-agent puppet configuration per-project-Vpc in the past. So my experience belies past prejudices, if there were any.

My recent experience was very favourable to dynamic push onto a changing estate of from dozens to many hundreds of servers which can scale up or down, be terminated and refreshed. In my situation a simple Ansible 1.7 ad hoc command was all that I needed to make the patch. However in view of the effectiveness of setting up an AnsibleController (on a t2.micro) per Vpc for the purpose, in future I am intending to expand the technique for more complex requirements.

So let me return to the question asked in this thread: pros and cons of push in a dynamically changing estate.

The assumptions of the kind of server estate I targeted was:

  • No assumption that IP addresses or Amazon-generated local hostnames would be long lasting - they can both come and go
  • All instances were created from machine images which already had the ability to make ssh access possible from a single privileged administrative user
  • To individuate servers, and potentially partition them into groups, according to function or according to the stage of development (e.g. test or prod) this would be done through launch specific Amazon tags of agreed conventional Names
  • That I would patch administer Linux and Windows servers separately, with different ad hoc commands, therefore simply allowing Linux specific logins to fail when contacting a Windows server was perfectly acceptable

With these conditions in mind, creating a machine image of an AnsibleController to drop into numerous Vpcs and configure (with credentials) in situ within the existing server accounts is very simple. Automated within each instance created from the image is

  1. A cron job to push the patch to running servers at regular intervals so that the required estate is accessed continually at intervals
  2. A way of computing the Ansible inventory at every such interval.

The second item can be made relatively sophisticated if needed (via the Info structure of the Ansible inventory). But if sophistication is not needed, here is a very straightforward example of a script to compute all Amazon EC2 instances at each cron interval and direct the results into an appropriate inventory file (e.g. /etc/ansible/hosts) …

#!/bin/bash
# Assumes aws-cli/1.3.4 Python/2.6.9 Linux/3.4.73-64.112.amzn1.x86_64 or greater
# http://aws.amazon.com/releasenotes/8906204440930658
# To check yum list aws-cli
# Assumes that server is equipped with AWS keys and is able to access some or all
# instances in the account within it is running.
# Provide a list of host IPs each on a separate line
# If an argument is passed then treat it as the filename, whether local or absolute 
# path, to which the list is written

function list-of-ips {
    /usr/bin/aws ec2 describe-instances --filters '[ {"Name": "instance-state-code", "Values": [ "16" ] } ]' | grep -w PrivateIpAddress | awk  '{x=$2; gsub("\"","", x); gsub(",","", x); if(x && FNR!=1){print x;}}' | uniq
 }

if [ -n "$1" ]; then
   list-of-ips > "$1"
else
   list-of-ips
fi

The only caveat for the use case is that the patch command should be idempotent. It is desirable to pre-test to make perfectly sure that this is satisfied, as part of making sure that the patch does exactly what is intended.

So to sum up, I have illustrated a use case where dynamic push is effective against the goals I set. It is a repeatable solution (in the sense of being encapsulated in an image which can be rolled out in multiple accounts and regions). In my experience to date the dynamic push technique is much easier to provide --- and get into action --- than the alternatives available from the toolsets available to us at the moment.

Solution 3:

This is an old post, but interestingly enough history repeats itself.

Now embedded IoT devices need configuration management and the infrastructure / network topology seems to be even more complex with both firewalls, NATs and even mobile networks in the mix.
The push or pull based decision is again just as important but the number of devices is even higher. When we developed our IoT embedded device configuration management tool qbee.io we selected a pull based approach with an agent having its foundation in promise theory. That means the agent pulls configuration and converges autonomously to the desired state. The advantage is that configuration is actively maintained even if the master server is down and the system does not need to track which device has received what configuration change. In addition it is often difficult to know how the local network conditions for the device are. So we do not care until the device pings the server. An additional example and argument for a pull based solution in case of an embedded use case is the long lifecycle of these devices. If a device fails and is replaced by a spare device (e.g. on an oil rig) the device will immediately receive the configuration for its specific group and converges towards that. If for example ssh keys are rotated for security reasons every 6 months then the last valid key for the spare device group will automatically be applied.

It will be interesting to follow how this discussion continues over the years. Also with containers and disposable infrastructure as an alternative to systems that maintain configuration over a longer period of time.