What is reasonable performance for a simple Ansible playbook against ~100 hosts?

We are starting to look at Ansible to replace an old cfengine2 installation. I have a simple playbook that:

  • copies a sudoers file
  • copies a templated resolv.conf (fed with group_vars and host_vars data)
  • checks a couple of services are running
  • checks for presence of a local user

The playbook takes over 4 minutes of wallclock time to run against 97 machines (all connected over fast 1gig or 10gig networking, with sub-1ms LAN latency) and consumes over 50% of CPU on the 2-core 4G memory VM when I'm running it.

It takes about 11 seconds to run against a single machine, with about 4sec of user+sys CPU time consumed, which TBH still seems a bit excessive for the amount of work involved.

The obvious bits:

  • I have pipelineing explicitly enabled in a playbook-dir local ansible.cfg
  • I have fact caching to jsonfile enabled, same local ansible.cfg
  • I have forks set to 50, same (I have tried other values)
  • I am sure that Ansible is using SSH not Paramiko and it is using the persistent control socket - I can see the SSH processes being started and persisting during the run.

Is this level of performance normal or is something wrong with my setup? How can I go about determining what, if so?

Edit: As of Aug 2017, we're still seeing this problem. Ansible version is 2.2.1 and the playbook size has grown now. Up-to-date numbers:

  • 98 hosts
  • ansible -m ping all takes 4.6s real, 3.2s user, 2.5s sys times
  • a full playbook run takes 4 minutes, using 100% user and ~35% system CPU while doing it (on a 2-core VM deployment sever, 100% being one full CPU)
  • target OS is largely CentOS 7, some CentOS 6
  • profiling does not reveal any specifc task hotspots AFAICT

Although the playbook is now much bigger, I still don't think there is anything in there to justify that level of CPU load on the playbook server - wallclock time, perhaps, but the deployment server should be largely idle for most of the run, as far as I can see, it's mostly file copies and some template expansions.

Note we are making quite extensive use of host/groupvars

Several people have asked about profiling, tail of a run with profiling:

Tuesday 01 August 2017  16:02:24 +0100 (0:00:00.539)       0:06:22.991 ******** 
=============================================================================== 
yumrepo : centos repos -------------------------------------------------- 9.77s
sshd : copy CentOS 6 sshd config ---------------------------------------- 7.41s
sshd : copy CentOS 7 sshd config ---------------------------------------- 6.94s
core : ensure core packages are present --------------------------------- 6.28s
core : remove packages on VM guests ------------------------------------- 5.39s
resolv : stop NetworkManager changing resolv.conf ----------------------- 5.25s
yumrepo : epel6 gpg key ------------------------------------------------- 3.94s
yumrepo : epel7 gpg key ------------------------------------------------- 3.71s
yumrepo : nsg gpg key --------------------------------------------------- 3.57s
resolv : build resolv.conf ---------------------------------------------- 3.30s
yumrepo : nsg repo ------------------------------------------------------ 2.66s
resolv : check NetworkManager running ----------------------------------- 2.63s
yumrepo : psp repo ------------------------------------------------------ 2.62s
yumrepo : ucs repo ------------------------------------------------------ 2.44s
yumrepo : epel repo ----------------------------------------------------- 2.27s
resolv : check for nmcli ------------------------------------------------ 2.08s
core : remove various unwanted files ------------------------------------ 1.42s
telegraf : write telegraf.conf file ------------------------------------- 1.13s
core : copy sudoers in place -------------------------------------------- 0.94s
core : ensure sshd is running ------------------------------------------- 0.90s

in your ansible.cfg set the following:

[defaults]

# profile each task
callback_whitelist = profile_tasks

# [don't validate host keys](http://docs.ansible.com/ansible/intro_configuration.html#host-key-checking)
host_key_checking = False

[ssh_connection]
pipelining = True

Also, in your playbook, set the strategy as 'free'

- hosts: all
  strategy: free
  tasks: [...]

Finally, disable fact gathering on your play: gather_facts: false

If, after profiling, you are seeing a lot of this:

TASK [pip foo]
ok: [10.192.197.252] => (item=ansible)
ok: [10.192.197.252] => (item=boto)
ok: [10.192.197.252] => (item=boto3)
ok: [10.192.197.252] => (item=passlib)
ok: [10.192.197.252] => (item=cryptography)

squash those actions in ansible.cfg under [defaults]:

e.g. squash_actions = yum,pip,bar