What is reasonable performance for a simple Ansible playbook against ~100 hosts?
We are starting to look at Ansible to replace an old cfengine2 installation. I have a simple playbook that:
- copies a sudoers file
- copies a templated resolv.conf (fed with group_vars and host_vars data)
- checks a couple of services are running
- checks for presence of a local user
The playbook takes over 4 minutes of wallclock time to run against 97 machines (all connected over fast 1gig or 10gig networking, with sub-1ms LAN latency) and consumes over 50% of CPU on the 2-core 4G memory VM when I'm running it.
It takes about 11 seconds to run against a single machine, with about 4sec of user+sys CPU time consumed, which TBH still seems a bit excessive for the amount of work involved.
The obvious bits:
- I have pipelineing explicitly enabled in a playbook-dir local ansible.cfg
- I have fact caching to jsonfile enabled, same local ansible.cfg
- I have forks set to 50, same (I have tried other values)
- I am sure that Ansible is using SSH not Paramiko and it is using the persistent control socket - I can see the SSH processes being started and persisting during the run.
Is this level of performance normal or is something wrong with my setup? How can I go about determining what, if so?
Edit: As of Aug 2017, we're still seeing this problem. Ansible version is 2.2.1 and the playbook size has grown now. Up-to-date numbers:
- 98 hosts
-
ansible -m ping all
takes 4.6s real, 3.2s user, 2.5s sys times - a full playbook run takes 4 minutes, using 100% user and ~35% system CPU while doing it (on a 2-core VM deployment sever, 100% being one full CPU)
- target OS is largely CentOS 7, some CentOS 6
- profiling does not reveal any specifc task hotspots AFAICT
Although the playbook is now much bigger, I still don't think there is anything in there to justify that level of CPU load on the playbook server - wallclock time, perhaps, but the deployment server should be largely idle for most of the run, as far as I can see, it's mostly file copies and some template expansions.
Note we are making quite extensive use of host/groupvars
Several people have asked about profiling, tail of a run with profiling:
Tuesday 01 August 2017 16:02:24 +0100 (0:00:00.539) 0:06:22.991 ********
===============================================================================
yumrepo : centos repos -------------------------------------------------- 9.77s
sshd : copy CentOS 6 sshd config ---------------------------------------- 7.41s
sshd : copy CentOS 7 sshd config ---------------------------------------- 6.94s
core : ensure core packages are present --------------------------------- 6.28s
core : remove packages on VM guests ------------------------------------- 5.39s
resolv : stop NetworkManager changing resolv.conf ----------------------- 5.25s
yumrepo : epel6 gpg key ------------------------------------------------- 3.94s
yumrepo : epel7 gpg key ------------------------------------------------- 3.71s
yumrepo : nsg gpg key --------------------------------------------------- 3.57s
resolv : build resolv.conf ---------------------------------------------- 3.30s
yumrepo : nsg repo ------------------------------------------------------ 2.66s
resolv : check NetworkManager running ----------------------------------- 2.63s
yumrepo : psp repo ------------------------------------------------------ 2.62s
yumrepo : ucs repo ------------------------------------------------------ 2.44s
yumrepo : epel repo ----------------------------------------------------- 2.27s
resolv : check for nmcli ------------------------------------------------ 2.08s
core : remove various unwanted files ------------------------------------ 1.42s
telegraf : write telegraf.conf file ------------------------------------- 1.13s
core : copy sudoers in place -------------------------------------------- 0.94s
core : ensure sshd is running ------------------------------------------- 0.90s
in your ansible.cfg
set the following:
[defaults]
# profile each task
callback_whitelist = profile_tasks
# [don't validate host keys](http://docs.ansible.com/ansible/intro_configuration.html#host-key-checking)
host_key_checking = False
[ssh_connection]
pipelining = True
Also, in your playbook, set the strategy as 'free'
- hosts: all
strategy: free
tasks: [...]
Finally, disable fact gathering on your play: gather_facts: false
If, after profiling, you are seeing a lot of this:
TASK [pip foo]
ok: [10.192.197.252] => (item=ansible)
ok: [10.192.197.252] => (item=boto)
ok: [10.192.197.252] => (item=boto3)
ok: [10.192.197.252] => (item=passlib)
ok: [10.192.197.252] => (item=cryptography)
squash those actions in ansible.cfg
under [defaults]:
e.g. squash_actions = yum,pip,bar