Ansible stuck on gathering facts
I'm having some odd issues with my ansible box(vagrant).
Everything worked yesterday and my playbook worked fine.
Today, ansible hangs on "gathering facts"?
Here is the verbose output:
<5.xxx.xxx.xxx> ESTABLISH CONNECTION FOR USER: deploy
<5.xxx.xxx.xxx> REMOTE_MODULE setup
<5.xxx.xxx.xxx> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-
o', 'ControlPersist=60s', '-o', 'ControlPath=/home/vagrant/.ansible/cp/ansible-s
sh-%h-%p-%r', '-o', 'Port=2221', '-o', 'KbdInteractiveAuthentication=no', '-o',
'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o
', 'PasswordAuthentication=no', '-o', 'User=deploy', '-o', 'ConnectTimeout=10',
'5.xxx.xxx.xxx', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1411372677
.18-251130781588968 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1411372677.18-2
51130781588968 && echo $HOME/.ansible/tmp/ansible-tmp-1411372677.18-251130781588
968'"]
I was having a similar issue with Ansible ping on Vagrant, it just suddenly stuck for no reason and has previously worked absolutely fine. Unlike any other issue like ssh or connective issue, it just forever die with no timeout.
One thing I did to resolve this issue is to clean ~/.ansible
directory and it just works again. I can't find out why, but it did get resolved.
If you got change to have it again try clean the ~/.ansible
folder before you refresh your Vagrant.
Ansible can hang like this for a number of reasons, usually because of a connection problem or because the setup module hangs. Here's how to narrow the problem down so you can solve it.
Ansible cannot connect to the destination host
Host Key (known_hosts) Problems
1) On older versions of Ansible (2.1 or older), Ansible would not always tell you if the host key for the destination does not exist on the source, or if there is a mismatch.
Solution: try opening an SSH connection with the same parameters to that destination. You may find SSH errors you need to resolve, and then the command will work.
2) Sometimes Ansible displays an SSH connection message to you in the midst of other statuses, causing Ansible to "freeze" on that task:
Warning: the ECDSA host key for 'myhost' differs from the key for the IP address '10.10.1.10'
Offending key for IP in /etc/ssh/ssh_known_hosts:246
Matching host key in /etc/ssh/ssh_known_hosts:477
Are you sure you want to continue connecting (yes/no)?
In this case, simply typing "yes" for as many SSH questions as you were asked will permit the play to continue. Afterwards you can fix the root known_hosts problems.
Private Key Authentication Problems
If using key-based authentication vs password, other problems include:
- Private key may not be set up properly on the destination
- Private key might have incorrect permissions locally (should be readable only by the user running the Ansible job)
Solution: try running ansible -m ping <destination> -k
against the problem host - if that doesn't work, try the Host Key Problems solutions above.
Ansible cannot quickly gather facts
The setup
module (when run automatically at the beginning of an ansible-playbook
run, or when run manually as ansible -m setup <host>
) can often hang when gathering hardware facts (e.g. if getting disk information from hosts with high i/o, bad mount entries, etc.).
Solution: try running ansible -m setup -a gather_subset=!all <destination>
. If this works, you should consider setting this line in your ansible.cfg:
gather_subset=!hardware
For me the setup module module was stuck on a dead NFS mount.
If you do a "df" on your machine and nothing happens, you may be on the same case.
PS: if you can't umount the NFS share/mountpoint, consider using the bad "umount -l"
There are many reasons why ansible may hang at fact gathering, but before going any further, here is the first test you should be making in any such situation :
ansible -m ping <hostname>
This test just connects to the host, and executes enough code to return :
<hostname> | SUCCESS => {
"changed": false,
"ping": "pong"
}
If this works, you can pretty much rule out any setup or connectivity issue, as it proves that you could resolve target hostname, open a connection, authenticate, and execute an ansible module with the remote python interpreter.
Now, here is a (non-exhaustive) list of things that can go wrong at the beginning of a playbook :
The command executed by ansible is waiting for an interactive input
I can remember this happening on older ansible versions, where a command would wait for an interactive input that would never come, such as a sudo password (when you forgot a -K
switch), or acceptation of a new ssh host fingerprint (for a new target host).
Modern versions of ansible handle both these cases gracefully and raise an error immediately for normal usecases, so unless you're doing things such as calling ssh or sudo yourself, you shouldn't have this kind of issue. And even if you did, it would be after fact gathering.
Dead ssh master connection
There are some very interesting options passed to the ssh client, in the debug log given here :
ControlMaster=auto
ControlPersist=60s
ControlPath=/home/vagrant/.ansible/cp/ansible-ssh-%h-%p-%r
These options are documented in man ssh_config.
By default, ansible will try and be smart regarding its ssh connection use. For a given host, instead of creating a new connection for each and every task in the play, it will open it once, and keep it open for the whole playbook (and even across playbooks).
That's good, as establishing a new connection is far slower and computation-intensive than using an already existing one.
In practice, every ssh connection will check for the existence of a socket at ~/.ansible/cp/some-host-specific-path
.
The first connection cannot find it, so it connects normally, and then creates it.
Every subsequent connection will then just use this socket to go through the already established connection.
Even if the established connection finally times out and closes after not being used for long enough, the socket is closed too, and we're back to square one.
So far so good.
Sometimes however, the connection actually dies, but the ssh client still considers it established. This typically happens when you execute the playbook from you laptop, and you lose your WiFi connection (or switch from WiFi to Ethernet, etc…)
This last example is a terrible situation : you can ssh to the target machine with a default ssh config, but as long as your previous connection is still considered active, ansible won't even try establishing a new one.
At this point, we just want to get rid of this old socket, and the simplest way to do that is to remove it:
# Delete all the current sockets (may disrupt currently running playbooks)
rm -r ~/.ansible/cp
# Delete only the affected socket (requires to know which one it is)
rm ~/.ansible/cp/<replace-by-your-socket>
This is perfect for a one-shot fix, but if it happens too often, you may need to look for a longer-term fix. Here are some pointers that might help towards this goal :
- Start playbooks from a server (with a network connection way more stable than your laptop's)
- Use ansible configuration, or directly ssh client configuration to disable connection sharing
- Use the same resources, but to fine-tune timeouts, so that a master connection crash actually times out faster
Please note that at the time of writing, a few options have changed (for example, my latest run gave me ControlPath=/home/toadjaune/.ansible/cp/871b533295
), but the general idea is still valid.
Fact gathering actually taking too much time
At the beginning of every play, ansible collects a lot of information on the target system, and puts it into Facts. These are variables that you can then use in your playbook, and are usually really handy, but sometimes, getting this info can be very long (bad mount points, disks with high i/o, high load…)
This being said, you don't strictly need facts to run a playbook, and almost certainly not all of them, so let's try and disable what we don't need. Several options for that :
- Completely disable the setup module
- Change the configuration of the setup module to include only certain parts of it.
- Via command-line arguments
- Via ansible configuration files
For debugging purposes, it is really convenient to invoke the setup module directly from the command-line :
ansible -m setup <hostname>
This last command should hang as well as your playbook, and eventually timeout (or succeed). Now, let's execute the module again, disabling everything we can :
ansible -m setup -a gather_subset='!all' <hostname>
If this still hangs, you can always try and disable totally the module in your play, but it's really likely that your problem is somewhere else.
If, however, it works fine (and quickly), then have a look at the module documentation. You have two options :
- Limit the fact gathering to a subset, excluding what you don't need (see possible values for
gather_subset
) -
gather_timeout
can also help you fix your issue, by allowing more time (although that would be to fix a timeout error, not a hang)
Other issues
Obviously, other things can go wrong. A few pointers to help debugging :
- Use ansible maximum verbosity level (
-vvvv
), as it will show you every command executed - Use
ping
andsetup
modules directly from the command-line as explained above - Try to ssh manually if
ansible -m ping
doesn't work