php-fpm php_network_getaddresses calls randomly start failing with bad udp cksum
We're running a number of web servers (nginx, php5.6-fpm) on ubuntu instances on AWS. They've been running fine for a number of months, but in the past few days we've started getting issues where after an instance spins up everything is fine, but after 12 hours or so, network calls start to fail (specifically in this instance socket tcp calls to redis).
Having done some digging with tcpdump, it looks like the dns lookups are being thrown out because of a udp checksum failure:
17:13:38.013346 IP (tos 0x0, ttl 64, id 46236, offset 0, flags [DF], proto UDP (17), length 103) 10.0.0.121.34071 > 10.0.0.2.53: [bad udp cksum 0x14df -> 0x3ae1!] 25855+ Type20736? xxxxxxxx.us-east-1.rds.amazonaws.com. (75)
If I use telnet to connect to the Redis server from the same instance then it's fine, it only seems to affect fpm. Equally strange, it only happens a little while after the instance has started - initially all the requests go through fine. Equally, re-starting the php5.6-fpm service seems to clear the issue for a time.
I'm pretty much at the end of my knowledge at this point, so hopefully someone can point me in the right direction!
You have a defective security fix installed -- this sounds like the issue from USN-3239-2.
A security update for GNU libc that addressed (among other things)...
an unbounded stack allocation in the
getaddrinfo()
function of the GNU C Library.
....contained a regression -- an unintended ABI change -- that seems to have caused issues similar to what you describe... DNS resolution would eventually stop working until processes were restarted.
The original update was release 2017-03-20 and the fix was released 2017-03-21. Applying the latest OS security fixes should remedy the issue, if that's what this is.