Likely causes of NTPD dying unexpectedly and solutions
On a web application which uses s3 for physical document storage, we are experiencing issues with NTP continuously dying. This seems to happen roughly once or twice a day. There is very little information provided when this occurs, other than that the PID file exists but the service is dead when I check the status.
Can anyone suggest likely causes of NTPD dying? I am assuming that maybe clock drift is causing it to die but I am not sure what would cause that either. There is more than enough memory and available disk space.
The last time the service died, this was the output:
Sep 6 06:15:25 vm02 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="988" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Sep 6 06:17:06 vm02 ntpd[10803]: 0.0.0.0 0618 08 no_sys_peer
Sep 6 08:01:10 vm02 ntpd[10803]: 0.0.0.0 0617 07 panic_stop -28101 s; set clock manually within 1000 s.
Solution 1:
I would say there is no 1-minute method to find the exact reason.
We had similar issues before in our ESXi environment. To cut the story short, we found the ESXi host's clock drifted a lot and guest VMs were syncing time from both ESXi host and upstream NTP server. This caused NTPd on VMs confused therefore died quite often.
We also found in some rare cases the random packet loss also caused NTPd quit because the round trip time between your server and upstream NTPd server is used to calculate the drift time.
In above two cases, if NTPd sees a massive time drift, for example more than 1000s, it quits by default. -g option will help a bit.
-g Normally, ntpd exits with a message to the system log if the offset exceeds the panic threshold, which is 1000 s by default. This option allows the time to be set to any value without restriction; however, this can happen only once. If the threshold is exceeded after that, ntpd will exit with a message to the system log. This option can be used with the -q and -x options. See the tinker command for other options.
You can have a look at the system log, which should have some words may give you a hint. You could also monitor "ntpq -p" output to have a rough idea how the offset develops.
Solution 2:
The log message clearly indicates that clock drift is the reason for the exit. Possible solutions:
- Start ntpd with the -g flag; however, this won't fix the root cause, which is clock skew.
- Run ntpdate before starting ntpd; probably same caveat.
-
Add more time sources; NTP needs 4-6 sources to maintain good accuracy. A simple way to do this is to include repeated references to [0-3].YOURREGION.pool.ntp.org in your config, e.g.
server 0.au.pool.ntp.org iburst server 1.au.pool.ntp.org iburst server 2.au.pool.ntp.org iburst server 3.au.pool.ntp.org iburst server 0.au.pool.ntp.org iburst server 1.au.pool.ntp.org iburst server 2.au.pool.ntp.org iburst server 3.au.pool.ntp.org iburst
Solution 3:
Another option you can try is chrony. In our testing it performs more stably than ntpd and handles time skew experienced in virtual environments better.
http://chrony.tuxfamily.org/