How do I determine the current state of an NTP server on Linux?
I have a rather simple setup where I have two computers:
Computer A. has a normal NTP setup and uses the standard Internet sources (As per Ubuntu) to determine time. It also allows for query on IP 10.0.2.0/24
:
restrict 10.0.2.0 mask 255.255.255.0 nomodify notrap
Computer B. has a normal NTP setup, except for all the sources are changed to use 10.0.2.1
(which is Computer A).
Once in a while, Computer A gets a Kiss-of-Death signal from one of its source. As a result, it totally kills Computer B's NTP (i.e. it looks like the KoD is transmitted directly).
Is there a way to know the state of an NTP server in terms of whether it will just be sending KoD message or not? (also, how do I get out of that situation? When I looked at it, all the IP addresses shown in the log were not used by the server?! so I don't understand why it insists on sending KoD to its client).
I've found two things so far:
-
ntpq
I can run
ntpq
like so:ntpq -pn
When the NTP server works, I can see an asterisk in front of the IP address the computer is happy about. In my case, all of the status flags (first column
+
,-
,*
,#
, etc.) all disappear. As far as I know, that means the NTP service is not happy and no synchronization is being performed.Here is an example when it still works (i.e. there are flags in the very first column):
remote refid st t when poll reach delay offset jitter ============================================================================== 10.0.2.255 .BCST. 16 B - 64 0 0.000 0.000 0.000 #51.77.203.211 134.59.1.5 3 u 4 64 1 171.248 -743.64 691.917 +72.5.72.15 216.218.254.202 2 u 2 64 1 19.223 -778.34 686.200 +159.69.25.180 192.53.103.103 2 u 3 64 1 237.733 -775.41 701.376 +173.0.48.220 43.77.130.254 2 u 2 64 1 35.489 -778.85 669.187 38.229.56.9 172.16.21.35 2 u 31 64 1 153.976 -268.90 122.557 +137.190.2.4 .PPS. 1 u 31 64 1 93.797 -253.69 116.289 +150.136.0.232 185.125.206.71 3 u 35 64 1 95.667 -178.19 114.912 94.154.96.7 194.29.130.252 2 u 31 64 1 237.560 -231.88 107.230 +162.159.200.123 10.4.1.175 3 u 34 64 1 16.246 -199.68 115.561 *216.218.254.202 .CDMA. 1 u 35 64 1 52.906 -193.84 131.148 91.189.91.157 132.163.96.1 2 u 45 64 1 87.772 -5.716 0.000 +204.2.134.163 44.24.199.34 3 u 34 64 1 16.711 -199.12 116.777 +74.6.168.73 208.71.46.33 2 u 35 64 1 69.772 -189.21 128.119 91.189.89.199 17.253.34.123 2 u 45 64 1 165.471 -3.708 0.000 +216.229.0.49 216.218.192.202 2 u 35 64 1 71.437 -178.94 97.505 91.189.89.198 17.253.34.123 2 u 44 64 1 172.852 -17.899 0.000
-
ntpdate -q <ip>
The
ntpdate
command will actually tell me whether the NTP is accepting packets. This is because it gives an error message if not:$ sudo ntpdate -q 10.0.2.1 server 10.0.2.1, stratum 4, offset 5.194725, delay 0.02652 21 Jul 15:22:48 ntpdate[13086]: no server suitable for synchronization found
This happens after a little while when my main server loses the
*
status on the one server it was first happy to synchronize with...
Now... I still need to understand what I have to do to fix this issue...
This may be helpful, here are the logs about a restart from a full reboot:
Jul 21 18:29:13 vm-ve-ctl kernel: [ 434.275481] audit: type=1400 audit(1626917353.636:43): apparmor="DENIED" operation="open" profile="/usr/sbin/ntp
d" name="/snap/bin/" pid=3896 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Jul 21 18:29:13 vm-ve-ctl ntpd[3896]: ntpd [email protected] (1): Starting
Jul 21 18:29:13 vm-ve-ctl ntpd[3896]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 126:129
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: proto: precision = 0.190 usec (-22)
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Cannot open logfile /var/log/ntp.log: Permission denied
Jul 21 18:29:13 vm-ve-ctl kernel: [ 434.291490] audit: type=1400 audit(1626917353.652:44): apparmor="DENIED" operation="capable" profile="/usr/sbin/
ntpd" pid=3901 comm="ntpd" capability=1 capname="dac_override"
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): good hash signature
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): loaded, expire=2021-12-28T00:00:00Z last=2017-01-01T
00:00:00Z ofs=37
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen and drop on 0 v6wildcard [::]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 2 lo 127.0.0.1:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 3 enp0s3 192.168.2.120:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 4 enp0s8 10.0.2.1:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 5 lo [::1]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 6 enp0s3 [fe80::a00:27ff:fe25:38ff%2]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listen normally on 7 enp0s8 [fe80::a00:27ff:fe35:c30b%3]:123
Jul 21 18:29:13 vm-ve-ctl ntpd[3901]: Listening on routing socket on fd #24 for interface updates
Jul 21 18:29:14 vm-ve-ctl ntpd[3901]: Soliciting pool server 51.77.203.211
Jul 21 18:29:15 vm-ve-ctl ntpd[3901]: Soliciting pool server 159.69.25.180
Jul 21 18:29:15 vm-ve-ctl ntpd[3901]: Soliciting pool server 72.5.72.15
Jul 21 18:29:16 vm-ve-ctl ntpd[3901]: Soliciting pool server 198.251.86.68
Jul 21 18:29:16 vm-ve-ctl ntpd[3901]: Soliciting pool server 173.0.48.220
Jul 21 18:29:16 vm-ve-ctl ntpd[3901]: Soliciting pool server 38.229.56.9
Jul 21 18:29:17 vm-ve-ctl ntpd[3901]: Soliciting pool server 150.136.0.232
Jul 21 18:29:17 vm-ve-ctl ntpd[3901]: Soliciting pool server 94.154.96.7
Jul 21 18:29:17 vm-ve-ctl ntpd[3901]: Soliciting pool server 137.190.2.4
Jul 21 18:29:18 vm-ve-ctl ntpd[3901]: Soliciting pool server 162.159.200.123
Jul 21 18:29:18 vm-ve-ctl ntpd[3901]: Soliciting pool server 216.218.254.202
Jul 21 18:29:18 vm-ve-ctl ntpd[3901]: Soliciting pool server 91.189.91.157
Jul 21 18:29:19 vm-ve-ctl ntpd[3901]: Soliciting pool server 91.189.89.199
Jul 21 18:29:19 vm-ve-ctl ntpd[3901]: Soliciting pool server 74.6.168.73
Jul 21 18:29:19 vm-ve-ctl ntpd[3901]: Soliciting pool server 204.2.134.163
Jul 21 18:29:20 vm-ve-ctl ntpd[3901]: Soliciting pool server 91.189.89.198
Jul 21 18:29:20 vm-ve-ctl ntpd[3901]: Soliciting pool server 216.229.0.49
Jul 21 18:29:20 vm-ve-ctl ntpd[3901]: Soliciting pool server 2604:ed40:1000:1711:d862:f5ff:fe4e:41c4
Jul 21 18:29:21 vm-ve-ctl ntpd[3901]: receive: Unexpected origin timestamp 0xe4a34871.ac57f05d does not match aorg 0000000000.00000000 from [email protected] xmt 0xe4a34871.65648c54
I do not know exactly when it starts going bad. I've also seen the following which I thought could have something to do with it (i.e. when that happens, the corresponding IP is removed from the list!), but it is already bad now and no such error occurred in my last run.
Jul 21 18:08:57 vm-ve-ctl ntpd[9764]: 92.243.6.5 local addr 192.168.2.120 -> <null>
Note: the 192.168.2.120 is the IP of the failing computer. It is a VirtualBox. It has been working for months... though, maybe something changed which makes it unhappy.
I found this post about an issue with the ... -> <null>
message. However, I think we have a newer version on Ubuntu 18.04:
SUSE minimum recommended version: ntp-4.2.8p7-11.1
Ubuntu 18.04 version: 1:4.2.8p10+dfsg-5ubuntu7.3
Just in case, I tried to connect the VM to the host and I still get a huge offset and jitter. What could have changed?!
remote refid st t when poll reach delay offset jitter
==============================================================================
10.0.2.10 .POOL. 16 p - 64 0 0.000 0.000 0.000
10.0.2.10 132.163.97.6 2 u 54 64 3 0.457 -5254.2 3917.68
As asked by Paul Gear, here is the output of ntpq with additional details:
$ ntpq -ncrv
associd=0 status=0028 leap_none, sync_unspec, 2 events, no_sys_peer,
version="ntpd [email protected] (1)", processor="x86_64",
system="Linux/4.15.0-151-generic", leap=00, stratum=4, precision=-23,
rootdelay=17.930, rootdisp=5019.260, refid=173.255.215.209,
reftime=e4a44f7a.1c2ec778 Thu, Jul 22 2021 13:11:38.110,
clock=e4a45030.c8a4b259 Thu, Jul 22 2021 13:14:40.783, peer=0, tc=6,
mintc=3, offset=-109.527915, frequency=-1.707, sys_jitter=0.000000,
clk_jitter=38.724, clk_wander=0.000, tai=37, leapsec=201701010000,
expire=202112280000
Here is the list of available clocks and the one currently used:
$ grep . /sys/devices/system/clocksource/clocksource*/[ac]*clocksource
/sys/devices/system/clocksource/clocksource0/available_clocksource:kvm-clock tsc acpi_pm
/sys/devices/system/clocksource/clocksource0/current_clocksource:kvm-clock
And finally, the dmesg
output about the clocksource selection process:
$ dmesg | grep clocksource
[ 0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[ 0.283117] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[ 1.161844] clocksource: Switched to clocksource kvm-clock
[ 1.208316] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 2.329228] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1db81a3240f, max_idle_ns: 440795250379 ns
Solution 1:
It looks like I found a solution. I'm not too sure why it would have been working before, though.
After many searches, I found this VirtualBox ticket:
https://www.virtualbox.org/ticket/15179
which says that you shouldn't use ntpd
because the VM already does keep time using the Host time to adjust the VM's virtual clock.
However, even withouth ntpd
running, my VM's clock was off and would bounce between ±30 seconds in a short period of time.
Reading that post further, they offered to adjust the paravirtualization settings to "None". That didn't seem to work for me. I restarted the VM and it got stuck. So I instead tried with "Minimal" and now the clock is working. The ntpq -np
output looks a lot better:
Thu Jul 22 12:57:57 PDT 2021
remote refid st t when poll reach delay offset jitter
==============================================================================
1.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.008
2.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.008
3.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.008
ntp.ubuntu.com .POOL. 16 p - 64 0 0.000 0.000 0.008
+23.157.160.168 129.6.15.28 2 u 20 128 377 83.831 0.783 1.745
-104.131.139.195 163.237.218.19 2 u 28 128 377 20.162 3.622 1.451
+144.76.64.40 36.224.68.195 2 u 18 128 377 179.329 2.557 6.671
-159.89.86.140 192.5.41.40 2 u 126 128 377 87.016 3.094 1.385
-23.175.208.10 239.9.71.195 2 u 35 128 377 82.350 2.311 0.794
+206.82.16.3 47.187.174.51 2 u 127 128 377 84.683 1.385 0.753
+92.243.6.5 85.199.214.98 2 u 25 128 377 163.041 4.275 4.025
*200.160.7.186 .ONBR. 1 u 47 128 377 191.078 1.126 1.865
+51.255.197.148 217.91.44.17 2 u 96 128 367 154.201 1.225 4.742
As we can see, the Offset (max. 4.3) and Jitter (max. 6.7) are minuscule compare to what I was getting before. Now my other computer also works and can synchronize itself to this clock. The delay is around 0.7, which is enough for me (in my case, enough is 12.0 or less).
IMPORTANT NOTE: I rebooted that VM 2 or 3 times using Minimal, the boot time is excruciatingly long. So I do not recommend using this mode except as a the very last resort if you absolutely need a working system clock. If you have a better solution that works, I'm all ears!
Just in case, I wanted to see the difference in regard to the clock sources in Minimal mode. We just get the acpi_pm
clock.
$ grep . /sys/devices/system/clocksource/clocksource*/[ac]*clocksource
/sys/devices/system/clocksource/clocksource0/available_clocksource:acpi_pm
/sys/devices/system/clocksource/clocksource0/current_clocksource:acpi_pm
Now I'm wondering whether this could have an effect on the host clock. Since I also have ntp on my host, I don't think this is an issue.