Why is NTP considering my server inadequate?
I have an embedded Linux device connected directly to my Windows desktop via a USB/Net interface. It's based on the Freescale iMX6 boards so I believe the clock hardware is the SNVS RTC.
On the desktop 192.0.0.10
, I have W32Time running as an NTP server and the embedded device 192.0.0.100
is (I think) correctly configured to use it as per the ntp.conf
file:
server 192.0.0.10 iburst minpoll 5 maxpoll 7
driftfile /data/ntp.drift
restrict default nomodify nopeer noquery limited kod
restrict 127.0.0.1
restrict [::1]
Connectivity is not an issue(a) since I can, on the embedded device, execute:
ntpdate -uq 192.0.0.10
ntpdate -ub 192.0.0.10
and this will successfully query and update the time.
However, I find that the clock which is supposed to be kept in sync by ntpd
is drifting quite a bit. I started and synced ntpd
about 18 hours ago and the offset gradually rose to about 5 seconds:
remote refid st t when poll reach delay offset jitter
==============================================================================
192.0.0.10 192.168.0.4 4 u 31 32 377 1.452 4941.57 11.927
Over the last few hours, it's actually started coming back but it's still 3.2 seconds away from what it should be. In any case, I'm not convinced that's any more than a coincidence, for the following reasons.
When I saw it rising consistently, I did some digging. The output of the ntpq
associations command was (and still is):
# ntpq -c as
ind assid status conf reach auth condition last_event cnt
===========================================================
1 62876 9024 yes yes none reject reachable 2
This appears to indicate that, though reachable, the server is being filtered for some reason. Base on the status 9024
(see here), it appears to be explained by "discarded as not valid (TEST10-TEST13)".
So, then I go and look at the ntpq
variables for that association:
# ntpq -c rv 62876
associd=62876 status=9024 conf, reach, sel_reject, 2 events, reachable,
srcadr=192.0.0.10, srcport=123, dstadr=192.0.0.100, dstport=123, leap=00,
stratum=4, precision=-6, rootdelay=129.150, rootdisp=2193.741,
refid=192.168.0.4,
reftime=ddd30907.eff60ee5 Thu, Dec 7 2017 0:25:43.937,
rec=ddd31287.4db24cd8 Thu, Dec 7 2017 1:06:15.303, reach=377,
unreach=0, hmode=3, pmode=4, hpoll=5, ppoll=5, headway=21,
flash=400 peer_dist, keyid=0, offset=3186.569, delay=1.446,
dispersion=16.036, jitter=11.844, xleave=0.093,
filtdelay= 1.45 1.42 1.41 1.47 1.44 1.43 1.44 1.48,
filtoffset= 3186.57 3189.58 3192.08 3194.56 3197.13 3199.58 3202.57 3205.06,
filtdisp= 15.63 16.12 16.60 17.08 17.58 18.06 18.54 19.03
I see that the flash
variable is set to 400
which, based on that same page linked to above, shows 0400/TEST11/peer_dist/peer distance exceeded
.
Now I gather that's not physical distance (both client and server are on my desktop) or network distance (the two devices are directly connected). The only useful reference I've been able to find on the net is on Google Groups where one David Woolley states:
Distance exceeded means that the combination of worst case round trip time induced error and an assumed drift of 15ppm since the last valid time on the root server (plus a few minor components) has exceeded 1 second.
It commonly happens with w32time servers that have been synchronized once but left to drift. It can also happen if the servers are orphan mode, and haven't had a real time source for too long, and you are not using the very latest orphan mode code.
Unfortunately, I have no idea how to calculate the "worst case round trip time induced error" so I'm not sure how to proceed from here. I'm pretty certain my desktop is synchronising with the corporate time server (mine, and a smattering of other desktops all seem to be very close in time) though I'm also not sure how I'd check that emphatically.
So, my question is, therefore, where can I go from here? I can't seem to get any more useful information out of ntpq
and even running ntpd -dd
in the foreground doesn't seem to clear up why the server time is being rejected.
Any help would be greatly appreciated.
(a) As further indicted by the logs on the Windows side, enabled with:
w32tm /debug /enable /file:C:\w32time.log /size:10000000 /entries:0-300
and producing:
152281 02:06:57.1968483s - ListeningThread -- DataAvailEvent set for socket 1 (0.0.0.0:123)
152281 02:06:57.1973483s - ListeningThread -- response heard from 192.0.0.100:123 <- 192.0.0.10:123
152281 02:06:57.1973483s - /-- NTP Packet:
152281 02:06:57.1973483s - | LeapIndicator: 3 - not synchronized; VersionNumber: 4; Mode: 3 - Client; LiVnMode: 0xE3
152281 02:06:57.1973483s - | Stratum: 0 - unspecified or unavailable
152281 02:06:57.1973483s - | Poll Interval: 5 - 32s; Precision: -20 - 953.674ns per tick
152281 02:06:57.1973483s - | RootDelay: 0x0000.0000s - unspecified; RootDispersion: 0x0000.F1A0s - 0.943848s
152281 02:06:57.1973483s - | ReferenceClockIdentifier: 0x494E4954 - source name: "INIT"
152281 02:06:57.1973483s - | ReferenceTimestamp: 0x0000000000000000 - unspecified
152281 02:06:57.1973483s - | OriginateTimestamp: 0xDDD320A033087D7D - 13157085984199348300ns - 152281 02:06:24.1993483s
152281 02:06:57.1973483s - | ReceiveTimestamp: 0xDDD3209D4DB18BA5 - 13157085981303490400ns - 152281 02:06:21.3034904s
152281 02:06:57.1973483s - | TransmitTimestamp: 0xDDD320BE4D535D3F - 13157086014302053300ns - 152281 02:06:54.3020533s
152281 02:06:57.1973483s - >-- Non-packet info:
152281 02:06:57.1973483s - | DestinationTimestamp: 152281 02:06:57.1973483s - 0xDDD320C132856B0E152281 02:06:57.1973483s - - 13157086017197348300ns152281 02:06:57.1973483s - - 152281 02:06:57.1973483s
152281 02:06:57.1973483s - | RoundtripDelay: -562900ns (0s)
152281 02:06:57.1973483s - | LocalClockOffset: -2895576400ns - 0:02.895576400s
152281 02:06:57.1973483s - \--
152281 02:06:57.1973483s - TransmitResponse: sent 0.0.0.0:123(192.0.0.10:123)->192.0.0.100:123
Update on the comment "Over the last few hours, it's actually started coming back": it's actually started drifting out again (currently at 3.7 seconds) so my thoughts that this was a coincidence seem to be supported.
Your client is refusing to synchronize to the server because its "root dispersion" (the server's own estimate of its error from "true" time, and one of the variables that contributes to peer distance) is around 2.2 seconds, which is greater than the default tolerance of one second.
Although it's best to debug the server and figure out why it has such a bad estimate of its own timekeeping abilities, you can force the client to synchronize to it anyway by providing a larger value for the tos maxdist
option in ntp.conf.