Why is NTP syncing to LOCAL rather than remote server?
So, I'm trying to debug my current NTP setup, and found that he offset from my single configured server is over 3 seconds, and not adjusting. The asterisk on the LOCAL(0) in the ntpq output seems to indicated that the system is happily syncing with itself rather than the 10.130.33.201 server (which is another linux box on our system that we want everything to sync to).
ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
10.130.33.201 LOCAL(0) 9 u 49 64 377 0.242 -3742.2 1.049
*LOCAL(0) .LOCL. 10 l 2 64 377 0.000 0.000 0.001
And this is my ntp.conf file. Written by someone else, so I'm not 100% sure that everything is correct.
server 10.130.33.201 burst iburst minpoll 4 maxpoll 11
driftfile /mnt/active/etc/ntp.drift
restrict -4 default nomodify nopeer notrap
restrict -6 default ignore
# Undisciplined Local Clock. This is a fake driver intended for backup
# and when no outside source of synchronized time is available.
server 127.127.1.0 # local clock
fudge 127.127.1.0 stratum 10
I've read about the burst and iburst and minpoll/maxpoll, so I realize that those might not be needed, but I don't think that has anything to do with my current issue.
Also, because of how it is deployed, that config file will take a lot of work to change, so I hope that there's nothing that really must be changed. I'm hoping that this is a case of me not understanding how NTP works.
EDIT -
So, it looks like this is a duplicate of This question, but I don't feel that poster got a sufficient answer, so I would still like to know why the local time is being preferred over the server. Also, as per one of the answers below, I tried to use the prefer
keyword on the server line of the config and restart, but that does not seem to have had an effect.
If I do remove all of the "local" lines in the config as the answer to the other question suggest, what will happen if the server is unreachable? Does NTP die or does it just keep trying?
IMPORTANT EDIT --
Ok, normally, 10.130.33.201 (The "server") has no access to the internet, and does not have a GPS time source to use. The important part is that all the devices on the system have the same time as the server, regardless of how correct that time actually is.
So, just to see what would happen, I added one of the NTP pool servers to the config file of the server so it would get time from there rather than getting time from local. It now correctly gets time from the NTP time server.
After I did that, the clients now sync with the server rather than prefering LOCAL(0)
ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*10.130.33.201 38.229.71.1 3 u 58 64 377 0.216 715621. 1.001
LOCAL(0) .LOCL. 10 l 18 64 377 0.000 0.000 0.001
NEW QUESTION - When my server is using local (original example that was given), it seems like the clients are saying, "Oh, 10.130.33.201 is using LOCAL(0). Hmm, I also have a LOCAL(0) server -- I'll just use that directly rather than getting the same information via 10.130.33.201".
Is that the case? Are they trying to go "directly to the source" which is incorrectly LOCAL(0)? I need my server to get time from LOCAL(0), and I need the clients to get time from the server. Right now removing the "local" server from the client config files is the only option, but I would like to understand why this is happening, and if at all possible, avoid changing their configs (config change will be a lot of work because of our environment...).
Also, this looks like another duplicate without a good answer.
Solution 1:
With only one NTP server configured, the algorithm isn't entirely sure who to trust. Even though, stratum is lower with the remote host, I bet the algorithm thinks local time is more trustworthy.
Try using the prefer
keyword with your server
statement to set that as a preferential time source.
EDIT -
So, it looks like this is a duplicate of This question, but I don't feel that poster got a sufficient answer, so I would still like to know why the local time is being preferred over the server.
For a truly sufficient answer, you are going to be digging into the bowels of a very complex algorithm. The documentation doesn't even get too specific but I am sure there's a white paper or specification out there.
If I do remove all of the "local" lines in the config as the answer to the other question suggest, what will happen if the server is unreachable? Does NTP die or does it just keep trying?
The NTP daemon doesn't die or stop, but it does quit synchronizing time after it fails to reach the remote server. This is why best practices will suggest minimum of three remote servers and not to use the LCL unless you are disconnected from the network. Three servers are suggested because when there are only two, and they disagree, which will it choose? The third server should help the algorithm eliminate the bogus server.
Lastly, I just noticed that you do not define a driftfile
. This might help?
Solution 2:
It looks to me like the interval of offset (difference between your system time and that of the NTP hosttime) is too far different for NTP to properly set it.
My suggestion,
1. Stop the NTP service
2. As root ntpdate -bs 10.130.33.201 to reset your time to something close
3. Start the NTP service
You should have no problems after that.
Solution 3:
I know this is old, but I think you are right. No one shows any way to debug ntpd issues. Turns out it is doable.
I think you were on the right track when you suspected that use of LOCAL(0) locally and on upstream server may be an issue.
It certainly was on a time island of 4 servers I had a similar issue with. These were all set to be peers of each other, so possibly a different issue to yours.
First though, there is a better way of handling time islands called orphan mode that is supported with ntpd versions of the last few years:
Orphan mode on doc.ntp.org
Initially all 4 servers had the same stratum of 10 and preferred their local clock. I fixed that and still they preferred their local clock (the stratum does seem to be important though).
I used ntpq command pe (peer), as, rv to get a handle on what was happening. You need to use rv (readvar) on the association number for the server to dump the information. pe and as seem to be sorted by the same index so you can get the as number that way. as has a field called condition that may show the value reject if it doesn't like the server.
In the rv output is a field called flash. If all is well this will be zero. If not it is a bitmask (displayed in hex) of the issues. They can be looked up here:
ntpd internal decodes
The issue I had was 0800 peer_loop. It turned out that refid of the clock is important. Seeing LOCAL(0) both on local clock and from remote server had ntpd thinking there was a loop. David Mills confirms that in posts on comp.protocols.time'How to avoid loop in NTP' (I have reached my limit of 2 links, sorry!)
Using the refid argument to fudge to set unique refid did not work - it still shows up as LOCAL(0) at recipient.
What did seem to work was using unique instance numbers for the local driver. 127.127.1.[0-3]. Use the same ID on both server and fudge line. When I did this this the servers generally synced to the lowest stratum server which usually used its local clock. However it occasionally tried to use one of the other servers that was using it as source. However times got in sync and seem to be staying that way.
Probably far too late to help, but I offer it up to show NTP is amenable to logic and troubleshooting. I took hours reaching the answer by trial and error and then found the docs later.