Setting up NTP servers

I have a problem setting up NTP to maintain time on a stand-alone network. This will be an island time-zone. The problem is that the time drifts apart, even after they have been initially synchronised.

There are two redundant NTP servers running RHEL 5.4 and several Windows XP clients. The requirements are that the network syncs to server A whilst server B acts as a backup. We do have a GPS that acts as a time server controlling both server A and server B, but it is not always available. When the GPS is present, both servers sync to the GPS.

The XP clients seem to divide into two groups once the servers drift apart; with some following server A and others server B.

How can I prevent my two servers from drifting apart?

Can I control which server the XP clients follow?

The two ntp.conf files are as follows

ntp.conf for Server A (10.203.224.13)

# Tweek NTP's behavior
tinker panic 0 step 0.01 stepout 64

# GPS
server 10.203.220.12 burst iburst minpoll 4 maxpoll 6

# Server A
server 10.203.224.13 burst iburst minpoll 4 maxpoll 6

# Server B
server 10.203.224.14 burst iburst minpoll 4 maxpoll 6

# Configure the local clock to serve from
server 127.127.1.1
fudge 127.127.1.1 stratum 11

# Establish the drift file location
driftfile /etc/ntp.drift 

ntp.conf for Server B (10.203.224.14)

# Tweek NTP's behavior
tinker panic 0 step 0.01 stepout 64

# GPS
server 10.203.220.12 burst iburst minpoll 4 maxpoll 6

# Server A
server 10.203.224.13 burst iburst minpoll 4 maxpoll 6

# Server B
server 10.203.224.14 burst iburst minpoll 4 maxpoll 6

# Configure the local clock to serve from
server 127.127.1.1
fudge 127.127.1.1 stratum 13

# Establish the drift file location
driftfile /etc/ntp.drift

On Server A

[root@serverA]# ntpq -p

     remote           refid          st t when poll reach   delay   offset  jitter
==============================================================================
 10.203.220.12   .INIT.          16 u    -   64    0    0.000    0.000   0.000
 10.203.224.13   .INIT.          16 u    -   64    0    0.000    0.000   0.000
 10.203.224.14   LOCAL(1)        14 u   27   64  377    0.312  359.753   0.289
*LOCAL(1)       .LOCL.          11 l   55   64  377    0.000    0.000   0.001

On Server B

[root@serverB]# ntpq -p

     remote           refid          st t when poll reach   delay   offset  jitter
==============================================================================
 10.203.220.12   .INIT.          16 u    -   64    0    0.000    0.000   0.000
 10.203.224.13   LOCAL(1)        12 u   55   64  377    0.346  -359.56   0.107
 10.203.224.14   .INIT.          16 u    -   64    0    0.000    0.000   0.000
*LOCAL(1)       .LOCL.          13 l   54   64  377    0.000    0.000   0.001

Solution 1:

On server A, remove the lines pointing to itself and server B, leaving only the "fudge" local clock line and the GPS. On server B, remove the "fudge" line and the server B line, leaving only the server A line and the GPS.

The idea is that server A should use the GPS if it's available, otherwise it should trust its own clock. Server B should use server A, howsoever server A is getting time, or the GPS. If server B is allowed to trust itself, it will advertise a reliable time source to its clients, even though that time is different from server A's - which is what you're seeing.

Solution 2:

There are a number of problems here:

  1. The GPS device isn't working correctly. This is most likely a connectivity issue. Either a firewall is blocking the packets or it isn't listening on the correct interface or it can't reach a GPS signal or something similar. It could be that intermittent unavailability that you mentioned. If so, try to show an ntpq -p from when it's working.
  2. The GPS is stratum 16. When it's working, this should be 1. Anything higher than 11 is going to cause the same issue you are having because server A will trust it's local clock more than anything at 11 or higher.
  3. Server A is configured to get time from Server B and Server B is configured to get time from Server A. This kind of setup should be a peering relationship rather than a circular master/slave relationship. Use the peer keyword rather than the server keyword for this.
  4. Server A and Server B are both set up to use themselves as a time source via the ntp protocol. This is redundant and isn't working. Either the connection is failing or the current stratum is 16 and can't go any higher.
  5. Both servers have selected their own clocks as the most reliable time source (indicated by the * next to the LOCAL source. They have also both managed to connect to each other. I'm not sure why Server B didn't choose Server A as the best time source as it has the lowest stratum value but it's probably because it has a significantly higher jitter than the LOCAL time source.

Get the GPS working, change the two servers to peer with each other and remove the lines to get time from their own IP address. (The local clock is fine but adding in the latency of a network protocol for a local clock is silly.)