NTP: How to establish redundant solution for NTP servers?

In my company's infrastructure there are 5 data centers in remote locations.

In each remote location, there's a pair of servers which hold DNS and NTP services and are configured on each one of the servers in that location to get DNS and NTP calls from these two servers.

All servers are CentOS 6.x machines.

There's a motivation to create redundancy between these two servers in terms of DNS and NTP.

The DNS part is covered and I only have problem with NTP.

What is the correct method to make sure that when one NTP server fails, the second/rest of the servers will continue serving the clients just like nothing happened?

I've Google'ed about it and found a RedHat solution to set one of the servers as primary (by configuring it in the clients as "true") but in-case the "true" (primary) server fails... then it fails and clients wouldn't be getting NTP updates from it, so it's not a pure redundant solution.

I wanted to know if anyone had any experience with configuring such a solution?

Edit #1:

For a test of MadHatter's answer I've done the following:

  1. I've stopped NTPd on the server which is configured as "preferred" on each one of the NTP clients.
  2. I'm waiting for the NTP client to stop working against this server and start working against it's partner NTPd server.
  3. I'm running ntpq -p on the client to see the change. This is the output of ntpq -p:
[root@ams2proxy10 ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 10.X.X.38      .INIT.          16 u    -  128    0    0.000    0.000   0.000
*10.X.X.39      131.211.8.244    2 u    2   64  377    0.123    0.104   0.220

What is "as in ntpq" ? which command shall I run please?

Edit #2: The output of as:

[root@ams2proxy10 ~]# ntpq
ntpq> as

ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1 64638  8011   yes    no  none    reject    mobilize  1
  2 64639  963a   yes   yes  none  sys.peer    sys_peer  3
ntpq>

The output of pe:

ntpq> pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 10.X.X.38      .INIT.          16 u    -  512    0    0.000    0.000   0.000
*10.X.X.39      131.211.8.244    2 u   36   64  377    0.147    0.031 18874.7
ntpq>

Solution 1:

I suspect this is a non-problem: NTP is resilient to this already.

You don't have a "primary" NTP server, and some secondaries: you have a set of configured servers. NTPd will decide which is reliable, which is most likely to offer a good time signal, and it will constantly re-evaluate its decisions.

This is the set of bindings from my NTP pool server over the past month or so:

munin NTP state graph

As you can see, most of the time state 6 (system peer) is occupied by the green line, ntp0.jonatkins.com, which is a stratum 1 server to which I bind with permission (all my other servers are stratum 2, so NTPd prefers the higher stratum server if no other factors apply).

But you can see a dip in that line early in week 44, and the numerical values below the image confirm that during the period of the graph, ntp0.jonatkins.com fell to state 4 (outlyer), while linnaeus.inf.ed.ac.uk, which spent much of its time at state 5 (candidate), nevertheless maxed out at 6 (system peer). (The lines don't go all the way down to 4 / up to 6 because these are 2-hour averages of 5-minute raw data; presumably whatever happened lasted noticeably less than 2 hours, and has therefore been smoothed out.)

This shows that, without any input on my part, NTPd decided at some point that its usual peer was insufficiently reliable, and picked the best alternative source during the "outage". As soon as its preferred peer passed its internal QA tests again, it was restored to peer status.

Solution 2:

Four or more NTP peers provide falseticker detection and n+1 redundancy. This also is Red Hat's recommendation (although it appears to be subscriber only content now).

Select 4 or more Internet sources or use the NTP Pool project. Add non Internet sources like GPS clocks if you have them. Configure all of your NTP servers to all of these sources.

Verify your NTP servers are spread throughout your infrastructure and use as few single points of failure as possible. Use different racks, power distribution, network and Internet connectivity, data centers etc.

Configure all "client" hosts to use all of your NTP servers. Configure at least 4 per client.

This configuration is very resilient. You can lose any single NTP peer and still detect falsetickers, throwing out one crazy clock.