DNS problem, possibly due to out-of-sync clock


#1

DNS broke on my mailserver recently. This meant that the system status checks reported that “Nameserver glue records are incorrect.” and “The nameservers set on this domain are incorrect. They are currently [Not Set].”

When I looked in /var/log/syslog after running dig @localhost google.com as a test, I found this:

Oct  1 00:21:30 ubuntu named[882]:   validating @0x72f00468: com DS: bad cache hit (./DNSKEY)
Oct  1 00:21:30 ubuntu named[882]: error (broken trust chain) resolving 'google.com/A/IN': 216.239.32.10#53
Oct  1 00:21:32 ubuntu named[882]:   validating @0x72e2c770: com DS: bad cache hit (./DNSKEY)
Oct  1 00:21:32 ubuntu named[882]: error (broken trust chain) resolving 'google.com/A/IN': 216.239.34.10#53

Strangely, the IP’s listed are valid google.com IP’s, so the DNS lookup is sort of succeeding, but named doesn’t want to allow the answers to be returned as valid.

I fixed the problem by resetting the clock like this, based on http://www.thedumbterminal.co.uk/posts/2015/03/correcting_bind_errors_due_to_an_out_of_sync_clock.html

Stop time and DNS daemons:

/etc/init.d/ntp stop
/etc/init.d/bind9 stop

Find the address of a public NTP server:

nslookup pool.ntp.org 8.8.8.8

Set the time correctly:

ntpdate 209.114.111.1

Restart DNS and time daemons:

/etc/init.d/bind9 start
/etc/init.d/ntp start

After this, DNS started working again and I started receiving mail again too.

I am still not sure exactly how my system time went wrong, but I hope this is useful to someone . . .


#2

Wow. I would have loved to know what the system clock was before and after these steps.

Thanks for posting this.


#3

Here are the relevant excerpts from my console as I was fixing it.

root@ubuntu:~# ntpdate 193.188.204.101
 1 Oct 00:30:20 ntpdate[15236]: adjust time server 193.188.204.101 offset 0.127124 sec
root@ubuntu:~# date
Thu Oct  1 00:30:30 UTC 2015
root@ubuntu:~# ntpdate 193.188.204.101
 1 Oct 00:30:45 ntpdate[15373]: adjust time server 193.188.204.101 offset 0.118377 sec
root@ubuntu:~# ntpdate 193.188.204.101
 1 Oct 00:30:53 ntpdate[15379]: adjust time server 193.188.204.101 offset 0.113422 sec
root@ubuntu:~# ntpdate 209.114.111.1
 1 Oct 00:31:20 ntpdate[15388]: adjust time server 209.114.111.1 offset 0.095278 sec

Looking at this now, it’s hard to believe that an error of less than a second could make named choke, but maybe I’m wrong about that.

An alternative explanation could be that restarting ntp and bind9 fixed something, but that seems even less likely, given that named was at least alive and listening before, and all ntp could have fixed was the time.


#4

Update a few weeks later: I think the problem has nothing to do with NTP or time.

The same symptoms recurred a couple of days ago and just restarting bind9 fixed it. I’m not sure what the root cause is, but it appears that time has nothing to do with it.


#5

Update 1.5 years later: the same problem recurred again.

I’m now on MIAB 0.21c. DNS failed on a Saturday afternoon. When I logged in, I discovered the system time had been reset to what I think was probably Unix epoch 0 (it was definitely 1970; I didn’t write down the exact time).

Restarting bind9 solved the problem as before. Before restarting bind9, I was able to query the daemon, but it would respond with empty records (maybe this was actually named responding?). Oddly, rebooting the system did not solve the problem.


#6

Just one other update: it appears that I have to actually reboot and subsequently restart bind9 as well.

Still not sure what’s going wrong.


#7

Also, restarting bind9 and then rebooting (the reverse order as above) fixes the system clock, but does not fix DNS. A second restart of bind9 fixes DNS.

There was a power glitch in the building around the same time DNS failed, so it might have something to do with the server not rebooting cleanly, or something like that. Maybe the system clock is set before the network connection comes up? Maybe DNS resolution is needed to set the system clock?


#8

A few pointers:

  1. ntpdate is depreciated, you ought to use ntpd (and ntpq -p to check ntp server status on redhat/centos machines).
  2. Make sure you have several ntp servers configured, sometimes one or other fails for whatever reason, so having a few gives you the redundancy you need. It’s very probable the single NTP server you added is failing hence causing time sync issues.
  3. That error message is due to DNSSEC which is validating your DNS requests (prevent DNS cache poisoning). Part of the the security mechanism needs properly synced time; out-of-sync time creates all sorts of issues in cryptography, so having the correct time is vital.
  4. NTP syncing can take up to 30 minutes. So don’t expect it to work as soon as you add a server, or reconnect to the internet after significant downtime. That’s why you need to check the status using commands like ntpq.