Why would a domain controller encouter a USN rollback after an unclean shutdown?

Solution 1:

I thought on this for a few hours today. It's a bit perplexing, but as I indicated in my comment, my best guess is that you either have some sort of disk caching going on that is not getting committed to disk before the power outage/dirty shutdown has wiped out the contents of the cache... Or, since you are running on a RAID volume that's housing ntds.dit, the power outage might be causing your RAID volume to temporarily break or become incoherent, if even for a moment.

We know that the party line on USN rollbacks is when a DC is restored to a state as it was earlier in time, the classic example being restoring a virtualized DC from a snapshot. I know that doesn't apply to you exactly... but even in the case of a disk with a write cache, you can think of the data that is physically on the disk as containing a "previous state," while the write cache is what actually contains the most up-to-date state of the DC... even if the two states are only half a second apart.

Ruminate on these comments from Microsoft:

Guidelines for virtualized domain controllers

Virtual SCSI disks provide increased performance compared to virtual IDE and they support Forced Unit Access (FUA). FUA ensures that the operating system writes and reads data directly from the media bypassing any and all caching mechanisms.

I know that your DC is not a VM, but the concept still applies. Disk caching and DCs do not mix. Which is why installing Active Directory turns write caching off as a Windows policy, but you can still have caching mechanisms in your hardware RAID controller, etc.

Scenario B: Starting Active Directory from other drives in a broken mirror

  1. Promote a domain controller. Locate the Ntds.dit file on a mirrored drive.

  2. Break the mirror.

  3. Continue to inbound replicate and outbound replicate by using the Ntds.dit file on the first drive in the mirror.

  4. Start the domain controller by using the Ntds.dit file on the second drive in the mirror.

That's a replication killer that has bitten me a lot on physical DCs with RAID 1 volumes. I've never personally had an actual USN rollback caused by it, but it will kill replication on that DC. I mean, imagine a RAID 1 volume of 2 disks. 1 drive dies. You remove it, pop in a new drive... aaaaaaand DSA Not Writable.

From the AskDS blog:

If you do not have uninterruptable power supplies (UPS) for your VM hosts or the storage disk where the active directory database resides, then ensure write-caching is disabled on the virtual machine’s host computer. Please refer this link for additional guidance. Conversely, if the write caching needs to stay enabled for the VM host which hosts the DC, then install a UPS to avoid damage to the DC(s).

Again, it's talking about virtualized DCs, but the disk caching concept applies to physical DCs as well.

So there's my idea. I think it's got something to do with your storage system. Definitely want to disable any and all caching mechanisms at least on the ntds.dit volume, especially if you're prone to power outages.