Windows DFSR - Changed replicated directory permissions and now have a 350,000 backlog for more than a week

Solution 1:

Very strange problem, especially after reviewing the edit.

I would inspect the DFSR debug log, which is located here: %systemroot%\debug By default there should be 9 previous log files that have been GZ archived, and one that is currently being written to.

Open that up in a text file, and do a search for the text "warning" or "error". You can check out this blog series for more detailed information on the debug logs: http://blogs.technet.com/b/askds/archive/2009/03/23/understanding-dfsr-debug-logging-part-1-logging-levels-log-format-guid-s.aspx

Other questions/suggestions:

Is there anything out of place when looking at the Resource Monitor? Excess hard drive or CPU activity that is outside a baseline?

If possible I'd restart both Alpha and Beta servers. If it resolves your issue you may never know what the real problem was, but if its critical that this is resolved soon it is worth a try.

Edit based on Question Update

You mentioned two entries related to an 850 MB file, as well as an error within the DFSR debug log.

Can you try changing the Staging Location to a different folder or drive on each server? In case the files that are currently being staged are corrupt or blocking the replication in some way.

Solution 2:

You can tweak the replication schedule to allow DFS-R to replicate at full-speed during off hours (or even on hours if appropriate).

You can also try to increase the staging size on the back logged server. It should increase performance in this situation.

You don't mention whether or not it's capped, but I assume it is since you have replication across a WAN.

Solution 3:

My experience is that this is Just How It Works.

I stumbled across this after updating security on a fairly small collection of 4 DFS replication groups (550 GB data, 58k files, 3.4k folders total). Data actually transmitted on the wire is low so it appears not to be moving entire files for just security changes, but disk activity feels like the entire hierarchy is being recopied -- sustained disk transfer rates between 60-100 MB/sec, and disk queues of 30, peaking as high as 500 on SSD tiered storage space.

My sense is that DFS has a lot of churn in its staging and destaging process which results in extreme disk I/O. An initial replication process between two gigabit LAN connected boxes takes multiples of time longer than the same data simply file copied between boxes, which would seem to indicate every byte replicated requires multiple bytes of disk read and write.

Security updates don't seem to have any special replication logic barring the use of the 2012 claims-based security (which isn't widely used AFAICT), resulting in the same stage/destage churn you would get for data changes.