Technical details for Server 2012 de-duplication feature

Now that Windows Server 2012 comes with de-duplication features for NTFS volumes I am having a hard time finding technical details about it. I can deduce from the TechNet documentation that the de-duplication action itself is an asynchronous process - not unlike how the SIS Groveler used to work - but there is virtually no detail about the implementation (algorithms used, resources needed, even the info on performance considerations is nothing but a bunch rule-of-thumb-style recommendations).

Insights and pointers are greatly appreciated, a comparison to Solaris' ZFS de-duplication efficiency for a set of scenarios would be wonderful.


Solution 1:

As I suspected, it's based in the VSS subsystem (source) which also explains it's async nature. The de-dupe chunks are stored in \System Volume Information\Dedup\ChunkStore\*, with settings in \System Volume Information\Dedup\Settings\*. This has significant impacts to how your backup software interacts with such volumes, which is explained in the linked article (in brief: w/o dedupe support your backups will be the same size as they always are, with dedupe support you'll just backup the much smaller dedupe store).

As for the methods used, the best I could find was a research paper put out by a Microsoft researcher in 2011 (source, fulltext) at the Usenix FAST11 conference. Section 3.3 goes into Deduplication in Primary Storage. It seems likely that this data was used in the development of the NTFS dedupe feature. This quote was used:

The canonical algorithm for variable-sized content-defined blocks is Rabin Fingerprints [25].

There is a lot of data in the paper to sift through, but the complexity of the toolset they used, combined with the features we know are in 2012 already, strongly suggest that the reasoning in the paper was used to develop the features. Can't know for certain without msdn articles, but this is as close as we're likely to get for the time being.

Performance comparisons with ZFS will have to wait until the benchmarkers get done with it.