Given a 1 TB data set on disk with around 1 KB per data record, how can I find duplicates using 512 MB RAM and infinite disk space?

The solutions offered so far seem too complicated. A Bloom filter, while being the data structure du jour for the last several years, isn't best applied in a situation like this: because no data can be associated with the hashed content, you must not only maintain the Bloom filter, but you must still record each (only 6-bit!) hash value and record to disk, destroying the benefit of the bloom filter and having a preposterously high collision rate.

On the other hand, merge sorting the entire terabyte is not only going to take O(n log n) comparisons, but O(n log n) disk traffic, since the majority of the intermediate files would have to be merged from disk, rather than from memory. Any real solution should try to reduce disk traffic as much as possible, since that's our primary bottleneck.

My solution is simple, making one assumption: that the terabyte of data is recorded in what's effectively one file.

Iterate through the records of the terabyte file and hash them. A cryptographic hash is unnecessary, costly and too large here; instead, use something like the 64-bit version of murmurhash. It can hash more than 2 GiB/sec (far faster than we'll likely need, given the speed of storage these days) and has excellent (though not cryptographically secure) collision resistance. With a 64-bit hash, we would expect our first collision at 2^32, so it's probable that our approximately one billion records will not have any collisions at all.

Write the hashes and their associated record offsets out to another file. Since the records contain arbitrary binary data, we can't rely on Unix's sort(1) to sort it, because some of the hashes and offsets may contain what sort(1) will interpret as newlines. We'll simply write the records out as fixed-width (probably 16 bytes: 8 bytes for the murmur2 64-bit hash, and 8 bytes for the offset in the terabyte file) records. The resulting file should be about 16 GB, given our number of records.

We can sort this file by reading the number of records which will safely fit into memory and sorting them, flushing the sorted chunks back to disk. We can fit more records into memory with a heapsort (it uses O(1) space) than with a quicksort (which uses O(log n) memory for the call stack), but in most implementations, quicksort wins by virtue of its memory locality and lower instruction count. These intermediate files (there should be 35-40 of them) will be written out to disk.

The last step is to merge these files (in memory; there's no need to store a result on disk for this) collecting all hash collisions and looking up the associated records in the terabyte file, comparing the records for duplication and emitting the records (or their offsets) in whatever way the problem specifies.

As far as I can tell, this task hits the disk significantly less than any other offered solution, and it's very conceptually simple: hash the records, look for duplicates in the hashes, and verify in the actual records.

For disk I/O, it would read the terabyte data file, write 16 GB to disk, read that 16 GB from disk and write it back sorted, then read it and return the duplicates. As an optimization, the process hashing the records could accumulate them in memory before flushing them out to disk, sorting them before doing so: that cuts out the 16 GB intermediate file, and allows the process to move from hashing directly to merging and reporting duplicates.


Use a Bloom filter: a table of simultaneous hashes. According to Wikipedia, the optimal number of hashes is ln(2) * 2^32 / 2^30 ≈ 2.77 ≈ 3. (Hmm, plugging in 4 gives fewer false positives but 3 is still better for this application.) This means that you have a table of 512 megabytes, or 4 gigabits, and processing each record sets three new bits in that vast sea. If all three bits were already set, it's a potential match. Record the three hash-values to a file. Otherwise, record them to another file. Note the record index along with each match.

(If a 5% error rate is tolerable, omit the large file and use the small file as your results.)

When finished, you should have a file of about 49M possible positive matches and a file of 975M negatives which yet may match positives. Read the former into a vector<pair<vector<uint32_t>,vector<uint32_t> > > (indexes in the latter vector, the former can be an array) and sort it. Put the indexes in another vector<uint32_t>; they're already sorted. Read the large file but instead of setting bits a table, find the hash values in the vector. (For example, use equal_range.) Use the list of positive-file indices to track the index of the current record in the negative file. If no match found, ignore. Otherwise, append the record's index match->second.push_back(current_negative_record_index).

Finally, iterate through the map and the vectors of record-indices. Any bucket with more than one entry is "almost" certain to contain a set of duplicates, but you've come this far, so look them up and compare them completely to be sure.

Total synchronous disk I/O: (one pass = 1 TiB) + (96 hash bits per record = 12 GiB) + (32 index bits per positive = ~200 MiB).

Final edit (seriously): On second thought, the Bloom Filter aspect might not really be helping here. The amount of hash data is more of a limiting factor than the number of false positives. With just one hash function, the total amount of hash data would be 4 GiB and the indexes of the 124 million expected false positives would be ~500 MiB. That should globally optimize this strategy.

Clarification (got a downvote): there's a distinction between a false positive from the Bloom filter and a hash collision. A hash collision can't be resolved except by returning to the original records and comparing, which is expensive. A Bloom false positive can be resolved by returning to the original hash values and comparing them, which is what the second pass of this algorithm does. So on second thought, the one-hash filter described in the "final" edit would unduly cause disk seeks. A two-hash Bloom filter would increase the number of false positives ending up in a single bucket of the match map, and would bring the number of false positives back down to the tens of millions.


That's a lot of records ;-) in the order of 1,000,000,000. 'd better be smart about it...

The nature of the records is unspecified: do we just discover them, one at at time by reading them sequentially, or is there some kind of index, or maybe are they stored as files in various directories? Also unspecified in the question is the availability of a dbms which we can use for index-like data (rather than having to sort it with our own code). Also a [even rough] idea of the number of duplicates would help direct some of the choices towards an efficient process.

If no index exists, we can/should create one; this could be done as the first pass through the data. The same pass would be used to produce a message digest (a hash) of sorts for each record (or possibly, for efficiency purposes, for the first few hundred bytes of the record).

The general idea is to quickly produce an index that can be used to identify possible duplicates, and to finalize the list of actual duplicate, possibly through parallel processing.

The info useful in the index would be:

  • length of record
  • first few bytes of the text
  • hash code (more on this below)
  • also the offset in file or whatever pointer to the data but of course unlike the above 3 elements, this can't be used for identifying potential matches.

The choice of the hash is critical: should favor a fast algorithm at the expense of one that is perfectly distributed; the number of bytes hashed for each record is also a compromise, maybe 100 to 200 bytes (i.e. circa 10 to 20% of the average record size) is a good value, depending on the expected ratio of duplicates, and depending on the time saving this provides (compared with hashing the whole record). (see edit below)

Once such an index is available, we can [relatively quickly/effortlessly] obtain a count of possible duplicates; based on this result a second pass aimed at improving the quality of the index, if it is not deemed selective enough, can be done (leaving out the records which are readily deemed unique). This second pass can compute another hash, on the whole record (excluding the first x bytes of the first hash), or on yet another subset of the record. Note that thanks to the index, this second pass can be multi-threaded if possible.

The second or final pass requires sorting the records within a group of possible matches (same length, same hash code(s), same first x bytes). This can be achieved as describe by Pax Diablo, the advantage of the index is that such operation can, again, be multi-threaded and involves much smaller sets (many of them). Added: Here again Nick Johnson makes a great point that the second pass could possibly be unnecessary would we use a long hash code (he suggests 128 bytes long SHA1). Assuming that there is no gain in partially hashing the records, this is a very plausible solution since the index could reside on disk and yet be more quickly sorted and stored than if we were sorting/storing the whole records.


Edit: Nick Johnson makes the excellent point that the latency of seeks in disk storage may be such that a plain sequential read be faster and that the bottleneck being Disk I/O bound, a fast hash function ran concurrently may effectively be faster than the sequential read, and hence not add to the overall process. This is a likely possibility (particularly if a sequential read if effectively required to detect each record start/end etc.), and that's why I "edged my bet" by writing "depending on the time saving this provides...". This said the actual structure of the records on disk is one of the open parameters of the question (for example if we're just reading from individual files in directories, hence imposing a non sequential read) and also a TeraByte-sized storage is likely supported by a fancy RAID where seek latency while remaining a concern is typically much improved.
I stand by my suggestion that a two passes approach may be more efficient than one where each record is completely hashed, but I wish I had stressed the possibility and benefits of the a single pass approach. As with many interview questions, several characteristics of the situation at hand were unspecified; the idea is not so much to see the applicant supply the absolute right answer (although some answers may be quite wrong!) but instead to gain insight in his/her thought process and ability to identify options and decision points.