How can I get the most frequent 100 numbers out of 4,000,000,000 numbers?

Yesterday in a coding interview I was asked how to get the most frequent 100 numbers out of 4,000,000,000 integers (may contain duplicates), for example:

813972066
908187460
365175040
120428932
908187460
504108776

The first approach that came to my mind was using HashMap:

static void printMostFrequent100Numbers() throws FileNotFoundException {
    
    // Group unique numbers, key=number, value=frequency
    Map<String, Integer> unsorted = new HashMap<>();
    try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
        while (scanner.hasNextLine()) {
            String number = scanner.nextLine();
            unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
        }
    }

    // Sort by frequency in descending order
    List<Map.Entry<String, Integer>> sorted = new LinkedList<>(unsorted.entrySet());
    sorted.sort((o1, o2) -> o2.getValue().compareTo(o1.getValue()));

    // Print first 100 numbers
    int count = 0;
    for (Map.Entry<String, Integer> entry : sorted) {
        System.out.println(entry.getKey());
        if (++count == 100) {
            return;
        }
    }
}

But it probably would throw an OutOfMemory exception for the data set of 4,000,000,000 numbers. Moreover, since 4,000,000,000 exceeds the maximum length of a Java array, let's say numbers are in a text file and they are not sorted. I assume multithreading or Map Reduce would be more appropriate for big data set?

How can the top 100 values be calculated when the data does not fit into the available memory?


If the data is sorted, you can collect the top 100 in O(n) where n is the data's size. Because the data is sorted, the distinct values are contiguous. Counting them while traversing the data once gives you the global frequency, which is not available to you when the data is not sorted.

See the sample code below on how this can be done. There is also an implementation (in Kotlin) of the entire approach on GitHub

Note: Sorting is not required. What is required is that distinct values are contiguous and so there is no need for ordering to be defined - we get this from sorting but perhaps there is a way of doing this more efficiently.

You can sort the data file using (external) merge sort in roughly O(n log n) by splitting the input data file into smaller files that fit into your memory, sorting and writing them out into sorted files then merging them.



About this code sample:

  • Sorted data is represented by a long[]. Because the logic reads values one by one, it's an OK approximation of reading the data from a sorted file.

  • The OP didn't specify how multiple values with equal frequency should be treated; consequently, the code doesn't do anything beyond ensuring that the result is top N values in no particular order and not implying that there aren't other values with the same frequency.

import java.util.*;
import java.util.Map.Entry;

class TopN {
    private final int maxSize;
    private Map<Long, Long> countMap;

    public TopN(int maxSize) {
        this.maxSize = maxSize;
        this.countMap = new HashMap(maxSize);
    }

    private void addOrReplace(long value, long count) {
        if (countMap.size() < maxSize) {
            countMap.put(value, count);
        } else {
            Optional<Entry<Long, Long>> opt = countMap.entrySet().stream().min(Entry.comparingByValue());
            Entry<Long, Long> minEntry = opt.get();
            if (minEntry.getValue() < count) {
                countMap.remove(minEntry.getKey());
                countMap.put(value, count);
            }
        }
    }

    public Set<Long> get() {
        return countMap.keySet();
    }

    public void process(long[] data) {
        long value = data[0];
        long count = 0;

        for (long current : data) {
            if (current == value) {
                ++count;
            } else {
                addOrReplace(value, count);
                value = current;
                count = 1;
            }
        }
        addOrReplace(value, count);
    }

    public static void main(String[] args) {
        long[] data = {0, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7};
        TopN topMap = new TopN(2);

        topMap.process(data);
        System.out.println(topMap.get()); // [5, 6]
    }
}


Integers are signed 32 bits, so if only positive integers happen, we look at 2^31 max different entries. An array of 2^31 bytes should stay under max array size.

But that can't hold frequencies higher than 255, you would say? Yes, you're right.

So we add an hashmap for all entries that exceed the max value possible in your array (255 - if it's signed just start counting at -128). There are at most 16 million entries in this hash map (4 billion divided by 255), which should be possible.


We have two data structures:

  • a large array, indexed by the number read (0..2^31) of bytes.
  • a hashmap of (number read, frequency)

Algorithm:

 while reading next number 'x'
 {
   if (hashmap.contains(x))
   {
     hashmap[x]++;
   }
   else
   {
     bigarray[x]++;
     if (bigarray[x] > 250)
     {
       hashmap[x] = bigarray[x];
     }
   }
 }

 // when done:
 // Look up top-100 in hashmap
 // if not 100 yet, add more from bigarray, skipping those already taken from the hashmap

I'm not fluent in Java, so can't give a better code example.


Note that this algorithm is single-pass, works on unsorted input, and doesn't use external pre-processing steps.

All it does is assuming a maximum to the number read. It should work if the input are non-negative Integers, which have a maximum of 2^31. The sample input satisfies that constraint.


The algorithm above should satisfy most interviewers that ask this question. Whether you can code in Java should be established by a different question. This question is about designing data structures and efficient algorithms.


In pseudocode:

  1. Perform an external sort
  2. Do a pass to collect the top 100 frequencies (not which values have them)
  3. Do another pass to collect the values that have those frequencies

Assumption: There are clear winners - no ties (outside the top 100).

Time complexity: O(n log n) (approx) due to sort. Space complexity: Available memory, again due to sort.

Steps 2 and 3 are both O(n) time and O(1) space.


If there are no ties (outside the top 100), steps 2 and 3 can be combined into one pass, which wouldn’t improve the time complexity, but would improve the run time slightly.

If there are ties that would make the quantity of winners large, you couldn’t discover that and take special action (e.g., throw error or discard all ties) without two passes. You could however find the smallest 100 values from the ties with one pass.


But it probably would throw an OutOfMemory exception for the data set of 4000000000 numbers. Moreover, since 4000000000 exceeds max length of Java array, let's say numbers are in a text file and they are not sorted.

That depends on the value distribution. If you have 4E9 numbers, but the numbers are integers 1-1000, then you will end up with a map of 1000 entries. If the numbers are doubles or the value space is unrestricted, then you may have an issue.

As in the previous answer - there's a bug

unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);

I personally would use "AtomicLong" for value, it allows to increase the value without updating the HashMap entries.

I assume multithreading or Map Reduce would be more appropriate for big data set? What would be the most efficient solution for this problem?

This is a typical map-reduce exercise example, so in theory you could use multi-threaded or M-R approach. Maybe it's the goal of your exercise and you suppose to implement the multithreaded map-reduce tasks regardless if it's the most efficient way or not.

In reality you should calculate if it is worth the effort. If you're reading the input serially (as it's in your code using the Scanner), then definitely not. If you can split the input files and read multiple parts in parallel, considering the I/O throughput, it may be the case.

Or maybe if the value space is too large to fit into memory and you will need to downscale the dataset, you may consider different approach.