Efficiency of very large collections; iteration and sort

Solution 1:

This might not be a direct answer, but : it is a way that I've used successfully for a similar system of similar scale. This is for the "tag engine" that drives the question lists here on Stack Overflow; Essentially, I have a:

struct Question {
    // basic members - score, dates, id, etc - no text
}

and basically an oversized Question[] (actually I use a Question* in unmanaged memory, but that's because I need to be able to share it with some GPU code for unrelated reasons). Populating the data is just taking out successive rows in the Question[]. This data is never sorted - it is left alone as the source data - with just append (new key) or overwrite (same key); at worst we might need to reallocate and block-copy the data to a new array if we reach max capacity.

Now, instead of sorting that data, I separately keep an int[] (actually int* for the same reason as before, but... meh), where each value in the int[] is the index of the actual data in the Question[]. So initially it may be 0, 1, 2, 3, 4, 5, ... (although I pre-filter this, so it only contains the rows I want to keep - removing "deleted" etc).

using either a modifier parallel quicksort (see http://stackoverflow.com/questions/1897458/parallel-sort-algorithm) or a modified "introspective sort" (like here) - so at the end of the sort, I might have 0, 3, 1, 5, ....

Now: to iterate through the data, I just iterate through the int[], and use that as a lookup to the actual data in the Question[]. This minimizes the amount of data movement during a sort, and allows me to keep multiple separate sorts (perhaps with different pre-filters) very efficiently. It takes milliseconds only to sort the 15M data (which happens every minute or so to bring in new questions into Stack Overflow, or to note changes to existing questions).

To make the sort as fast as possible, I try to write my sort code such that a composite sort can be represented by a single integer value, allowing very effective sort (usable by the introspective sort). For example, here's the code for the "last activity date, then question id" sort:

public override bool SupportsNaturallySortableUInt64 => true;
public override unsafe ulong GetNaturallySortableUInt64(Question* question)
{
    // compose the data (MSB) and ID (LSB)
    var val = Promote(question->LastActivityDate) << 32
        | Promote(question->Id);
    return ~val; // the same as ulong.MaxValue - val (which reverses order) but much cheaper
}

This works by treating the LastActivityDate as a 32-bit integer, left shifting by 32 bits and composing it with the Id as a 32-bit integer, meaning we can compare the date and the id in a single operation.

Or for "score, then answer score, then id":

public override unsafe ulong GetNaturallySortableUInt64(Question* question)
{
    // compose the data
    var val = Promote(question->Score) << 48
        | Promote(question->AnswerScore) << 32
        | Promote(question->Id);
    return ~val; // the same as ulong.MaxValue - val (which reverses order) but much cheaper
}

Note that GetNaturallySortableUInt64 is only called once per element - into a working area of a ulong[] (yes, actually a ulong*) of the same size, so initially the two workspaces are something like:

int[]    ulong[]
0        34243478238974
1        12319388173
2        2349245938453
...      ...

Now I can do the entire sort by looking just at an int[] and a ulong[], such that the ulong[] vector ends up in the sorted order, and the int[] contains the indices of the items to look at.