Random sample from an IEnumerable generated by yielding elements?

If a sequence is infinite in length, you can't select N elements from it in less than infinite time where the chance of each element in the sequence being selected is the same.

However it IS possible to select N items from a sequence of unknown, but finite, length. You can do that using reservoir sampling.

Here's an example implementation:

/// <summary>
/// This uses Reservoir Sampling to select <paramref name="n"/> items from a sequence of items of unknown length.
/// The sequence must contain at least <paramref name="n"/> items.
/// </summary>
/// <typeparam name="T">The type of items in the sequence from which to randomly choose.</typeparam>
/// <param name="items">The sequence of items from which to randomly choose.</param>
/// <param name="n">The number of items to randomly choose from <paramref name="items"/>.</param>
/// <param name="rng">A random number generator.</param>
/// <returns>The randomly chosen items.</returns>

public static T[] RandomlySelectedItems<T>(IEnumerable<T> items, int n, System.Random rng)
{
    // See http://en.wikipedia.org/wiki/Reservoir_sampling for details.

    var result = new T[n];
    int index = 0;
    int count = 0;

    foreach (var item in items)
    {
        if (index < n)
        {
            result[count++] = item;
        }
        else
        {
            int r = rng.Next(0, index + 1);

            if (r < n)
                result[r] = item;
        }

        ++index;
    }

    if (index < n)
        throw new ArgumentException("Input sequence too short");

    return result;
}

This must still iterate over the entire sequence, however, so it does NOT work for an infinitely long sequence.


If you want to support input sequences longer than 2^31, you can use longs in the implementation like so:

public static T[] RandomlySelectedItems<T>(IEnumerable<T> items, int n, System.Random rng)
{
    // See http://en.wikipedia.org/wiki/Reservoir_sampling for details.

    var  result = new T[n];
    long index  = 0;
    int  count  = 0;

    foreach (var item in items)
    {
        if (index < n)
        {
            result[count++] = item;
        }
        else
        {
            long r = rng.NextInt64(0, index + 1);

            if (r < n)
                result[r] = item;
        }

        ++index;
    }

    if (index < n)
        throw new ArgumentException("Input sequence too short");

    return result;
}

Note that this implementation requires .Net 6.0 or higher because of the rng.NextInt64().

Also note that there's no point in making n long because you can't have an array that exceeds ~2^31 elements, so it wouldn't be possible to fill it. You could in theory fix that by returning multiple arrays, but I'll leave that as an exercise for the reader. ;)