Random sample from an IEnumerable generated by yielding elements?
If a sequence is infinite in length, you can't select N elements from it in less than infinite time where the chance of each element in the sequence being selected is the same.
However it IS possible to select N items from a sequence of unknown, but finite, length. You can do that using reservoir sampling.
Here's an example implementation:
/// <summary>
/// This uses Reservoir Sampling to select <paramref name="n"/> items from a sequence of items of unknown length.
/// The sequence must contain at least <paramref name="n"/> items.
/// </summary>
/// <typeparam name="T">The type of items in the sequence from which to randomly choose.</typeparam>
/// <param name="items">The sequence of items from which to randomly choose.</param>
/// <param name="n">The number of items to randomly choose from <paramref name="items"/>.</param>
/// <param name="rng">A random number generator.</param>
/// <returns>The randomly chosen items.</returns>
public static T[] RandomlySelectedItems<T>(IEnumerable<T> items, int n, System.Random rng)
{
// See http://en.wikipedia.org/wiki/Reservoir_sampling for details.
var result = new T[n];
int index = 0;
int count = 0;
foreach (var item in items)
{
if (index < n)
{
result[count++] = item;
}
else
{
int r = rng.Next(0, index + 1);
if (r < n)
result[r] = item;
}
++index;
}
if (index < n)
throw new ArgumentException("Input sequence too short");
return result;
}
This must still iterate over the entire sequence, however, so it does NOT work for an infinitely long sequence.
If you want to support input sequences longer than 2^31, you can use longs in the implementation like so:
public static T[] RandomlySelectedItems<T>(IEnumerable<T> items, int n, System.Random rng)
{
// See http://en.wikipedia.org/wiki/Reservoir_sampling for details.
var result = new T[n];
long index = 0;
int count = 0;
foreach (var item in items)
{
if (index < n)
{
result[count++] = item;
}
else
{
long r = rng.NextInt64(0, index + 1);
if (r < n)
result[r] = item;
}
++index;
}
if (index < n)
throw new ArgumentException("Input sequence too short");
return result;
}
Note that this implementation requires .Net 6.0 or higher because of the rng.NextInt64()
.
Also note that there's no point in making n
long because you can't have an array that exceeds ~2^31 elements, so it wouldn't be possible to fill it. You could in theory fix that by returning multiple arrays, but I'll leave that as an exercise for the reader. ;)