Generating m distinct random numbers in the range [0..n-1]
Pure mathematics:
Let's calculate the quantity of rand()
function calls in both cases and compare the results:
Case 1:
let's see the mathematical expectation of calls on step i = k
, when you already have k numbers chosen. The probability to get a number with one rand()
call is equal to p = (n-k)/n
. We need to know the mathematical expectation of such calls quantity which leads to obtaining a number we don't have yet.
The probability to get it using 1
call is p
. Using 2
calls - q * p
, where q = 1 - p
. In general case, the probability to get it exactly after n
calls is (q^(n-1))*p
. Thus, the mathematical expectation is Sum[ n * q^(n-1) * p ], n = 1 --> INF
. This sum is equal to 1/p
(proved by wolfram alpha).
So, on the step i = k
you will perform 1/p = n/(n-k)
calls of the rand()
function.
Now let's sum it overall:
Sum[ n/(n - k) ], k = 0 --> m - 1 = n * T
- the number of rand
calls in method 1.
Here T = Sum[ 1/(n - k) ], k = 0 --> m - 1
Case 2:
Here rand()
is called inside random_shuffle
n - 1
times (in most implementations).
Now, to choose the method, we have to compare these two values: n * T ? n - 1
.
So, to choose the appropriate method, calculate T
as described above. If T < (n - 1)/n
it's better to use the first method. Use the second method otherwise.
Check the Wikipedia description of the original Fisher-Yates algorithm. It advocates using essentially your method 1 for up to n/2, and your method 2 for the remainder.
Personally, I would use Method 1, and then if M > N/2, choose N-M values, and then invert the array (return the numbers that were not picked). So for example, if N is 1000 and you want 950 of them, chose 50 values using Method 1, and then return the other 950.
Edit: Though, if consistent performance is your goal, I would use a modified method 2, which doesn't do the full shuffle, but only shuffles the first M elements of your N length array.
int arr[n];
for(int i = 0; i < n; ++i)
arr[i] = i;
for (int i =0; i < m; ++i) {
int j = rand(n-i); // Pick random number from 0 <= r < n-i. Pick favorite method
// j == 0 means don't swap, otherwise swap with the element j away
if (j != 0) {
std::swap(arr[i], arr[i+j]);
}
}
result = first m elements in arr;
Here's an algorithm that will work in O(n) memory and O(n) time (where n is the number of returned results, not the size of the set you're selecting from) for any result set. It's in Python for convenience because it uses a hashtable:
def random_elements(num_elements, set_size):
state = {}
for i in range(num_elements):
# Swap state[i] with a random element
swap_with = random.randint(i, set_size - 1)
state[i], state[swap_with] = state.get(swap_with, swap_with), state.get(i, i)
return [state[i] for i in range(num_elements) # effectively state[:num_elements] if it were a list/array.
This is just a partial fisher-yates shuffle, with the array being shuffled implemented as a sparse hashtable - any element that is not present is equal to its index. We shuffle the first num_elements
indices, and return those values. In the case that set_size = 1,
this is equivalent to picking a random number in the range, and in the case that num_elements = set_size
, this is equivalent to a standard fisher-yates shuffle.
It's trivial to observe that this is O(n) time, and because each iteration of the loop initializes at most two new indices in the hashtable, it's O(n) space, too.