What's the best way to return a random line in a text file using C?
Read each line, and use a random number to choose whether to keep that line or ignore it. For the first line, you want odds of 1:1 to keep; for the second, you want odds of 1:2, etc.
count = 0;
while (fgets(line, length, stream) != NULL)
{
count++;
if ((rand() * count) / RAND_MAX == 0)
strcpy(keptline, line);
}
I haven't verified that this has the proper random qualities, but it seems right at first glance.
It has been pointed out that integer overflow would quickly become a problem with the way the comparison is coded, and I had independently reached the same conclusion myself. There are probably many ways to fix it, but this is the first that comes to mind:
if ((rand() / (float)RAND_MAX) <= (1.0 / count))
Mark's answer is almost correct except for two issues:
- If a line is longer than
length - 1
characters (including the newline), then thewhile
loop will incrementcount
at least twice for the same line: once for the firstlength - 1
characters, another for the nextlength - 1
characters, etc. - The calculation of
rand() * count
can cause an integer overflow.
To solve the first problem, you can call fgets
into a trash buffer until it returns NULL
(indicating an I/O error or EOF with no data read) or the trash buffer contains a newline:
count = 0;
while (fgets(line, length, stream) != NULL)
{
char *p = strchr(line, '\n');
if (p != NULL) {
assert(*p == '\n');
*p = '\0'; // trim the newline
}
else { // haven't reached EOL yet. Read & discard the rest of the line.
#define TRASH_LENGTH 1024
char trash[TRASH_LENGTH];
while((p = fgets(trash, TRASH_LENGTH, stream)) != NULL) {
if ((p = strchr(trash, '\n')) != NULL) // reached EOL
break;
}
}
assert(strchr(line, '\n') == NULL); // `line` does not contain a newline
count++;
// ...
The second problem can be solved with @tvanfosson's suggestion if floating-point arithmetic is not available:
int one_chance_in(size_t n)
{
if (rand() % n == 0) // `rand` returns an integer in [0, `RAND_MAX`]
return 1;
else
return 0;
}
But note that rand() % n
is not a uniform, discrete random variable even if rand()
is assumed to be one because the probability that rand() % n == 0
can be as much as 1/RAND_MAX
higher than the desired probability 1/n
. On my machine, RAND_MAX
is 2147483647, so the difference is 4.66 × 10-10, but the C standard only requires that RAND_MAX
be at least 32767 (3.05 × 10-5 difference).
Also, for anyone left wondering why this scheme works (as I was), it might be helpful to work through the calculation of the probability that the first line remains in keptline
if there are m lines and generalize: In the first iteration of the loop, the probability that the first line is copied to keptline
is 1/1. In the second iteration of the loop, the probability that the second line does not overwrite the first line is 1/2. In the third iteration, the probability that the third line does not overwrite the first line is 2/3. Continuing, the probability that the last line does not overwrite the first line is (m - 1)/m. Thus, the probability that the first line remains in keptline
after iterating over all lines is:
1/1 × 1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m
The probability that the second line remains in keptline
is:
1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m
The probability that the third line remains in keptline
is:
1/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m
Etc. They're all 1/m.
This method is good because:
i) You can keep generating random lines at no big cost
ii) You only have to read the file a total of 1 time + 1 line at a time per random line you want. The excess read data is only equal to the size of the file.
iii) It gives each line a fair chance no matter what its position is in the file.
iv) It gives each line a fair chance no matter what its length is in the file.
The suggestion:
I would suggest a 2 pass algorithm. Well really it's a 1 pass + N lines. Where N is the number of random lines you want.
The first pass you would use to calculate how many lines and the start positions of each line.
You then take a random number from 0 to the number of lines minus 1. Use that random number, which is your line index, get the start position for that line index. Seek to that position.
You then have only 1 more read needed, and you know the exact size. (until the start index of the next line)
How to store the number of lines and the index of each line:
To store the number of lines, you can obviously just use an int.
If you can use a vector then you can add each line index into the vector. If not you can just create an array of ints with the max number of lines you think there will be. Then index into that array.
Other answers:
Another answer mentioned that you can pick a random number from 1 to the size of the file, and then use the closest newline. But this won't work. For example you might have 1 line that is really long and the others that are not that long. In that case you would have an uneven distribution.