What's the best way to return a random line in a text file using C?

Read each line, and use a random number to choose whether to keep that line or ignore it. For the first line, you want odds of 1:1 to keep; for the second, you want odds of 1:2, etc.

count = 0;
while (fgets(line, length, stream) != NULL)
{
    count++;
    if ((rand() * count) / RAND_MAX == 0)
        strcpy(keptline, line);
}

I haven't verified that this has the proper random qualities, but it seems right at first glance.

It has been pointed out that integer overflow would quickly become a problem with the way the comparison is coded, and I had independently reached the same conclusion myself. There are probably many ways to fix it, but this is the first that comes to mind:

if ((rand() / (float)RAND_MAX) <= (1.0 / count))

Mark's answer is almost correct except for two issues:

If a line is longer than length - 1 characters (including the newline), then the while loop will increment count at least twice for the same line: once for the first length - 1 characters, another for the next length - 1 characters, etc.
The calculation of rand() * count can cause an integer overflow.

To solve the first problem, you can call fgets into a trash buffer until it returns NULL (indicating an I/O error or EOF with no data read) or the trash buffer contains a newline:

count = 0;
while (fgets(line, length, stream) != NULL)
{
    char *p = strchr(line, '\n');
    if (p != NULL) {
        assert(*p == '\n');
        *p = '\0'; // trim the newline
    }
    else { // haven't reached EOL yet. Read & discard the rest of the line.
#define TRASH_LENGTH 1024
        char trash[TRASH_LENGTH];
        while((p = fgets(trash, TRASH_LENGTH, stream)) != NULL) {
            if ((p = strchr(trash, '\n')) != NULL) // reached EOL
                break;
        }
    }
    assert(strchr(line, '\n') == NULL); // `line` does not contain a newline
    count++;
    // ...

The second problem can be solved with @tvanfosson's suggestion if floating-point arithmetic is not available:

int one_chance_in(size_t n)
{
    if (rand() % n == 0) // `rand` returns an integer in [0, `RAND_MAX`]
        return 1;
    else
        return 0;
}

But note that rand() % n is not a uniform, discrete random variable even if rand() is assumed to be one because the probability that rand() % n == 0 can be as much as 1/RAND_MAX higher than the desired probability 1/n. On my machine, RAND_MAX is 2147483647, so the difference is 4.66 × 10^-10, but the C standard only requires that RAND_MAX be at least 32767 (3.05 × 10^-5 difference).

Also, for anyone left wondering why this scheme works (as I was), it might be helpful to work through the calculation of the probability that the first line remains in keptline if there are m lines and generalize: In the first iteration of the loop, the probability that the first line is copied to keptline is 1/1. In the second iteration of the loop, the probability that the second line does not overwrite the first line is 1/2. In the third iteration, the probability that the third line does not overwrite the first line is 2/3. Continuing, the probability that the last line does not overwrite the first line is (m - 1)/m. Thus, the probability that the first line remains in keptline after iterating over all lines is:

1/1 × 1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

The probability that the second line remains in keptline is:

1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

The probability that the third line remains in keptline is:

1/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

Etc. They're all 1/m.

This method is good because:

i) You can keep generating random lines at no big cost

ii) You only have to read the file a total of 1 time + 1 line at a time per random line you want. The excess read data is only equal to the size of the file.

iii) It gives each line a fair chance no matter what its position is in the file.

iv) It gives each line a fair chance no matter what its length is in the file.

The suggestion:

I would suggest a 2 pass algorithm. Well really it's a 1 pass + N lines. Where N is the number of random lines you want.

The first pass you would use to calculate how many lines and the start positions of each line.

You then take a random number from 0 to the number of lines minus 1. Use that random number, which is your line index, get the start position for that line index. Seek to that position.

You then have only 1 more read needed, and you know the exact size. (until the start index of the next line)

How to store the number of lines and the index of each line:

To store the number of lines, you can obviously just use an int.

If you can use a vector then you can add each line index into the vector. If not you can just create an array of ints with the max number of lines you think there will be. Then index into that array.

Other answers:

Another answer mentioned that you can pick a random number from 1 to the size of the file, and then use the closest newline. But this won't work. For example you might have 1 line that is really long and the others that are not that long. In that case you would have an uneven distribution.

What's the best way to return a random line in a text file using C?

Related

Recent Posts