Fast way to generate pseudo-random bits with a given probability of 0 or 1 for each bit

Normally, a random number generator returns a stream of bits for which the probability to observe a 0 or a 1 in each position is equal (i.e. 50%). Let's call this an unbiased PRNG.

I need to generate a string of pseudo-random bits with the following property: the probability to see a 1 in each position is p (i.e. the probability to see a 0 is 1-p). The parameter p is a real number between 0 and 1; in my problem it happens that it has a resolution of 0.5%, i.e. it can take the values 0%, 0.5%, 1%, 1.5%, ..., 99.5%, 100%.

Note that p is a probability and not an exact fraction. The actual number of bits set to 1 in a stream of n bits must follow the binomial distribution B(n, p).

There is a naive method that can use an unbiased PRNG to generate the value of each bit (pseudocode):

generate_biased_stream(n, p):
  result = []
  for i in 1 to n:
    if random_uniform(0, 1) < p:
      result.append(1)
    else:
      result.append(0)
  return result

Such an implementation is much slower than one generating an unbiased stream, since it calls the random number generator function once per each bit; while an unbiased stream generator calls it once per word size (e.g. it can generate 32 or 64 random bits with a single call).

I want a faster implementation, even it it sacrifices randomness slightly. An idea that comes to mind is to precompute a lookup table: for each of the 200 possible values of p, compute C 8-bit values using the slower algorithm and save them in a table. Then the fast algorithm would just pick one of these at random to generate 8 skewed bits.

A back of the envelope calculation to see how much memory is needed: C should be at least 256 (the number of possible 8-bit values), probably more to avoid sampling effects; let's say 1024. Maybe the number should vary depending on p, but let's keep it simple and say the average is 1024. Since there are 200 values of p => total memory usage is 200 KB. This is not bad, and might fit in the L2 cache (256 KB). I still need to evaluate it to see if there are sampling effects that introduce biases, in which case C will have to be increased.

A deficiency of this solution is that it can generate only 8 bits at once, even that with a lot of work, while an unbiased PRNG can generate 64 at once with just a few arithmetic instructions.

I would like to know if there is a faster method, based on bit operations instead of lookup tables. For example modifying the random number generation code directly to introduce a bias for each bit. This would achieve the same performance as an unbiased PRNG.


Edit March 5

Thank you all for your suggestions, I got a lot of interesting ideas and suggestions. Here are the top ones:

  • Change the problem requirements so that p has a resolution of 1/256 instead of 1/200. This allows using bits more efficiently, and also gives more opportunities for optimization. I think I can make this change.
  • Use arithmetic coding to efficiently consume bits from an unbiased generator. With the above change of resolution this becomes much easier.
  • A few people suggested that PRNGs are very fast, thus using arithmetic coding might actually make the code slower due to the introduced overhead. Instead I should always consume the worst-case number of bits and optimize that code. See the benchmarks below.
  • @rici suggested using SIMD. This is a nice idea, which works only if we always consume a fixed number of bits.

Benchmarks (without arithmetic decoding)

Note: as many of you have suggested, I changed the resolution from 1/200 to 1/256.

I wrote several implementations of the naive method that simply takes 8 random unbiased bits and generates 1 biased bit:

  • Without SIMD
  • With SIMD using the Agner Fog's vectorclass library, as suggested by @rici
  • With SIMD using intrinsics

I use two unbiased pseudo-random number generators:

  • xorshift128plus
  • Ranvec1 (Mersenne Twister-like) from Agner Fog's library.

I also measure the speed of the unbiased PRNG for comparison. Here are the results:


RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 16.081 16.125 16.093 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 0.778 0.783 0.812 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.176 2.184 2.145 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.129 2.151 2.183 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

SIMD increases performance by a factor of 3 compared to the scalar method. It is 8 times slower than the unbiased generator, as expected.

The fastest biased generator achieves 2.1 Gb/s.


RNG: xorshift128plus

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.300 21.486 21.483 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 22.660 22.661 24.662 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 1.065 1.102 1.078 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 4.972 4.971 4.970 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 4.955 4.971 4.971 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical   : 104,857,600

For xorshift, SIMD increases performance by a factor of 5 compared to the scalar method. It is 4 times slower than the unbiased generator. Note that this is a scalar implementation of xorshift.

The fastest biased generator achieves 4.9 Gb/s.


RNG: xorshift128plus_avx2

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.754 21.494 21.878 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 54.126 54.071 54.145 [Gb/s]
Number of ones: 536,874,540 536,880,718 536,891,316
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 1.093 1.103 1.063 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 19.567 19.578 19.555 [Gb/s]
Number of ones: 104,836,115 104,846,215 104,835,129
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 19.551 19.589 19.557 [Gb/s]
Number of ones: 104,831,396 104,837,429 104,851,100
Theoretical   : 104,857,600

This implementation uses AVX2 to run 4 unbiased xorshift generators in parallel.

The fastest biased generator achieves 19.5 Gb/s.

Benchmarks for arithmetic decoding

Simple tests show that the arithmetic decoding code is the bottleneck, not the PRNG. So I am only benchmarking the most expensive PRNG.


RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)

Method: Arithmetic decoding (floating point)
Gbps/s: 0.068 0.068 0.069 [Gb/s]
Number of ones: 10,235,580 10,235,580 10,235,580
Theoretical   : 10,240,000

Method: Arithmetic decoding (fixed point)
Gbps/s: 0.263 0.263 0.263 [Gb/s]
Number of ones: 10,239,367 10,239,367 10,239,367
Theoretical   : 10,240,000

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 12.687 12.686 12.684 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 14.536 14.536 14.536 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 0.754 0.754 0.754 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.094 2.095 2.094 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.094 2.094 2.095 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

The simple fixed point method achieves 0.25 Gb/s, while the naive scalar method is 3x faster, and the naive SIMD method is 8x faster. There might be ways to optimize and/or parallelize the arithmetic decoding method further, but due to its complexity I have decided to stop here and choose the naive SIMD implementation.

Thank you all for the help.


One thing you can do is to sample from the underlying unbiased generator multiple times, getting several 32-bit or 64-bit words, and then performing bitwise boolean arithmetic. As an example, for 4 words b1,b2,b3,b4, you can get the following distributions:

    expression             | p(bit is 1)
    -----------------------+-------------
    b1 & b2 & b3 & b4      |  6.25%
    b1 & b2 & b3           | 12.50%
    b1 & b2 & (b3 | b4)    | 18.75%
    b1 & b2                | 25.00%
    b1 & (b2 | (b3 & b4))  | 31.25%
    b1 & (b2 | b3)         | 37.50%
    b1 & (b2 | b3 | b4))   | 43.75%
    b1                     | 50.00%

Similar constructions can be made for finer resolutions. It gets a bit tedious and still requires more generator calls, but at least not one per bit. This is similar to a3f's answer, but is probably easier to implement and, I suspect, faster than scanning words for 0xF nybbles.

Note that for your desired 0.5% resolution, you would need 8 unbiased words for one biased word, which would give you a resolution of (0.5^8) = 0.390625%.


If you're prepared to approximate p based on 256 possible values, and you have a PRNG which can generate uniform values in which the individual bits are independent of each other, then you can use vectorized comparison to produce multiple biased bits from a single random number.

That's only worth doing if (1) you worry about random number quality and (2) you are likely to need a large number of bits with the same bias. The second requirement seems to be implied by the original question, which criticizes a proposed solution, as follows: "A deficiency of this solution is that it can generate only 8 bits at once, even that with a lot of work, while an unbiased PRNG can generate 64 at once with just a few arithmetic instructions." Here, the implication seems to be that it is useful to generate a large block of biased bits in a single call.

Random-number quality is a difficult subject. It's hard if not impossible to measure, and therefore different people will propose different metrics which emphasize and/or devalue different aspects of "randomness". It is generally possible to trade off speed of random-number generation for lower "quality"; whether this is worth doing depends on your precise application.

The simplest possible tests of random number quality involve the distribution of individual values and the cycle length of the generator. Standard implementations of the C library rand and Posix random functions will typically pass the distribution test, but the cycle lengths are not adequate for long-running applications.

These generators are typically extremely fast, though: the glibc implementation of random requires only a few cycles, while the classic linear congruential generator (LCG) requires a multiply and an addition. (Or, in the case of the glibc implementation, three of the above to generate 31 bits.) If that's sufficient for your quality requirements, then there is little point trying to optimize, particularly if the bias probability changes frequently.

Bear in mind that the cycle length should be a lot longer than the number of samples expected; ideally, it should be greater than the square of that number, so a linear-congruential generator (LCG) with a cycle length of 231 is not appropriate if you expect to generate gigabytes of random data. Even the Gnu trinomial nonlinear additive-feedback generator, whose cycle length is claimed to be approximately 235, shouldn't be used in applications which will require millions of samples.

Another quality issue, which is much harder to test, relates to the independence on consecutive samples. Short cycle lengths completely fail on this metric, because once the repeat starts, the generated random numbers are precisely correlated with historical values. The Gnu trinomial algorithm, although its cycle is longer, has a clear correlation as a result of the fact that the ith random number generated, ri, is always one of the two values ri−3&plus;ri−31 or ri−3&plus;ri−31&plus;1. This can have surprising or at least puzzling consequences, particularly with Bernoulli experiments.

Here's an implementation using Agner Fog's useful vector class library, which abstracts away a lot of the annoying details in SSE intrinsics, and also helpfully comes with a fast vectorized random number generator (found in special.zip inside the vectorclass.zip archive), which lets us generate 256 bits from eight calls to the 256-bit PRNG. You can read Dr. Fog's explanation of why he finds even the Mersenne twister to have quality issues, and his proposed solution; I'm not qualified to comment, really, but it does at least appear to give expected results in the Bernoulli experiments I have tried with it.

#include "vectorclass/vectorclass.h"
#include "vectorclass/ranvec1.h"

class BiasedBits {
  public:
    // Default constructor, seeded with fixed values
    BiasedBits() : BiasedBits(1)  {}
    // Seed with a single seed; other possibilities exist.
    BiasedBits(int seed) : rng(3) { rng.init(seed); }

    // Generate 256 random bits, each with probability `p/256` of being 1.
    Vec8ui random256(unsigned p) {
      if (p >= 256) return Vec8ui{ 0xFFFFFFFF };
      Vec32c output{ 0 };
      Vec32c threshold{ 127 - p };
      for (int i = 0; i < 8; ++i) {
        output += output;
        output -= Vec32c(Vec32c(rng.uniform256()) > threshold);
      }
      return Vec8ui(output);
    }

  private:
    Ranvec1 rng;
};

In my test, that produced and counted 268435456 bits in 260 ms, or one bit per nanosecond. The test machine is an i5, so it doesn't have AVX2; YMMV.

In the actual use case, with 201 possible values for p, the computation of 8-bit threshold values will be annoyingly imprecise. If that imprecision is undesired, you could adapt the above to use 16-bit thresholds, at the cost of generating twice as many random numbers.

Alternatively, you could hand-roll a vectorization based on 10-bit thresholds, which would give you a very good approximation to 0.5% increments, using the standard bit-manipulation hack of doing the vectorized threshold comparison by checking for borrow on every 10th bit of the subtraction of the vector of values and the repeated threshold. Combined with, say, std::mt19937_64, that would give you an average of six bits each 64-bit random number.


From an information-theoretic point of view, a biased stream of bits (with p != 0.5) has less information in it than an unbiased stream, so in theory it should take (on average) less than 1 bit of the unbiased input to produce a single bit of the biased output stream. For example, the entropy of a Bernoulli random variable with p = 0.1 is -0.1 * log2(0.1) - 0.9 * log2(0.9) bits, which is around 0.469 bits. That suggests that for the case p = 0.1 we should be able to produce a little over two bits of the output stream per unbiased input bit.

Below, I give two methods for producing the biased bits. Both achieve close to optimal efficiency, in the sense of requiring as few input unbiased bits as possible.

Method 1: arithmetic (de)coding

A practical method is to decode your unbiased input stream using arithmetic (de)coding, as already described in the answer from alexis. For this simple a case, it's not hard to code something up. Here's some unoptimised pseudocode (cough, Python) that does this:

import random

def random_bits():
    """
    Infinite generator generating a stream of random bits,
    with 0 and 1 having equal probability.
    """
    global bit_count  # keep track of how many bits were produced
    while True:
        bit_count += 1
        yield random.choice([0, 1])

def bernoulli(p):
    """
    Infinite generator generating 1-bits with probability p
    and 0-bits with probability 1 - p.
    """
    bits = random_bits()

    low, high = 0.0, 1.0
    while True:
        if high <= p:
            # Generate 1, rescale to map [0, p) to [0, 1)
            yield 1
            low, high = low / p, high / p
        elif low >= p:
            # Generate 0, rescale to map [p, 1) to [0, 1)
            yield 0
            low, high = (low - p) / (1 - p), (high - p) / (1 - p)
        else:
            # Use the next random bit to halve the current interval.
            mid = 0.5 * (low + high)
            if next(bits):
                low = mid
            else:
                high = mid

Here's an example usage:

import itertools
bit_count = 0

# Generate a million deviates.
results = list(itertools.islice(bernoulli(0.1), 10**6))

print("First 50:", ''.join(map(str, results[:50])))
print("Biased bits generated:", len(results))
print("Unbiased bits used:", bit_count)
print("mean:", sum(results) / len(results))

The above gives the following sample output:

First 50: 00000000000001000000000110010000001000000100010000
Biased bits generated: 1000000
Unbiased bits used: 469036
mean: 0.100012

As promised, we've generated 1 million bits of our output biased stream using fewer than five hundred thousand from the source unbiased stream.

For optimisation purposes, when translating this into C / C++ it may make sense to code this up using integer-based fixed-point arithmetic rather than floating-point.

Method 2: integer-based algorithm

Rather than trying to convert the arithmetic decoding method to use integers directly, here's a simpler approach. It's not quite arithmetic decoding any more, but it's not totally unrelated, and it achieves close to the same output-biased-bit / input-unbiased-bit ratio as the floating-point version above. It's organised so that all quantities fit into an unsigned 32-bit integer, so should be easy to translate to C / C++. The code is specialised to the case where p is an exact multiple of 1/200, but this approach would work for any p that can be expressed as a rational number with reasonably small denominator.

def bernoulli_int(p):
    """
    Infinite generator generating 1-bits with probability p
    and 0-bits with probability 1 - p.

    p should be an integer multiple of 1/200.
    """
    bits = random_bits()
    # Assuming that p has a resolution of 0.05, find p / 0.05.
    p_int = int(round(200*p))

    value, high = 0, 1
    while True:
        if high < 2**31:
            high = 2 * high
            value = 2 * value + next(bits)
        else:
            # Throw out everything beyond the last multiple of 200, to
            # avoid introducing a bias.
            discard = high - high % 200
            split = high // 200 * p_int
            if value >= discard:  # rarer than 1 time in 10 million
                value -= discard
                high -= discard
            elif value >= split:
                yield 0
                value -= split
                high = discard - split
            else:
                yield 1
                high = split

The key observation is that every time we reach the beginning of the while loop, value is uniformly distributed amongst all integers in [0, high), and is independent of all previously-output bits. If you care about speed more than perfect correctness, you can get rid of discard and the value >= discard branch: that's just there to ensure that we output 0 and 1 with exactly the right probabilities. Leave that complication out, and you'll just get almost the right probabilities instead. Also, if you make the resolution for p equal to 1/256 rather than 1/200, then the potentially time-consuming division and modulo operations can be replaced with bit operations.

With the same test code as before, but using bernoulli_int in place of bernoulli, I get the following results for p=0.1:

First 50: 00000010000000000100000000000000000000000110000100
Biased bits generated: 1000000
Unbiased bits used: 467997
mean: 0.099675