AVX/SSE version of xorshift128+

Solution 1:

For anyone else who might reach this question, I think this C++ code implements correctly 4 xorshift128plus generators running in parallel, using AVX2:

__m256i xorshift128plus_avx2(__m256i &state0, __m256i &state1)
{
    __m256i s1 = state0;
    const __m256i s0 = state1;
    state0 = s0;
    s1 = _mm256_xor_si256(s1, _mm256_slli_epi64(s1, 23));
    state1 = _mm256_xor_si256(_mm256_xor_si256(_mm256_xor_si256(s1, s0),
                                               _mm256_srli_epi64(s1, 18)),
                              _mm256_srli_epi64(s0, 5));
    return _mm256_add_epi64(state1, s0);
}

The scalar implementation I used is:

u64 xorshift128plus(u64 &state0, u64 &state1)
{
    u64 s1 = state0;
    const u64 s0 = state1;
    state0 = s0;
    s1 ^= s1 << 23;                              // a
    state1 = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
    return state1 + s0;
}

Which is the same one in the xorshiftplus paper. Note that the right-shift constants from the original question do not correspond to the ones in the paper.

Solution 2:

XorShift is indeed a good choice. It is so good, so fast and requires so little state that I'm surprised to see so little adoption. It should be the standard generator on all platforms. I have implemented it myself 8 years ago and even then it could generate 800MB/s of random bytes.

You cannot use vector instructions to speed up generating a single random number. There is too little instruction-level parallelism in those few instructions.

But you can easily speed up generating N numbers where N is the vector size of your target instruction set. Just run N generators in parallel. Keep state for N generators and generate N numbers at the same time.

If client code demands numbers one at a time you could keep a buffer of N (or more) numbers. If the buffer is empty you fill it using vector instructions. If the buffer is not empty you just return the next number.