AVX/SSE version of xorshift128+
Solution 1:
For anyone else who might reach this question, I think this C++ code implements correctly 4 xorshift128plus generators running in parallel, using AVX2:
__m256i xorshift128plus_avx2(__m256i &state0, __m256i &state1)
{
__m256i s1 = state0;
const __m256i s0 = state1;
state0 = s0;
s1 = _mm256_xor_si256(s1, _mm256_slli_epi64(s1, 23));
state1 = _mm256_xor_si256(_mm256_xor_si256(_mm256_xor_si256(s1, s0),
_mm256_srli_epi64(s1, 18)),
_mm256_srli_epi64(s0, 5));
return _mm256_add_epi64(state1, s0);
}
The scalar implementation I used is:
u64 xorshift128plus(u64 &state0, u64 &state1)
{
u64 s1 = state0;
const u64 s0 = state1;
state0 = s0;
s1 ^= s1 << 23; // a
state1 = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
return state1 + s0;
}
Which is the same one in the xorshiftplus paper. Note that the right-shift constants from the original question do not correspond to the ones in the paper.
Solution 2:
XorShift is indeed a good choice. It is so good, so fast and requires so little state that I'm surprised to see so little adoption. It should be the standard generator on all platforms. I have implemented it myself 8 years ago and even then it could generate 800MB/s of random bytes.
You cannot use vector instructions to speed up generating a single random number. There is too little instruction-level parallelism in those few instructions.
But you can easily speed up generating N numbers where N is the vector size of your target instruction set. Just run N generators in parallel. Keep state for N generators and generate N numbers at the same time.
If client code demands numbers one at a time you could keep a buffer of N (or more) numbers. If the buffer is empty you fill it using vector instructions. If the buffer is not empty you just return the next number.