Flaw or no flaw in MS Excel's RNG?
I have a question about my understanding of an article of B.D. McCullough (2008) about Excel's implementation of the Wichmann-Hill random number generator (1982).
First, a bit of context
The Wichmann-Hill algorithm is given in AS 183 here. As one can see in the program's comments, it's merely three linear congruential generators (LCG) combined to make a single RNG:
$$ix = 171 \times ix \mod p\\ iy = 172 \times iy \mod q\\ iz = 170 \times iz \mod r$$
With $p=30269, q=30307, r=30323$
Then the three integer values are combined to yield a (pseudo-random) real number $x$ in $[0,1]$, with
$$x=\frac{ix}{p}+\frac{iy}{q}+\frac{iz}{r} \mod 1$$
The actual program is a bit more complicated because it's designed to run on machines that can't represent $172\times30306=5212632$ (for example, 16-bit machines, which where not rare back then), but it's equivalent to the formulas above.
In a subsequent article (1986), H. Zeisel remarked that the three LCG can be reduced into a single one, thanks to chinese remainder theorem:
$$u = a \times u \mod m$$
With $a=16555425264690$ and $m=pqr=27817185604309$, and $a$ is found by solving the following (it's where chinese remainder theorem is used):
$$a\equiv171\pmod{30269}\\ a\equiv172\pmod{30307}\\ a\equiv170\pmod{30323}$$
Notice that $30269$, $30307$ and $30323$ are all primes.
The claim
Now, the claim in McCullough's article is that Microsoft didn't implement the Wichmann-Hill correctly, and to prove this, one has to show the successive values computed by Excel don't agree with the values computed by a reference implementation of the WH RNG.
Since Excel does not give access to the seed, the seed is computed by multiplying a random number returned by Excel, by $m$, to get the seed (see section 3 of the article).
I don't understand this.
If I understand correctly, if one assumes the WH RNG has been implemented according to the original algorithm AS 183, one can't recover the seed by just multiplying by $m$. For example, imagine the seed is $u=1$ in Zeisel's formulation. Then trivially, the seed is $ix=iy=iz=1$ in AS 183's formulation. But the random number must then be
$$x=\frac1p+\frac1q+\frac1r\simeq0.000099011$$
Notice the "modulo 1" has no effect in this case.
And by multiplying $x$ by $m$, one finds
$$mx=pqr\left(\frac1p+\frac1q+\frac1r\right)=2754208631\neq1$$
So we actually didn't recover the seed.
The reason it's not the same is that, while one can switch back and forth from AS 183 to Zeisel seeds by chinese remainder thorem, the combination in the last step (divide each of $ix, iy, iz$ by the corresponding moduli, and take the sum modulo $1$), is not equivalent to dividing the Zeisel seed $u$ by the "combined" module $m=pqr$. One would have to recover first $ix, iy, iz$, by taking respectively
$$ix=u\mod p\\iy=u \mod q\\iz=u \mod r$$
And only then to finish with AS 183 combination. But this is not easy to revert, as is necessary for the purpose of checking conformity of Excel's RNG.
My actual question
First, I would like to know if my reasoning is correct, which would mean there is a flaw in the article.
I would like to know if there is some simple way to find the seed $(ix,iy,iz)$, given one or several successive random values from the (correct) Wichmann-Hill. It does not look obvious, as McCullough himself remarked in the same section 3 of his article. That would allow me to assess the correctness of Excel's RNG (I am not sure Excel still uses the Wichmann-Hill, anyway).
Notice that Wichmann and Hill have published a revised generator (2005), available for free on NPL's website. According to the link to Microsoft's site above, Excel implemented the older one, in Excel 2003.
Afterthought: the problem is even more difficult than expected. The article states in section 2 that Excel produces random real numbers in single precision, thus there is a loss of precision. However, it's well known that a good RNG satisfies an important property: two almost equal seeds give random numbers that may be far apart. Hence, knowing only a unique approximate random number from the generator may not be enough to recover the seed, so we can expect we will need several in a row.
Edit
As it happens, Ronald A. Thisted, of the university of Chicago, has worked on a second volume of his Elements of Statistical Computing. The draft notes, dated january 2005, can be found here. The Wichmann-Hill generator is evoked on page 9, which ends with
However, the uniform random number returned by the Wichmann-Hill algorithm is not simply the ratio of the single generator’s output to its modulus. While the properties of the latter would be easily studied, there have been no analytic studies of the actual Wichmann-Hill generator. Nonetheless, it appears to perform well in practice.
That's a definitive answer to my first question, but I was already rather confident about this. However, I am still lost on the second question, about actually "reversing" the generator.
Edit (2016-10-15)
Since I asked the question, I have been able to implement the WH rng in Python, along with a program to find the exact seed from the floating point random values, either produced in double or single precision. It's enough to have two or three consecutive random values: two should be enough, but because of precision loss, one has to check with another one. This involves checking possible possible seeds in a relativeley narrow range, and it makes use of the solution of the modular equation given in the answer of this question. It's easy afterwards to check that the seed is correct by comparing to several consecutive values returned by the RNG.
In the meantime, I have also been able to get random values from Excel 2003 and 2007, which are claimed by Microsoft to implement the WH RNG. I have not been able to confirm this: the program does not find a seed that allows one to reproduce the sequence. This may mean that the WH RNG is badly implemented, or implemented in some nonstandard way.
I have tried another approach: look for the numbers 30269, 30307, 30323 in the binary content of files in the Office distribution. It's relatively easy to look for a file containing precisely these numbers (stored as 2 bytes each). Sadly, it didn't work either. This is not a proof either : the numbers could be hidden in compressed or encrypted data, or stored in some other way.
I have tried on the entire hard disk, and I actually found these numbers only in a DLL from the SAS statistical software (I later disassembled part of it, and it indeed implemented the WH RNG): it seems to be used in computations of generalized linear models, probably to initialize an optimization algorithm.
Thus, so far, I have not been able to prove the implementation of the WH RNG in Excel is good. I don't consider this to be a proof of the contrary : any operation done on the sequence could complicate seed finding program.For instance, if two random numbers are used to produce one floating point output, my program is bound to fail. Unless I am able to disassemble the code that does actually compute random numbers, there is some doubt.
Now, does it really matter? I would be happy to find the definitive answer to this question, of course, but there have been three versions of Excel since (2010, 2013, 2016), none of which is supposed to use the Whichmann-Hill random generator. My employer has even switched to LibreOffice in the meantime. And there is still much to do to assess the quality of spreadsheets, and more generally of numerical and statistical software.
Solution 1:
Given the random number $x = \frac{ix}{p} + \frac{iy}{q} + \frac{iz}{r}$ we can multiply by $m$ to obtain
$$mx = (qr)ix + (pr)iy + (pq)iz$$
and it follows that
$$(qr)ix \equiv mx \mod p$$ $$(pr)iy \equiv mx \mod q$$ $$(pq)iz \equiv mx \mod r$$
These are three linear congurences which we can solve to obtain the seed $ix,iy,iz$.
The only caveat here is that numerically $x$ might not have the sufficient precision that when we multiply by $m$ we don't obtain $mx$ as given in the expression above. If this is the case we can still obtain the seed by using a try-fail method by testing all values in an interval around $mx$ (the size of the interval is given by the precision limit of the numerical system) untill we obtain the correct $mx$-value - the one where the seed we get gives us back $x$ when used in the algorithm.