Is there a simple test for uniform distributions?
I have a function that (more or less) is supposed to select a small number $m$ of random numbers from the range $[1,n]$ (for some $n \gg m$) and I need to test that it work. Is there an easy to implement test that will give good confidence that it is working correctly? (For reference the full on chi squared test is not simple enough.)
Edit:
- $m$ in $[5,500]$, $n$ is small enough I can use normal int/float types for most math.
- I can get as many sets as I'm willing to wait for ($>100$s per second)
The reason I'm looking for something simple is that the code I'm checking isn't that complicated. If the test code is more complex than the code under test, then the potential for the test being wrong becomes a real problem.
Edit 2:
I took another look at Chi-squared and it's not as bad as I remembered... as long as I have a fixed number of buckets and a fixed significance threshold. (Or a canned CDF, and I don't think I do.)
You could use a chi square test. Basically, this tests whether the number of draws that fall into various intervals is consistent with a uniform random distribution. See Knuth's TAOCP volume 2, Seminumerical Algorithms for an explanation of the chi-square test.
Update: If the chi square test is too complicated, you could plot a histogram of your output and eyeball it to see if it looks uniform. But if you'd like to quantify whether it's even enough, that's exactly what a chi square test is. The test will also tell you whether the data are too evenly spread out, i.e. not random enough.
You can use the Kolmogorov Smirnov test too. It is a non parametric test, and will work on many distributions - including Uniform. The advantage it has over other tests is that it looks at the whole distribution. Also it is fairly easy to use.
The idea is to compare the CDFs of the $H_0$ distribution with the observed empirical distribution, and find the maximum difference. This difference does not depend on the distribution (hence it is non-parametric), and depends only on $n$, the number of observations you are drawing. Then you look it up against a table to see how different it is from expected range of the difference.
Empirical CDF of your observations $\\{x_0,x_1,...,x_n\\}$ is $F_n(x)=\frac{1}{n}\sum_1^nI_{x_i\lt x}$. Then your test statistic will be $D_n=\sup|F(x)-F_n(x)|$. Assuming you sort your $x_n$'s in ascending order, and assuming your numbers come from Uniform[0,1] (wlg since you can scale them appropriately) it will be $D_n=\max_i(max(|x_i-\frac{i}{n}|,|x_i-\frac{i-1}{n}))$.
Once you compute that, it's only a matter of looking up table or using the right software. I recommend this site which has an excellent tool to compute KS along with many other tests. For KS, you need to select "Dn", put in the value you got, press the button, and assume interpret the result as it is truly uniform if P (computed in the form) < $\alpha$.
One issue with chi-square goodness-of-fit test is, because it operates on the empirical PDF and not CDF, the outcome depends on the bin size (class interval width) that you choose: if your data actually comes from a non-uniform distribution, a wider bin will create a bigger chance to reach the wrong conclusion (imagine the extreme of taking just one bin for all your data: perfectly rectangular histogram!). Also, if your bin counts are below 5 (as a rule), you need to combine bins together, which will also affect the outcome of the test. CDFs do not use binning and hence do not suffer from this problem. Kolmogorov-Smirnov test for CDF is rather crude, because it considers only one data point (= that of maximum deviation). A better one is Cramer-von Mises test, similar to K-S but instead integrates the differences between theoretical and empirical CDFs across the entire data range. There are other tests (Anderson-Darling, Shapiro-Wilks, etc.)