Excel - stipulate an average for random number function
Solution 1:
Consider a lottery in which the sponsor sells a bunch of tickets (let’s say 100), puts them into a container (let’s call it a hat), and then randomly selects a ticket from the hat to decide who wins a prize. Now suppose there are multiple prizes; the sponsor draws the first prize winner, then the second, and then the third. Naturally, the winning tickets are not put back into the hat after being drawn, because that would allow them to be selected again, resulting in a single ticket winning more than one prize, which is against the rules. Therefore, all selections after the first are constrained. In a sense, the game is rigged, so that subsequent winners can come only from the pool of non-winners, as a mechanism of reaching the goal of not awarding multiple prizes to any single ticket.
Now suppose that there are 100 prizes – everybody wins something. By the time the final ticket is pulled out of the hat, there’s only one ticket there. The selection is constrained to the point of being pre-determined. And you know who the 100th winner is – it’s the sad guy who’s standing in the crowd, waiting for his name to be called; the last remaining non-winner.
Your problem can be solved by a similar strategy:
- start with “truly” random numbers (by which I mean numbers statistically distributed according to a probability distribution of your choosing, e.g., uniform distribution over the range 25-75) and then
- constrain the later numbers (by adjusting the distribution) as a mechanism of reaching the goal of the stipulated average.
And the logic for constraining the probability distribution is simple: don’t generate (and enter into the pricing sheet) any number that makes it impossible to reach the goal.
I find it easier to think of the goal as achieving a specified total rather than a particular average. Of course these are logically equivalent. For example, suppose that you want only 4 numbers instead of 100. You want the average to be 50, so the total will need to be 4×50=200. Let’s say you generate the first two numbers, v1=68 and v2=70. These add up to 138, so v3+v4 must be 62. If v3 is 50, then v4 will have to be 12, which is outside the valid range of 25-75. Solving 68+70+v3+25=200 yields v3=200−(68+70+25)=37, so we constrain v3 to lie between 25 and 37. Once v3 is chosen, v4 is determined; it must be 62−v3; i.e., 200−(v1+v2+v3). This is analogous to the situation with the 100th lottery ticket; it must be what’s left over after the first n−1 choices have been made.
This sounds like a job for VBA, but I found a worksheet-formulas solution.
In the following, I have chosen
to make the bounds of the desired probability distribution (25 and 75) parameters;
i.e., put them in individual cells (A1
and B1
)
rather than hard-coding them into the formulas. And I use a few helper columns.
- Set
C1
to the desired total of the numbers. You can just enter the constant value, or compute it as average_number × number_of_numbers, i.e.,=AVERAGE(A1, B1) * 100
. - Set
C2
to=C$1-SUM(G$1:G1)
. - Set
D1
to=100-ROW()
. - Set
E1
to=MAX(C1-D1*B$1, A$1)
. - Set
F1
to=MIN(C1-D1*A$1, B$1)
. - Set
G1
to=RANDBETWEEN(E1, F1)
. - Drag
C2
,D1
,E1
,F1
, andG1
down to Row 100.
Discussion:
Let n be 100, the number of random numbers, and hence the number of rows. The following is the general implementation for the four-row example I gave earlier.
For each Row i (1 ≤ i ≤ n),
-
Gi
is vi, the i-th random number. -
Ei
andFi
are the lower and upper bounds of the range from whichGi
is selected (as is obvious from the formula forGi
). -
Di
counts down from n−1 in Row 1 to 0 in Row n; i.e., it gives the number of rows below the current one. If you start in a row other than 1, adjust this formula accordingly. -
C1
is v1+v2+…+vn−1+vnCi
=C1
− (v1+v2+…+vi−2+vi−1), which equals (the goal value for) vi+vi+1+…+vn−1+vn -
D1*A$1
→Di
×A1
is the minimum valid value for vi+1+vi+2+…+vn−1+vn
SoC1-D1*A$1
(Ci
−Di
×A1
) is an upper bound for vi for it to be possible to reach the goal total.Let’s revisit the four-row example. If (hypothetically) v1 were 126, it would be impossible to achieve the goal of v1+v2+v3+v4=200 because of the constraint that v2, v3, and v4 be ≥ 25. But if v1 were 125, it would be possible to achieve the goal with v2 = v3 = v4 = 25. So 125 is an upper bound for v1. But, of course, v1 is also required to be ≤ 75; the 125 limit is moot. The v values must satisfy both sets of constraints (both the 25-75 range and the constraints required to reach the total of n × 50), so they must satisfy the tighter of the two bounds on each side (above and below).
SoFi
is the upper bound for vi. - And
Ei
is the lower bound for vi.
Here’s a snapshot of what I got with n = 10:
Naturally, pressing F9 causes Excel to recalculate the sheet
and generate a new set of random numbers.
But G1:G10
always add up to 500; i.e., the average is 50.
If vn is determined by v1, v2, …, vn−2, and vn−1, is it still really “random”?
This is an interesting question.
- If n is 1, v1 is forced to be 50. This is clearly not random. But, of course, this is an edge case.
-
If n is 2, v1 is chosen randomly, then v2 is set to 100−v1. Clearly v2 is not independent of v1. But
- P(v2=25) = P(v1=75) = 1/51.
- P(v2=26) = P(v1=74) = 1/51.
- ︙
- P(v2=75) = P(v1=25) = 1/51.
So v2 is uniformly distributed over the range 25-75, the same as v1.
- Going back to the lottery example – the winner of the 100th prize is determined by the winners of the first 99. But, before any tickets are taken out of the hat, every contestant has an equal chance of winning that 100th prize. So the winner of the 100th prize is random.
-
I’m having trouble doing the math for this question for n > 2. So I did an empirical experiment. I generated 100 random numbers between 25 and 75, and I recalculated 250 times, accumulating the results. Then I charted the frequency distributions for several of the v values:
(The images are links to full-sized versions of themselves.)
Up through v91, the distributions look fairly uniform:
But, starting at v92, something interesting starts to happen:
Note that I had to change the scale at v94 and beyond:
It looks like the last few v values have non-uniform distributions. I guess that, as n gets larger, it becomes more likely that the first n−10 (or so) values will drift away from the desired average, forcing the last few values to the extreme numbers at the ends of the range, in order to bring the average back on track. But, at least, all of the values should have symmetric distributions.
- There’s no reason for them to be asymmetric.
- My charts appear to be symmetric.
If you’re interested in pursuing this issue, you might want to ask it on Math Stack Exchange – they eat questions like this for breakfast.
Solution 2:
It is not theoretically possible to generate a random (or even pseudo-random) list with a fixed average because this requires that the later values depend on the earlier values and could even force you to go back and change some of the earlier values in some cases.