Excel - stipulate an average for random number function

Solution 1:

Consider a lottery in which the sponsor sells a bunch of tickets (let’s say 100), puts them into a container (let’s call it a hat), and then randomly selects a ticket from the hat to decide who wins a prize. Now suppose there are multiple prizes; the sponsor draws the first prize winner, then the second, and then the third. Naturally, the winning tickets are not put back into the hat after being drawn, because that would allow them to be selected again, resulting in a single ticket winning more than one prize, which is against the rules. Therefore, all selections after the first are constrained. In a sense, the game is rigged, so that subsequent winners can come only from the pool of non-winners, as a mechanism of reaching the goal of not awarding multiple prizes to any single ticket.

Now suppose that there are 100 prizes – everybody wins something. By the time the final ticket is pulled out of the hat, there’s only one ticket there. The selection is constrained to the point of being pre-determined. And you know who the 100th winner is – it’s the sad guy who’s standing in the crowd, waiting for his name to be called; the last remaining non-winner.

Your problem can be solved by a similar strategy:

start with “truly” random numbers (by which I mean numbers statistically distributed according to a probability distribution of your choosing, e.g., uniform distribution over the range 25-75) and then
constrain the later numbers (by adjusting the distribution) as a mechanism of reaching the goal of the stipulated average.

And the logic for constraining the probability distribution is simple: don’t generate (and enter into the pricing sheet) any number that makes it impossible to reach the goal.

I find it easier to think of the goal as achieving a specified total rather than a particular average. Of course these are logically equivalent. For example, suppose that you want only 4 numbers instead of 100. You want the average to be 50, so the total will need to be 4×50=200. Let’s say you generate the first two numbers, v₁=68 and v₂=70. These add up to 138, so v₃+v₄ must be 62. If v₃ is 50, then v₄ will have to be 12, which is outside the valid range of 25-75. Solving 68+70+v₃+25=200 yields v₃=200−(68+70+25)=37, so we constrain v₃ to lie between 25 and 37. Once v₃ is chosen, v₄ is determined; it must be 62−v₃; i.e., 200−(v₁+v₂+v₃). This is analogous to the situation with the 100th lottery ticket; it must be what’s left over after the first n−1 choices have been made.

This sounds like a job for VBA, but I found a worksheet-formulas solution. In the following, I have chosen to make the bounds of the desired probability distribution (25 and 75) parameters; i.e., put them in individual cells (A1 and B1) rather than hard-coding them into the formulas. And I use a few helper columns.

Set C1 to the desired total of the numbers. You can just enter the constant value, or compute it as average_number × number_of_numbers, i.e., =AVERAGE(A1, B1) * 100.
Set C2 to =C$1-SUM(G$1:G1).
Set D1 to =100-ROW().
Set E1 to =MAX(C1-D1*B$1, A$1).
Set F1 to =MIN(C1-D1*A$1, B$1).
Set G1 to =RANDBETWEEN(E1, F1).
Drag C2, D1, E1, F1, and G1 down to Row 100.

Discussion:

Let n be 100, the number of random numbers, and hence the number of rows. The following is the general implementation for the four-row example I gave earlier.

For each Row i (1 ≤ i ≤ n),

G_i is v_i, the i-th random number.
E_i and F_i are the lower and upper bounds of the range from which G_i is selected (as is obvious from the formula for G_i).
D_i counts down from n−1 in Row 1 to 0 in Row n; i.e., it gives the number of rows below the current one. If you start in a row other than 1, adjust this formula accordingly.
C1 is v₁+v₂+…+v_n−1+v_n
C_i = C1 − (v₁+v₂+…+v_i−2+v_i−1), which equals (the goal value for) v_i+v_i+1+…+v_n−1+v_n
D1*A$1 → D_i × A1 is the minimum valid value for v_i+1+v_i+2+…+v_n−1+v_n
So C1-D1*A$1 (C_i − D_i×A1) is an upper bound for v_i for it to be possible to reach the goal total.

Let’s revisit the four-row example. If (hypothetically) v₁ were 126, it would be impossible to achieve the goal of v₁+v₂+v₃+v₄=200 because of the constraint that v₂, v₃, and v₄ be ≥ 25. But if v₁ were 125, it would be possible to achieve the goal with v₂ = v₃ = v₄ = 25. So 125 is an upper bound for v₁. But, of course, v₁ is also required to be ≤ 75; the 125 limit is moot. The v values must satisfy both sets of constraints (both the 25-75 range and the constraints required to reach the total of n × 50), so they must satisfy the tighter of the two bounds on each side (above and below).
So F_i is the upper bound for v_i.
And E_i is the lower bound for v_i.

Here’s a snapshot of what I got with n = 10:

snapshot of sheet

Naturally, pressing F9 causes Excel to recalculate the sheet and generate a new set of random numbers. But G1:G10 always add up to 500; i.e., the average is 50.

If v_n is determined by v₁, v₂, …, v_n−2, and v_n−1, is it still really “random”?

This is an interesting question.

If n is 1, v₁ is forced to be 50. This is clearly not random. But, of course, this is an edge case.
If n is 2, v₁ is chosen randomly, then v₂ is set to 100−v₁. Clearly v₂ is not independent of v₁. But
- P(v₂=25) = P(v₁=75) = 1/51.
- P(v₂=26) = P(v₁=74) = 1/51.
- ︙
- P(v₂=75) = P(v₁=25) = 1/51.
So v₂ is uniformly distributed over the range 25-75, the same as v₁.
Going back to the lottery example – the winner of the 100th prize is determined by the winners of the first 99. But, before any tickets are taken out of the hat, every contestant has an equal chance of winning that 100th prize. So the winner of the 100th prize is random.
I’m having trouble doing the math for this question for n > 2. So I did an empirical experiment. I generated 100 random numbers between 25 and 75, and I recalculated 250 times, accumulating the results. Then I charted the frequency distributions for several of the v values:



^{(The images are links to full-sized versions of themselves.)}

Up through v₉₁, the distributions look fairly uniform:



But, starting at v₉₂, something interesting starts to happen:



Note that I had to change the scale at v₉₄ and beyond:





It looks like the last few v values have non-uniform distributions. I guess that, as n gets larger, it becomes more likely that the first n−10 (or so) values will drift away from the desired average, forcing the last few values to the extreme numbers at the ends of the range, in order to bring the average back on track. But, at least, all of the values should have symmetric distributions.
- There’s no reason for them to be asymmetric.
- My charts appear to be symmetric.

If you’re interested in pursuing this issue, you might want to ask it on Math Stack Exchange – they eat questions like this for breakfast.

Solution 2:

It is not theoretically possible to generate a random (or even pseudo-random) list with a fixed average because this requires that the later values depend on the earlier values and could even force you to go back and change some of the earlier values in some cases.

Excel - stipulate an average for random number function

Solution 1:

Discussion:

Solution 2:

Related

Recent Posts