Probability for any ball to be picked X times.

My team member and I came across a problem in a non-Math-related research area. I am paraphrasing the problem here in order to seek your help:

There are 10000 different numbered balls in a bowl. Each time, 10 balls are picked at random from the bowl and put back into the bowl. This process is repeated 5 times.

What is the probability that at least one numbered ball is picked at least 3 times?

Our first idea is to use cumulative Binomial probability to find the probability that a particular ball is picked more than 2 times. But this only gives the probability from the "perspective of a single ball". How can we expand this to seek the answer we need? We highly appreciate any help regarding this.


Addendum-2 added to provide an exact answer.

Addendum-3 added to discuss sanity-checking, show an alternative (direct) approach to the problem, and to display key computations.


Analytical mistake found and corrected.
See the Addendum-1 for a discussion of my mistake.


$\underline{\textbf{Overview}}$

There are 10000 different numbered balls in a bowl. Each time, 10 balls are picked at random from the bowl and put back into the bowl. This process is repeated 5 times.

What is the probability that at least one numbered ball is picked at least 3 times?

This response represents a long-winded use of Inclusion-Exclusion to determine an exact closed form computation of the answer. Relevant equations are tagged, and the end of this response contains a Summary section that puts all of the relevant items together.


$\underline{\textbf{Combinatorics : The Denominator}}$

The (Probability) problem can be attacked exclusively through Combinatorics, with a computation of

$$ 1 - \frac{N\text{(umerator)}}{D\text{(enominator)}}, \tag1$$

where $D$ represents the total number of ways that $5$ random selections can be made. Here, each selection is of $10$ balls, out of the $10000$, with the sampling of the $10$ balls done without replacement.

I am going to construe that there were (in effect) $50$ different selections made, and that the order that these selections were made is relevant. This will facilitate the computation of $N$. However, this implies that the computation of $D$ must be made in a consistent manner.

Therefore,

$$D = \left[\frac{(10000)!}{(10000 - 10)!}\right]^5. \tag2 $$


$\underline{\textbf{Combinatorics : The Numerator}}$

The approach that I will take for computing $N$ will be to have the primary focus be on the $(50)$ slots, (i.e. ball selection positions), rather than the $(10000)$ balls.

Then, the number of slot-pairings where the same ball was selected in a pair of slots, must be some element $k$ in $\{0,1,2,\cdots, 25\}.$ To clarify, this means that $k$ numbers from $\{1,2,\cdots, 10000\}$ were each used exactly twice, and then $(50 - 2k)$ other numbers from $\{1,2,\cdots, 10000\}$ were each used exactly once. This means that none of the numbers from $\{1,2,\cdots, 10000\}$ were used more than twice.

For $k \in \{0,1,2,\cdots,25\}$, let $E_k$ denote the event that there were exactly $k$ slot pairings. Then there are $26$ mutually exclusive events, $E_0, E_1, E_2, \cdots, E_{25}$.

Let $|E_k|$ represent the number of ways that event $E_k$ can occur. Then, because the events are mutually exclusive, you have that

$$N = \sum_{k=0}^{25} |E_k|. \tag3 $$

So, the problem has been reduced to the computation of $|E_k| ~: k \in \{0,1,2,\cdots, 25\}.$


$\underline{\textbf{Computation of} ~|E_k|}$

In order to implement a generic method for computing $|E_k|$, I will employ the helper functions $A(k)$ and $B(k)$.

$A(k)$ will denote the number of ways of selecting $k$ slot pairs out of the $(50)$ slots so that none of the $k$ slot pairs are in the same group of $(10)$ slots.

Then, $B(k)$ will represent the number of ways that $(50 - k)$ different numbers can be selected, in some order, from the set $\{1,2,\cdots, 10000\}$, under the assumption that $k$ of the selected numbers will be repeated (i.e. $k$ slot pairs will be formed).

Then, you will have that

$$|E_k| = A(k) \times B(k) ~: ~B(k) = \frac{[10000]!}{[10000 - (50 - k)]!}. \tag5 $$

Consequently, the entire problem has been reduced to the computation of $A(k)$, for $k \in \{0,1,2,\cdots, 25\}.$


$\underline{\textbf{Helper Function} ~F(x,y) ~x ~\text{slot pairs within} ~y ~\text{slot positions}}$

The computations are better organized by first creating the helper function $F(x,y)$. This represents the number of distinct ways that $x$ slot pairs can be formed within $y$ slot positions, where $x,y \in \Bbb{Z^+}$ and $(2x) \leq y$.

$\displaystyle F(x,y) = \binom{y}{2} \times \binom{y-2}{2} \times \cdots \times \binom{y - 2x + 2}{2} \times \frac{1}{x!}$.

The $~\displaystyle \left(\frac{1}{x!}\right)~$ factor is an overcounting management factor that is designed to accommodate that the order that the $x$ slot-pairs are made within the group of $y$ slot positions is irrelevant.

The computation simplifies to

$$F(x,y) = \frac{y!}{[(y - 2x)!] \times (2)^x \times [x!]}. \tag6$$


$\underline{\textbf{Computation of} ~A(k)}$

The computation of $A(k)$, which is by far the most complicated portion of the algorithm, will be done via Inclusion-Exclusion.

For any set $S$, with a finite number of elements, let $|S|$ denote the number of elements in the set $S$.

Let $T_k$ denote the set that represents the formation of $k$ slot pairs, from the slot-positions $\{1,2,\cdots,50\}.$

Let $T(k,0)$ denote $|T_k|$.

Then,

$$T(k,0) = F(k,50).\tag7 $$

Clearly,

$$A(0) = T(0,0) = 1.\tag8 $$

For the remainder of this answer, it will be assumed that $k \in \{1,2,\cdots,25\}.$


$\underline{\textbf{Inclusion-Exclusion Variables}}$

There are $(5)$ groups of $(10)$, that I will refer to as $G^1, G^2, G^3, G^4, G^5$. If at least one of the $(k)$ slot pairs is completely contained in group $G^r ~: ~r \in \{1,2,3,4,5\}$, then I will regard group $G^r$ as having been violated.

For $~r \in \{1,2,3,4,5\}, ~k \in \{1,2,\cdots, 25\}$, I will let $H_k^r$ refer to the subset of $T_k$ where group $G^r$ was violated.

Then, for $k \in \{1,2,\cdots,25\}$ you have that

$$A_k = T(k,0) - |H_k^1 \cup H_k^2 \cup H_k^3 \cup H_k^4 \cup H_k^5|.\tag9$$

Let $T(k,1)$ denote $\displaystyle \sum_{1 \leq i_1 \leq 5} |H_k^{i_1}|.$
That is, $T(k,1)$ represents the summation of $\displaystyle \binom{5}{1}$ terms.

For $k \geq 2$,
let $T(k,2)$ denote $\displaystyle \sum_{1 \leq i_1 < i_2 \leq 5} |H_k^{i_1} \cap H_k^{i_2}|.$
That is, $T(k,2)$ represents the summation of $\displaystyle \binom{5}{2}$ terms.

For $k \geq 3$,
let $T(k,3)$ denote $\displaystyle \sum_{1 \leq i_1 < i_2 < i_3 \leq 5} |H_k^{i_1} \cap H_k^{i_2} \cap H_k^{i_3}|.$
That is, $T(k,3)$ represents the summation of $\displaystyle \binom{5}{3}$ terms.

For $k \geq 4$,
let $T(k,4)$ denote $\displaystyle \sum_{1 \leq i_1 < i_2 < i_3 < i_4 \leq 5} |H_k^{i_1} \cap H_k^{i_2} \cap H_k^{i_3} \cap H_k^{i_4}|.$
That is, $T(k,4)$ represents the summation of $\displaystyle \binom{5}{4}$ terms.

For $k \geq 5$,
let $T(k,5)$ denote $|H_k^1 \cap H_k^2 \cap H_k^3 \cap H_k^4 \cap H_k^5|.$

Then, in accordance with Inclusion-Exclusion, for $k \geq 5$:

$$A_k = \sum_{i = 0}^5 (-1)^i T(k,i).\tag{10} $$

For $k \in \{1,2,3,4\}$, the alteration to the formula for $A_k$, in (10) above is

$$A_k = \sum_{i = 0}^k (-1)^i T(k,i).\tag{11} $$

Equation (7) above provides the formula for $T(k,0)$.

Equation (8) above provides the formula for $A(0)$.

Equations (10) and (11) above provide the formula(s) for $A(k) ~: k \geq 1$
in terms of $~T(k,0), ~T(k,1), ~\cdots, ~T(k,5)$.

Therefore, the problem has been completely reduced to the computation of
$T(k,1), ~T(k,2), ~T(k,3), ~T(k,4) ~$ and $~ T(k,5),~$ for $~k \in \{1,2,\cdots, 25\}.$


$\underline{\textbf{Programmatic Computation of} ~~~T(k,r)}$

First, see Addendum-1, which discusses (in broad terms), the analytical mistake that I made, when computing $T(k,1)$. Correcting this mistake is so complicated, that I have decided to completely overhaul the method used to compute $~T(k,r) ~: ~r \in \{1,2,3,4,5\}$.

This section examines the complication that I originally overlooked, and then discusses how to remedy the situation, as cleanly as possible.

First, for illustrative purposes, assume that $k = 15$, $r = 3$, and it is desired to compute $T(15,3)$.
By symmetry considerations, you know that:

$\displaystyle T(k,3) = \binom{5}{3} |H_k^1 \cap H_k^2 \cap H_k^3|.$

Therefore, the attack of $T(15,3)$ has been reduced to attacking $|H_{15}^1 \cap H_{15}^2 \cap H_{15}^3|.$

Per the definition of $H_k^r$, there must be at least one slot pair completely contained in each of $G^1, G^2, G^3.$ This establishes that $G_1, G_2, G_3$ have each been violated. Let $D$ denote the region of $20$ slot positions that represents the original $(50)$ slot positions, with each of $G_1, G_2, G_3$ excluded.

Superficially, one might suppose that the remaining $(15 - 3 = 12)$ slot pairs must be distributed so that which ever slot pairs are not completely contained in $G_1, G_2,$ or $G_3$ must be completely contained in $D$.

This supposition, which I originally fell into, is false. Whichever of the $(12)$ remaining slot pairs, that are not completely contained in $G_1, G_2, G_3$, or $D$ may be shared by any two of the $4$ regions.

The following constraints must be satisfied, under the assumption that $k = 15, r = 3,$ and that you are specifically examining $|H_{15}^1 \cap H_{15}^2 \cap H_{15}^3|.$

  • Each of $H_{15}^1, H_{15}^2, H_{15}^3$ must be assigned at least one slot pair. This means that the slot pair is completely contained in the pertinent group.

  • The remaining $(12)$ slot pairs, may be similarly assigned to $H_{15}^1, H_{15}^2, H_{15}^3,$ and $D$. Here, $H_{15}^1, H_{15}^2, H_{15}^3$ must not be assigned more than $5$ slot pairs each. Further, since $D$ represents only $(20)$ slot positions, $D$ can not completely contain more than $(10)$ slot pairs.

  • However many slot pairs are completely contained by $H_{15}^1, H_{15}^2, H_{15}^3,$ or $D$, the enumeration of the number of ways of assigning such completely contained slot pairs is handled by the helper function $F(x,y)$.

In enumerating the number of ways of sharing those slot pairs (out of the $15$ slot pairs) not completely contained in a region:

  • Since there are $(4)$ distinct regions, there are $\displaystyle \binom{4}{2} = (6)$ different ways that the regions may be paired up. This means that there are $(6)$ choices for which two regions will share a specific slot pair.

  • Suppose that (for example) you are examining the situation where $H_{15}^1$ completely contains $x_1$ slot pairs and $H_{15}^2$ completely contains $x_2$ slot pairs. This means that $H_{15}^1$ only has $(10 - 2x_1)$ available slot positions for sharing, and that $H_{15}^2$ only has $(10 - 2x_2)$ available slot positions for sharing. This means that the most slot positions that can be shared between $H_{15}^1$ and $H_{15}^2$ are $\min[10 - 2x_1, 10 - 2x_2]$ slot positions. It also means (for example) that the total number of slot pairs that $H_{15}^1$ shares with the other $(3)$ regions combined can not exceed $(10 - 2x_1)$.

Suppose that you are exploring the possibility of sharing $(5)$ slot pairs between $H_{15}^1$ and $D$ which has a size of $(20)$ slot positions. Further suppose that $H_{15}^1$ has, (at this point in the computation), $(7)$ available slot positions, while $D$ has $(13)$ available slot positions. What is the enumeration for sharing the $(5)$ slot pairs, in this scenario?

  • There are $\displaystyle \frac{(7)!}{(7 - 5)!}$ and $\displaystyle \frac{(13)!}{(13 - 5)!}$ choices for selecting slot positions from the $H_{15}^1$ and $D$ regions, respectively. Further, each distinct sharing is counted $(5!)$ ways, since there are $(5!)$ ways of ordering how the shared positions are selected. Therefore, the enumeration is

$$\frac{(7)!}{(7 - 5)!} \times \frac{(13)!}{(13 - 5)!} \times \frac{1}{5!}.\tag{12}$$


$\underline{\textbf{Pseudocode For Computation of } ~T(k,r)}$
$\underline{: ~r \in \{1,2,3,4,5\}, ~k \geq r}$

First, assume that $r \leq 4$. The end of this section will discuss the change in the pseudocode for $r = 5$.

As discussed in the previous section, the idea of explicitly specifiying a nesting of $\displaystyle \left[(r + 1) + \binom{r + 1}{2}\right]$ summations does not seem practical. It seems better to establish a generic (software) algorithm to handle the computation.

Assume that $c = \displaystyle \binom{r + 1}{2}.$

You will have $(r + 1) + c$ variables. Denote them as $e_1, e_2, \cdots, e_r, f, g_1, g_2, \cdots, g_c$. Loop through all of the possibilities of these $(r + 1) + c$ variables, rejecting any collection of $(r + 1) + c$ values that do not satisfy all of the following constraints:

  • All $(r + 1) + c$ variables must be non-negative integers.

  • $e_1, e_2, \cdots, e_r$ must all be positive.

  • $\displaystyle e_1 + e_2 + \cdots + e_r + f + g_1 + g_2 + \cdots + g_c = k$, where $k$ is the number of slot pairs to be formed.

  • $e_1, \cdots, e_r$ must each be $\leq 5$. Since one of the regions will have $(50 - 10r)$ slot positions, $f$ must be $\leq (25 - 5r)$.

  • The $\displaystyle \binom{r+1}{2} = c$ sharing variables, $g_1, g_2, \cdots, g_c$ must each be assigned to two of the regions so that each region has $r$ sharing variables assigned to it, and each of the $c$ ways of combining two regions has its own variable.

  • For each region $H_k^r$, the sum of $2 \times$ the pair-containment variable, $(2 \times e_r)$, and the $r$ sharing variables must not exceed $10$. Similarly, for the region that corresponds to containment variable $f$, the sum of $2 \times $ this containment variable and its associated $r$ sharing variables must not exceed $(50 - 10r)$.

  • Given any collection of $(r + 1 + c)$ satisfying variables, you can compute the total number of slot pair containment possibilities as the product of $(r + 1)$ factors, where the $F(x,y)$ formula in (6) above is used. Similarly, you would compute the total number of slot pair sharing possibilities as the product of $\displaystyle \binom{r+1}{2} = c$ factors, where you can follow the model given in (12) above.

  • Then, for the specific satisfying collection of $(r + 1 + c)$ variables, you then take the slot pair containment product times the slot pair sharing product. Then, by the symmetry considerations discussed in the previous section, you also apply the factor of $\displaystyle \binom{5}{r}$.

  • The computations represented in the last two bullet points must be repeated for each set of satisfying values for the variables. You would then take the sum of these computations, with a separate term for each set of satisfying values.

When $r = 5$ there is very little difference. You will have (only) $5$ slot pair containment variables, each of which corresponds to a region of $(10)$ slot positions. Then, you will have $\displaystyle\binom{5}{2} = 10$ slot pair sharing variables.


$\underline{\textbf{Summary: Putting It All Together}}$

There are 10000 different numbered balls in a bowl. Each time, 10 balls are picked at random from the bowl and put back into the bowl. This process is repeated 5 times.

What is the probability that at least one numbered ball is picked at least 3 times?

The probability is

$$ 1 - \frac{N}{D}. $$

$$D = \left[\frac{(10000)!}{(10000 - 10)!}\right]^5. $$

For $k \in \{0,1,2,\cdots,25\}$, let $E_k$ denote the event that there were exactly $k$ slot pairings. Then there are $26$ mutually exclusive events, $E_0, E_1, E_2, \cdots, E_{25}$.

Let $|E_k|$ represent the number of ways that event $E_k$ can occur. Then, because the events are mutually exclusive, you have that

$$N = \sum_{k=0}^{25} |E_k|. $$


$A(k)$ will denote the number of ways of selecting $k$ slot pairs out of the $(50)$ slots so that none of the $k$ slot pairs are in the same group of $(10)$ slots.

Then, $B(k)$ will represent the number of ways that $(50 - k)$ different numbers can be selected, in some order, from the set $\{1,2,\cdots, 10000\}$, under the assumption that $k$ of the selected numbers will be repeated (i.e. $k$ slot pairs will be formed).

Then, you will have that

$$|E_k| = A(k) \times B(k) ~: ~B(k) = \frac{[10000]!}{[10000 - (50 - k)]!}. $$


$F(x,y) ~$ represents the number of distinct ways that $x$ slot pairs can be formed within $y$ slot positions, where $x,y \in \Bbb{Z^+}$ and $(2x) \leq y$.

$$F(x,y) = \frac{y!}{[(y - 2x)!] \times (2)^x \times [x!]}. $$

Let $T_k$ denote the set that represents the formation of $k$ slot pairs, from the slot-positions $\{1,2,\cdots,50\}.$

Let $T(k,0)$ denote $|T_k|$.

$$T(k,0) = F(k,50). $$

$$A(0) = T(0,0) = 1. $$


There are $(5)$ groups of $(10)$, that I will refer to as $G^1, G^2, G^3, G^4, G^5$.
If at least one of the $(k)$ slot pairs is completely contained in group $G^r ~: ~r \in \{1,2,3,4,5\}$, then I will regard group $G^r$ as having been violated.

For $~r \in \{1,2,3,4,5\}, ~k \in \{1,2,\cdots, 25\}$, I will let $H_k^r$ refer to the subset of $T_k$ where group $G^r$ was violated.

Then, for $k \in \{1,2,\cdots,25\}$ you have that

$$A_k = T(k,0) - |H_k^1 \cup H_k^2 \cup H_k^3 \cup H_k^4 \cup H_k^5|. $$

Let $T(k,1)$ denote $\displaystyle \sum_{1 \leq i_1 \leq 5} |H_k^{i_1}|.$
That is, $T(k,1)$ represents the summation of $\displaystyle \binom{5}{1}$ terms.

For $k \geq 2$,
let $T(k,2)$ denote $\displaystyle \sum_{1 \leq i_1 < i_2 \leq 5} |H_k^{i_1} \cap H_k^{i_2}|.$
That is, $T(k,2)$ represents the summation of $\displaystyle \binom{5}{2}$ terms.

For $k \geq 3$,
let $T(k,3)$ denote $\displaystyle \sum_{1 \leq i_1 < i_2 < i_3 \leq 5} |H_k^{i_1} \cap H_k^{i_2} \cap H_k^{i_3}|.$
That is, $T(k,3)$ represents the summation of $\displaystyle \binom{5}{3}$ terms.

For $k \geq 4$,
let $T(k,4)$ denote $\displaystyle \sum_{1 \leq i_1 < i_2 < i_3 < i_4 \leq 5} |H_k^{i_1} \cap H_k^{i_2} \cap H_k^{i_3} \cap H_k^{i_4}|.$
That is, $T(k,4)$ represents the summation of $\displaystyle \binom{5}{4}$ terms.

For $k \geq 5$,
let $T(k,5)$ denote $|H_k^1 \cap H_k^2 \cap H_k^3 \cap H_k^4 \cap H_k^5|.$

Then, in accordance with Inclusion-Exclusion, for $k \geq 5$:

$$A_k = \sum_{i = 0}^{\min(5,k)} (-1)^i T(k,i). $$


For the computation of $T(k,r)$, see the following sections:

  • Programmatic Computation of $~~~T(k,r)$

  • Pseudocode For Computation of $~~~T(k,r)$


$\underline{\textbf{Addendum-1: Discussion of Analytical Mistake}}$

Consider the following excerpt from the section:

Computation of $~T(k,1) ~: k \geq 1$.

To compute $|H_k^1|$ consider that the number of violations that are occurring inside of $H_k^1$ is some element $i$ in $\{1,2,\cdots, \min[5,k]\}$, and the number of violations that are occurring outside of $H_k^1$ are $(k - i)$.

The above excerpt, which has been corrected, is wrong. I discovered the mistake when using a computer simulation to manually compute (for example) $A(2)$. The mistake is that I misinterpreted what $H_k^1$ needs to represent.

$T_k$ denotes the set that represents that exactly $k$ slot pairs have been formed from the slot positions $\{1,2,\cdots, 50\}$.

$T_k(0)$, which denotes $|T_k|,$ is equal to $F(k,50)$.

$H_k^1$ necessarily represents the subset of $T_k$ where group $G^1$ was violated. This means that there were exactly $k$ slot pairs formed, and at least $(1)$ of the $k$ slot pairs is totally contained in $G^1$.

Part of the above excerpt is correct. That is, the number of slot pairs totally contained in $G^1$ must be some element $(i)$ in $\{1,2,\cdots, \min[5,k]\}$.

However, this does not imply that the other $(k - i)$ slot pairs must occur totally outside of $G^1$. Instead, it merely implies that the other $(k - i)$ slot pairs are not totally contained in $G^1$. That is, one or more of these other $(k - i)$ slot pairs may still have one element inside of $G^1$, and one element outside of $G_1$.

I made similar mistakes with respect to the computations of $T_k(2), T_k(3), T_k(4),$ and $T_k(5)$. Because of the added complication in computing $~T_k(r) ~: ~r \in \{1,2,3,4,5\}$, I have abandoned my original approach of explicitly specifying the pertinent (multiple) summation(s) that represent $T_k(r)$.

Instead, I have replaced the corresponding sections with new sections:

  • Programmatic Computation of $~~~T(k,r)$

  • Pseudocode For Computation of $~~~T(k,r)$

These new sections represent the cleanest approach to the problem that I could conjure. See also, Addendum-2, which provides an exact answer, and Addendum-3, which discusses sanity-checking $A(k)$.


$\underline{\textbf{Addendum-2: The Exact Answer}}$

See also Addendum-3, which discusses my attempt to eliminate all analytical/arithmetic errors. If there is an error here, I can't find it.

The exact value of the numerator is

97764 66436  56143 77455  88343 04066  53926 52678
85054 65050  00769 58007  81459 03523  37082 73950
59966 68940  01160 25362  28962 22626  90344 95207
15411 55691  57470 71405  37027 65275  58261 08373
14668 57820  89474 16370  82644 48000  00000 00000

The exact value of the denominator is

97774 42674  38827 31934  47266 26439  36204 96249
83050 55177  80234 33824  19788 80516  11695 32304
64391 00450  04771 83954  44808 91188  58949 42381
77798 78757  84541 52208  66813 35192  47968 47657
33912 35915  44832 00000  00000 00000  00000 00000

Each of the above numbers should be read left to right, top to bottom.

Therefore, the respective approximate values of the numerator and denominator are

$$N \approx 9.776466436 \times (10)^{(199)}, ~~ D \approx 9.777442674 \times (10)^{(199)}.$$

Therefore,

$$\frac{N}{D} \approx \frac{9.776466436}{9.777442674} \approx 0.999900154 \implies $$

$$1 - \frac{N}{D} \approx 9.984595 \times (10)^{-(5)}.$$


$\underline{\textbf{Addendum-3: Sanity-Checking}}$

First of all, does the answer make sense?

For $~n \in \{1,2,\cdots, 10000\},~$ let $G(n)$ denote the event that the number $n$ is picked at least $3$ times, and let $P(n)$ denote the probability of event $G(n)$ occurring.

$G(1), G(2), \cdots, G(10000)$ are close to being independent events. Further, $P(n)$ can be reasonably approximated to be the probability of the number $n$ being picked exactly $3$ times.

So,

$$P(n) \approx \binom{5}{3} \times \left[\frac{10}{10000}\right]^{(3)} \approx (10)^{(-8)}.$$

Therefore, the chance that none of the events $G(1), G(2), \cdots, G(10000)$ occur is

$$\approx \left[1 - (10)^{(-8)}\right]^{(10000)} \approx 0.999900005.$$

Therefore, the chance that at least one of the events $G(1), G(2), \cdots, G(10000)$ occurs is

$$\approx 9.999501 \times (10)^{-(5)}.$$

If there is an analytical/arithmetic error in my Addendum-2, what is it's most likely source?

The computation of

$$D = \left[\frac{(10000)!}{(10000 - 10)!}\right]^5 $$

is straightforward.

In calculating

$$N = \sum_{k=0}^{(25)} \left[A(k) \times B(k)\right] ~: ~B(k) = \frac{[10000]!}{[10000 - (50 - k)]!},$$

the calculation of $B(k)$ is also straightforward.

As discussed in Addendum-1, I originally had analytical errors re my computation of $A(k)$. These errors involved overlooking the possibility that some slot-pair positions could be shared.

Therefore, it became important to sanity-check my computation of $A(k) ~: ~k \in \{0,1,2,\cdots, 25\}.$

Unfortunately, my pc can't reasonably handle more than $(10)^8$ iterations. Therefore, I was blocked from brute-force verification on $A(k)$ for $k \geq 4$.

Then:

Strangely enough, I discovered a (much simpler) direct approach to calculating $A(k)$, for this problem. I independently applied this approach, and then manually verified each of $A(0), A(1), \cdots, A(25)$.

The direct approach is as follows:

Consider the $50$ slot pair positions as being partitioned into:

  • Group 1 : positions $1 - 10$.
  • Group 2 : positions $11 - 20$.
  • Group 3 : positions $21 - 30$.
  • Group 4 : positions $31 - 40$.
  • Group 5 : positions $41 - 50$.

Assume that each slot-pair has a lower end $L$, and an upper end $U$. You can then assume that if $L$ is in Group $a ~: a \in \{1,2,3,4\}$, then $U$ must be in one of Groups $(a+1), (a+2), \cdots, 5$.

Further, you can assume that when you have exactly $k$ slot pairs, with the lower ends of these slot pairs given by $L_1, L_2, \cdots, L_k$, that these lower ends are in strictly ascending order. Note, that you must still allow (for example) that $U_1 > U_2$.

The algorithm for programmatically (directly) computing $A(k)$ is as follows:

For a specific fixed value of $k \in \{0,1,2,\cdots, 25\}$, let $x_1, x_2, x_3, x_4$ denote the number of slot pairs whose lower end is in Group-1, Group-2, Group-3, Group-4, respectively.

Consider all of the solutions to the following equation:

  • $x_1 + x_2 + x_3 + x_4 = k$.

  • $x_i \in \{0,1,2,\cdots,10\} ~: ~i \in \{1,2,3,4\}$.

  • $\{[2 \times (x_4)] + x_3\} \leq 20.$

  • $\{[2 \times (x_4 + x_3)] + x_2\} \leq 30.$

  • $\{[2 \times (x_4 + x_3 + x_2)] + x_1\} \leq 40.$

From Stars and Bars, you know that the number of such solutions will be somewhat less than

$$\binom{k + 3}{3} \leq \binom{28}{3} < 4000.$$

Therefore, for each value of $k$, the pc can routinely identify each of the $(< 4000)$ individual corresponding solutions. Then, the problem of directly computing $A(k)$ is reduced to computing the number of non-violating slot pair collections that correspond to each individual solution $x_1, x_2, x_3, x_4$. Here, the phrase non-violating refers to the fact that for any slot pair $(L,U)$, $U$ must be in a higher group than $L$.

The formula, for a specific ordered quadruple $(x_1, x_2, x_3, x_4)$ is:

$$F_4 \times F_3 \times F_2 \times F_1, ~~\text{where}$$

$$F_4 = \binom{10}{x_4} \times \frac{\{10\}!}{\{10 - x_4\}!}.$$

$$F_3 = \binom{10}{x_3} \times \frac{\{20 - [2 \times (x_4)]\}!}{\{20 - [2 \times (x_4)] - x_3\}!}.$$

$$F_2 = \binom{10}{x_2} \times \frac{\{30 - [2 \times (x_4 + x_3)]\}!}{\{30 - [2 \times (x_4 + x_3)] - x_2\}!}.$$

$$F_1 = \binom{10}{x_1} \times \frac{\{40 - [2 \times (x_4 + x_3 + x_2)]\}!}{\{40 - [2 \times (x_4 + x_3 + x_2)] - x_1\}!}.$$

The formula is explained as follows:

In $F_4$, there are $\displaystyle \binom{10}{x_4}$ ways that the lower ends of the $x_4$ slot pairs may be assigned from Group-4. Here, to prevent over-counting, the lower ends of the $x_4$ slot pairs are required to be in strictly ascending order.

Then, there are $10$ choices within Group-5, for the upper portion of the first slot pair. Then, $9$ choices for the upper portion of the second slot pair. Continuing, there are $(10 + 1 - x_4)$ choices for the upper portion of the last slot pair.

Now, exactly $[20 - (2 \times x_4)]$ positions remain available in Group-4 + Group-5 combined. This explains the 2nd factor in the computation of $F_3$ above. The computations of $F_2$ and $F_1$ follow the same pattern.

The following chart shows the computed values for $A(k)$:

 A(0) :                                            1

 A(1) :                                         1000
 A(2) :                                      4 60500
 A(3) :                                   1297 44000
 A(4) :                               2  50637 71500
 A(5) :                             352  61309 44800

 A(6) :                           37458  72313 20000
 A(7) :                        30 74788  17892 80000
 A(8) :                      1979 54756  73036 00000
 A(9) :                  1  00898 14147  37856 00000
A(10) :                 40  93170 95795  63040 00000


A(11) :               1324  19254 69663  99488 00000
A(12) :              34131  66649 07978  08800 00000
A(13) :            6 98386  59025 92299  91168 00000
A(14) :          112 71439  52456 71088  37888 00000
A(15) :         1421 37303  02822 93615  23507 20000

A(16) :        13824 72885  14168 68454  63040 00000
A(17) :     1  01934 79915  95713 56769  28000 00000
A(18) :     5  56883 54182  65931 61802  85440 00000
A(19) :    21  86233 68201  45292 49240  67840 00000
A(20) :    59  15174 34316  84268 46809  88672 00000


A(21) :   103  95174 45983  61574 05474  81600 00000
A(22) :   108  53091 64003  32554 37705  21600 00000
A(23) :    58  07909 81968  70316 80827  39200 00000
A(24) :    11  91737 92380  01009 23359  23200 00000
A(25) :        39136 39359  78934 00414  12608 00000

The answer is going to be a pain to calculate exactly but very small.

As it is small, you will get a reasonable approximation if you find the expected number of balls picked at least $3$ times, which is $10000$ times the probability a particular ball is picked more than $2$ times. That you have found using the binomial distribution (easiest with $1$ minus the probability of appearing $0$ or $1$ or $2$ times).

The expected number of balls picked more than $2$ times is an upper bound on the probability that at least one ball is picked more than $2$ times, and this is easy to prove. Suppose $N$ is the number of balls picked more than $2$ times. Then

$$\mathbb P(N>0) \\ = P(N=1)+P(N=2)+P(N=3)+\cdots \\= 0 \times P(N=0) +1\times P(N=1)+1\times P(N=2)+1\times P(N=3)+\cdots \\ \le 0 \times P(N=0) +1\times P(N=1)+2\times P(N=2)+3\times P(N=3)+\cdots \\ = \mathbb E[N]$$