How much space do I need to sort my socks?

In my pile of finished laundry, there are $2n$ socks of $n$ types, comprising $n$ easily distinguishable pairs. I sort the socks into pairs by picking one sock at a time randomly from the pile and either (1) laying it in a row of distinct socks or (2) pairing it with a sock already in the row, if there is such a sock, and putting the pair away. What is the expected maximum number of socks in the row?

Let $N$ be the peak number of socks in the row. Clearly $1\leqslant N\leqslant n$. And obviously, if $n=1$, then $N=\mathrm E[N]=1$. For $n=2$, I get $\mathrm E[N]=\frac53$. As $n$ increases, the calculation gets more complicated, and I don't know whether $\mathrm E[N]$ has a reasonable explicit general formulation. If not, perhaps it still has a nice asymptotic expression for large $n$. My guess is that there are positive constants $\alpha$ and $\beta$ such that $\lim_{\,n\rightarrow\infty}\mathrm E[N]/n^\beta=\alpha$.


Solution 1:

Expanding my comments into an answer per request.

Let $N_n$ be the random variable for the peak number of socks for a given $n$. It is known that $$\overline{N}_n := \mathbf{E}[N_n] = 1, \frac{5}{3}, \frac{7}{3}, \frac{311}{105}, \frac{3377}{945}, \frac{3943}{945}, \ldots $$ or in general $\displaystyle\overline{N}_n = \frac{a_n}{b_n}$ where $a_n$ = OEIS A225177 and $b_n = (2n-1)!! = \prod\limits_{k=1}^n (2k-1)$.
As of this moment, closed form of the sequence is an unknown. For more infos, one can follow the reference in the OEIS A225177 or consult the M/J2 section of this article.

There are two things in above article I'll like to comment.

  1. It described an algorithm to compute the probability distribution for $N_n$.
  2. It mentioned as $n \to \infty$, the probability distribution of $N_n$ becomes normal with mean $\frac{n}{2}$ and variance $\frac{n}{4}$.

I have implemented an algorithm similar to what described in the article and performed some numerical experiment of the probability distribution.

Let $\mathbf{Var}_n = \mathbf{E}\left[ \left(N_n - \overline{N}_n \right)^2 \right]$ be the variance for $N_n$. Following table summarizes what I get: $$\begin{array}{|r|rr:rr:r|} \hline N_n & \overline{N}_n & \overline{N}_n - \frac{n}{2} & \mathbf{Var}_n & \mathbf{Var}_n/n & \Delta_n(2)\\ \hline 200 & 105.107856 & 5.107856 & 40.733642 & 20.366821\% & 2.175188\%\\ 400 & 206.615246 & 6.615246 & 84.774930 & 21.193732\% & 1.600965\%\\ 600 & 307.674159 & 7.674159 & 129.707716 & 21.617953\% & 1.291395\%\\ 800 & 408.517714 & 8.517714 & 175.150639 & 21.893830\% & 1.124633\%\\ 1000 & 509.230343 & 9.230343 & 220.941243 & 22.094124\% & 1.009333\%\\ 1200 & 609.853430 & 9.853430 & 266.990895 & 22.249241\% & 0.923493\%\\ 1400 & 710.410713 & 10.410713 & 313.244447 & 22.374603\% & 0.856246\%\\ 1600 & 810.917224 & 10.917224 & 359.664606 & 22.479038\% & 0.802341\%\\ 1800 & 911.383157 & 11.383157 & 406.224651 & 22.568036\% & 0.757826\%\\ 2000 & 1011.815776 & 11.815776 & 452.904598 & 22.645230\% & 0.719742\%\\ \hline \end{array}$$

  • As one can see, as $n$ becomes large, the leading behavior of $\overline{N}_n$ is indeed $\frac{n}{2}$.
  • The difference between $\overline{N}_n$ and $\frac{n}{2}$ is a very slowly growing function. The difference seems to grow like $O(n^\beta)$ with $\beta \sim 0.37$.
  • The variance $\mathbf{Var}_n$ does seem scale as $n$. However, the actual coefficient in front of $n$ is smaller than $\frac14$. It is not clear whether this is a finite size effect or the limiting coefficient is indeed different from $\frac14$.

About the claim the probability distribution of $N_n$ is normal, it actually works surprisingly well.

Let $\sigma_{n} = \sqrt{\mathbf{Var}_n}$ and $P_n(k) = \mathbf{Prob}[ N_n = k ]$ be the probability that $N_n$ equal to a particular value $k$. At the end is a picture comparing $P_n(k)$ versus a normal distribution for various $n$.

  • The $x$ axis is $\displaystyle\;x = \frac{k - \overline{N}_n}{\sigma_n}$.
  • The $y$ axis is the scaled probability $\sigma_n P_n(k)$.
  • The $\color{orange}{\verb/orange/}$ curve is the PDF $\displaystyle\;\rho(x) = \frac{e^{-x^2/2}}{\sqrt{2\pi}}$ for a normal distribution of unit variance.

As one can see, for all the cases tested, the scaled probability $P_n(k)$ nearly fall right on top of $\rho(x)$. To measure how close the computed probability differ from the PDF of a normal distribution. Let us define following quantity

$$\Delta_n(s) = \max\left\{ \left|\frac{\sigma_nP_n(k)}{\rho(x)} - 1\right| : |x| < s \right\}$$ This measures the biggest difference between the ratio $\frac{\text{scaled probability}}{\text{reference PDF}}$ and $1$ for $x$ within $s$ standard derivation. As shown in last column of table above. As long as we are within 2 standard derivation, the difference between the ratio and $1$ is at most a few percents and improves as $n$ increases.

Distribution of N_n