Expected length of the shortest polygonal path connecting random points

Given: $(X,Y)$ is Uniformly distributed on a disc of unit radius. Then, the joint distribution of $n$ such points, $((X_1,Y_1), ..., (X_n,Y_n))$, each drawn independently, will have joint pdf:

$$f( (x_1,y_1), ..., (x_n,y_n) ) = \begin{cases}\pi^{-n}& (x_1^2 + y_1^2 < 1) & \text{ & } \cdots \text{ & }&(x_n^2 + y_n^2 < 1) \\ 0 & otherwise \end{cases}$$

Next, let $Z_n$ denote the Euclidean length of the shortest polygonal path that visits each point (at least) once. For any given $n$, it is possible to express $Z_n$ as an exact symbolic construct. This rapidly gets tiresome to do manually, so I have set up a function PolygonPathMinDistance[points] to automate same (I have provided the code at the bottom of this post). For example:

  • Case $n = 2$:
    With 2 points, the shortest polygonal path $Z_2$ is, of course, ... :

enter image description here

and so the desired exact solution is:

$$E[Z_2] = E\left[\sqrt{\left(X_1-X_2\right){}^2+\left(Y_1-Y_2\right){}^2}\right] \approx 0.9054$$

where the random variables $(X_1, Y_1, X_2, Y_2)$ have joint pdf $f(\cdots)$ given above.

Although getting a closed form solution for the expectation integral has thus far proved elusive (perhaps transforming to polar coordinates might be a path worth pursuing - see below), the result is just a number, and we can find that number to arbitrary precision using numerical integration rather than symbolic integration. Doing so yields the result that: $E[Z_2] = 0.9054 ...$.

  • Case $n = 3$:
    With 3 points, the shortest polygonal path $Z_3$ is:

enter image description here

and similarly obtain:

$$E[Z_3] \approx 1.49 ...$$

  • Case $n = 4$:
    With 4 points, the shortest polygonal path $Z_4$ is:

enter image description here

and similarly obtain:

$$E[Z_4] \approx 1.96 ...$$

... and so on. For cases above $n \ge 5$, the method still works fine, but numerical integration gets less reliable, and other methods, such as Monte Carlo simulation [i.e. actually generating $n$ pseudorandom points in the disc (say 100,000 times), evaluating the actual shortest path for each and every $n$-dimensional drawing, and calculating the sample mean] appear to work more reliably. The latter is, of course, no longer an exact conceptual methodology, but as a practical evaluation method, for larger $n$, it appears to become the more practical option.

Summary Plots and Fitting $$\color{red}{\text{Expected shortest polygon path connecting $n$ random points in a disc of unit radius }}$$

enter image description here

... and with $n=50, 100$ and $150$ points added:

enter image description here

The fitted model here is simply:

$$a + b \frac{n-1}{\sqrt{n}}$$

with $a = -0.11$, $b = 1.388$.

Fitting curves

A number of 'natural' contenders present themselves, including models of form:

  • $a + b n + c \sqrt{n}$ ... this performs neatly within sample, but as soon as one extends to larger values of $n$, the model performs poorly out of sample, and needs to be re-fitted given the extra data points. This suggests that the true model is not just square root $n$.

  • $a + b n^c$ ... this works very neatly within sample, and outside sample, but the optimal parameter values show some instability as more data values are added.

  • $a + b \frac{n-1}{\sqrt{n}}$ ... just 2 parameters, and works absolutely beautifully. The idea for this model was inspired by Ju'x posting. The parameter values are remarkably stable as more data points are added (they remain constant to 2 decimal places, irrespective of whether we use just $n = 2 ... 20$, or $n = 50$ is added, or $n=100$ is added, all the way up to $n = 300$), as shown here:

enter image description here

The parameter values for $a$ and $b$ are the same as for the diagram above.


Code for the PolygonPathMinDistance function

Here is some Mathematica code for the exact PolygonPathMinDistance function that calculates all of the possible relevant permutations involved in calculating the shortest polygon path, as an exact symbolic/algebraic construct:

PolygonPathMinDistance[points_] := Module[{orderings, pointorderings, pathsToCheck, ww},
       orderings =  Union[Permutations[Range[Length[points]]], 
                        SameTest -> (#1 == Reverse[#2] &)];
  pointorderings = Map[points[[#]] &, orderings];
   pathsToCheck  = Map[Partition[#, 2, 1] &, pointorderings];
 Min[Map[Total[Map[EuclideanDistance @@ # &, #1] /. Abs[ww_]->ww]&, pathsToCheck]]
]

.

Alternative specification of problem in Polar Coordinates

If points are Uniformly distributed on a disc of unit radius and independent, the joint distribution of $n$ points, $((X_1,Y_1), ..., (X_n,Y_n))$, with joint pdf $f(\cdots)$ is given above. Alternatively, consider the transformation to polar coordinates, $(X_i \to R_i cos(\Theta_i)$, $Y_i \to R_i sin(\Theta_i))$, for $i = 1$ to $n$. Then, the joint pdf is:

$$f( (r_1,\theta_1), ..., (r_n,\theta_n) ) = \frac{r_1 \times r_2 \times \cdots \times r_n}{\pi^{n}}$$

with domain of support, $R_i=r_i \in \{r_i:0<r_i<1\}$ and $\Theta_i=\theta_i \in \{\theta_i:0<\theta_i<2 \pi\}$. Then, the distance between any two random points, say $(X_i,Y_i)$ and $(X_j,Y_j)$ is given by the distance formula for polar coordinates:

$$\sqrt{\left(X_j-X_i\right){}^2+\left(Y_j-Y_i\right){}^2} = \sqrt{R_i^2 + R_j^2 -2 R_i R_j \cos \left(\Theta_i-\Theta_j\right)}$$

Taking expectations then yields the same results as above. For the $n = 2$ case, symbolic integration is a little easier, and one can obtain the result:

$$\begin{align*}\displaystyle E\left[ Z_2\right] &= E\left[ \sqrt{R_1^2 + R_2^2 -2 R_1 R_2 \cos \left(\Theta _1-\Theta _2\right)}\right] \\ &= \frac{8}{\pi} \int_0^1\int_0^1 |r_1 - r_2| EllipticE[\frac{-4 r_1 r_2}{(r_1-r_2)^2}]r_1 r_2 dr_1\,dr_2 \\ &= \frac{8}{\pi} \frac{16}{45} \approx 0.9054 \text{(as above)}\end{align*}$$

Extending from a disc to a ball

There is no inherent reason the same methods cannot be extended to a ball ... just more dimensions (and more computational grunt required).


There is a very elegant theorem published by Beardwood, Halton and Hammersley in 1959 that gives an alsmost-sure convergence for the length of the shortest path divided by $\sqrt{N}$. See this poster for a statement in a general setting.

Also, the convergence holds for expectations. In 1989, Rhee and Talagrand showed that the length of the shortest path is very tightly concentrated around its mean.


Let us give a very elementary proof that there exists constant $c,C > 0$ such that $$ c \sqrt{N} \leq L(N) \leq C \sqrt{N}. $$ In order to simplify the presentation I will do the proof for a unit square instead of a disk, but it should work just the same.

1. Upper bound (deterministic)

Take the unit square to simplify and split it into $N$ small boxes of area $1/N$ and side length $\frac{1}{\sqrt{N}}$, accordingly to the following figure (to be rigourous this is only possible if $N$ is a square, but by monotonicity it is sufficient to treat only this case).

enter image description here

In each box, consider an arbitrary polygonal line that joins each of the points contained in the box. Its length is at most $\sqrt{2/N}$ times the number of points in the box except one.

Now link the small chains together (for instance, follow the red path): this is done by adding at most $N-1$ lines of length at most $\sqrt{\frac{5}{N}}$. Thus this procedure gives a full polygonal path that visits each of the points. If $B_i$ denotes the number of points in the $i$th box, the length of the total path is at most $$ \sum_{i=1}^N \sqrt{\frac{2}{N}}(B_i-1)_+ + (N-1)\times \sqrt{\frac{5}{N}} \leq (\sqrt{2}+\sqrt{5})\times \frac{N-1}{\sqrt{N}} $$ since $\sum_{i=1}^N B_i = N$. This gives an upper bound for the length of the shortest path, hence for its expectation. Of course, the constant $\sqrt{2}+\sqrt{5}$ is far from optimal and can easily be improved.

2. Lower bound (probabilistic)

Let us scale the $N$ previous boxes by a factor $\frac{1}{2}$, so that the area of each of them is now $\frac{1}{4N}$.

Smaller boxes

As before, let $B_i$ denote the number of points in the $i$th box. It follows a Binomial$(N,\frac{1}{4N})$ distribution. The shortest path should at least link together the nonempty boxes, and the cost of these links is at least (thanks to the triangle inequality) $\frac{1}{2\sqrt{N}}$ for each nonempty box except the first one. Using the linearity of expectations we see that $L(N)$ is bounded by below by $$ \mathbb{E}\left[\frac{1}{2\sqrt{N}}\times \left(\sum_{i=1}^n \mathbf{1}_{B_i \neq 0}-1\right)\right] \geq \frac{\sqrt{N}}{2}\Pr(B_1 \neq 0) - \frac{1}{2\sqrt{N}}. $$ Note finally that $$ \Pr(B_i \neq 0) = 1-\left(1-\frac{1}{4N}\right)^N \geq 1-e^{-\frac{1}{4}} > 0. $$

3. Extension to dimension $d\geq 3$ : The proof readily extends to $c\cdot N^{\frac{d-1}{d}} \leq L(N) \leq C\cdot N^{\frac{d-1}{d}}$


$L(2)=128/45\pi$

L(3) may or may not be extractable from here.

I suspect thats all thats known about exact values.

Also, your problem is called Traveling Salesman Problem.