From the principle of the "jacobi-rotation" method for obtaining the principal components it is clear, that some ellipsoid (based on the data) is defined by the following idea:

twodimensional case:

  1. rotate the cloud of datapoints such that the sum-of-squared x-coordinates ("ssq(x)") is maximum

  2. the sum-of-squared y-coordinates ("ssq(y)") is then minimal.

multidimensional case:

  1. you do 1) with all pairs of axes x-y, x-z, x-w,... This gives a "temporary" maximum of ssq(x)

  2. then you do 1) with all pairs y-z,y-w,... to obtain a "temporary" maximum of ssq(y) and so on with z-w,....

  3. Then you repeat the whole process until convergence

In principle this only defines the direction of the axes of some ellipsoid. Then, for instance in the monography "Faktorenanalyse" ("factoranalysis") of K. Ueberla, the size of the ellipsoid is defined by: "length of the axes are the roots of ssq(axis)" (paraphrased) Here no direct reference is made to the property of radial deviation (when representated in terms of polar-coordinates).

In fact, the concept of minimizing the radial distances is different by a simple consideration.
If we have -for instance- only three datapoints which are nearly on one straight line, then the center of the best-fitting-circle tends to infinity and not to the mean. The same may occur with some cloud of datapoints whose overall shape is that of a small but sufficient segment of a circumference of an ellipse. Well, we may force the origin into the arithmetic center of the cloud - but this introduces a likely artificial restriction. Actually, if you search for "circular regression" you'll arrive at nonlinear, iterative methods much different of the method of principal components.

[update] Here I show the difference between the two concepts in a two-dimensional example. I generated slightly correlated random-data. Rotated to pca-position, such that the x-axis of the plot is the first princ.comp. (eigenvector) and so on.

The blue ellipse is that ellipse using the axis-lengthes of stddev of the principal components.

Then I computed another ellipse by minimizing the sum-of-squares of radial distances of datapoints to the ellipses circumference (just manual checking), this is the red ellipse. Here is the plot:

ellipsefitting

[end update]


One more comment for the geometric visualization (it is just for the two-dimensional case).

Consider the cloud of data as endpoints of vectors from the origin. Then the common concept of the resultant is the mean of the arithmetic/geometric sum of all vectors. This is also called "centroid" and in early times of factor-analysis, without the help of computers, it was often an accepted approximation for the "central tendency": the mean of the coordinates are the best fit in the sense of least "squares-of-distances" .

However, if some vectors have "opposing" (better word?) direction, their influence neutralizes, although they would nicely define an axis. In the old centroid-method such vectors were "inflected" (sign changed) to correct for that effect.

Another option is to double the angle of each vector - then opposing vectors overlay and strengthen each others influence instead of neutralizing. Then take the mean of the new vectors and half its angle. This is then the principal component/axis and thus represents another least-squares estimation for the fit. (The computation of the rotation-criterion agrees perfectly to the minimizing formula of the Jacobi-rotation for PCA)


Just fleshing out Rahul's comment: probably this choice of $\mu$ and $\Sigma$ maximizes the likelihood of the points given a multivariate Gaussian distribution $N(\mu,\Sigma)$.