Expected Value of R squared

This problem seems simple...but its not. For example, see here for a rather complex analysis for the prima facie simple case of ratios of normal rv and ratios of sums of uniforms.

In general, if your pairs are not from a bivariate gaussian, there is no nice formula for $E[R^2]$.

Note:

$$R_n=\frac{n\sum x_iy_i-\sum x_i\sum y_i}{n^2s_Xs_Y}$$

This mess will have some distribution $f_{R_n}(r)$ that will be very sensitive to $n$.

I think your best bet is to simulate this (Monte Carlo) for $n\in [2....N]$ using a large number of trials (you can check convergence by running each simulation twice, with randomly chosen seeds and comparing these results to each other and to results from $n-1$).

Once you have this data, you can fit a curve to the it or some transformation thereof. Your general equation looks reasonable in terms of how the curve will look, since:

$$E[R^2_n] \xrightarrow{p} 0$$ for correlations between independent variables

Possible Solution

Since your variables are independent, I realized that we are really looking for the variance of the sample correlation (i.e., the square of the expected value of the standard error of the correlation coefficient (see p.6):

$$se_{R_n}=\sqrt{\frac{1-R^2}{n-2}}$$. However, you already know the true value of $R^2$, so you can increase the df in the denominator to get:

But: $R^2=0$ for independent variables, so it reduces to:

$$(se_{R_n})^2=\sigma^2_{R_n}=E[R^2_n]=\frac{1}{n-1}$$

There you have it...it matches your empirical results. As per Wolfies, I should note that this is an asymptotic result, but sums of uniform RVs generally exhibit good convergence properties ala CLT, so this may explain the good fit.

For further reading, see @soakley's nice reference. I was able to pull the relevant page from JSTOR:

enter image description here

or, if you're really motivated, you can get this recent article (2005) about your exact problem.


According to Kendall's Advanced Theory of Statistics (Exercise 16.17 in the 5th edition of Volume 1), Pitman (1937) showed the sample correlation coefficient $r$ has zero mean and variance or second moment of $$\sigma^2_{r}=E[r^2] = {1 \over {n-1}}$$ for any sample of size $n$ where $x$ and $y$ are independent continuous variates.

Checking the reference, we find he shows $r^2$ has an approximate $\mathrm{Beta} \left( {1 \over 2}, {{n-2} \over {2}}\right)$ distribution.

Reference: Pitman, E.J.G.. Significance tests which may be applied to samples from any population., v. 4, No. 1, II. The correlation coefficient test., v. 4, No. 2, $\it{Supp. J.R. Statist. Soc.},$ 1937.


I'm just copying the section from

http://en.wikipedia.org/wiki/Coefficient_of_determination

I think it is what you are looking for.

A data set has n values marked $y_1...y_n$ (collectively known as $y_i$), each associated with a predicted (or modeled) value $f_1...f_n$ (known as $f_i$, or sometimes $ŷ_i$).

If $\bar{y}$ is the mean of the observed data:

$\bar{y}=\frac{1}{n}\sum_{i=1}^n y_i $ then the variability of the data set can be measured using three sums of squares formulas:

The total sum of squares (proportional to the variance of the data): $SS_\text{tot}=\sum_i (y_i-\bar{y})^2,$ The regression sum of squares, also called the explained sum of squares: $SS_\text{reg}=\sum_i (f_i -\bar{y})^2,$ The sum of squares of residuals, also called the residual sum of squares: $SS_\text{res}=\sum_i (y_i - f_i)^2\,$ The notations $SS_\text{R}$ and $SS_\text{E}$ should be avoided, since in some texts their meaning is reversed to Residual sum of squares and Explained sum of squares, respectively.

The most general definition of the coefficient of determination is

$R^2 \equiv 1 - {SS_{\rm res}\over SS_{\rm tot}}.$

Note: I can't tell from the preview if it looks ok. I'll keep trying to make it look ok, or just follow the link.

If nothing else, look at the inset figure to the right.

Here is the link to the graphic, with squares of data versus (difference of squared) $\bar{y}$ on the left compared to squares of data versus (difference of squared) fit line on right.

http://en.wikipedia.org/wiki/Coefficient_of_determination#mediaviewer/File:Coefficient_of_Determination.svg