Correlation Coefficient and Determination Coefficient
I'm new to linear regression and am trying to teach myself.
In my textbook there's a problem that asks "why is $R^{2}$ in the regression of $Y$ on $X$ equal to the square of the sample correlation between X and Y?"
I've been throwing my head against this for a while and I keep getting stuck because in the correlation coefficient there is a $X$ and $\bar{X}$ term, whilst in the $R^{2}$ term there is no such thing.
Can anyone provide a derivation as to why $R^{2}$ is the correlation coefficient squared?
Thanks!
Suppose that we have $n$ observations $(x_1,y_1),\ldots,(x_n,y_n)$ from a simple linear regression $$ Y_i=\alpha+\beta x_i+\varepsilon_i, $$ where $i=1,\ldots,n$. Let us denote $\hat y_i=\hat\alpha+\hat\beta x_i$ for $i=1,\ldots,n$, where $\hat\alpha$ and $\hat\beta$ are the ordinary least squares estimators of the parameters $\alpha$ and $\beta$. The coefficient of the determination $r^2$ is defined by $$ r^2=\frac{\sum_{i=1}^n(\hat y_i-\bar y)^2}{\sum_{i=1}^n(y_i-\bar y)^2}. $$ Using the facts that $$ \hat\beta=\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sum_{i=1}^n(x_i-\bar x)^2} $$ and $\hat\alpha=\bar y-\hat\beta\bar x$, we obtain \begin{align*} \sum_{i=1}^n(\hat y_i-\bar y)^2 &=\sum_{i=1}^n(\hat\alpha+\hat\beta x_i-\bar y)^2\\ &=\sum_{i=1}^n(\bar y-\hat\beta\bar x+\hat\beta x_i-\bar y)^2\\ &=\hat\beta^2\sum_{i=1}^n(x_i-\bar x)^2\\ &=\frac{[\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)]^2\sum_{i=1}^n(x_i-\bar x)^2}{[\sum_{i=1}^n(x_i-\bar x)^2]^2}\\ &=\frac{[\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)]^2}{\sum_{i=1}^n(x_i-\bar x)^2}. \end{align*} Hence, \begin{align*} r^2 &=\frac{\sum_{i=1}^n(\hat y_i-\bar y)^2}{\sum_{i=1}^n(y_i-\bar y)^2}\\ &=\frac{[\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)]^2}{\sum_{i=1}^n(x_i-\bar x)^2\sum_{i=1}^n(y_i-\bar y)^2}\\ &=\biggl(\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n(x_i-\bar x)^2\sum_{i=1}^n(y_i-\bar y)^2}}\biggr)^2. \end{align*} This shows that the coefficient of determination of a simple linear regression is the square of the sample correlation coefficient of $(x_1,y_1),\ldots,(x_n,y_n)$.
The complete proof of how to derive the coefficient of determination $R^{2}$ from the Squared Pearson Correlation Coefficient between the observed values $y_{i}$ and the fitted values $\hat{y}_{i}$ can be found under the following link:
http://economictheoryblog.wordpress.com/2014/11/05/proof/
In my eyes it should be pretty easy to understand, just follow the single steps.