$\text{MSE}$ as an unbiased estimator for $\sigma^2$ [duplicate]
Solution 1:
It is better to follow the lucid method provided by @RCL under the general setup.
However, you can still find the result by doing some simple calculations. Here $SS_E$ is called Residual Sum of Squares(RSS). It is defined as the sum of squares of residuals (difference of observed and predicted value). Suppose $\hat{y_{i}}$ is predicted value obtained by the linear model. Then
\begin{equation}
\begin{aligned}
SS_E&=\sum_{i=1}^{n}(y_i-\hat{y_{i}})^2\\
&=\sum_{i=1}^{n}(y_i-\hat{\beta_0}-\hat{\beta_1}x_i)^2\\
&=\sum_{i=1}^{n}(y_i-\overline{y}+\hat{\beta_1}\overline{x}-\hat{\beta_1}x_i)^2\\
&=\sum_{i=1}^{n}((y_i-\overline{y})-\hat{\beta_1}(x_i-\overline{x}))^2\\
&=\sum_{i=1}^{n} (y_i-\overline{y})^2+\hat{\beta_1}^2\sum_{i=1}^{n} (x_i-\overline{x})^2-2\hat{\beta_1}\sum_{i=1}^{n}(x_i-\overline{x})(y_i-\overline{y})\\
&=\sum_{i=1}^{n} (y_i-\overline{y})^2+\hat{\beta_1}^2 S_{xx}-2\hat{\beta_1}^2S_{xx}\\
&=\sum_{i=1}^{n} (y_i-\overline{y})^2-\hat{\beta_1}^2 S_{xx}
\end{aligned}
\end{equation}
Thus $E(SS_E)=E(\sum_{i=1}^{n} (y_i-\overline{y})^2)-E(\hat{\beta_1}^2 S_{xx})$.
Let us focus on $E(\sum_{i=1}^{n} (y_i-\overline{y})^2)$. It is not a good idea to break the square in the initial stage. It is efficient if we replace $y_i=\beta_0+\beta_1x_i+\epsilon_i$. Because $\epsilon_i$'s are better to deal as they have zero mean!
Denote $\overline{\epsilon}=\frac{1}{n}\sum_{i=1}^{n}\epsilon_i$
Now, \begin{equation} \begin{aligned} \sum_{i=1}^{n} (y_i-\overline{y})^2&=\sum_{i=1}^{n} (\beta_0+\beta_1x_i+\epsilon_i-\beta_0-\beta_1\overline{x}-\overline{\epsilon})^2\\ &=\sum_{i=1}^{n}(\beta_1(x_i-\overline{x})+(\epsilon_i-\overline{\epsilon}))^2\\ &=\sum_{i=1}^{n} \beta_1^2(x_i-\overline{x})^2+\sum_{i=1}^{n} (\epsilon_i-\overline{\epsilon})^2+2\beta_1 \sum_{i=1}^{n} (x_i-\overline{x})(\epsilon_i-\overline{\epsilon}) \end{aligned} \end{equation} So, $E(\sum_{i=1}^{n} (y_i-\overline{y})^2)^2=\sum_{i=1}^{n} \beta_1^2(x_i-\overline{x})^2+\sum_{i=1}^{n}E(\epsilon_i-\overline{\epsilon})^2+2\beta_1 \sum_{i=1}^{n} (x_i-\overline{x})E(\epsilon_i-\overline{\epsilon})$. Clearly, the last term is $0$.
And, $E(\epsilon_i-\overline{\epsilon})^2=E(\sum_{i=1}^{n}\epsilon_i^2-n\overline{\epsilon}^2)=\sum_{i=1}^{n}E(\epsilon_i^2)-\frac{1}{n}E(\sum_{i=1}^{n} \epsilon_i^2 +\sum_{i \not = j} \epsilon_i \epsilon_j)=n \sigma^2 -\sigma^2$.
Finally, $E(SS_E)=\beta_{1}^2 S_{xx} +(n-1)\sigma^2-\sigma^2- \beta_1^2 S_{xx}=(n-2)\sigma^2$.
(I have skipped some steps in the last part. You should check them. It is not difficult to realize those steps if you use the results efficiently)
Solution 2:
I'll show you a general way to prove it...
Note that $SS_E=\sum_i(Y_i-\hat{\beta}_0-\hat{\beta}_1x_i)^2$. There are at least two ways to show the result. Both ways are easy, but it is convenient to do it with vectors and matrices.
Define the model as $Y_{(n\times 1)}=X_{(n\times k)}\beta_{(k\times 1)}+\epsilon_{(n\times 1)}$ (in your case $k=2$) with $E[\epsilon]=0_{(n\times 1)}$ and $Cov(\epsilon)=\sigma^2I_{(n\times n)}$. With this framework, $$SS_E=(Y-X\hat{\beta})^{\top}(Y-X\hat{\beta})=Y^{\top}(I-P)Y,$$ where $P$ is the projection matrix on the column space of $X$. It is a fact that $\hat{\beta}$ is such that $PY=X\hat{\beta}$, and if $X$ is full rank $\hat{\beta}=(X^{\top}X)^{-1}X^{\top}Y$.
If $\epsilon\sim N_n(0,\sigma^2I)$, the result is immediate, because $Y\sim N_n(X\beta,\sigma^2I_n)$ and $$\dfrac{SS_E}{\sigma^2}=\dfrac{Y^{\top}(I-P)Y}{\sigma^2}\sim\chi^2_{(n-k)},$$ bacause $I-P$ is a projection matrix of rank $n-k$.
The second way doesn't need that $\epsilon\sim N_n(0,\sigma^2I)$, just that $E[\epsilon]=0_{(n\times 1)}$ and $Cov(\epsilon)=\sigma^2I_{(n\times n)}$. But you need to show that for any random vector $Z_{(n\times 1)}$ with $E[Z]=\mu$ and $Cov(Z)=\Sigma$, and any symmetric matrix $A_{(n\times n)}$, $$E[Z^{\top}AZ]=tr(A\Sigma)+\mu^{\top}A\mu.$$ So in this case \begin{align*} E[SS_E]&=tr(\sigma^2(I-P))+(X\beta)^{\top}(I-P)X\beta\\ &=\sigma^2(n-k)+0, \end{align*} where we use that $PX=X$ (by definition of $P$) and $tr(I-P)=n-k$ because $P$ just have 0 and 1 eigenvalues (and the trace can be obtained by their sum).