Why is polynomial regression considered a kind of linear regression?

Why is polynomial regression considered a kind of linear regression?

This is what I mean by polynomial regression. For example, the hypothesis function is

$$h(x; t_0, t_1, t_2) = t_0 + t_1 x + t_2 x^2 ,$$

and the sample points are

$$ (x_1, y_1), (x_2, y_2), \ldots$$


This is a form of linear regression because it takes the form

$$h(x)=\sum_i t_if_i(x)\;,$$

which is a linear combination of functions $f_i(x)$ and is amenable to a solution using only linear algebra. The non-linearity of the functions $f_i(x)$ doesn't complicate the solution; it enters only in calculating the values $f_i(x_j)$, and everything is then linear in these values. What's important is that the function is linear in the parameters $t_i$; otherwise these need to be determined by non-linear optimization.


Everything in "polynomial regression" for which linearity matters is linear. In linear regression, the vector of least-squares estimators $\widehat\alpha,\widehat\beta,\widehat\gamma,\ldots$ depends in a linear way on the vector of response variables $y_1,y_2,y_3,\ldots,y_n$. The response variables are in effect treated as random and the predictor variables are in effect treated as fixed, i.e. non-random. That can make sense despite the fact that if you take a new sample, both the response variables and the predictor variables change. The reason for that is that you're interested in the conditional distribution of the response variable given the predictor variable.

So say we have $$ y_i = \alpha_0 + \alpha_1 x_i + \alpha_2 x_i^2 + \mathrm{error}_i, $$ We observe the $x$s and the $y$s and then we find the least-squares estimates $\widehat\alpha_0,\widehat\alpha_1,\widehat\alpha_2$. Then we take a new sample with the same $x$s but instead of the $y$s we have $w_1,\ldots,w_n$, and we again find least-squares estimates; call them $\widehat\beta_0,\widehat\beta_1,\widehat\beta_2$. Now suppose in place of the response variables we put $y_1+w_1,\ldots,y_n+w_n$, and again find the least-squares estimators. What do we get? The answer is just $\widehat\alpha_0+\widehat\beta_0,\widehat\alpha_1+\widehat\beta_1,\widehat\alpha_2+\widehat\beta_2$. That's linearity. (A similar proposition applies to scalar multiples.)

We want to know about the probability distribution of the least-squares estimates when we know the joint probability distribution of the $y$s. Linearity makes that easy.