how is the formula for linear regression connected to the fact that the conditional mean is the optimal estimator?
It’s not the linear function That minimizes the MSE — it’s how you estimate it’s parameters. If you are trying to minimize the MSE of a linear estimator then there are nice linear algebra formulas called normal equations that save you from having to numerically optimize over the $\beta_i$.
However, if you are in nonlinear territory then MSE may be just had hard to optimize.