What is the theoretical justification for alternatives to MSE minimalisation?

I'm trying to wrap my head around the connection between statistical regression and its probability theoretical justification. In many books on statistics/machine learning, one is introduced to the idea of the loss function, which is then typically followed by a phrase of the flavour 'a popular choice for this function is mean squared loss'. As far as I understand, the justification for this choice stems from the theorem that

$$ \arg\min_{Z \in L^2(\mathcal{G})} \ \mathbb{E} \left[ (X - Z)^2 \right] = \mathbb{E} \left[ X \Vert \mathcal{G} \right] \tag{*} $$

where $X$ is the random variable to be estimated based on the information contained in $\mathcal{G}$. As far as I understand, probability theory teaches us that the conditional expectation $\mathbb{E}[X \Vert \mathcal{G}]$ is the best such estimate. If that's the case, why should our loss function still be a choice? Clearly we should be statistically estimating $\mathbb{E}[X \Vert \mathcal{G}]$, which by (*) implies minimizing the MSE.

You could argue that such reasoning is circular because we define the conditional expectation to satisfy (*), but that's doesn't seem true, as we have conditional expectations for any random variable in $L^1$, and moreover there have been many eloquent posts on this website explaining how the $L^1$ definition can be intuitively interpreted in terms of the measurability capturing the information contained in $\mathcal{G}$, etc. I would greatly appreciate it if someone could clear up my confusion.


When considering a loss function you also need to consider the finite-sample convergence properties, variance, and existence of the conditional mean.

For example:

If the data are heavy tailed then the sample mean can swing rapidly until $n$ is large — or the mean may not exist (e.g. Cauchy error terms). You may then opt for a more robust loss function (e.g., https://en.m.wikipedia.org/wiki/Huber_loss)

Another example is when the impact of errors is not symmetric — you may want it to be a little biased based on economic considerations (asymmetric loss).

In the end — you need to get the best decisions from your algo given the quantity and quality of the data you have — quadratic loss or MSE may not be the best choice then.