What is the theoretical justification for alternatives to MSE minimalisation?

I'm trying to wrap my head around the connection between statistical regression and its probability theoretical justification. In many books on statistics/machine learning, one is introduced to the idea of the loss function, which is then typically followed by a phrase of the flavour 'a popular choice for this function is mean squared loss'. As far as I understand, the justification for this choice stems from the theorem that

$$ \arg\min_{Z \in L^2(\mathcal{G})} \ \mathbb{E} \left[ (X - Z)^2 \right] = \mathbb{E} \left[ X \Vert \mathcal{G} \right] \tag{*} $$

where $X$ is the random variable to be estimated based on the information contained in $\mathcal{G}$. As far as I understand, probability theory teaches us that the conditional expectation $\mathbb{E}[X \Vert \mathcal{G}]$ is the best such estimate. If that's the case, why should our loss function still be a choice? Clearly we should be statistically estimating $\mathbb{E}[X \Vert \mathcal{G}]$, which by (*) implies minimizing the MSE.

You could argue that such reasoning is circular because we define the conditional expectation to satisfy (*), but that's doesn't seem true, as we have conditional expectations for any random variable in $L^1$, and moreover there have been many eloquent posts on this website explaining how the $L^1$ definition can be intuitively interpreted in terms of the measurability capturing the information contained in $\mathcal{G}$, etc. I would greatly appreciate it if someone could clear up my confusion.

When considering a loss function you also need to consider the finite-sample convergence properties, variance, and existence of the conditional mean.

For example:

If the data are heavy tailed then the sample mean can swing rapidly until $n$ is large — or the mean may not exist (e.g. Cauchy error terms). You may then opt for a more robust loss function (e.g., https://en.m.wikipedia.org/wiki/Huber_loss)

Another example is when the impact of errors is not symmetric — you may want it to be a little biased based on economic considerations (asymmetric loss).

In the end — you need to get the best decisions from your algo given the quantity and quality of the data you have — quadratic loss or MSE may not be the best choice then.

proving a function is Borelmeasurable

For what $n$ is there an injective homomorphism from $\mathbb{Z_n} \to S_7$

Find the minimum value of this

How to show that $dX_{t} = \sqrt{2c\lambda} dB_{t}-\lambda X_{t}dt$ has the following solution

Triple integral of a sphere bounded by a region

Two model theory questions regarding infinitely axiomatizable classes of structures

Localization and Field [duplicate]

Finding the Maclaurin polynomial of order 6 of: $f(x)=x\ln(1+x^{3})\ln(1-x^{2})$

$E[1/(M+1)]$ where $M$ is the number of unique items from sampling with replacement

Find function that minimizes the distance from $f$ to $g$ with respect to the $L_2$-norm

Analysis of a calculation of expected number of collisions in hashing

Basic math explanation (related to estimating linear regression with no intercept)