Suppose we observe one draw from the random variable $X$, which is distributed with normal distribution $\mathcal{N}(\mu,\sigma^2)$. The variance $\sigma^2$ is known, $\mu$ isn't. We want to estimate $\mu$.

Suppose further that the prior distribution is given by truncated normal distribution $\mathcal{N}(\mu_0,\sigma^2_0,t)$, i.e., density $f(\mu)=c/\sigma \phi((\mu-\mu_0)/\sigma_0)$ if $\mu<t$, and $f(\mu)=0$ otherwise, where $t>\mu$ and $c$ is a normalizing constant. (Interpretation: we get noisy signals about $\mu$, which are known to be normally distributed with known variance---this is the draw of $X$. But we have prior knowledge that values $\mu\ge t$ are not possible.)

In this setup, is the resulting posterior a truncated normal distribution (truncated at $t$ like the prior)? I tried to adapt the derivation of the posterior for the well known conjugate normal pair (e.g., here and here), and it seems to work. Do you see any mistake in this derivation?

The likelihood function is given by $$f(x|\mu)=\frac{1}{\sigma\sqrt{2\pi}} \exp\left\{-\frac{(x-\mu)^2}{2\sigma^2} \right\} $$ The prior density is ($\Phi(.)$ is the cdf of the standard normal distribution) $$f(\mu)=\begin{cases} \frac{1}{\sigma_0\sqrt{2\pi}\Phi((t-\mu_0)/\sigma_0)} \exp\left\{-\frac{(\mu-\mu_0)^2}{2\sigma_0^2} \right\} &\text{ if } \mu\le t \\ 0 & \text{else}. \end{cases}$$ The prior density can be rewritten as $$f(\mu)=c \phi((\mu-\mu_0)/\sigma_0)\mathbf{1}\{\mu<t\},$$ where $c$ is the normalizing constant (independent of $\mu$, but dependent on $t$). Now, by Bayes' rule, \begin{equation} f(\mu|x)\propto f(x|\mu) f(\mu)\propto\exp\left\{-\frac{(x-\mu)^2}{2\sigma^2} \right\} \exp\left\{-\frac{(\mu-\mu_0)^2}{2\sigma_0^2} \right\}\mathbf{1}\{\mu<t\} \\ =\exp\left\{-\frac{(x-\mu)^2}{2\sigma^2} -\frac{(\mu-\mu_0)^2}{2\sigma_0^2} \right\}\mathbf{1}\{\mu<t\}\\ \propto \exp\left\{-\frac{1}{2\sigma^2\sigma_0^2/(\sigma^2+\sigma_0^2)} \left(\mu-\frac{\sigma^2\mu_0+\sigma_0^2 x}{\sigma^2+\sigma_0^2}\right)^2 \right\}\mathbf{1}\{\mu<t\}. \end{equation} This is the kernel of the normal distribution with the usual mean and variance (as if we had done the derivation for an untruncated prior), but truncated at $t$ and above. In other words, ignoring the truncation in the prior distribution, using the usual learning rule for the conjugate normal pair, and then applying the truncation gives the same result as the derivation above (assuming it is correct). Is it correct? All I do is add the indicator function (and adapt the normalizing constant), does that introduce problems somewhere?


Solution 1:

Your derivation is correct.

I think the result is also very intuitive. As you pointed out, if you have a prior which is a normal distribution and posterior which is also a normal distribution, then the result will be another normal distribution.

$$f(\mu|x)\propto f(x|\mu) f(\mu)$$

Now suppose I came along and set a region of $f(\mu)$ to zero and scaled it by $c$ to renormalize it. For points of $\mu$ where it was not set to zero, the right-hand side of the above equation is the same except that we have to change $f(\mu) \to c f(\mu)$. Therefore, the left-hand side is also just scaled by $c$, but retains the exact shape of a normal distribution. So we end up with a scaled normal distribution, except of course for points where $f(\mu)$ is zero and the left hand side is also zero.

It won't cause problems since it's correct, although it might not be a nice function to work with if you're trying to derive something analytically. For example, the mean of your posterior is a very long, complicated expression (which I was able to find in Mathematica).

Solution 2:

The new density is again truncated normal at $t$ with new parameters $\frac{\sigma^2\mu_0+\sigma_0^2 x}{\sigma^2+\sigma_0^2},\sigma^2\sigma_0^2/(\sigma^2+\sigma_0^2)$. The new normalising constant is $$\Phi\left( \frac{t-\frac{\sigma^2\mu_0+\sigma_0^2 x}{\sigma^2+\sigma_0^2}}{\sigma^2\sigma_0^2/(\sigma^2+\sigma_0^2)}\right )$$

So the interesting thing is that conjugacy is preserved under truncation of the prior for the mean. It would be nice to study these posteriors for a fixed $t$ and different values of the prior parameters.