In what sense is the Jeffreys prior invariant?

Having come back to this question and thought about it a bit more, I believe I have finally worked out how to formally express the sense of "invariance" that applies to Jeffreys' priors, as well as the logical issue that prevented me from seeing it before.

The following lecture notes were helpful in coming to this conclusion, as they contain an explanation that is clearer than anything I could find at the time of writing the question: https://www2.stat.duke.edu/courses/Fall11/sta114/jeffreys.pdf

My key stumbling point seems to be that the phrase "the Jeffreys prior is invariant" is incorrect - the invariance in question is not a property of any given prior, but rather it's a property of a method of constructing priors from likelihood functions.

That is, we want something that will take a likelihood function and give us a prior for the parameters, and will do it in such a way that if we take that prior and then transform the parameters, we will get the same result as if we first transform the parameters and then use the same method to generate the prior. I was looking for an invariance property that would apply to a particular prior generated using Jeffreys' method, whereas the desired invariance principle in fact applies to Jeffreys' method itself.

To give an attempt at fleshing this out, let's say that a "prior construction method" is a functional $M$, which maps the function $f(x \mid \theta)$ (the conditional probability density function of some data $x$ given some parameters $\theta$, considered a function of both $x$ and $\theta$) to another function $\rho(\theta)$, which is to be interpreted as a prior probability density function for $\theta$. That is, $\rho(\theta) = M\{ f(x\mid \theta) \}$.

What we seek is a construction method $M$ with the following property: (I hope I have expressed this correctly) $$ M\{ f(x\mid h(\theta)) \} = M\{ f(x \mid \theta) \}\circ h, $$ for any arbitrary smooth monotonic transformation $h$. That is, we can either apply $h$ to transform the likelihood function and then use $M$ to obtain a prior, or we can first use $M$ on the original likelihood function and then transform the resulting prior, and the end result will be the same.

What Jeffreys provides is a prior construction method $M$ which has this property. My problem arose from looking at a particular example of a prior constructed by Jeffreys' method (i.e. the function $M\{ f(x\mid \theta )\}$ for some particular likelihood function $f(x \mid \theta)$) and trying to see that it has some kind of invariance property. In fact the desired invariance is a property of $M$ itself, rather than of the priors it generates.

I do not currently know whether the particular prior construction method supplied by Jeffreys is unique in having this property. This seems to be rather an important question: if there is some other functional $M'$ that is also invariant and which gives a different prior for the parameter of a binomial distribution then there doesn't seem to be anything that picks out the Jeffreys distribution for a binomial trial as particularly special. On the other hand, if this is not the case then the Jeffreys prior does have a special property, in that it's the only prior that can be produced by a prior generating method that is invariant under parameter transformations. It would therefore seem rather valuable to find a proof that Jeffrey's prior construction method is unique in having this invariance principle, or an explicit counterexample showing that it is not.


Maybe the problem is that you are forgetting the jacobian of the transformation in (ii).

I suggest that you check carefully the formulas here (hint: $\left| \frac{d \Phi^{- 1}}{d y} \right|$ is the jacobian where $\Phi^{- 1}$ is the inverse transformation). Then, start with some simple examples of some monotonic transformations in order to see the invariance. I suggest to start with $\varphi(\theta)=2\theta$ and $\varphi(\theta)=1-\theta$.

Also, to answer your question, the constants of integration do not matter here. In (i), it is $\pi$. Do the calculations with $\pi$ in there to see that point. Let me know if you are stuck somewhere.

Edit: The dependence on the likelihood is essential for the invariance to hold, because the information is a property of the likelihood and because the object of interest is ultimately the posterior. However, regardless what likelihood you use, the invariance will hold through. This happens through the relationship $ \sqrt{I (\theta)} = \sqrt{I (\varphi (\theta))} | \varphi' (\theta) | $. Indeed this equation links the information of the likelihood to the information of the likelihood given the transformed model. Here $| \varphi' (\theta) |$ is the inverse of the jacobian of the transformation. (I will let you verify this by deriving the information from the likelihood. Just use the chain rule after applying the definition of the information as the expected value of the square of the score). Now, for the prior. \begin{eqnarray*} p (\varphi (\theta) ) & = & \frac{1}{| \varphi' (\theta) |} p (\theta )\\ & = & \frac{1}{| \varphi' (\theta) |} \sqrt{I (\theta)} \\ & = & \sqrt{I (\varphi (\theta))} \\ & = & p (\varphi (\theta)) \end{eqnarray*} The first line is only applying the formula for the jacobian when transforming between posteriors. The second line applies the definition of Jeffreys prior. The third line applies the relationship between the information matrices. The final line applies the definition of Jeffreys prior on $\varphi{(\theta)}$. You can see that the use of Jeffreys prior was essential for $\frac{1}{| \varphi' (\theta) |}$ to cancel out.

Look again at what happens to the posterior ($y$ is obviously the observed sample here) \begin{eqnarray*} p (\varphi (\theta) |y) & = & \frac{1}{| \varphi' (\theta) |} p (\theta |y)\\ & \propto & \frac{1}{| \varphi' (\theta) |} p (\theta) p (y| \theta)\\ & \propto & \frac{1}{| \varphi' (\theta) |} \sqrt{I (\theta)} p (y| \theta)\\ & \propto & \sqrt{I (\varphi (\theta))} |p (y| \theta)\\ & \propto & p (\varphi (\theta)) p (y| \theta) \end{eqnarray*} The only difference is that the second line applies Bayes rule.

As I explained earlier in the comments, it is essential to understand how jacobians work (or differential forms).