Deterministic Policy Gradient Theorem Proof

This paper states the deterministic policy gradient theorem. The proof of the theorem is provided in a supplement. This question is about the very first step in the proof. It begins with

$$ \nabla_\theta V^{\mu_\theta}(s) = \nabla_\theta Q^{\mu_\theta}(s, \mu_\theta(s)) = \nabla_\theta \left( r(s, \mu_\theta(s)) + \int_{\mathcal{S}} \gamma p(s' | s, \mu_\theta(s)) V^{\mu_\theta} (s')ds' \right) $$

and then more steps follow. The first equality is clear. This follows from the relation between the state value function $V$ and the state-action value function $Q$. In the more general stochastic case, with a stochastic policy $\pi$ we have this relation: $$ V^\pi (s) = \sum_a \pi(a|s) Q^\pi(s, a). $$ When the policy $\pi$ is changed to $\mu_\theta $ and is deterministic, only 1 term in the sum remains and we get $$ V^{\mu_\theta}(s) = Q^{\mu_\theta}(s, \mu_\theta(s)) . $$

Now we apply $\nabla_\theta$ on each side. This is where the confusion starts. On the right side we can use the chain rule and we would get

$$ \nabla_\theta V^{\mu_\theta}(s) = \nabla_\theta Q^{\mu_\theta}(s, \mu_\theta(s)) = \nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s, a)|_{a=\mu_\theta(s)} $$ and be done. Doesn't this directly prove the theorem?

Surely I must have missed something. Can you please explain why this simple differentiation rule is not used in the proof and why the long substitution in the next step is needed, and where does that come from?

Also, it would be helpful to see what do the Bellman equations look like for a deterministic policy?

Many thanks


Solution 1:

The basic issue is that $Q^{\mu_{\theta}}(s,\mu_{\theta})$ depends on $\theta$ both through the $\mu_{\theta}$ in the argument, and also through the sequence of all future states that will be visited by $\mu_{\theta}$. Your expresion with the chain rule is incorrect,since you are only differentiating with respect to first kind of dependence.

Here's a simple, but somewhat contrived, example to illustrate. Let the state space and the action space of the MDP both be the real line, and let the transition structure be $p(s'|s,a)=\delta_{s-a}$ (that is, taking action $a$ determinstically translates the agent to the left). Consider the reward $r(s,a)=s-a$. In particular, $r(s_n,a_n)=s_{n+1}$ (i.e., the reward is numerically equal to the agent's position at the next timepoint).

Define the deterministic policy $\mu_{\theta}(s)=\theta s$, for $0<\theta<1$. Under this policy, we have $s'=s-\theta s=(1-\theta)s$.

We want to calculate $Q^{\theta}(s,a)$. This corresponds to taking action $a$ in state $s$, and then following the policy $\mu_{\theta}$ thereafter. So the sequence of states is $s,s-a, (1-\theta)(s-a), (1-\theta)^2(s-a),...$, and the sum of rewards is correspondingly:

$$Q^{\theta}(s,a)=s-a+\gamma(1-\theta)(s-a)+\gamma^2(1-\theta)^2(s-a)+...={\frac {s-a}{1-\gamma(1-\theta)}}$$

In particular, $V^{\theta}(s)=Q^{\theta}(s,\theta s)={\frac {s(1-\theta)}{1-\gamma(1-\theta)}}$.

You can check that $\nabla_{\theta} V^{\theta}(s)=-{\frac {s}{((\theta-1)\gamma+1)^2}}$. On the other hand, $\nabla_{a} Q^{\theta}(s,a)|_{a=\theta s}=-{\frac {1}{1-\gamma(1-\theta)}}|_{a=\theta s}=-{\frac 1 {1-\gamma(1-\theta)}}$ and $\nabla_{\theta}\mu_{\theta}(s)=s$, the product of which does not equal $\nabla_{\theta}V$.