Markov Decision Process
Consider a Markov decision process $(S, A, P, R)$ where $S=\{0,1,2,3,4\}$ is a state space, $A=$ $\left\{a^{1}, a^{2}\right\}$ is an action space, $P$ is a transition probability matrix such that $$ \begin{aligned} &P\left((s+1) \% 5 \mid s, a^{1}\right)=P\left((s+2) \% 5 \mid s, a^{1}\right)=P\left((s+4) \% 5 \mid s, a^{1}\right)=\frac{1}{3} \\ &P\left((s+1) \% 5 \mid s, a^{2}\right)=P\left((s+3) \% 5 \mid s, a^{2}\right)=\frac{1}{2} \end{aligned} $$ (where $a \% b$ means the remainder when $a$ is divided by $b$ ) for all $s \in S$ and $R$ is the reward such that $R\left(s, a, s^{\prime}\right)$ follows Bernoulli $\left(\frac{1}{2}\right)$ if $s^{\prime}=0$ and $0$ otherwise. Also, when the process reaches the state $4$, then the process is terminated (i.e. $4$ is the terminal state). Assume the initial state is $s_{0}=0$ and set the discounted factor $\gamma=0.9.$ Let $\pi$ be a Markovian randomized stationary policy that $\pi\left(a^{1} \mid s\right)=\pi\left(a^{2} \mid s\right)=$ $0.5$ for $s=0,2$ and $\pi\left(a^{1} \mid s\right)=0.7, \pi\left(a^{2} \mid s\right)=0.3$ for $s=1,3.$ In this problem, you may use the python code for matrix calculations, e.g. matrix addition, multiplication, inversion, $\ldots$
Considering these given statements, how can we calculate $V^{\pi}(s)$ and $Q^{\pi}(s, a)$?
Having gone through multiple ideas for MDP, it seemed somewhat confusing how one could tackle the problem theoretically when we adopt the policy $\pi$, especially the evaluation of probabilities that trajectory is sampled. It could be great if anyone could share some insights on the calculation of it.
Solution 1:
The bellman equation for this problem is given as a system of $|S|$ equations:
$$V^\pi(s)=\sum_{actions\ a}\pi(a|s)\sum_{states\ s'}P(s'|s,a)(R(s',a,s)+\gamma\cdot V^\pi(s'))$$
The trick is to rewrite this in matrix form: $$V^\pi=\pi^1 P_1( R_1\mathbf 1^\top+\gamma V^\pi)+\pi^2 P_2( R_2\mathbf 1^\top+\gamma V^\pi)$$
where $$\pi^i=(\pi(a^i|0),\ldots,\pi(a^i|4))$$ $$R_i=\begin{pmatrix} R(0,a^i,0)&\ldots &R(4,a^i,0)\\ \vdots&\ddots&\vdots\\ R(0,a^i,4)&\ldots& R(4,a^i,4) \end{pmatrix}$$ $$\mathbf 1=(1,\ldots,1)$$
Then, you can regroup and get $$V^\pi=(\pi^1P_1R_1+\pi^2P_2R_2)\mathbf1^\top+(\pi^1P_1+\pi^2P_2)V^\pi$$
Which finally results in $$V^\pi=(Id-\gamma(\pi^1P_1+\pi^2P_2))^{-1}(\pi^1P_1R_1+\pi^2P_2R_2)\mathbf1^\top$$ The inverse exists only for $\gamma<1$. Please check this carefully. The formula is a lot more complicated than what you find in the literature, as the rewards are randomized and depend on state transition, too.
If you like to check, whether your computations are correct, try to implement the recursion with initially $V^\pi(0)=(0,0,0,0,0)$ and $$V^\pi(k+1,s)\leftarrow\sum_{actions\ a}\pi(a|s)\sum_{states\ s'}P(s'|s,a)(R(s',a,s)+\gamma\cdot V^\pi(k,s'))$$
No guarantee for correctness here. I am almost sure I made some mistake and could have written some of it more elegantly.