Dynamic Programming

Policy Evaluation

Policy evaluation is the process of determining the value function of a given policy. The value function of a policy is the expected return when starting in a given state and following the policy thereafter. The value function is denoted as \(V(s)\) and is defined as:

\[\begin{aligned} V(s) &= \mathbb{E}[G_t | S_t = s] \\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\ &= \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V(s')] \end{aligned}\]

The iteration process ends when the value function converges to a stable state. The value function is updated using the Bellman equation:

\[V(s) \leftarrow \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V(s')]\]

Policy Improvement

Given a policy \(\pi\), a new policy \(\pi^{\prime}\) is said to be better than or equal to \(\pi\) if and only if \(V_{\pi^{\prime}}(s) \geq V_{\pi}(s)\) for all \(s \in S\). The policy improvement theorem states that if \(\pi^{\prime}\) is better than or equal to \(\pi\), then the policy \(\pi^{\prime}\) is an improvement over \(\pi\).

The policy improvement process is defined as:

\[\begin{aligned} \pi^{\prime}(s) &= \arg\max_{a} q_{\pi}(s, a) \\ &= \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma V_{\pi}(s')] \end{aligned}\]

The improvement is based on the action-value function \(q_{\pi}(s, a)\),

\[\begin{aligned} V_{\pi}(s)&\leq q_{\pi}(s, \pi^{\prime}(s)) \\ &= \mathbb{E}[R_{t+1} + \gamma V_{\pi}(S_{t+1}) | S_t = s, A_t = \pi^{\prime}(s)] \\ &= \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma V_{\pi}(S_{t+1}) | S_t = s] \\ &\leq \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, \pi^{\prime}(S_{t+1})) | S_t = s] \\ &= \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma \mathbb{E}_{\pi^{\prime}}[R_{t+2} + \gamma V_{\pi}(S_{t+2}) | S_{t+1} = s'] | S_t = s] \\ &= \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{\pi}(S_{t+2}) | S_t = s] \\ &\leq \cdots\\ &\vdots\\ &\leq \mathbb{E}_{\pi^{\prime}}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots | S_t = s] \\ &= V_{\pi^{\prime}}(s) \end{aligned}\]

Policy Iteration

Policy iteration is the process of alternating between policy evaluation and policy improvement. The process continues until the policy converges to an optimal policy. The policy iteration process is defined as:

  1. Initialize the policy \(\pi\)
  2. Repeat
    1. Policy Evaluation: until value function converges
    2. Policy Improvement: until policy converges

Policy iteration updates policy each time the value function converges, while value iteration updates the policy only once after the value function converges.

Value Iteration

In Policy Iteration, at each step, policy evaluation is run until convergence, then the policy is updated and the process repeats. In contrast, Value Iteration only does a single iteration of policy evaluation at each step. Then, for each state, it takes the maximum action value to be the estimated state value. The value iteration process is defined as:

  1. Initialize the value function \(V(s)\)
  2. Repeat
    1. Update the action-value function \(q_{\pi}(s, a)\)
    2. Update the value function \(V(s)\) using the optimal action-value function \(q^{\ast}_{\pi}(s, a)\)
\[\begin{aligned} q_{\pi}(s, a) &\leftarrow \sum_{s', r} p(s', r | s, a) [r + \gamma V(s')] \\ V(s) &\leftarrow \max_{a} q_{\pi}(s, a) \\ \end{aligned}\]

The value iteration process continues until the value function converges to a stable state.

\[\pi^{\ast}(s) = \arg\max_{a} q_{\pi}(s, a)\]

The difference between policy iteration and value iteration is that policy iteration updates the policy each time the value function converges, while value iteration updates the policy only once after the value function converges. In other words, value iteration is more greedy than policy iteration.