Chapter 2 REINFORCE
Last updated
Last updated
Equation 2.1 misplaces the prime symbol due to a Latex formatting error. It was
Instead it should have been
Equation 2.3 contains a typo. Following from equation 2.2, the max argument should be applied on both side of the equation. It was,
Instead it should have been
In the chain of derivation, equation (2.9) states that the step used is (chain-rule), but in fact it is (product-rule).
Equation 2.21 misses a step in derivation:
By substituting Equation 2.20 into 2.15 and bringing in the multiplier R(τ), we obtain
Actually, the substitution yields:
There is an additional step which modifies R(τ) to give us equation 2.21. The form above has a high variance due to the many possible actions over a trajectory. One way to reduce the variance is to account for causality by only considering the future rewards for any given time step t. This makes sense since event occurring at time step t can only affect the future but not the past. To do so, we modify R(τ) as follows:
With this, we obtain equation 2.21.