Foundations of Deep Reinforcement Learning
  • Foundations of Deep Reinforcement Learning
  • Errata
    • Chapter 1 Introduction
    • Chapter 2 REINFORCE
    • Chapter 3 SARSA
    • Chapter 6 Advantage Actor-Critic
    • Chapter 7 PPO
    • Chapter 14 States
    • Appendix A Timeline
    • Appendix B Example Environments
  • Contact
Powered by GitBook
On this page
  • Page 26, Section 2.2 The Objective Function, Equation 2.1
  • Page 27, Section 2.3 The Policy Gradient, Equation 2.3
  • Page 28, Section 2.3.1 Policy Gradient Derivation, Equation 2.9
  • Page 30, Section 2.3.1 Policy Gradient Derivation, Equation 2.21

Was this helpful?

  1. Errata

Chapter 2 REINFORCE

PreviousChapter 1 IntroductionNextChapter 3 SARSA

Last updated 3 years ago

Was this helpful?

Page 26, Section 2.2 The Objective Function, Equation 2.1

Equation 2.1 misplaces the prime symbol due to a Latex formatting error. It was

Rt(τ)=∑t′=tTγt′−trt′      (misplaced prime on r′)R_t(\tau) = \sum_{t'=t}^T \gamma^{t'-t} r_t' \ \ \ \ \ \ (\text{misplaced prime on } r')Rt​(τ)=t′=t∑T​γt′−trt′​      (misplaced prime on r′)

Instead it should have been

Rt(τ)=∑t′=tTγt′−trt′R_t(\tau) = \sum_{t'=t}^T \gamma^{t'-t} r_{t'}Rt​(τ)=t′=t∑T​γt′−trt′​

Page 27, Section 2.3 The Policy Gradient, Equation 2.3

Equation 2.3 contains a typo. Following from equation 2.2, the max argument should be applied on both side of the equation. It was,

max⁡θJ(πθ)=Eτ∼πθ[R(τ)]      (missing max on the right)\max\limits_\theta \textit{J}(\pi_\theta) = \mathbb{E}_{\tau\thicksim\pi_\theta} [\textit{R}(\tau)] \ \ \ \ \ \ (\text{missing max on the right})θmax​J(πθ​)=Eτ∼πθ​​[R(τ)]      (missing max on the right)

Instead it should have been

Page 28, Section 2.3.1 Policy Gradient Derivation, Equation 2.9

In the chain of derivation, equation (2.9) states that the step used is (chain-rule), but in fact it is (product-rule).

Page 30, Section 2.3.1 Policy Gradient Derivation, Equation 2.21

Equation 2.21 misses a step in derivation:

By substituting Equation 2.20 into 2.15 and bringing in the multiplier R(τ), we obtain

Actually, the substitution yields:

There is an additional step which modifies R(τ) to give us equation 2.21. The form above has a high variance due to the many possible actions over a trajectory. One way to reduce the variance is to account for causality by only considering the future rewards for any given time step t. This makes sense since event occurring at time step t can only affect the future but not the past. To do so, we modify R(τ) as follows:

With this, we obtain equation 2.21.

max⁡θJ(πθ)=max⁡θEτ∼πθ[R(τ)]\max\limits_\theta \textit{J}(\pi_\theta) = \max\limits_\theta \mathbb{E}_{\tau\thicksim\pi_\theta} [\textit{R}(\tau)] θmax​J(πθ​)=θmax​Eτ∼πθ​​[R(τ)]
∇θJ(πθ)=Eτ∼πθ[∑t=0TRt(τ)∇θlog⁡πθ(at∣st)]      (2.21)\nabla_\theta J(\pi_\theta) = E_{\tau \sim \pi_\theta} \big[\sum_{t=0}^T R_t(\tau) \nabla_\theta \log \pi_\theta(a_t|s_t) \big] \ \ \ \ \ \ (2.21)∇θ​J(πθ​)=Eτ∼πθ​​[t=0∑T​Rt​(τ)∇θ​logπθ​(at​∣st​)]      (2.21)
∇θJ(πθ)=Eτ∼πθ[∑t=0TR(τ)∇θlog⁡πθ(at∣st)]      (missing step 1)\nabla_\theta J(\pi_\theta) = E_{\tau \sim \pi_\theta} \big[\sum_{t=0}^T R(\tau) \nabla_\theta \log \pi_\theta(a_t|s_t) \big] \ \ \ \ \ \ (\text{missing step 1})∇θ​J(πθ​)=Eτ∼πθ​​[t=0∑T​R(τ)∇θ​logπθ​(at​∣st​)]      (missing step 1)
R(τ)=R0(τ)=∑t′=0Tγt′rt′→∑t′=tTγt′−trt′=Rt(τ)      (missing step 2)R(\tau) = R_0(\tau) = \sum_{t'=0}^T \gamma^{t'} r_{t'} \rightarrow \sum_{t'=t}^T \gamma^{t'-t} r_{t'} = R_t(\tau) \ \ \ \ \ \ (\text{missing step 2})R(τ)=R0​(τ)=t′=0∑T​γt′rt′​→t′=t∑T​γt′−trt′​=Rt​(τ)      (missing step 2)