Foundations of Deep Reinforcement Learning
  • Foundations of Deep Reinforcement Learning
  • Errata
    • Chapter 1 Introduction
    • Chapter 2 REINFORCE
    • Chapter 3 SARSA
    • Chapter 6 Advantage Actor-Critic
    • Chapter 7 PPO
    • Chapter 14 States
    • Appendix A Timeline
    • Appendix B Example Environments
  • Contact
Powered by GitBook
On this page
  • Page 176, Section 7.2 Proximal Policy Optimization (PPO)
  • Page 178, Section 7.3 PPO Algorithm, Algorithm 7.2

Was this helpful?

  1. Errata

Chapter 7 PPO

PreviousChapter 6 Advantage Actor-CriticNextChapter 14 States

Last updated 1 year ago

Was this helpful?

Page 176, Section 7.2 Proximal Policy Optimization (PPO)

Thanks to Jérémie Clair Coté for suggesting we clarify this and for the discussion, and HyeAnn Lee for correction.

Page 176, the last sentence of the 1st paragraph and the first two sentences of the 2nd paragraph read:

To see why this is the case, consider when rt(θ)At would assume large positivevalues, which is either At>0,rt(θ)>0, or At<0,rt(θ)<0.\text{To see why this is the case, consider when } r_t(\theta)A_t \text{ would assume large positive} \newline \text{values, which is either } A_t > 0, r_t(\theta) > 0, \text{ or } A_t < 0, r_t(\theta) < 0. To see why this is the case, consider when rt​(θ)At​ would assume large positivevalues, which is either At​>0,rt​(θ)>0, or At​<0,rt​(θ)<0.
When At>0,rt(θ)>0, if rt(θ) becomes much larger than 1, the upper clip term1−ϵ applies to upper-bound rt(θ)≤1+ϵ, hence JCLIP≤(1+ϵ)At. On the other hand, when At<0,rt(θ)<0, if rt(θ) becomes much smaller than 1, thelower clip term 1−ϵ applies to again upper-bound JCLIP≤(1−ϵ)At.\text{When } A_t > 0, r_t(\theta) > 0, \text{ if } r_t(\theta) \text{ becomes much larger than 1, the upper clip term} \newline 1 - \epsilon \text{ applies to upper-bound } r_t(\theta) \le 1 + \epsilon, \text{ hence } J^{CLIP} \le (1 + \epsilon)A_t. \text{ On the} \newline \text{ other hand, when } A_t < 0, r_t(\theta) < 0, \text{ if } r_t(\theta) \text{ becomes much smaller than 1, the} \newline \text{lower clip term } 1 - \epsilon \text{ applies to again upper-bound } J^{CLIP} \le (1 - \epsilon)A_t.When At​>0,rt​(θ)>0, if rt​(θ) becomes much larger than 1, the upper clip term1−ϵ applies to upper-bound rt​(θ)≤1+ϵ, hence JCLIP≤(1+ϵ)At​. On the other hand, when At​<0,rt​(θ)<0, if rt​(θ) becomes much smaller than 1, thelower clip term 1−ϵ applies to again upper-bound JCLIP≤(1−ϵ)At​.

This is confusing because (1) rt(θ)r_t(\theta)rt​(θ) cannot be < 0 because it is a ratio of two probabilities and (2) there is a typo when referring to the upper clip term. The sentences should be replaced with:

To see why this is the case, let’s consider when At is either >0 or <0. Note that rt(θ) is always≥0 because it is a ratio of two probabilites.\text{To see why this is the case, let's consider when } A_t \text{ is either } \gt 0 \text{ or } \lt 0. \text{ Note} \newline \text{ that } r_t(\theta) \text{ is always} \ge 0 \text{ because it is a ratio of two probabilites}. To see why this is the case, let’s consider when At​ is either >0 or <0. Note that rt​(θ) is always≥0 because it is a ratio of two probabilites.
When At>0, if rt(θ)>1+ϵ, the upper clip term 1+ϵ applies to upper-bound rt(θ)≤1+ϵ, hence JCLIP≤(1+ϵ)At. On the other hand, when At<0, if rt(θ)<1−ϵ the lower clip term 1−ϵ applies to again upper-bound JCLIP≤(1−ϵ)At.\text{When } A_t > 0, \text{ if } r_t(\theta) > 1 + \epsilon, \text{ the upper clip term } 1 + \epsilon \text{ applies to upper-bound } \newline r_t(\theta) \le 1 + \epsilon, \text{ hence } J^{CLIP} \le (1 + \epsilon)A_t. \text{ On the other hand, when } A_t < 0, \text{ if } \newline r_t(\theta) < 1 - \epsilon \text{ the lower clip term } 1 - \epsilon \text{ applies to again upper-bound } \newline J^{CLIP} \le (1 - \epsilon)A_t.When At​>0, if rt​(θ)>1+ϵ, the upper clip term 1+ϵ applies to upper-bound rt​(θ)≤1+ϵ, hence JCLIP≤(1+ϵ)At​. On the other hand, when At​<0, if rt​(θ)<1−ϵ the lower clip term 1−ϵ applies to again upper-bound JCLIP≤(1−ϵ)At​.

Page 178, Section 7.3 PPO Algorithm, Algorithm 7.2

Thanks to Jérémie Clair Coté for this correction.

Algorithm 7.2 PPO with clipping, line 35:

contains a typo. The second term on the right hand side of the equation should be subtracted not added since the loss is being minimized. It should read:

Note that the actor parameter update on line 33 of algorithm 7.2 is correct because the policy "loss" for PPO is formulated as an objective to be maximized (see equation 7.39 on page 177).

θC=θC+αC∇θCLval(θC)\theta_C = \theta_C + \alpha_C \nabla_{\theta_C} L_{val}(\theta_C)θC​=θC​+αC​∇θC​​Lval​(θC​)
θC=θC−αC∇θCLval(θC)\theta_C = \theta_C - \alpha_C \nabla_{\theta_C} L_{val}(\theta_C)θC​=θC​−αC​∇θC​​Lval​(θC​)