Page 60, Section 3.2.1 Intuition for Temporal Difference Learning
Page 60, the first square bullet:
(s0,aUP): The agent moves out of the corridor, receives a reward of 0, and theepisode terminates, so Q∗(s0,aUP)=0. contains a typo in the index of the state. It should read:
(s1,aUP): The agent moves out of the corridor, receives a reward of 0, and theepisode terminates, so Q∗(s1,aUP)=0. Page 62, Section 3.2.1 Intuition for Temporal Difference Learning
Page 62, Figure 3.3, Episode 4, Time step 5, the target value calculation:
0+0.9+0.9=0.81 contains a typo, the second + should have been x. It should read:
Page 67, Section 3.4 SARSA Algorithm, Algorithm 3.1
Algorithm 3.1 SARSA, line 13:
Contains a typo; the loss function J should have been L: