Env Spec: A2C on Pong
Last updated
Last updated
In this tutorial we look at how to use an env spec to specify an environment used to train an agent. We will train an A2C agent on the Atari Pong environment.
The environment is specified using the env key in a spec file with the following format:
The env spec is a list to accommodate for multi-environment setting in the future version of SLM Lab.
As an example, let's look at the env spec for A2C on Pong from slm_lab/spec/benchmark/a2c/a2c_gae_pong.json.
Here, we are using the "PongNoFrameskip-v4" environment from gym. This allows us to specify our own state-preprocessing method. Each frame is an automatically-processed greyscale image of shape (1, 84, 84). To restore the temporal aspect of the frames, it is useful to concatenate 4 successive frames together as specified by "frame_op" and "frame_op_len", so the shape of the input state to A2C becomes (4, 84, 84).
For Atari environments, we also apply frame-skipping internally via before doing any state-processing. This is done using the <env.wrapper.MaxAndSkipEnv>
wrapper. Effectively, we are concatenating 4 frames selected from the span of 16 raw frames.
To standardize the range of rewards in Atari environments, we also preprocess the rewards to -1, 0, or +1 by taking the sign of the raw reward, as specified in "reward_scale".
In order to speed up training, we asynchronously parallelize the stepping of environment by using a vector of 16 sub-environments, as specified in "num_envs". This means that at every environment step, A2C gets a batch of 16 states from 16 instances of the same environment with different random seeds.
Finally, we train A2C using a total of 10 million frames, specified in "max_frame". Note that when parallelizing using vector environment, the environment will actually step (10 million / 16) times. This is so that the total frame-count will sum up to "max_frame" (with some modular-remainder) regardless of how many vector environments were used.
Let's run a Trial using the spec file above. First, run it in dev mode to see the rendering of 16 vector environments.
We can see these environments are in fact different in their progression, so our agent will obtain a diverse set of experience from them.
Next, terminate (Ctrl+C
) and rerun it in train mode for the full 10 million frames:
The maximum score for Pong is 21; our learning agent will converge to an average score close to that. When the trial completes, we will see the graphs similar to the ones below generated and saved to the data/a2c_gae_pong_{ts}
folder.
The trial graph is an average of the 4 session graphs, each of which plots the episodic rewards once every 10,000 steps averaged over all the vector environments.
We can also smoothen the trial graph by plotting its moving average over a window of 100 to obtain the graph below.
This trial will complete within a day if no GPU is used, which is still fairly quick in RL terms. In the next section, we will see how SLM Lab can easily enable GPU usage to speed up training on image-based environments like the Atari games.