๐พMemory
Overview
Memory classes handle experience storage and retrieval for RL training. Every time an agent acts, it stores a transition (state, action, reward, next_state, done) in memory. When it's time to train, the algorithm samples from this memory.
Code: slm_lab/agent/memory
Memory Types
On-Policy vs Off-Policy
The key distinction in RL memory is whether the algorithm can learn from past experience:
On-policy (PPO, A2C, REINFORCE):
Can only learn from data collected by the current policy
Must discard data after each training update
Requires fresh data for each update
Uses
OnPolicyReplayorOnPolicyBatchReplay
Off-policy (DQN, SAC):
Can learn from data collected by any policy
Stores data in a replay buffer for reuse
More sample-efficient (reuses data multiple times)
Uses
ReplayorPrioritizedReplay
Memory Interface
All memory classes implement this interface:
Training Signal
Memory controls when training happens via the to_train flag:
Different memory types set to_train = True based on different conditions:
Replay
Every training_frequency steps
PrioritizedReplay
Every training_frequency steps
OnPolicyReplay
End of episode
OnPolicyBatchReplay
Every training_frequency steps
Memory Spec
Configure memory in the agent spec:
Choosing the Right Memory
PPO
OnPolicyBatchReplay
Collects fixed-size batches, trains multiple epochs
A2C
OnPolicyBatchReplay
Similar to PPO but single epoch
REINFORCE
OnPolicyReplay
Trains on complete episodes
DQN
Replay
Standard experience replay
DDQN+PER
PrioritizedReplay
Prioritize high-error transitions
SAC
Replay
Off-policy, high replay ratio
Advanced: Combined Experience Replay (CER)
CER (de Bruin et al., 2018) always includes the most recent transition in each batch:
This helps with environments where recent experience is particularly valuable.
Data Format
Memory stores transitions in numpy arrays for efficiency. When training, data is converted to PyTorch tensors:
The terminated vs truncated distinction (new in v5/Gymnasium) enables correct value bootstrappingโsee Changelog for details.
Last updated
Was this helpful?