๐Ÿ’พMemory

Overview

Memory classes handle experience storage and retrieval for RL training. Every time an agent acts, it stores a transition (state, action, reward, next_state, done) in memory. When it's time to train, the algorithm samples from this memory.

Code: slm_lab/agent/memoryarrow-up-right

Memory Types

Memory Type
Algorithm Type
Behavior
Use Case

Off-policy (DQN, SAC)

Ring buffer, random sampling

Standard experience replay

Off-policy (DDQN+PER)

Priority-based sampling

Learn more from surprising transitions

On-policy (REINFORCE)

Flush after each episode

Episode-based training

On-policy (PPO, A2C)

Flush after N steps

Batch-based training

On-Policy vs Off-Policy

The key distinction in RL memory is whether the algorithm can learn from past experience:

On-policy (PPO, A2C, REINFORCE):

  • Can only learn from data collected by the current policy

  • Must discard data after each training update

  • Requires fresh data for each update

  • Uses OnPolicyReplay or OnPolicyBatchReplay

Off-policy (DQN, SAC):

  • Can learn from data collected by any policy

  • Stores data in a replay buffer for reuse

  • More sample-efficient (reuses data multiple times)

  • Uses Replay or PrioritizedReplay

Memory Interface

All memory classes implement this interface:

Training Signal

Memory controls when training happens via the to_train flag:

Different memory types set to_train = True based on different conditions:

Memory Type
Training Trigger

Replay

Every training_frequency steps

PrioritizedReplay

Every training_frequency steps

OnPolicyReplay

End of episode

OnPolicyBatchReplay

Every training_frequency steps

Memory Spec

Configure memory in the agent spec:

Choosing the Right Memory

Algorithm
Memory
Why

PPO

OnPolicyBatchReplay

Collects fixed-size batches, trains multiple epochs

A2C

OnPolicyBatchReplay

Similar to PPO but single epoch

REINFORCE

OnPolicyReplay

Trains on complete episodes

DQN

Replay

Standard experience replay

DDQN+PER

PrioritizedReplay

Prioritize high-error transitions

SAC

Replay

Off-policy, high replay ratio

Advanced: Combined Experience Replay (CER)

CER (de Bruin et al., 2018arrow-up-right) always includes the most recent transition in each batch:

This helps with environments where recent experience is particularly valuable.

Data Format

Memory stores transitions in numpy arrays for efficiency. When training, data is converted to PyTorch tensors:

The terminated vs truncated distinction (new in v5/Gymnasium) enables correct value bootstrappingโ€”see Changelog for details.

Last updated

Was this helpful?