For the complete documentation index, see llms.txt. This page is also available as Markdown.

๐Ÿ’พMemory

Overview

Memory classes handle experience storage and retrieval for RL training. Every time an agent acts, it stores a transition (state, action, reward, next_state, done) in memory. When it's time to train, the algorithm samples from this memory.

Code: slm_lab/agent/memory

Memory Types

Memory Type
Algorithm Type
Behavior
Use Case

Off-policy (DQN, SAC)

Ring buffer, random sampling

Standard experience replay

Off-policy (DDQN+PER)

Priority-based sampling

Learn more from surprising transitions

On-policy (REINFORCE)

Flush after each episode

Episode-based training

On-policy (PPO, A2C)

Flush after N steps

Batch-based training

On-Policy vs Off-Policy

The key distinction in RL memory is whether the algorithm can learn from past experience:

On-policy (PPO, A2C, REINFORCE):

  • Can only learn from data collected by the current policy

  • Must discard data after each training update

  • Requires fresh data for each update

  • Uses OnPolicyReplay or OnPolicyBatchReplay

Off-policy (DQN, SAC):

  • Can learn from data collected by any policy

  • Stores data in a replay buffer for reuse

  • More sample-efficient (reuses data multiple times)

  • Uses Replay or PrioritizedReplay

Memory Interface

All memory classes implement this interface:

Training Signal

Memory controls when training happens via the to_train flag:

Different memory types set to_train = True based on different conditions:

Memory Type
Training Trigger

Replay

Every training_frequency steps

PrioritizedReplay

Every training_frequency steps

OnPolicyReplay

End of episode

OnPolicyBatchReplay

Every training_frequency steps

Memory Spec

Configure memory in the agent spec:

Choosing the Right Memory

Algorithm
Memory
Why

PPO

OnPolicyBatchReplay

Collects fixed-size batches, trains multiple epochs

A2C

OnPolicyBatchReplay

Similar to PPO but single epoch

REINFORCE

OnPolicyReplay

Trains on complete episodes

DQN

Replay

Standard experience replay

DDQN+PER

PrioritizedReplay

Prioritize high-error transitions

SAC

Replay

Off-policy, high replay ratio

Advanced: Combined Experience Replay (CER)

CER (de Bruin et al., 2018) always includes the most recent transition in each batch:

This helps with environments where recent experience is particularly valuable.

Data Format

Memory stores transitions in numpy arrays for efficiency. When training, data is converted to PyTorch tensors:

The terminated vs truncated distinction (new in v5/Gymnasium) enables correct value bootstrappingโ€”see Changelog for details.

Last updated

Was this helpful?