This page describes SLM-Lab's architecture and control flow.
Control Hierarchy
SLM-Lab organizes training into a hierarchical structure:
CLI (slm-lab command)
โโโ Experiment (hyperparameter search)
โโโ Trial (one configuration)
โโโ Session (one random seed)
โโโ Agent
โ โโโ Algorithm
โ โโโ Memory
โโโ Env
โโโ MetricsTracker
Level
Purpose
Configured By
Orchestrates hyperparameter search via Ray Tune
meta.max_trial, search block
Runs multiple sessions with one configuration
Single training run, owns Agent and Env
In most cases, you run a single Trial (which creates multiple Sessions). Experiments are used for hyperparameter tuning.
Agent Components
The Agent is a container that wires together three components:
Implements the RL algorithm: network architecture, action selection, and gradient updates.
Class Hierarchy:
Each algorithm extends its parent, adding only the differences:
Algorithm
Parent
Key Difference
Neural network Q-function
Adds target network infrastructure
Uses online network for action selection
Adds value function (critic), supports GAE and n-step
Adds clipped surrogate objective, minibatch training
Adds entropy regularization, twin Q-networks
See Class Inheritance: A2C > PPO for a deep dive.
Key Algorithm Methods:
Stores and retrieves experience for training.
Type
Algorithms
Behavior
Key Config
Fixed-size buffer, cleared after use
Ring buffer with random sampling
Samples by TD-error priority
Memory Interface:
Neural network architectures configured in the net spec:
Type
Use Case
Input
Architecture
Network Configuration Example:
A Session runs this loop until max_frame:
Training Frequency
How often training happens depends on the algorithm:
Algorithm Type
Training Trigger
Example
PPO trains every 128 steps
Every training_frequency steps
SLM-Lab uses Gymnasium environments with automatic vectorization:
Environment Creation Flow
Track frame/episode counts
Grayscale, resize, frame skip
Running mean/std normalization
Running reward normalization
Track true episode rewards
Vectorized Environments
num_envs controls parallelization:
With num_envs=4:
Each env.step() returns batched data: (4, state_dim), (4,), etc.
Frame count increments by 4 per step
Useful for on-policy algorithms that need diverse samples
JSON specs fully configure experiments:
Variable Substitution
Specs support ${var} placeholders:
Substitute at runtime:
Define hyperparameter ranges for Ray Tune:
Extending SLM-Lab
Adding a New Algorithm
Create slm_lab/agent/algorithm/your_algo.py
Inherit from appropriate base (Algorithm, ActorCritic, DQN, etc.)
Override necessary methods:
init_algorithm_params() - Set hyperparameter defaults
init_nets() - Create networks
Register in slm_lab/agent/algorithm/__init__.py
Create a spec file for testing
Example: Custom DQN Variant
Adding Environment Support
SLM-Lab works with any gymnasium-compatible environment:
For custom environments, ensure gymnasium API compliance:
Register and use: