๐Ÿง Algorithm Families

circle-info

v4 Book Readers: This documentation uses v5 spec format. If using v4.1.1, see Changelog for spec format differences.

Overview

Algorithm classes implement RL algorithms: network architecture, action selection, and gradient updates. SLM Lab's algorithms use a taxonomy-based inheritance design where each algorithm extends its parent by adding only its distinguishing features.

Code: slm_lab/agent/algorithmarrow-up-right

Algorithm Taxonomy

Algorithm (base class)
 โ”œโ”€โ”€ SARSA (tabular-like Q-learning)
 โ”‚    โ””โ”€โ”€ VanillaDQN โ†’ DQNBase โ†’ DQN โ†’ DoubleDQN
 โ””โ”€โ”€ Reinforce (policy gradient)
      โ””โ”€โ”€ ActorCritic (adds value function, GAE/n-step)
           โ”œโ”€โ”€ PPO (adds clipped objective)
           โ””โ”€โ”€ SoftActorCritic (adds entropy regularization)

Each level adds only its distinguishing features. For example, PPO inherits everything from ActorCritic and only overrides the policy loss calculation. Note: ActorCritic is A2Cโ€”there's no separate A2C class.

See Class Inheritance: A2C > PPO for a detailed example.

Implemented Algorithms

Algorithm
Type
Action Space
Key Features

SARSA

Value-based

Discrete

On-policy TD learning

VanillaDQN

Value-based

Discrete

Basic Q-learning with neural network

DQN

Value-based

Discrete

+ Target network

DoubleDQN

Value-based

Discrete

+ Double Q-learning

REINFORCE

Policy gradient

Both

Monte Carlo policy gradient

ActorCritic

Actor-Critic

Both

Separate actor and critic

A2C

Actor-Critic

Both

+ Synchronized updates

PPO

Actor-Critic

Both

+ Clipped surrogate objective

SAC

Actor-Critic

Continuous

+ Maximum entropy RL

Algorithm Interface

All algorithms implement this interface:

Algorithm Spec

Configure algorithms in the agent spec:

Key Parameters

Common Parameters

Parameter
Description
Typical Values

gamma

Discount factor (how much to value future rewards)

0.99 (long-horizon), 0.9 (short-horizon)

action_pdtype

Probability distribution for actions

"default" (auto-select), "Categorical", "Normal"

action_policy

How to select actions

"default", "epsilon_greedy", "boltzmann"

Policy Gradient Parameters (A2C, PPO)

Parameter
Description
Typical Values

lam

GAE lambda (bias-variance tradeoff)

0.95 (balanced), 0.99 (high variance), 0.7 (low variance)

entropy_coef_spec

Entropy bonus for exploration

0.01 (typical), 0.001 (less exploration)

val_loss_coef

Value loss weight

1.0 (default)

PPO-Specific Parameters

Parameter
Description
Typical Values

time_horizon

Steps collected before each update

128 (typical), 2048 (MuJoCo)

minibatch_size

Samples per gradient step

64-256

training_epoch

Passes through collected data

4-10

clip_eps_spec

Clipping parameter

0.1-0.2

DQN-Specific Parameters

Parameter
Description
Typical Values

explore_var_spec

Epsilon schedule

Start 1.0, end 0.01

training_frequency

Steps between updates

1-4

training_start_step

Steps before training starts

1000-10000

Exploration Schedules

Many parameters use schedules for decay during training:

Available schedules:

  • "no_decay" - Constant value

  • "linear_decay" - Linear interpolation

  • "rate_decay" - Exponential decay

Example Specs

PPO for CartPole (Discrete)

DQN for LunarLander (Discrete)

SAC for MuJoCo (Continuous)

Adding a New Algorithm

  1. Create slm_lab/agent/algorithm/your_algo.py

  2. Inherit from the appropriate base class

  3. Override only the methods that differ

  4. Register in slm_lab/agent/algorithm/__init__.py

Example: Custom DQN Variant

See Architecture for more on extending SLM Lab.

Algorithm Performance Notes

Based on v5 benchmark results, here's guidance on algorithm selection:

Environment Type
Best Algorithm
Notes

Classic Control

PPO, A2C

Fast convergence, reliable

Box2D Discrete

DDQN+PER

Better than DQN, PPO close second

Box2D Continuous

SAC

Best for continuous LunarLander

MuJoCo

PPO

Robust across all 11 envs

Atari

PPO

Validated on 54 games

Known Limitations

These algorithm-environment combinations underperform:

Algorithm
Environment
Issue
Alternative

DQN

CartPole

Slow convergence (188 vs 499 PPO)

Use DDQN+PER or PPO

A2C

LunarLander

Fails discrete (9.5) and continuous (-38)

Use PPO or SAC

A2C

Pendulum

Poor performance (-553 vs -168 PPO)

Use PPO or SAC

SAC

Discrete envs

Mixed results, high variance

Use PPO or DDQN+PER

circle-info

SAC on MuJoCo: Not included in v5 benchmarks due to compute requirements. Off-policy algorithms require significantly more resources for systematic benchmarking. Use PPO for validated MuJoCo results.

Lambda Tuning for Atari

Different games benefit from different GAE lambda values:

Lambda
Best For
Examples

0.95

Strategic games

Qbert, BeamRider, Seaquest

0.85

Mixed games

Pong, MsPacman, Enduro

0.70

Action games

Breakout, KungFuMaster

See Atari Benchmark for per-game results.

Learning Resources

For deep dives into these algorithms:

Last updated

Was this helpful?