๐Ÿง Algorithm Families

circle-info

v4 Book Readers: This documentation uses v5 spec format. If using v4.1.1, see Changelog for spec format differences.

Overview

Algorithm classes implement RL algorithms: network architecture, action selection, and gradient updates. SLM Lab's algorithms use a taxonomy-based inheritance design where each algorithm extends its parent by adding only its distinguishing features.

Code: slm_lab/agent/algorithmarrow-up-right

Algorithm Taxonomy

Algorithm (base class)
 โ”œโ”€โ”€ SARSA (tabular-like Q-learning)
 โ”‚    โ””โ”€โ”€ VanillaDQN โ†’ DQNBase โ†’ DQN โ†’ DoubleDQN
 โ””โ”€โ”€ Reinforce (policy gradient)
      โ””โ”€โ”€ ActorCritic (adds value function, GAE/n-step)
           โ”œโ”€โ”€ PPO (adds clipped objective)
           โ””โ”€โ”€ SoftActorCritic (adds entropy regularization)
                โ””โ”€โ”€ CrossQ (eliminates target networks via cross batch norm)

Each level adds only its distinguishing features. For example, PPO inherits everything from ActorCritic and only overrides the policy loss calculation. Note: ActorCritic is A2Cโ€”there's no separate A2C class.

See Class Inheritance: A2C > PPO for a detailed example.

Implemented Algorithms

Algorithm
Type
Action Space
Key Features

SARSA

Value-based

Discrete

On-policy TD learning

VanillaDQN

Value-based

Discrete

Basic Q-learning with neural network

DQN

Value-based

Discrete

+ Target network

DoubleDQN

Value-based

Discrete

+ Double Q-learning

REINFORCE

Policy gradient

Both

Monte Carlo policy gradient

ActorCritic

Actor-Critic

Both

Separate actor and critic

A2C

Actor-Critic

Both

+ Synchronized updates

PPO

Actor-Critic

Both

+ Clipped surrogate objective

SAC

Actor-Critic

Both

+ Maximum entropy RL, auto-tuned temperature

CrossQ

Actor-Critic

Both

+ No target networks, cross batch norm

Algorithm Interface

All algorithms implement this interface:

Algorithm Spec

Configure algorithms in the agent spec:

Key Parameters

Common Parameters

Parameter
Description
Typical Values

gamma

Discount factor (how much to value future rewards)

0.99 (long-horizon), 0.9 (short-horizon)

action_pdtype

Probability distribution for actions

"default" (auto-select), "Categorical", "Normal"

action_policy

How to select actions

"default", "epsilon_greedy", "boltzmann"

Policy Gradient Parameters (A2C, PPO)

Parameter
Description
Typical Values

lam

GAE lambda (bias-variance tradeoff)

0.95 (balanced), 0.99 (high variance), 0.7 (low variance)

entropy_coef_spec

Entropy bonus for exploration

0.01 (typical), 0.001 (less exploration)

val_loss_coef

Value loss weight

1.0 (default)

PPO-Specific Parameters

Parameter
Description
Typical Values

time_horizon

Steps collected before each update

128 (typical), 2048 (MuJoCo)

minibatch_size

Samples per gradient step

64-256

training_epoch

Passes through collected data

4-10

clip_eps_spec

Clipping parameter

0.1-0.2

DQN-Specific Parameters

Parameter
Description
Typical Values

explore_var_spec

Epsilon schedule

Start 1.0, end 0.01

training_frequency

Steps between updates

1-4

training_start_step

Steps before training starts

1000-10000

Exploration Schedules

Many parameters use schedules for decay during training:

Available schedules:

  • "no_decay" - Constant value

  • "linear_decay" - Linear interpolation

  • "rate_decay" - Exponential decay

Example Specs

PPO for CartPole (Discrete)

DQN for LunarLander (Discrete)

SAC for MuJoCo (Continuous)

Adding a New Algorithm

  1. Create slm_lab/agent/algorithm/your_algo.py

  2. Inherit from the appropriate base class

  3. Override only the methods that differ

  4. Register in slm_lab/agent/algorithm/__init__.py

Example: Custom DQN Variant

See Architecture for more on extending SLM Lab.

Algorithm Performance Notes

Based on v5 benchmark results, here's guidance on algorithm selection:

Environment Type
Best Algorithm
Notes

Classic Control

PPO, SAC

Fast convergence, reliable

Box2D Discrete

DDQN+PER

Better than DQN, PPO close second

Box2D Continuous

SAC, CrossQ

SAC reliable, CrossQ 2โ€“7x faster

MuJoCo

PPO, SAC, CrossQ

All validated on 11 envs; CrossQ fastest

Atari

PPO

Validated on 57 games; SAC on 48 games

Known Limitations

These algorithm-environment combinations underperform:

Algorithm
Environment
Issue
Alternative

DQN

CartPole

Slow convergence (188 vs 499 PPO)

Use DDQN+PER or PPO

A2C

LunarLander

Fails discrete (9.5) and continuous (-38)

Use PPO or SAC

A2C

Pendulum

Poor performance (-553 vs -168 PPO)

Use PPO or SAC

CrossQ

Atari

Experimental; underperforms SAC/PPO on most games

Use PPO or SAC

Lambda Tuning for Atari

Different games benefit from different GAE lambda values:

Lambda
Best For
Examples

0.95

Strategic games

Qbert, BeamRider, Seaquest

0.85

Mixed games

Pong, MsPacman, Enduro

0.70

Action games

Breakout, KungFuMaster

See Atari Benchmark for per-game results.

Learning Resources

For deep dives into these algorithms:

Last updated

Was this helpful?