๐Ÿค–Agent Spec

This tutorial shows how to configure an agent's algorithm, memory, and neural network. We'll train a DDQN+PER agent on LunarLanderโ€”a spacecraft landing task.

What is DDQN+PER?

Component
What It Does

DDQN (Double DQN)

Reduces overestimation of Q-values by using separate networks for action selection and evaluation

PER (Prioritized Experience Replay)

Learns faster by prioritizing surprising transitions (high TD error)

Together, they create a more stable, sample-efficient agent than vanilla DQN.

circle-info

You don't need to understand these algorithms in detail to follow this tutorial. The goal is to show how SLM Lab's spec system works.

The Agent Spec Structure

Every agent in SLM Lab is configured with three components:

{
  "spec_name": {
    "agent": {
      "name": "AgentName",       // For logging
      "algorithm": {...},        // Algorithm configuration
      "memory": {...},           // Experience storage
      "net": {...}               // Neural network
    },
    "env": {...},
    "meta": {...}
  }
}

Available Algorithms

Algorithm
Type
Best For
Validated Environments

REINFORCE

On-policy

Learning/teaching

Classic

SARSA

On-policy

Tabular-like

Classic

DQN/DDQN+PER

Off-policy

Discrete actions

Classic, Box2D, Atari

A2C

On-policy

Fast iteration

Classic, Box2D, Atari

PPO

On-policy

General purpose

Classic, Box2D, MuJoCo (11), Atari (54)

SAC

Off-policy

Continuous control

Classic, Box2D, MuJoCo

See Benchmark Specs for complete spec files for each algorithm.

Example: DDQN+PER Spec

Here's the full spec from slm_lab/spec/benchmark/dqn/ddqn_per_lunar.jsonarrow-up-right:

Algorithm Spec Breakdown

Key concepts:

Parameter
Effect

action_policy: "epsilon_greedy"

Random action with probability ฮต, greedy otherwise

gamma: 0.99

Value future rewards highly (long-horizon)

training_frequency: 1

Train on every environment step

training_start_step: 32

Collect initial batch before training

Memory Spec Breakdown

Key concepts:

Parameter
Effect

alpha: 0.6

Moderate prioritization (0=uniform sampling, 1=strict priority)

batch_size: 32

Each training step uses 32 transitions

max_size: 50000

Store up to 50k transitions (oldest deleted when full)

Net Spec Breakdown

Key concepts:

Parameter
Effect

hid_layers: [256, 128]

First layer has 256 units, second has 128

SmoothL1Loss

Huber lossโ€”less sensitive to outliers than MSE

update_type: "replace"

Periodically copy online network to target

update_frequency: 100

Copy weights every 100 training steps

Running the Experiment

Dev Mode (Quick Test)

You'll see the LunarLander environment rendering:

LunarLander environment

LunarLander-v3arrow-up-right is a classic control task: land a spacecraft safely between two flags using four discrete actions (left thruster, right thruster, main engine, or nothing). The agent receives reward for moving toward the landing pad and penalty for crashing or using fuel.

circle-info

Gymnasium Note: LunarLander-v3 (Gymnasium) has stricter termination conditions than LunarLander-v2 (OpenAI Gym). Scores are typically lower than older benchmarks. See Gymnasium docsarrow-up-right for details.

Train Mode (Full Training)

This runs 4 sessions with different random seeds. Expect ~1-2 hours for completion.

Results

After training, graphs are saved to data/ddqn_per_concat_lunar_{timestamp}/ (e.g., ddqn_per_concat_lunar_2026_01_30_215532):

Trial graph (average of 4 sessions):

DDQN+PER LunarLander trial graph

Moving average (100-checkpoint window):

DDQN+PER LunarLander trial graph MA

The target score for LunarLander-v3 is 200. DDQN+PER reaches 261.5 MA with this configuration. See Discrete Benchmark for full results.

Trained models are available on HuggingFacearrow-up-right.

Modifying the Spec

Change the Algorithm

Switch from DDQN to plain DQN:

Change the Memory

Switch from PER to uniform replay:

Change the Network

Use a larger network:

Using Other Algorithms

To use a different algorithm, find its spec file and change algorithm.name. All algorithm specs are in slm_lab/spec/benchmark/:

Algorithm
Spec Directory
Example Spec

REINFORCE

slm_lab/spec/benchmark/reinforce/

reinforce_cartpole.json

SARSA

slm_lab/spec/benchmark/sarsa/

sarsa_cartpole.json

DQN

slm_lab/spec/benchmark/dqn/

dqn_cartpole.json, dqn_lunar.json

DDQN+PER

slm_lab/spec/benchmark/dqn/

ddqn_per_lunar.json

A2C

slm_lab/spec/benchmark/a2c/

a2c_cartpole.json, a2c_gae_lunar.json

PPO

slm_lab/spec/benchmark/ppo/

ppo_cartpole.json, ppo_lunar.json, ppo_atari.json

SAC

slm_lab/spec/benchmark/sac/

sac_lunar.json, sac_pendulum.json

Switching Algorithms

  1. Find a spec for your target algorithm in the directories above

  2. Copy and modify the agent section, or use the spec directly

  3. Run with slm-lab run <spec_file> <spec_name> train

Exampleโ€”switch from DDQN to PPO on the same environment:

Algorithm-Specific Notes

Algorithm
Memory Type
Action Space
Key Parameters

REINFORCE

OnPolicyReplay

Discrete/Continuous

gamma

SARSA

OnPolicyReplay

Discrete

gamma, lam

DQN/DDQN

Replay, PrioritizedReplay

Discrete

gamma, explore_var_spec, training_frequency

A2C

OnPolicyBatchReplay

Discrete/Continuous

gamma, lam, entropy_coef

PPO

OnPolicyBatchReplay

Discrete/Continuous

gamma, lam, clip_eps, time_horizon

SAC

Replay

Continuous

gamma, alpha (entropy), training_iter

circle-info

Finding more specs: Run ls slm_lab/spec/benchmark/ to see all algorithm directories. Each contains spec files for various environments.

What's Next

Last updated

Was this helpful?