๐Ÿ“‹Running Benchmarks

This guide covers how to run reproducible benchmarks with SLM Lab, including hyperparameter search methodology and best practices.

Quick Start

After installation, copy SPEC_FILE and SPEC_NAME from result tables in the benchmark pages.

Running Benchmarks

Local - runs on your machine (Classic Control: minutes):

slm-lab run SPEC_FILE SPEC_NAME train

Remote - cloud GPU via dstackarrow-up-right, auto-syncs to HuggingFace:

source .env && slm-lab run-remote --gpu SPEC_FILE SPEC_NAME train -n NAME

Remote setup: cp .env.example .env then set HF_TOKEN. See Remote Training for dstack config.

circle-info

Recommended: Use run-remote for MuJoCo and Atari benchmarks. Cloud GPUs are faster and cheaper than local training for longer runs.

Atari

All games share one spec file (54 tested, 5 hard exploration skipped). Use -s env=ENV to substitute:

source .env && slm-lab run-remote --gpu -s env=ALE/Pong-v5 slm_lab/spec/benchmark/ppo/ppo_atari.json ppo_atari train -n pong

Download Results

Trained models and metrics sync to HuggingFacearrow-up-right. Pull locally:

Replay Trained Model

circle-info

Authoritative source: BENCHMARKS.mdarrow-up-right in the repository contains exact reproduction commands, current results, and HuggingFace links.

Standardized Settings

Fair comparison requires consistent configurations across environment categories:

Category

num_envs

max_frame

log_frequency

max_session

Classic Control

4

2e5-3e5

500

4

Box2D

8

3e5

1000

4

MuJoCo

16

1e6-10e6

10000

4

Atari

16

10e6

10000

4

circle-exclamation

Three-Stage Search Process

When tuning hyperparameters or adding new environments, use this systematic approach:

Stage
Mode
Config
Purpose

1. ASHA

search

max_session=1, search_scheduler enabled

Wide exploration with early termination

2. Multi

search

max_session=4, NO search_scheduler

Validate top configs with multiple seeds

3. Final

train

Best hyperparameters committed to spec

Confirmation run for benchmark table

ASHA (Asynchronous Successive Halving) terminates unpromising trials early, focusing compute on promising configurations.

Stage 2: Multi-Seed Validation

After ASHA, validate top 3-5 configurations with multiple seeds (no early stopping):

Single runs can be luckyโ€”averaging 4 independent runs reveals true performance.

Stage 3: Final Validation

Update spec defaults with best hyperparameters, then run in train mode:

circle-exclamation

Search Space Sizing

Rule: ~3-4 trials per search dimension minimum.

max_trial

Max Dimensions

Use Case

8

2-3

Focused refinement

12-16

3-4

Typical search

20

5

Wide exploration

30

6-7

Broad ASHA search

High-Impact Hyperparameters

Focus on these firstโ€”they have the largest effect on performance:

Priority
Parameter
Path
Typical Range

1

Learning rate

agent.net.optim_spec.lr

1e-5 to 1e-3

2

Discount factor

agent.algorithm.gamma

0.98-0.999

3

GAE lambda

agent.algorithm.lam

0.9-0.99

4

Entropy coefficient

agent.algorithm.entropy_coef_spec.start_val

0.001-0.1

5

Clip epsilon

agent.algorithm.clip_eps_spec.start_val

0.1-0.3

Less impactful (fix based on successful runs): minibatch_size, training_epoch, network architecture.

circle-info

Iterative narrowing: After finding good ranges, narrow the search space and re-run rather than continuing broad exploration.

Grace Period by Environment

The grace_period determines minimum frames before ASHA can terminate trials:

Environment

grace_period

Reasoning

Classic Control

10000-50000

Fast learning, quick signal

Box2D

50000-100000

Medium complexity

MuJoCo

100000-1000000

Slower learning curves

Atari

500000-1000000

Need significant training for signal

Template Specs

Template specs use ${var} placeholders for flexibility across similar environments:

Template
Variables
Environments

ppo_mujoco.json

env, max_frame

All 11 MuJoCo

ppo_atari.json

env

All 54 Atari games

MuJoCo Tips

Unified vs Individual Specs

  • ppo_mujoco: HalfCheetah, Walker2d, Humanoid, HumanoidStandup (gamma=0.99, lam=0.95)

  • ppo_mujoco_longhorizon: Reacher, Pusher (gamma=0.997, lam=0.97)

  • Individual specs: Hopper, Swimmer, Antโ€”each has environment-specific tuning

Common Issues

Problem
Solution

Reward not improving

Try higher training_iter (8-16) for more gradient updates

Unstable learning

Try lower learning rate or enable clip_vloss: true

Large reward variance

Enable normalize_v_targets: true for value normalization

Atari Tips

Lambda Variants

Different games benefit from different lambda values:

Spec Name
Lambda
Best For

ppo_atari

0.95

Strategic games (Qbert, Seaquest)

ppo_atari_lam85

0.85

Mixed games (MsPacman)

ppo_atari_lam70

0.70

Action games (Breakout, Pong)

Best practice: Test all three variants per game; use the best result.

v5 Environment Difficulty

Gymnasium ALE v5 uses sticky actions (25% repeat probability) per Machado et al. 2018. This makes environments harder than OpenAI Gym v4โ€”expect 10-40% lower scores.

Troubleshooting

When Progress Stalls

  1. Check GPU metrics (dstack metrics <run-name>)โ€”low GPU util means bottleneck in env stepping or config issue

  2. Compare with successful specsโ€”review what worked for similar environments

  3. Look for patternsโ€”same failure across runs suggests framework issue, not hyperparameters

  4. Research reference implementationsโ€”check CleanRLarrow-up-right or Stable-Baselines3arrow-up-right configs

  5. Kill unpromising runs earlyโ€”iterate faster with new approaches

Common Mistakes

Mistake
Fix

Too many search dimensions

Focus on 2-3 high-impact parameters per search

Skipping multi-seed validation

Always run max_session=4 before finalizing

Using search results directly

Always run final train mode with committed spec

Inconsistent settings

Verify spec matches standardized settings table

Recording Results

After a successful run:

  1. Extract final score from logs:

  2. Pull results:

  3. Update spec defaults with best hyperparameters

  4. Commit spec file for reproducibility

Algorithms

Algorithm
Type
Best For
Validated Environments

REINFORCE

On-policy

Learning/teaching

Classic

SARSA

On-policy

Tabular-like

Classic

DQN/DDQN+PER

Off-policy

Discrete actions

Classic, Box2D, Atari

A2C

On-policy

Fast iteration

Classic, Box2D, Atari

PPO

On-policy

General purpose

Classic, Box2D, MuJoCo (11), Atari (54)

SAC

Off-policy

Continuous control

Classic, Box2D, MuJoCo

Environments

Category
Examples
Difficulty
Docs

Classic Control

CartPole, Pendulum, Acrobot

Easy

Box2D

LunarLander, BipedalWalker

Medium

MuJoCo

Hopper, HalfCheetah, Humanoid

Hard

Atari

Qbert, MsPacman, and 54 more

Varied

Benchmark Spec Reference

All benchmark specs are in slm_lab/spec/benchmark/arrow-up-right, organized by algorithm.

REINFORCE / SARSA

Simple algorithms for learning fundamentals. CartPole only.

DQN Family

Value-based algorithms for discrete action spaces.

A2C

On-policy actor-critic with synchronized updates. Two variants: GAE (Generalized Advantage Estimation) and n-step returns.

PPO

Proximal Policy Optimizationโ€”robust across all environment types.

SAC

Soft Actor-Criticโ€”best for continuous control.

A3C (Async)

Asynchronous Advantage Actor-Critic using Hogwild!. See Async Training.

Async SAC

SAC with Hogwild! for parallel training. See Async Training.

Environment
Spec

Performance Results

For scores, training curves, and trained models:

Last updated

Was this helpful?