📋Running Benchmarks

This guide covers how to run reproducible benchmarks with SLM Lab, including hyperparameter search methodology and best practices.

Quick Start

After installation, copy SPEC_FILE and SPEC_NAME from result tables in the benchmark pages.

Running Benchmarks

Local - runs on your machine (Classic Control: minutes):

slm-lab run SPEC_FILE SPEC_NAME train

Remote - cloud GPU via dstack, auto-syncs to HuggingFace:

source .env && slm-lab run-remote --gpu SPEC_FILE SPEC_NAME train -n NAME

Remote setup: cp .env.example .env then set HF_TOKEN. See Remote Training for dstack config.

Recommended: Use run-remote for MuJoCo and Atari benchmarks. Cloud GPUs are faster and cheaper than local training for longer runs.

Atari

All games share one spec file (54 tested, 5 hard exploration skipped). Use -s env=ENV to substitute:

source .env && slm-lab run-remote --gpu -s env=ALE/Pong-v5 slm_lab/spec/benchmark/ppo/ppo_atari.json ppo_atari train -n pong

Download Results

Trained models and metrics sync to HuggingFace. Pull locally:

source .env && slm-lab pull SPEC_NAME
slm-lab list  # see available experiments

Replay Trained Model

slm-lab run slm_lab/spec/benchmark/ppo/ppo_cartpole.json ppo_cartpole enjoy@data/ppo_cartpole_*/ppo_cartpole_t0_spec.json

Authoritative source: BENCHMARKS.md in the repository contains exact reproduction commands, current results, and HuggingFace links.

Standardized Settings

Fair comparison requires consistent configurations across environment categories:

Three-Stage Search Process

When tuning hyperparameters or adding new environments, use this systematic approach:

Stage

Mode

Config

Purpose

1. ASHA

search

max_session=1, search_scheduler enabled

Wide exploration with early termination

2. Multi

search

max_session=4, NO search_scheduler

Validate top configs with multiple seeds

3. Final

train

Best hyperparameters committed to spec

Confirmation run for benchmark table

Stage 1: ASHA Search

ASHA (Asynchronous Successive Halving) terminates unpromising trials early, focusing compute on promising configurations.

{
  "meta": {
    "max_session": 1,
    "max_trial": 16,
    "search_resources": {"cpu": 2, "gpu": 0.25},
    "search_scheduler": {
      "grace_period": 500000,
      "reduction_factor": 3
    }
  },
  "search": {
    "agent.algorithm.lam__uniform": [0.7, 0.98],
    "agent.algorithm.entropy_coef_spec.start_val__loguniform": [0.005, 0.03],
    "agent.net.optim_spec.lr__loguniform": [1e-4, 5e-4]
  }
}

slm-lab run spec.json spec_name search                                        # local
source .env && slm-lab run-remote --gpu spec.json spec_name search -n NAME    # remote

Stage 2: Multi-Seed Validation

After ASHA, validate top 3-5 configurations with multiple seeds (no early stopping):

{
  "meta": {
    "max_session": 4,
    "max_trial": 5
  }
}

Single runs can be lucky—averaging 4 independent runs reveals true performance.

Stage 3: Final Validation

Update spec defaults with best hyperparameters, then run in train mode:

slm-lab run spec.json spec_name train                                        # local
source .env && slm-lab run-remote --gpu spec.json spec_name train -n NAME    # remote

Never use raw search results in benchmark tables. Always run a final validation with committed spec file.

Search Space Sizing

Rule: ~3-4 trials per search dimension minimum.

max_trial

Max Dimensions

Use Case

2-3

Focused refinement

12-16

3-4

Typical search

Wide exploration

6-7

Broad ASHA search

High-Impact Hyperparameters

Focus on these first—they have the largest effect on performance:

Priority

Parameter

Path

Typical Range

Learning rate

agent.net.optim_spec.lr

1e-5 to 1e-3

Discount factor

agent.algorithm.gamma

0.98-0.999

GAE lambda

agent.algorithm.lam

0.9-0.99

Entropy coefficient

agent.algorithm.entropy_coef_spec.start_val

0.001-0.1

Clip epsilon

agent.algorithm.clip_eps_spec.start_val

0.1-0.3

Less impactful (fix based on successful runs): minibatch_size, training_epoch, network architecture.

Iterative narrowing: After finding good ranges, narrow the search space and re-run rather than continuing broad exploration.

Grace Period by Environment

The grace_period determines minimum frames before ASHA can terminate trials:

Environment

grace_period

Reasoning

Classic Control

10000-50000

Fast learning, quick signal

Box2D

50000-100000

Medium complexity

MuJoCo

100000-1000000

Slower learning curves

Atari

500000-1000000

Need significant training for signal

Template Specs

Template specs use ${var} placeholders for flexibility across similar environments:

# MuJoCo template
slm-lab run -s env=HalfCheetah-v5 -s max_frame=10e6 slm_lab/spec/benchmark/ppo/ppo_mujoco.json ppo_mujoco train

# Atari template
slm-lab run -s env=ALE/Qbert-v5 slm_lab/spec/benchmark/ppo/ppo_atari.json ppo_atari train

Template

Variables

Environments

ppo_mujoco.json

env, max_frame

All 11 MuJoCo

ppo_atari.json

env

All 54 Atari games

MuJoCo Tips

Unified vs Individual Specs

ppo_mujoco: HalfCheetah, Walker2d, Humanoid, HumanoidStandup (gamma=0.99, lam=0.95)
ppo_mujoco_longhorizon: Reacher, Pusher (gamma=0.997, lam=0.97)
Individual specs: Hopper, Swimmer, Ant—each has environment-specific tuning

Common Issues

Problem

Solution

Reward not improving

Try higher training_iter (8-16) for more gradient updates

Unstable learning

Try lower learning rate or enable clip_vloss: true

Large reward variance

Enable normalize_v_targets: true for value normalization

Atari Tips

Lambda Variants

Different games benefit from different lambda values:

Spec Name

Lambda

Best For

ppo_atari

0.95

Strategic games (Qbert, Seaquest)

ppo_atari_lam85

0.85

Mixed games (MsPacman)

ppo_atari_lam70

0.70

Action games (Breakout, Pong)

Best practice: Test all three variants per game; use the best result.

v5 Environment Difficulty

Gymnasium ALE v5 uses sticky actions (25% repeat probability) per Machado et al. 2018. This makes environments harder than OpenAI Gym v4—expect 10-40% lower scores.

Troubleshooting

When Progress Stalls

Check GPU metrics (dstack metrics <run-name>)—low GPU util means bottleneck in env stepping or config issue
Compare with successful specs—review what worked for similar environments
Look for patterns—same failure across runs suggests framework issue, not hyperparameters
Research reference implementations—check CleanRL or Stable-Baselines3 configs
Kill unpromising runs early—iterate faster with new approaches

Common Mistakes

Mistake

Fix

Too many search dimensions

Focus on 2-3 high-impact parameters per search

Skipping multi-seed validation

Always run max_session=4 before finalizing

Using search results directly

Always run final train mode with committed spec

Inconsistent settings

Verify spec matches standardized settings table

Recording Results

After a successful run:

Extract final score from logs:

dstack logs my-experiment | grep "trial_metrics"
# Output: trial_metrics: frame:1.00e+07 | total_reward_ma:15094 | ...

Pull results:
```
slm-lab pull spec_name
```
Update spec defaults with best hyperparameters
Commit spec file for reproducibility

Algorithms

Algorithm

Type

Best For

Validated Environments

REINFORCE

On-policy

Learning/teaching

Classic

SARSA

On-policy

Tabular-like

Classic

DQN/DDQN+PER

Off-policy

Discrete actions

Classic, Box2D, Atari

A2C

On-policy

Fast iteration

Classic, Box2D, Atari

PPO

On-policy

General purpose

Classic, Box2D, MuJoCo (11), Atari (54)

SAC

Off-policy

Continuous control

Classic, Box2D, MuJoCo

Environments

Benchmark Spec Reference

All benchmark specs are in slm_lab/spec/benchmark/, organized by algorithm.

REINFORCE / SARSA

Simple algorithms for learning fundamentals. CartPole only.

Algorithm

Spec

REINFORCE

reinforce_cartpole.json

SARSA

sarsa_cartpole.json

DQN Family

Value-based algorithms for discrete action spaces.

Environment

DQN

DDQN+PER

CartPole

dqn_cartpole.json

—

Acrobot

dqn_acrobot.json

ddqn_per_acrobot.json

LunarLander

dqn_lunar.json

ddqn_per_lunar.json

A2C

On-policy actor-critic with synchronized updates. Two variants: GAE (Generalized Advantage Estimation) and n-step returns.

Environment

A2C GAE

A2C n-step

CartPole

a2c_gae_cartpole.json

—

Acrobot

a2c_gae_acrobot.json

—

Pendulum

a2c_gae_pendulum.json

—

LunarLander

a2c_gae_lunar.json

a2c_nstep_lunar.json

BipedalWalker

a2c_gae_bipedalwalker.json

—

MuJoCo

a2c_gae_mujoco.json

a2c_nstep_mujoco.json

Atari

a2c_gae_atari.json

a2c_nstep_atari.json

PPO

Proximal Policy Optimization—robust across all environment types.

Environment

Spec

CartPole

Acrobot

Pendulum

LunarLander

BipedalWalker

ppo_bipedalwalker.json

MuJoCo

ppo_mujoco.json (template)

Atari

ppo_atari.json (template)

SAC

Soft Actor-Critic—best for continuous control.

Environment

Spec

CartPole

Acrobot

Pendulum

LunarLander

BipedalWalker

sac_bipedalwalker.json

HalfCheetah

sac_halfcheetah.json

Hopper

sac_hopper.json

A3C (Async)

Asynchronous Advantage Actor-Critic using Hogwild!. See Async Training.

Environment

Spec

Pong

a3c_gae_pong.json

Async SAC

SAC with Hogwild! for parallel training. See Async Training.

Environment

Spec

MuJoCo

async_sac_mujoco.json (template)

Performance Results

For scores, training curves, and trained models:

Discrete Benchmark — Classic Control, Box2D
Continuous Benchmark — MuJoCo
Atari Benchmark — 54 Atari games
Public Benchmark Data — HuggingFace download links

PreviousAsync Training: Hogwild!NextCLI Reference

Last updated 2 days ago

Was this helpful?

hashtagQuick Start

hashtagRunning Benchmarks

hashtagAtari

hashtagDownload Results

hashtagReplay Trained Model

hashtagStandardized Settings

hashtagThree-Stage Search Process

hashtagStage 1: ASHA Search

hashtagStage 2: Multi-Seed Validation

hashtagStage 3: Final Validation

hashtagSearch Space Sizing

hashtagHigh-Impact Hyperparameters

hashtagGrace Period by Environment

hashtagTemplate Specs

hashtagMuJoCo Tips

hashtagUnified vs Individual Specs

hashtagCommon Issues

hashtagAtari Tips

hashtagLambda Variants

hashtagv5 Environment Difficulty

hashtagTroubleshooting

hashtagWhen Progress Stalls

hashtagCommon Mistakes

hashtagRecording Results

hashtagAlgorithms

hashtagEnvironments

hashtagBenchmark Spec Reference

hashtagREINFORCE / SARSA

hashtagDQN Family

hashtagA2C

hashtagPPO

hashtagSAC

hashtagA3C (Async)

hashtagAsync SAC

hashtagPerformance Results

Quick Start

Running Benchmarks

Atari

Download Results

Replay Trained Model

Standardized Settings

Three-Stage Search Process

Stage 1: ASHA Search

Stage 2: Multi-Seed Validation

Stage 3: Final Validation

Search Space Sizing

High-Impact Hyperparameters

Grace Period by Environment

Template Specs

MuJoCo Tips

Unified vs Individual Specs

Common Issues

Atari Tips

Lambda Variants

v5 Environment Difficulty

Troubleshooting

When Progress Stalls

Common Mistakes

Recording Results

Algorithms

Environments

Benchmark Spec Reference

REINFORCE / SARSA

DQN Family

A2C

PPO

SAC

A3C (Async)

Async SAC

Performance Results