This guide covers how to run reproducible benchmarks with SLM Lab, including hyperparameter search methodology and best practices.
After installation , copy SPEC_FILE and SPEC_NAME from result tables in the benchmark pages .
Running Benchmarks
Local - runs on your machine (Classic Control: minutes):
Copy slm-lab run SPEC_FILE SPEC_NAME train Remote - cloud GPU via dstackarrow-up-right , auto-syncs to HuggingFace:
Copy source .env && slm-lab run-remote --gpu SPEC_FILE SPEC_NAME train -n NAME Remote setup: cp .env.example .env then set HF_TOKEN. See Remote Training for dstack config.
All games share one spec file (54 tested, 5 hard exploration skipped). Use -s env=ENV to substitute:
Copy source .env && slm-lab run-remote --gpu -s env=ALE/Pong-v5 slm_lab/spec/benchmark/ppo/ppo_atari.json ppo_atari train -n pong Download Results
Trained models and metrics sync to HuggingFacearrow-up-right . Pull locally:
Replay Trained Model
Standardized Settings
Fair comparison requires consistent configurations across environment categories:
circle-exclamation
Before running: Verify spec settings match the table above. Inconsistent settings make results incomparable.
Three-Stage Search Process
When tuning hyperparameters or adding new environments, use this systematic approach:
max_session=1, search_scheduler enabled
Wide exploration with early termination
max_session=4, NO search_scheduler
Validate top configs with multiple seeds
Best hyperparameters committed to spec
Confirmation run for benchmark table
Stage 1: ASHA Search
ASHA (Asynchronous Successive Halving) terminates unpromising trials early, focusing compute on promising configurations.
Stage 2: Multi-Seed Validation
After ASHA, validate top 3-5 configurations with multiple seeds (no early stopping):
Single runs can be luckyโaveraging 4 independent runs reveals true performance.
Stage 3: Final Validation
Update spec defaults with best hyperparameters, then run in train mode:
circle-exclamation
Never use raw search results in benchmark tables. Always run a final validation with committed spec file.
Search Space Sizing
Rule: ~3-4 trials per search dimension minimum.
High-Impact Hyperparameters
Focus on these firstโthey have the largest effect on performance:
Priority
Parameter
Path
Typical Range
agent.algorithm.entropy_coef_spec.start_val
agent.algorithm.clip_eps_spec.start_val
Less impactful (fix based on successful runs): minibatch_size, training_epoch, network architecture.
Grace Period by Environment
The grace_period determines minimum frames before ASHA can terminate trials:
Fast learning, quick signal
Need significant training for signal
Template specs use ${var} placeholders for flexibility across similar environments:
Template
Variables
Environments
Unified vs Individual Specs
ppo_mujoco : HalfCheetah, Walker2d, Humanoid, HumanoidStandup (gamma=0.99, lam=0.95)
ppo_mujoco_longhorizon : Reacher, Pusher (gamma=0.997, lam=0.97)
Individual specs : Hopper, Swimmer, Antโeach has environment-specific tuning
Try higher training_iter (8-16) for more gradient updates
Try lower learning rate or enable clip_vloss: true
Enable normalize_v_targets: true for value normalization
Lambda Variants
Different games benefit from different lambda values:
Strategic games (Qbert, Seaquest)
Action games (Breakout, Pong)
Best practice: Test all three variants per game; use the best result.
v5 Environment Difficulty
Gymnasium ALE v5 uses sticky actions (25% repeat probability) per Machado et al. 2018. This makes environments harder than OpenAI Gym v4โexpect 10-40% lower scores.
Troubleshooting
When Progress Stalls
Check GPU metrics (dstack metrics <run-name>)โlow GPU util means bottleneck in env stepping or config issue
Compare with successful specs โreview what worked for similar environments
Look for patterns โsame failure across runs suggests framework issue, not hyperparameters
Kill unpromising runs early โiterate faster with new approaches
Common Mistakes
Too many search dimensions
Focus on 2-3 high-impact parameters per search
Skipping multi-seed validation
Always run max_session=4 before finalizing
Using search results directly
Always run final train mode with committed spec
Verify spec matches standardized settings table
Recording Results
After a successful run:
Extract final score from logs:
Update spec defaults with best hyperparameters
Commit spec file for reproducibility
Algorithm
Type
Best For
Validated Environments
Classic, Box2D, MuJoCo (11), Atari (54)
Category
Examples
Difficulty
Docs
CartPole, Pendulum, Acrobot
LunarLander, BipedalWalker
Hopper, HalfCheetah, Humanoid
Qbert, MsPacman, and 54 more
Benchmark Spec Reference
All benchmark specs are in slm_lab/spec/benchmark/arrow-up-right , organized by algorithm.
REINFORCE / SARSA
Simple algorithms for learning fundamentals. CartPole only.
Value-based algorithms for discrete action spaces.
On-policy actor-critic with synchronized updates. Two variants: GAE (Generalized Advantage Estimation) and n-step returns.
Environment
A2C GAE
A2C n-step
Proximal Policy Optimizationโrobust across all environment types.
Soft Actor-Criticโbest for continuous control.
Asynchronous Advantage Actor-Critic using Hogwild!. See Async Training .
SAC with Hogwild! for parallel training. See Async Training .
For scores, training curves, and trained models: