๐Ÿ”—Public Benchmark Data

Overview

All SLM Lab benchmark results are publicly available on HuggingFace for reproducibility and comparison:

Each experiment includes:

  • Trained models - PyTorch checkpoints (*_ckpt-best_net_model.pt)

  • Training curves - Full learning history (*_session_df_{train,eval}.csv)

  • Specs - Exact configurations for reproduction (*_spec.json)

  • Graphs - Plotly visualizations (PNG and HTML)

Setup

To access the public benchmarks, set HF_REPO in your .env file:

Then source it before running commands:

circle-info

No token needed for read-only access. HF_TOKEN is only required for uploading to your own repo.

Accessing Results ๐Ÿ”—

List Available Experiments

Shows all experiments on HuggingFacearrow-up-right.

Download an Experiment

Downloads to data/ppo_hopper_*/ including:

  • Model checkpoints (model/*_ckpt-best_net_model.pt)

  • Training metrics (info/*_session_df_{train,eval}.csv)

  • Saved spec (*_spec.json)

Replay a Trained Agent

Browse on HuggingFace

Direct links to experiment folders (example):

See the benchmark pages for complete lists:

v5 Benchmark Coverage

Environments Tested

Category
Environments
Algorithms

Classic Control

CartPole-v1, Acrobot-v1, Pendulum-v1

REINFORCE, SARSA, DQN, DDQN+PER, A2C, PPO, SAC

Box2D

LunarLander-v3 (discrete & continuous)

DQN, DDQN+PER, A2C, PPO, SAC

MuJoCo

11 environments (Hopper, HalfCheetah, etc.)

PPO

Atari

54 games

PPO (3 lambda variants)

Benchmark
Page
Environments

Classic + Box2D

CartPole, Acrobot, Pendulum, LunarLander

MuJoCo

Hopper, HalfCheetah, Humanoid, etc.

Atari

54 games

Methodology

How Scores Are Reported

Results show Trial-level performance:

  1. Trial = 4 Sessions with different random seeds

  2. Session = One complete training run

  3. Score = Final 100-checkpoint moving average (total_reward_ma)

The trial score is the mean across 4 sessions, providing statistically meaningful results.

Training Details

Setting
Value

Sessions per trial

4 (different random seeds)

Checkpoint frequency

Varies by env (see below)

Moving average window

100 checkpoints

Hardware

Cloud GPUs (L4/A10G via dstack)

Environment Settings

Standardized settings for fair comparison across environment categories:

Category
num_envs
max_frame
log_frequency
ASHA grace_period

Classic Control

4

2e5-3e5

500

1e4

Box2D

8

3e5

1000

5e4

MuJoCo

16

4e6-10e6

10000

1e5-1e6

Atari

16

10e6

10000

5e5

The grace_period is the minimum frames before ASHA can terminate underperforming trials. Set it high enough for meaningful learning signal (typically 5-10% of max_frame).

Hardware Requirements

Category
GPU Required
Typical Runtime
Recommendation

Classic Control

No

Minutes

Local CPU is fine

Box2D

Optional

10-30 min

Local or remote

MuJoCo

Yes

1-4 hours

Use run-remote --gpu

Atari

Yes

2-3 hours

Use run-remote --gpu

circle-info

Cloud GPUs recommended for MuJoCo and Atari. Cloud L4/A10G via dstackarrow-up-right is faster and often cheaper than local training. See Remote Training for setup.

Contributing Benchmark Results

When adding or updating benchmarks:

  1. Audit spec settings: Ensure your spec.json matches the Settings line in the benchmark table

  2. Run and commit: Execute the benchmark, then commit the spec file to the repo

  3. Record scores: Extract total_reward_ma from logs and add HuggingFace folder link

  4. Generate plots: Use slm-lab plot -t "EnvName" -f folder1,folder2,...

circle-info

Only use final validation runs (not search results) for benchmark tables. Search is for hyperparameter discovery; validation confirms with committed specs.

Reproducibility

Every experiment can be exactly reproduced:

For exact code version, checkout the git SHA in the spec file.

Historical Data

v4 Results (Google Drive)

v4 benchmarks used OpenAI Gym and Roboschool (both deprecated). Available for historical reference:

circle-exclamation

Using Your Own HuggingFace Repo

Set up credentials in .env:

Then push your results:

Terminology

Abbreviation
Meaning

A2C

Advantage Actor-Critic

DDQN

Double Deep Q-Network

DQN

Deep Q-Network

GAE

Generalized Advantage Estimation

PER

Prioritized Experience Replay

PPO

Proximal Policy Optimization

SAC

Soft Actor-Critic

CER

Combined Experience Replay

MA

Moving Average

Contributing Benchmarks

To contribute new benchmark results:

  1. Run experiments with --upload-hf flag (or source .env for auto-upload)

  2. Ensure HF_TOKEN and HF_REPO are configured

  3. Results automatically upload to your HuggingFace repo

For official SLM Lab benchmarks, see docs/BENCHMARKS.mdarrow-up-right in the code repository.

Last updated

Was this helpful?