๐Ÿ”—Public Benchmark Data

Overview

All SLM Lab benchmark results are publicly available on HuggingFace:

Each experiment includes:

  • Trained models โ€” PyTorch checkpoints (*_ckpt-best_net_model.pt)

  • Training curves โ€” Full learning history (*_session_df_{train,eval}.csv)

  • Specs โ€” Exact configurations for reproduction (*_spec.yaml)

  • Graphs โ€” Plotly visualizations (PNG and HTML)

Algorithm Coverage

Which algorithms are benchmarked in each environment category. โœ“ = benchmarked.

Algorithm
Classic Control
Box2D
MuJoCo
Atari
Playground

REINFORCE

โœ“

SARSA

โœ“

DQN

โœ“

โœ“

DDQN+PER

โœ“

โœ“

A2C

โœ“

โœ“

โœ“

PPO

โœ“

โœ“

โœ“

โœ“

โœ“

SAC

โœ“

โœ“

โœ“

โœ“

CrossQ

โœ“

โœ“

โœ“

โœ“

Category
Environments
Algorithms

Classic Control

CartPole-v1, Acrobot-v1, Pendulum-v1

REINFORCE, SARSA, DQN, DDQN+PER, A2C, PPO, SAC, CrossQ

Box2D

LunarLander-v3 (discrete & continuous)

DQN, DDQN+PER, A2C, PPO, SAC, CrossQ

MuJoCo

11 environments (Hopper, HalfCheetah, Humanoid, etc.)

PPO, SAC, CrossQ

Atari

57 games (6 hard-exploration skipped)

A2C, PPO, SAC, CrossQ

Playground

54 MuJoCo Playground environments (DM Control, Robots, Manipulation)

PPO

Detailed Results

Benchmark
Page
Environments

Classic + Box2D

CartPole, Acrobot, Pendulum, LunarLander

MuJoCo

Hopper, HalfCheetah, Humanoid, etc.

Atari

57 games

Playground

54 MuJoCo Playground envs (DM Control, Locomotion, Manipulation)

Accessing Results

List and Download

circle-info

No token needed for read-only access. HF_TOKEN is only required for uploading to your own repo.

Replay a Trained Agent

Browse on HuggingFace

Direct links to experiment folders (example):

Methodology

How Scores Are Reported

  1. Trial = 4 Sessions with different random seeds

  2. Session = One complete training run

  3. Score = Final 100-checkpoint moving average (total_reward_ma)

The trial score is the mean across 4 sessions.

Environment Settings

Standardized settings for fair comparison:

Category
num_envs
max_frame
log_frequency
ASHA grace_period

Classic Control

4

2e5-3e5

500

1e4

Box2D

8

3e5

1000

5e4

MuJoCo

16

1e6-10e6

1e4

1e5-1e6

Atari

16

10e6

10000

5e5

Playground

2048

100e6

1e4

โ€”

Hardware Requirements

Category
GPU Required
Typical Runtime
Recommendation

Classic Control

No

Minutes

Local CPU is fine

Box2D

Optional

10-30 min

Local or remote

MuJoCo

Yes

1-4 hours

Use run-remote --gpu

Atari

Yes

2-3 hours

Use run-remote --gpu

Playground

Yes (CUDA)

1-6 hours

Use run-remote --gpu

circle-info

Cloud GPUs recommended for MuJoCo and Atari. Cloud L4/A10G via dstackarrow-up-right is faster and often cheaper than local training. See Remote Training for setup.

Contributing Benchmarks

Follow these steps when adding or updating benchmark results.

1. Audit Spec Settings

Ensure your spec.yaml matches the Settings line in the benchmark tablesarrow-up-right. Example: max_frame 3e5 | num_envs 4 | max_session 4 | log_frequency 500.

2. Run Benchmark and Commit Specs

Always commit the spec.yaml file to the repo after a successful run.

3. Record Scores and Plots

  • Extract total_reward_ma from logs (trial_metrics)

  • Add HuggingFace folder link to the benchmark table

  • Generate plots:

When an algorithm fails to reach target, run search before the final validation:

Stage
Mode
Config
Purpose

ASHA

search

max_session=1, search_scheduler enabled

Wide exploration with early stopping

Multi

search

max_session=4, NO search_scheduler

Robust validation with averaging

Validate

train

Final spec

Confirmation run

Search budget: ~3-4 trials per dimension (8 trials = 2-3 dims, 16 = 3-4 dims).

circle-exclamation

Reproducibility

Every experiment can be exactly reproduced:

For exact code version, checkout the git SHA recorded in the spec file.

Using Your Own HuggingFace Repo

Historical Data (v4)

v4 benchmarks used OpenAI Gym and Roboschool (both deprecated). Available for historical reference:

circle-exclamation

Terminology

Abbreviation
Meaning

A2C

Advantage Actor-Critic

CrossQ

Cross-batch Normalized Q-learning

DDQN

Double Deep Q-Network

DQN

Deep Q-Network

GAE

Generalized Advantage Estimation

MA

Moving Average

MJWarp

Warp-accelerated MJX (GPU physics)

PER

Prioritized Experience Replay

PPO

Proximal Policy Optimization

SAC

Soft Actor-Critic

Last updated

Was this helpful?