๐ŸƒContinuous Benchmark

MuJoCo Benchmark Results (v5)

SLM Lab v5 validates PPO on Gymnasium MuJoCo environmentsarrow-up-right. MuJoCo (Multi-Joint dynamics with Contact) provides physics simulation for continuous control tasks ranging from simple pendulums to complex humanoid locomotion.

Results below are from January 2026 benchmark reruns using MuJoCo v5 environments.

All trained models and metrics are publicly available on HuggingFacearrow-up-right.

Methodology

Results show Trial-level performance:

  1. Trial = 4 Sessions with different random seeds

  2. Session = One complete training run

  3. Score = Final 100-checkpoint moving average (total_reward_ma)

The trial score is the mean across 4 sessions, providing statistically meaningful results.

Standardized Settings

Category
num_envs
max_frame
log_frequency
ASHA grace_period

MuJoCo

16

4e6-10e6

10000

1e5-1e6

The grace_period is the minimum frames before ASHA early stopping can terminate underperforming trials.

circle-exclamation

Running Benchmarks

Remote (recommended) - cloud GPU via dstackarrow-up-right, auto-syncs to HuggingFace:

Remote setup: cp .env.example .env then set HF_TOKEN. See Remote Training for dstack config.

Local - runs on your machine (requires decent GPU, runs 1-4 hours):

circle-exclamation

Download and Replay


Results

circle-info

January 2026 Rerun: SAC benchmarks are omitted due to compute constraints (off-policy algorithms require significantly more resources). PPO results cover all 11 MuJoCo environments.

Hopper-v5

Docsarrow-up-right | State: Box(11) | Action: Box(3) | Target: ~2000

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Algorithm
Status
MA
Spec
HuggingFace
Hopper-v5

HalfCheetah-v5

Docsarrow-up-right | State: Box(17) | Action: Box(6) | Target: >5000

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

HalfCheetah-v5

Walker2d-v5

Docsarrow-up-right | State: Box(17) | Action: Box(6) | Target: >3500

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Walker2d-v5

Ant-v5

Docsarrow-up-right | State: Box(105) | Action: Box(8) | Target: >2000

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Algorithm
Status
MA
Spec
HuggingFace
Ant-v5

Swimmer-v5

Docsarrow-up-right | State: Box(8) | Action: Box(2) | Target: >200

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Algorithm
Status
MA
Spec
HuggingFace
Swimmer-v5

Reacher-v5

Docsarrow-up-right | State: Box(10) | Action: Box(2) | Target: >-10

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Reacher-v5

Pusher-v5

Docsarrow-up-right | State: Box(23) | Action: Box(7) | Target: >-50

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Pusher-v5

InvertedPendulum-v5

Docsarrow-up-right | State: Box(4) | Action: Box(1) | Target: ~1000

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

InvertedPendulum-v5

InvertedDoublePendulum-v5

Docsarrow-up-right | State: Box(9) | Action: Box(1) | Target: ~8000

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

InvertedDoublePendulum-v5

Humanoid-v5

Docsarrow-up-right | State: Box(348) | Action: Box(17) | Target: >1000

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Humanoid-v5

HumanoidStandup-v5

Docsarrow-up-right | State: Box(348) | Action: Box(17) | Target: >100k

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

HumanoidStandup-v5

Legend: โœ… Solved | โš ๏ธ Close (>80%) | โŒ Failed


Historical Results (v4)

chevron-rightRoboschool Results (v4) - click to expandhashtag
circle-exclamation
Env. \ Alg.
A2C (GAE)
A2C (n-step)
PPO
SAC

RoboschoolAnt

787

1396

1843

2915

RoboschoolHalfCheetah

712

439

1960

2497

RoboschoolHopper

710

285

2042

2045

RoboschoolInvertedDoublePendulum

996

4410

8076

8085

RoboschoolInvertedPendulum

995

978

986

941

RoboschoolReacher

12.9

10.16

19.51

19.99

RoboschoolWalker2d

280

220

1660

1894

RoboschoolHumanoid

99.31

54.58

2388

2621*

Episode score at the end of training. Reported scores are the average over the last 100 checkpoints, averaged over 4 Sessions. Results marked with * required 50M-100M frames using async SAC.

Last updated

Was this helpful?