๐ŸƒContinuous Benchmark

MuJoCo Benchmark Results

SLM Lab v5.2 validates PPO, SAC, and CrossQ on Gymnasium MuJoCo environmentsarrow-up-right. MuJoCo (Multi-Joint dynamics with Contact) provides physics simulation for continuous control tasks ranging from simple pendulums to complex humanoid locomotion.

Results below are from Januaryโ€“March 2026 benchmark runs using MuJoCo v5 environments.

All trained models and metrics are publicly available on HuggingFacearrow-up-right.

Methodology

Results show Trial-level performance:

  1. Trial = 4 Sessions with different random seeds

  2. Session = One complete training run

  3. Score = Final 100-checkpoint moving average (total_reward_ma)

The trial score is the mean across 4 sessions, providing statistically meaningful results.

Standardized Settings

Category
num_envs
max_frame
log_frequency
ASHA grace_period

MuJoCo

16

4e6-10e6

10000

1e5-1e6

The grace_period is the minimum frames before ASHA early stopping can terminate underperforming trials.

Algorithms: PPO, SAC, and CrossQ. Network: MLP [256,256], orthogonal init. PPO uses tanh activation; SAC and CrossQ use relu. CrossQ uses Batch Renormalization in critics (no target networks).

Note on frame budgets: SAC uses higher update-to-data ratios, making it more sample-efficient but slower per frame than PPO (1-4M frames vs PPO's 4-10M). CrossQ uses UTD=1 (like PPO) but eliminates target network overhead, achieving ~700 fps โ€” its frame budgets (3-7.5M) reflect this speed advantage. Scores may still be improving at cutoff.

circle-exclamation

Spec Files

Spec Files (one file per algorithm, all envs via YAML anchors):

Spec Variants: Each file has a base config (shared via YAML anchors) with per-env overrides:

SPEC_NAME
Envs
Key Config

ppo_mujoco_arc

HalfCheetah, Walker, Humanoid, HumanoidStandup

Base: gamma=0.99, lam=0.95, lr=3e-4

ppo_mujoco_longhorizon_arc

Reacher, Pusher

gamma=0.997, lam=0.97, lr=2e-4, entropy=0.001

ppo_{env}_arc

Ant, Hopper, Swimmer, IP, IDP

Per-env tuned (gamma, lam, lr)

sac_mujoco_arc

(generic, use with -s flags)

Base: gamma=0.99, iter=4, lr=3e-4, [256,256]

sac_{env}_arc

All 11 envs

Per-env tuned (iter, gamma, lr, net size)

crossq_mujoco

(generic base)

Base: gamma=0.99, iter=1, lr=1e-3, policy_delay=3

crossq_{env}

All 11 envs

Per-env tuned (critic width, actor LN)

Running Benchmarks

Reproduce: Copy SPEC_NAME and MAX_FRAME from the table below.

ENV
SPEC_NAME
MAX_FRAME

Ant-v5

ppo_ant_arc

10e6

sac_ant_arc

2e6

crossq_ant

3e6

HalfCheetah-v5

ppo_mujoco_arc

10e6

sac_halfcheetah_arc

4e6

Hopper-v5

ppo_hopper_arc

4e6

sac_hopper_arc

3e6

Humanoid-v5

ppo_mujoco_arc

10e6

sac_humanoid_arc

1e6

HumanoidStandup-v5

ppo_mujoco_arc

4e6

sac_humanoid_standup_arc

1e6

InvertedDoublePendulum-v5

ppo_inverted_double_pendulum_arc

10e6

sac_inverted_double_pendulum_arc

2e6

InvertedPendulum-v5

ppo_inverted_pendulum_arc

4e6

sac_inverted_pendulum_arc

2e6

Pusher-v5

ppo_mujoco_longhorizon_arc

4e6

sac_pusher_arc

1e6

Reacher-v5

ppo_mujoco_longhorizon_arc

4e6

sac_reacher_arc

1e6

Swimmer-v5

ppo_swimmer_arc

4e6

sac_swimmer_arc

2e6

Walker2d-v5

ppo_mujoco_arc

10e6

sac_walker2d_arc

3e6

Remote setup: cp .env.example .env then set HF_TOKEN. See Remote Training for dstack config.

circle-exclamation

Download and Replay


Results

Ant-v5

Docsarrow-up-right | State: Box(105) | Action: Box(8) | Target: >2000

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Ant-v5

HalfCheetah-v5

Docsarrow-up-right | State: Box(17) | Action: Box(6) | Target: >5000

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

HalfCheetah-v5

Hopper-v5

Docsarrow-up-right | State: Box(11) | Action: Box(3) | Target: ~2000

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Hopper-v5

Humanoid-v5

Docsarrow-up-right | State: Box(348) | Action: Box(17) | Target: >1000

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Humanoid-v5

HumanoidStandup-v5

Docsarrow-up-right | State: Box(348) | Action: Box(17) | Target: >100k

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

HumanoidStandup-v5

InvertedDoublePendulum-v5

Docsarrow-up-right | State: Box(9) | Action: Box(1) | Target: ~8000

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

InvertedDoublePendulum-v5

InvertedPendulum-v5

Docsarrow-up-right | State: Box(4) | Action: Box(1) | Target: ~1000

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

InvertedPendulum-v5

Pusher-v5

Docsarrow-up-right | State: Box(23) | Action: Box(7) | Target: >-50

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Pusher-v5

Reacher-v5

Docsarrow-up-right | State: Box(10) | Action: Box(2) | Target: >-10

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Reacher-v5

Swimmer-v5

Docsarrow-up-right | State: Box(8) | Action: Box(2) | Target: >200

Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Swimmer-v5

Walker2d-v5

Docsarrow-up-right | State: Box(17) | Action: Box(6) | Target: >3500

Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4

Walker2d-v5

Legend: โœ… Solved | โš ๏ธ Close (>80%) | โŒ Failed


CrossQ Wall-Clock Speedup vs SAC

CrossQ eliminates target networks via cross batch normalization, enabling UTD=1 at ~700 fps โ€” 3.5โ€“6.7x faster than SAC on the same hardware.

Env
CrossQ FPS
SAC FPS
Speedup

HalfCheetah-v5

705

200

3.5x

Hopper-v5

693

104

6.7x

Walker2d-v5

~700

104

6.7x

Ant-v5

~700

200

3.5x

Humanoid-v5

~350

53

6.6x

HumanoidStandup-v5

340

53

6.4x

Measured on RTX 3090. CrossQ achieves comparable scores at significantly lower wall-clock time.


Historical Results (v4)

chevron-rightRoboschool Results (v4) - click to expandhashtag
circle-exclamation
Env. \ Alg.
A2C (GAE)
A2C (n-step)
PPO
SAC

RoboschoolAnt

787

1396

1843

2915

RoboschoolHalfCheetah

712

439

1960

2497

RoboschoolHopper

710

285

2042

2045

RoboschoolInvertedDoublePendulum

996

4410

8076

8085

RoboschoolInvertedPendulum

995

978

986

941

RoboschoolReacher

12.9

10.16

19.51

19.99

RoboschoolWalker2d

280

220

1660

1894

RoboschoolHumanoid

99.31

54.58

2388

2621*

Episode score at the end of training. Reported scores are the average over the last 100 checkpoints, averaged over 4 Sessions. Results marked with * required 50M-100M frames using async SAC.

Last updated

Was this helpful?