๐Continuous Benchmark
MuJoCo Benchmark Results
SLM Lab v5.2 validates PPO, SAC, and CrossQ on Gymnasium MuJoCo environments. MuJoCo (Multi-Joint dynamics with Contact) provides physics simulation for continuous control tasks ranging from simple pendulums to complex humanoid locomotion.
Results below are from JanuaryโMarch 2026 benchmark runs using MuJoCo v5 environments.
All trained models and metrics are publicly available on HuggingFace.
Methodology
Results show Trial-level performance:
Trial = 4 Sessions with different random seeds
Session = One complete training run
Score = Final 100-checkpoint moving average (
total_reward_ma)
The trial score is the mean across 4 sessions, providing statistically meaningful results.
Standardized Settings
MuJoCo
16
4e6-10e6
10000
1e5-1e6
The grace_period is the minimum frames before ASHA early stopping can terminate underperforming trials.
Algorithms: PPO, SAC, and CrossQ. Network: MLP [256,256], orthogonal init. PPO uses tanh activation; SAC and CrossQ use relu. CrossQ uses Batch Renormalization in critics (no target networks).
Note on frame budgets: SAC uses higher update-to-data ratios, making it more sample-efficient but slower per frame than PPO (1-4M frames vs PPO's 4-10M). CrossQ uses UTD=1 (like PPO) but eliminates target network overhead, achieving ~700 fps โ its frame budgets (3-7.5M) reflect this speed advantage. Scores may still be improving at cutoff.
v5 vs v4 Difficulty: Gymnasium MuJoCo v5 environments are significantly harder than v4. Key changes include:
Updated physics engine with more accurate contact dynamics
Revised reward functions with stricter success criteria
Termination conditions more closely match real-world failure modes
Expect 10-30% lower scores compared to v4 benchmarks. See Gymnasium Migration Guide for details.
Spec Files
Spec Files (one file per algorithm, all envs via YAML anchors):
PPO: ppo_mujoco_arc.yaml
SAC: sac_mujoco_arc.yaml
CrossQ: crossq_mujoco.yaml
Spec Variants: Each file has a base config (shared via YAML anchors) with per-env overrides:
ppo_mujoco_arc
HalfCheetah, Walker, Humanoid, HumanoidStandup
Base: gamma=0.99, lam=0.95, lr=3e-4
ppo_mujoco_longhorizon_arc
Reacher, Pusher
gamma=0.997, lam=0.97, lr=2e-4, entropy=0.001
ppo_{env}_arc
Ant, Hopper, Swimmer, IP, IDP
Per-env tuned (gamma, lam, lr)
sac_mujoco_arc
(generic, use with -s flags)
Base: gamma=0.99, iter=4, lr=3e-4, [256,256]
sac_{env}_arc
All 11 envs
Per-env tuned (iter, gamma, lr, net size)
crossq_mujoco
(generic base)
Base: gamma=0.99, iter=1, lr=1e-3, policy_delay=3
crossq_{env}
All 11 envs
Per-env tuned (critic width, actor LN)
Running Benchmarks
Reproduce: Copy SPEC_NAME and MAX_FRAME from the table below.
Ant-v5
ppo_ant_arc
10e6
sac_ant_arc
2e6
crossq_ant
3e6
HalfCheetah-v5
ppo_mujoco_arc
10e6
sac_halfcheetah_arc
4e6
Hopper-v5
ppo_hopper_arc
4e6
sac_hopper_arc
3e6
Humanoid-v5
ppo_mujoco_arc
10e6
sac_humanoid_arc
1e6
HumanoidStandup-v5
ppo_mujoco_arc
4e6
sac_humanoid_standup_arc
1e6
InvertedDoublePendulum-v5
ppo_inverted_double_pendulum_arc
10e6
sac_inverted_double_pendulum_arc
2e6
InvertedPendulum-v5
ppo_inverted_pendulum_arc
4e6
sac_inverted_pendulum_arc
2e6
Pusher-v5
ppo_mujoco_longhorizon_arc
4e6
sac_pusher_arc
1e6
Reacher-v5
ppo_mujoco_longhorizon_arc
4e6
sac_reacher_arc
1e6
Swimmer-v5
ppo_swimmer_arc
4e6
sac_swimmer_arc
2e6
Walker2d-v5
ppo_mujoco_arc
10e6
sac_walker2d_arc
3e6
Remote setup: cp .env.example .env then set HF_TOKEN. See Remote Training for dstack config.
GPU strongly recommended for MuJoCo. These benchmarks run 4M-10M frames and take 1-4 hours on cloud GPU (L4/A10G). Local CPU training is not practical. Cloud GPUs via dstack are faster and often cheaper than running on local hardware.
Download and Replay
Results
Ant-v5
Docs | State: Box(105) | Action: Box(8) | Target: >2000
Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
2138.28
ppo_ant_arc
SAC
โ
4942.91
sac_ant_arc
CrossQ
โ
4517.00
crossq_ant

HalfCheetah-v5
Docs | State: Box(17) | Action: Box(6) | Target: >5000
Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
6240.68
ppo_mujoco_arc
SAC
โ
9815.16
sac_halfcheetah_arc
CrossQ
โ
8616.52
crossq_halfcheetah

Hopper-v5
Docs | State: Box(11) | Action: Box(3) | Target: ~2000
Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ ๏ธ
1653.74
ppo_hopper_arc
SAC
โ ๏ธ
1416.52
sac_hopper_arc
CrossQ
โ ๏ธ
1168.53
crossq_hopper

Humanoid-v5
Docs | State: Box(348) | Action: Box(17) | Target: >1000
Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
2661.26
ppo_mujoco_arc
SAC
โ
1989.65
sac_humanoid_arc
CrossQ
โ
1755.29
crossq_humanoid

HumanoidStandup-v5
Docs | State: Box(348) | Action: Box(17) | Target: >100k
Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
150104.59
ppo_mujoco_arc
SAC
โ
137357.00
sac_humanoid_standup_arc
CrossQ
โ
150912.66
crossq_humanoid_standup

InvertedDoublePendulum-v5
Docs | State: Box(9) | Action: Box(1) | Target: ~8000
Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
8383.76
ppo_inverted_double_pendulum_arc
SAC
โ
9032.67
sac_inverted_double_pendulum_arc
CrossQ
โ
8027.38
crossq_inverted_double_pendulum

InvertedPendulum-v5
Docs | State: Box(4) | Action: Box(1) | Target: ~1000
Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
949.94
ppo_inverted_pendulum_arc
SAC
โ
928.43
sac_inverted_pendulum_arc
CrossQ
โ ๏ธ
877.83
crossq_inverted_pendulum

Pusher-v5
Docs | State: Box(23) | Action: Box(7) | Target: >-50
Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
-49.59
ppo_mujoco_longhorizon_arc
SAC
โ
-43.00
sac_pusher_arc
CrossQ
โ
-37.08
crossq_pusher

Reacher-v5
Docs | State: Box(10) | Action: Box(2) | Target: >-10
Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
-5.03
ppo_mujoco_longhorizon_arc
SAC
โ
-6.31
sac_reacher_arc
CrossQ
โ
-5.65
crossq_reacher

Swimmer-v5
Docs | State: Box(8) | Action: Box(2) | Target: >200
Settings: max_frame 4e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
282.44
ppo_swimmer_arc
SAC
โ
301.34
sac_swimmer_arc
CrossQ
โ
221.12
crossq_swimmer

Walker2d-v5
Docs | State: Box(17) | Action: Box(6) | Target: >3500
Settings: max_frame 10e6 | num_envs 16 | max_session 4 | log_frequency 1e4
PPO
โ
4378.62
ppo_mujoco_arc
SAC
โ ๏ธ
3123.66
sac_walker2d_arc
CrossQ
โ
4389.62
crossq_walker2d

Legend: โ Solved | โ ๏ธ Close (>80%) | โ Failed
CrossQ Wall-Clock Speedup vs SAC
CrossQ eliminates target networks via cross batch normalization, enabling UTD=1 at ~700 fps โ 3.5โ6.7x faster than SAC on the same hardware.
HalfCheetah-v5
705
200
3.5x
Hopper-v5
693
104
6.7x
Walker2d-v5
~700
104
6.7x
Ant-v5
~700
200
3.5x
Humanoid-v5
~350
53
6.6x
HumanoidStandup-v5
340
53
6.4x
Measured on RTX 3090. CrossQ achieves comparable scores at significantly lower wall-clock time.
Historical Results (v4)
Roboschool Results (v4) - click to expand
Deprecated: Roboschool is abandoned (MuJoCo became free in 2022). These v4 results are preserved for historical reference only. Use Gymnasium MuJoCo environments for new work.
Environment mapping: RoboschoolHopper-v1 โ Hopper-v5, RoboschoolHalfCheetah-v1 โ HalfCheetah-v5, etc.
RoboschoolAnt
787
1396
1843
2915
RoboschoolHalfCheetah
712
439
1960
2497
RoboschoolHopper
710
285
2042
2045
RoboschoolInvertedDoublePendulum
996
4410
8076
8085
RoboschoolInvertedPendulum
995
978
986
941
RoboschoolReacher
12.9
10.16
19.51
19.99
RoboschoolWalker2d
280
220
1660
1894
RoboschoolHumanoid
99.31
54.58
2388
2621*
Episode score at the end of training. Reported scores are the average over the last 100 checkpoints, averaged over 4 Sessions. Results marked with
*required 50M-100M frames using async SAC.
Last updated
Was this helpful?