SLM Lab v5 validates PPO on Gymnasium MuJoCo environments. MuJoCo (Multi-Joint dynamics with Contact) provides physics simulation for continuous control tasks ranging from simple pendulums to complex humanoid locomotion.
Results below are from January 2026 benchmark reruns using MuJoCo v5 environments.
All trained models and metrics are publicly available on HuggingFace.
Methodology
Results show Trial-level performance:
Trial = 4 Sessions with different random seeds
Session = One complete training run
Score = Final 100-checkpoint moving average (total_reward_ma)
The trial score is the mean across 4 sessions, providing statistically meaningful results.
Standardized Settings
Category
num_envs
max_frame
log_frequency
ASHA grace_period
MuJoCo
16
4e6-10e6
10000
1e5-1e6
The grace_period is the minimum frames before ASHA early stopping can terminate underperforming trials.
v5 vs v4 Difficulty: Gymnasium MuJoCo v5 environments are significantly harder than v4. Key changes include:
Updated physics engine with more accurate contact dynamics
Revised reward functions with stricter success criteria
Termination conditions more closely match real-world failure modes
Remote (recommended) - cloud GPU via dstack, auto-syncs to HuggingFace:
Remote setup: cp .env.example .env then set HF_TOKEN. See Remote Training for dstack config.
Local - runs on your machine (requires decent GPU, runs 1-4 hours):
GPU strongly recommended for MuJoCo. These benchmarks run 4M-10M frames and take 1-4 hours on cloud GPU (L4/A10G). Local CPU training is not practical. Cloud GPUs via dstack are faster and often cheaper than running on local hardware.
Download and Replay
Results
January 2026 Rerun: SAC benchmarks are omitted due to compute constraints (off-policy algorithms require significantly more resources). PPO results cover all 11 MuJoCo environments.
Deprecated: Roboschool is abandoned (MuJoCo became free in 2022). These v4 results are preserved for historical reference only. Use Gymnasium MuJoCo environments for new work.
Environment mapping: RoboschoolHopper-v1 โ Hopper-v5, RoboschoolHalfCheetah-v1 โ HalfCheetah-v5, etc.
Episode score at the end of training. Reported scores are the average over the last 100 checkpoints, averaged over 4 Sessions. Results marked with * required 50M-100M frames using async SAC.
slm-lab run slm_lab/spec/benchmark/ppo/ppo_hopper.json ppo_hopper train
slm-lab run -s env=Humanoid-v5 -s max_frame=10e6 slm_lab/spec/benchmark/ppo/ppo_mujoco.json ppo_mujoco train
# List all available experiments (requires HF_REPO=SLM-Lab/benchmark in .env)
source .env && slm-lab list
# Download a specific experiment
source .env && slm-lab pull ppo_hopper
# Replay the trained agent
slm-lab run slm_lab/spec/benchmark/ppo/ppo_hopper.json ppo_hopper enjoy@data/ppo_hopper_2026_01_31_105438/ppo_hopper_t0_spec.json