๐ฏDiscrete Benchmark
Classic Control & Box2D Results
SLM Lab v5.2 validates algorithms on Gymnasium discrete environments using the TorchArc architecture. These benchmarks cover:
Classic Control: CartPole, Acrobot, Pendulumโsimple physics tasks ideal for algorithm validation
Box2D: LunarLanderโ2D physics with more complex dynamics
Results below are from FebruaryโMarch 2026 benchmark runs using Gymnasium v5 environments. TorchArc specs for existing algorithms; CrossQ uses standard MLP specs.
All trained models and metrics are publicly available on HuggingFace.
Methodology
Results show Trial-level performance:
Trial = 4 Sessions with different random seeds
Session = One complete training run
Score = Final 100-checkpoint moving average (
total_reward_ma)
The trial score is the mean across 4 sessions, providing statistically meaningful results.
Standardized Settings
Classic Control
4
2e5-3e5
500
1e4
Box2D
8
3e5
1000
5e4
The grace_period is the minimum frames before ASHA early stopping can terminate underperforming trials.
v5 vs v4 Difficulty: Gymnasium environments have stricter termination and reward handling:
LunarLander-v3 is notably harder than v2โstricter landing criteria, lower typical scores
Pendulum-v1 uses different reward scaling than v0
Expect 5-15% lower scores compared to OpenAI Gym benchmarks
See Gymnasium docs for environment-specific changes.
Running Benchmarks
Local - runs on your machine (Classic Control completes in minutes on CPU):
Remote - cloud GPU via dstack, auto-syncs to HuggingFace:
Remote setup: cp .env.example .env then set HF_TOKEN. See Remote Training for dstack config.
GPU not required for Classic Control. These environments train fast on CPU. Box2D (LunarLander) benefits from GPU but still runs fine locally.
Download and Replay
Results
Classic Control
CartPole-v1
Docs | State: Box(4) | Action: Discrete(2) | Target: >400
Settings: max_frame 2e5 | num_envs 4 | max_session 4 | log_frequency 500
REINFORCE
โ
483.31
reinforce_cartpole_arc
SARSA
โ
430.95
sarsa_boltzmann_cartpole_arc
DQN
โ ๏ธ
239.94
dqn_boltzmann_cartpole_arc
DDQN+PER
โ
451.51
ddqn_per_boltzmann_cartpole_arc
A2C
โ
496.68
a2c_gae_cartpole_arc
PPO
โ
498.94
ppo_cartpole_arc
SAC
โ
406.09
sac_cartpole_arc
CrossQ
โ ๏ธ
334.59
crossq_cartpole

Acrobot-v1
Docs | State: Box(6) | Action: Discrete(3) | Target: >-100
Settings: max_frame 3e5 | num_envs 4 | max_session 4 | log_frequency 500
DQN
โ
-94.17
dqn_boltzmann_acrobot_arc
DDQN+PER
โ
-83.92
ddqn_per_acrobot_arc
A2C
โ
-83.99
a2c_gae_acrobot_arc
PPO
โ
-81.28
ppo_acrobot_arc
SAC
โ
-92.60
sac_acrobot_arc
CrossQ
โ
-103.13
crossq_acrobot

Pendulum-v1
Docs | State: Box(3) | Action: Box(1) | Target: >-200
Settings: max_frame 3e5 | num_envs 4 | max_session 4 | log_frequency 500
A2C
โ
-820.74
a2c_gae_pendulum_arc
PPO
โ
-174.87
ppo_pendulum_arc
SAC
โ
-150.97
sac_pendulum_arc
CrossQ
โ
-145.66
crossq_pendulum

Box2D
LunarLander-v3 (Discrete)
Docs | State: Box(8) | Action: Discrete(4) | Target: >200
Settings: max_frame 3e5 | num_envs 8 | max_session 4 | log_frequency 1000
DQN
โ ๏ธ
195.21
dqn_concat_lunar_arc
DDQN+PER
โ
265.90
ddqn_per_concat_lunar_arc
A2C
โ
27.38
a2c_gae_lunar_arc
PPO
โ ๏ธ
183.30
ppo_lunar_arc
SAC
โ ๏ธ
106.17
sac_lunar_arc
CrossQ
โ
139.21
crossq_lunar

LunarLander-v3 (Continuous)
Docs | State: Box(8) | Action: Box(2) | Target: >200
Settings: max_frame 3e5 | num_envs 8 | max_session 4 | log_frequency 1000
A2C
โ
-76.81
a2c_gae_lunar_continuous_arc
PPO
โ ๏ธ
132.58
ppo_lunar_continuous_arc
SAC
โ ๏ธ
125.00
sac_lunar_continuous_arc
CrossQ
โ
268.91
crossq_lunar_continuous

Legend: โ Solved | โ ๏ธ Close (>80%) | โ Failed
Historical Results (v4)
OpenAI Gym Results (v4) - click to expand
These results from SLM Lab v4 used OpenAI Gym environments (now deprecated). Environment versions differ from current Gymnasium versions. Unity environments are no longer included in the core package.
Breakout
80.88
182
377
398
443
3.51*
Pong
18.48
20.5
19.31
19.56
20.58
19.87*
Qbert
5494
11426
12405
13590
13460
923*
Seaquest
1185
4405
1070
1684
1715
171*
LunarLander
192
233
25.21
68.23
214
276
Episode score at the end of training. Reported scores are the average over the last 100 checkpoints, averaged over 4 Sessions. Results marked with
*used async SAC.
For the full Atari benchmark, see Atari Benchmark.
Last updated
Was this helpful?