๐ŸŽฏDiscrete Benchmark

Classic Control & Box2D Results (v5)

SLM Lab v5 validates algorithms on Gymnasiumarrow-up-right discrete environments. These benchmarks cover:

Results below are from January 2026 benchmark reruns using Gymnasium v5 environments.

All trained models and metrics are publicly available on HuggingFacearrow-up-right.

Methodology

Results show Trial-level performance:

  1. Trial = 4 Sessions with different random seeds

  2. Session = One complete training run

  3. Score = Final 100-checkpoint moving average (total_reward_ma)

The trial score is the mean across 4 sessions, providing statistically meaningful results.

Standardized Settings

Category
num_envs
max_frame
log_frequency
ASHA grace_period

Classic Control

4

2e5-3e5

500

1e4

Box2D

8

3e5

1000

5e4

The grace_period is the minimum frames before ASHA early stopping can terminate underperforming trials.

circle-exclamation

Running Benchmarks

Local - runs on your machine (Classic Control completes in minutes on CPU):

Remote - cloud GPU via dstackarrow-up-right, auto-syncs to HuggingFace:

Remote setup: cp .env.example .env then set HF_TOKEN. See Remote Training for dstack config.

circle-info

GPU not required for Classic Control. These environments train fast on CPU. Box2D (LunarLander) benefits from GPU but still runs fine locally.

Download and Replay


Results

Classic Control

CartPole-v1

Docsarrow-up-right | State: Box(4) | Action: Discrete(2) | Target: >400

Settings: max_frame 2e5 | num_envs 4 | max_session 4 | log_frequency 500

CartPole-v1 Multi-Trial Graph

Acrobot-v1

Docsarrow-up-right | State: Box(6) | Action: Discrete(3) | Target: >-100

Settings: max_frame 3e5 | num_envs 4 | max_session 4 | log_frequency 500

Acrobot-v1 Multi-Trial Graph

Pendulum-v1

Docsarrow-up-right | State: Box(3) | Action: Box(1) | Target: >-200

Settings: max_frame 3e5 | num_envs 4 | max_session 4 | log_frequency 500

Pendulum-v1 Multi-Trial Graph

Box2D

LunarLander-v3 (Discrete)

Docsarrow-up-right | State: Box(8) | Action: Discrete(4) | Target: >200

Settings: max_frame 3e5 | num_envs 8 | max_session 4 | log_frequency 1000

LunarLander-v3 (Discrete) Multi-Trial Graph

LunarLander-v3 (Continuous)

Docsarrow-up-right | State: Box(8) | Action: Box(2) | Target: >200

Settings: max_frame 3e5 | num_envs 8 | max_session 4 | log_frequency 1000

LunarLander-v3 (Continuous) Multi-Trial Graph

Legend: โœ… Solved | โš ๏ธ Close (>80%) | โŒ Failed


Historical Results (v4)

chevron-rightOpenAI Gym Results (v4) - click to expandhashtag
circle-info

These results from SLM Lab v4 used OpenAI Gym environments (now deprecated). Environment versions differ from current Gymnasium versions. Unity environments are no longer included in the core package.

Env. \ Alg.
DQN
DDQN+PER
A2C (GAE)
A2C (n-step)
PPO
SAC

Breakout

80.88

182

377

398

443

3.51*

Pong

18.48

20.5

19.31

19.56

20.58

19.87*

Qbert

5494

11426

12405

13590

13460

923*

Seaquest

1185

4405

1070

1684

1715

171*

LunarLander

192

233

25.21

68.23

214

276

Episode score at the end of training. Reported scores are the average over the last 100 checkpoints, averaged over 4 Sessions. Results marked with * used async SAC.

For the full Atari benchmark, see Atari Benchmark.

Last updated

Was this helpful?