For the complete documentation index, see llms.txt. This page is also available as Markdown.

๐ŸŒEnv Spec

This tutorial shows how to configure the env spec for continuous control environments. We'll train PPO on HalfCheetahโ€”a MuJoCo locomotion task.

The Env Spec

The environment is specified using the env key in a spec file:

{
  "spec_name": {
    "agent": {...},
    "env": {
      // Environment name (must be in gymnasium registry)
      "name": str,

      // Number of parallel environment instances
      "num_envs": int,

      // Maximum timesteps per episode (null = use environment default)
      "max_t": int|null,

      // Total training frames
      "max_frame": int,

      // Optional: Online state normalization (recommended for MuJoCo)
      "normalize_obs": bool,

      // Optional: Online reward normalization (recommended for MuJoCo)
      "normalize_reward": bool,

      // Optional: Clip observations to [-bound, bound] (default: 10.0 if normalize_obs)
      "clip_obs": float,

      // Optional: Clip rewards to [-bound, bound] (default: 10.0 if normalize_reward)
      "clip_reward": float
    },
    ...
  }
}

Supported Environments

SLM Lab uses Gymnasium (the maintained fork of OpenAI Gym):

Category
Examples
Difficulty
Docs

Classic Control

CartPole, Pendulum, Acrobot

Easy

Box2D

LunarLander, BipedalWalker

Medium

MuJoCo

Hopper, HalfCheetah, Humanoid

Hard

Atari

Qbert, MsPacman, and 54 more

Varied

Any gymnasium-compatible environment worksโ€”just specify its name in the spec.

Environment-Specific Settings

Category

num_envs

max_frame

Normalization

GPU

Classic Control

4

2e5-3e5

No

Optional

Box2D

8

3e5

No

Optional

MuJoCo

16

1e6-10e6

normalize_obs, normalize_reward

Optional

Atari

16

10e6

No

Recommended

See Benchmark Specs for complete spec files for each environment.

Example: PPO on HalfCheetah

HalfCheetah-v5 is a classic MuJoCo benchmarkโ€”a 2D cheetah robot that learns to run forward. It has a 17-dimensional observation space and 6-dimensional continuous action space.

The PPO MuJoCo spec from slm_lab/spec/benchmark/ppo/ppo_mujoco.json:

Key env settings for MuJoCo:

Parameter
Value
Why

num_envs: 16

16 parallel environments

Faster data collection for on-policy learning

normalize_obs: true

Normalize observations

MuJoCo observations have varying scales

normalize_reward: true

Normalize rewards

Stabilizes value function learning

MuJoCo became free in 2022. No license neededโ€”Gymnasium includes MuJoCo out of the box.

Running PPO on HalfCheetah

The variable substitution (-s env=...) lets you use the same spec for different MuJoCo environments.

Results

PPO achieves 5852 MA on HalfCheetah-v5 with this configuration.

Training curves (average of 4 sessions):

HalfCheetah Training Curve
HalfCheetah Moving Average

Trained models available on HuggingFace.

Using Other Environments

To use a different environment, find a spec for that environment category and modify env.name. Spec files are organized by algorithm in slm_lab/spec/benchmark/:

Environment Categories

Category
Environments
Spec Examples

Classic Control

CartPole-v1, Acrobot-v1, Pendulum-v1, MountainCar-v0

ppo_cartpole.json, dqn_cartpole.json

Box2D

LunarLander-v3, BipedalWalker-v3

ppo_lunar.json, ddqn_per_lunar.json

MuJoCo

Hopper-v5, HalfCheetah-v5, Walker2d-v5, Ant-v5, Humanoid-v5, Swimmer-v5, etc.

ppo_mujoco.json, sac_mujoco.json

Atari

54 games (ALE/Qbert-v5, ALE/MsPacman-v5, etc.)

ppo_atari.json, dqn_atari.json

Switching Environments

  1. Find a spec for your target environment category

  2. Change env.name to the Gymnasium environment name

  3. Adjust settings as needed (num_envs, max_frame, normalization)

Exampleโ€”use the same algorithm on different environments:

Template Specs with Variable Substitution

Some specs use ${var} placeholders for flexibility. Use -s var=value to substitute:

Finding Environment Specs

Any Gymnasium environment works. Just set env.name to a valid Gymnasium environment ID. Use the benchmark specs as starting points for hyperparameters.

Standard Settings for Fair Comparison

When comparing algorithms, use consistent environment settings. Different num_envs or max_frame values make comparisons invalid.

Category
num_envs
max_frame
log_frequency
Notes

Classic Control

4

2e5-3e5

500

Fast training

Box2D

8

3e5

1000

Medium complexity

MuJoCo

16

4e6-10e6

10000

Use normalization

Atari

16

10e6

10000

GPU recommended

What to Keep Consistent

When comparing algorithms on the same environment:

Parameter
Keep Same?
Why

num_envs

Yes

Affects data collection rate and batch statistics

max_frame

Yes

Total training budget must match

max_t

Yes

Episode length affects learning signal

normalize_obs

Yes

Changes observation distribution

normalize_reward

Yes

Changes reward scale

Example: Fair Algorithm Comparison

To compare DQN vs PPO on LunarLander fairly:

Check that both specs have matching env settings before comparing results.

Advanced Env Options

Environment Kwargs

Any additional keys in the env spec are passed to gymnasium.make():

Normalization Details

The normalization wrappers maintain running statistics:

Option
What It Does
When to Use

normalize_obs

Centers observations, scales to unit variance

MuJoCo, continuous control

normalize_reward

Scales rewards using running std

Environments with varying reward scales

Next, we'll use GPU to train on Atari games where image processing is the bottleneck.

Last updated

Was this helpful?