๐ŸŒEnv Spec

This tutorial shows how to configure the env spec for continuous control environments. We'll train PPO on HalfCheetahโ€”a MuJoCo locomotion task.

The Env Spec

The environment is specified using the env key in a spec file:

{
  "spec_name": {
    "agent": {...},
    "env": {
      // Environment name (must be in gymnasium registry)
      "name": str,

      // Number of parallel environment instances
      "num_envs": int,

      // Maximum timesteps per episode (null = use environment default)
      "max_t": int|null,

      // Total training frames
      "max_frame": int,

      // Optional: Online state normalization (recommended for MuJoCo)
      "normalize_obs": bool,

      // Optional: Online reward normalization (recommended for MuJoCo)
      "normalize_reward": bool,

      // Optional: Clip observations to [-bound, bound] (default: 10.0 if normalize_obs)
      "clip_obs": float,

      // Optional: Clip rewards to [-bound, bound] (default: 10.0 if normalize_reward)
      "clip_reward": float
    },
    ...
  }
}

Supported Environments

SLM Lab uses Gymnasiumarrow-up-right (the maintained fork of OpenAI Gym):

Category
Examples
Difficulty
Docs

Classic Control

CartPole, Pendulum, Acrobot

Easy

Box2D

LunarLander, BipedalWalker

Medium

MuJoCo

Hopper, HalfCheetah, Humanoid

Hard

Atari

Qbert, MsPacman, and 54 more

Varied

Any gymnasium-compatible environment worksโ€”just specify its name in the spec.

Environment-Specific Settings

Category

num_envs

max_frame

Normalization

GPU

Classic Control

4

2e5-3e5

No

Optional

Box2D

8

3e5

No

Optional

MuJoCo

16

1e6-10e6

normalize_obs, normalize_reward

Optional

Atari

16

10e6

No

Recommended

See Benchmark Specs for complete spec files for each environment.

Example: PPO on HalfCheetah

HalfCheetah-v5arrow-up-right is a classic MuJoCo benchmarkโ€”a 2D cheetah robot that learns to run forward. It has a 17-dimensional observation space and 6-dimensional continuous action space.

The PPO MuJoCo spec from slm_lab/spec/benchmark/ppo/ppo_mujoco.jsonarrow-up-right:

Key env settings for MuJoCo:

Parameter
Value
Why

num_envs: 16

16 parallel environments

Faster data collection for on-policy learning

normalize_obs: true

Normalize observations

MuJoCo observations have varying scales

normalize_reward: true

Normalize rewards

Stabilizes value function learning

circle-info

MuJoCo became free in 2022. No license neededโ€”Gymnasium includes MuJoCo out of the box.

Running PPO on HalfCheetah

The variable substitution (-s env=...) lets you use the same spec for different MuJoCo environments.

Results

PPO achieves 5852 MA on HalfCheetah-v5 with this configuration.

Training curves (average of 4 sessions):

HalfCheetah Training Curve
HalfCheetah Moving Average

Trained models available on HuggingFacearrow-up-right.

Using Other Environments

To use a different environment, find a spec for that environment category and modify env.name. Spec files are organized by algorithm in slm_lab/spec/benchmark/:

Environment Categories

Category
Environments
Spec Examples

Classic Control

CartPole-v1, Acrobot-v1, Pendulum-v1, MountainCar-v0

ppo_cartpole.json, dqn_cartpole.json

Box2D

LunarLander-v3, BipedalWalker-v3

ppo_lunar.json, ddqn_per_lunar.json

MuJoCo

Hopper-v5, HalfCheetah-v5, Walker2d-v5, Ant-v5, Humanoid-v5, Swimmer-v5, etc.

ppo_mujoco.json, sac_mujoco.json

Atari

54 games (ALE/Qbert-v5, ALE/MsPacman-v5, etc.)

ppo_atari.json, dqn_atari.json

Switching Environments

  1. Find a spec for your target environment category

  2. Change env.name to the Gymnasium environment name

  3. Adjust settings as needed (num_envs, max_frame, normalization)

Exampleโ€”use the same algorithm on different environments:

Template Specs with Variable Substitution

Some specs use ${var} placeholders for flexibility. Use -s var=value to substitute:

Finding Environment Specs

circle-info

Any Gymnasium environment works. Just set env.name to a valid Gymnasiumarrow-up-right environment ID. Use the benchmark specs as starting points for hyperparameters.

Standard Settings for Fair Comparison

When comparing algorithms, use consistent environment settings. Different num_envs or max_frame values make comparisons invalid.

Category
num_envs
max_frame
log_frequency
Notes

Classic Control

4

2e5-3e5

500

Fast training

Box2D

8

3e5

1000

Medium complexity

MuJoCo

16

4e6-10e6

10000

Use normalization

Atari

16

10e6

10000

GPU recommended

What to Keep Consistent

When comparing algorithms on the same environment:

Parameter
Keep Same?
Why

num_envs

Yes

Affects data collection rate and batch statistics

max_frame

Yes

Total training budget must match

max_t

Yes

Episode length affects learning signal

normalize_obs

Yes

Changes observation distribution

normalize_reward

Yes

Changes reward scale

Example: Fair Algorithm Comparison

To compare DQN vs PPO on LunarLander fairly:

Check that both specs have matching env settings before comparing results.

circle-exclamation

Advanced Env Options

Environment Kwargs

Any additional keys in the env spec are passed to gymnasium.make():

Normalization Details

The normalization wrappers maintain running statistics:

Option
What It Does
When to Use

normalize_obs

Centers observations, scales to unit variance

MuJoCo, continuous control

normalize_reward

Scales rewards using running std

Environments with varying reward scales

circle-check

Next, we'll use GPU to train on Atari games where image processing is the bottleneck.

Last updated

Was this helpful?