โ–ถ๏ธTrain: PPO on CartPole

Run a full training with saved results.

Train vs Dev Mode

In Quick Start, we used dev mode for quick verification. Now we use train mode:

Mode
Sessions
Rendering
Saves Results

dev

1

optional

no

train

4 (default)

disabled

yes

Train mode runs multiple sessions with different random seeds for statistical reliability, and disables rendering for faster training.

Run Training

slm-lab run slm_lab/spec/benchmark/ppo/ppo_cartpole.json ppo_cartpole train

This runs a Trial with 4 Sessions. Training completes in about 5-10 minutes. Watch total_reward_ma climb toward 500 (solved).

Training curves (from benchmark run):

PPO CartPole Training Curve

Moving average smooths out episode-to-episode noise to show the learning trend:

PPO CartPole Moving Average

Output Folder

Results are saved to a timestamped folder:

Best vs final checkpoints: "best" is the highest evaluation score during training; "final" is at the end. Usually similar, but "best" helps if performance dropped near the end.

See Data Locations for full details.

Auto-Upload to HuggingFace

If you set up environment variables, results auto-upload to HuggingFace after trainingโ€”useful for remote training:

Next Steps

Learn how SLM Lab organizes experiments in Core Concepts.

Last updated

Was this helpful?