This runs a Trial with 4 Sessions. Training completes in about 5-10 minutes. Watch total_reward_ma climb toward 500 (solved).
Training curves (from benchmark run):
PPO CartPole Training Curve
Moving average smooths out episode-to-episode noise to show the learning trend:
PPO CartPole Moving Average
Output Folder
Results are saved to a timestamped folder:
Best vs final checkpoints: "best" is the highest evaluation score during training; "final" is at the end. Usually similar, but "best" helps if performance dropped near the end.