๐Ÿ”—Class Inheritance: A2C > PPO

REINFORCE โ†’ ActorCritic โ†’ PPO

Proximal Policy Optimization (PPO) (Schulman et al., 2017)arrow-up-right demonstrates SLM Lab's taxonomy-based inheritance. While PPO appears complex as a standalone algorithm, it differs from Actor-Critic in only a few key ways:

  1. Clipped surrogate objective for policy updates

  2. Minibatch training over multiple epochs

  3. Old network for computing probability ratios

Code Comparison

The PPO class inherits from ActorCritic and overrides only what's different:

Inheritance Benefits

Benefit
How PPO Demonstrates It

Code reuse

calc_v(), calc_advs_v_targets(), network initialization all inherited

Fewer bugs

Only ~280 lines to review vs thousands for a standalone implementation

Fair comparison

Differences between A2C and PPO are exactly the overridden methods

Easy extension

Adding PPO variants (e.g., with different clipping) requires minimal code

Method Override Summary

Method
ActorCritic
PPO Override

init_algorithm_params

Sets gamma, lam, entropy_coef

Adds clip_eps, minibatch_size, time_horizon

init_nets

Creates actor/critic networks

Adds old_net copy

calc_policy_loss

Policy gradient with entropy

Clipped surrogate objective

calc_val_loss

MSE loss

Optional CleanRL-style clipping

train

Single gradient step

Multi-epoch minibatch training

Full Class Hierarchy

Each level adds only its distinguishing features, making the codebase:

  • Readable: Understanding PPO requires reading ~280 lines, not thousands

  • Testable: Test ActorCritic once, trust it works for PPO

  • Comparable: Performance differences between algorithms are attributable to their actual differences

Running the Comparison

Both use the same network architecture, memory, and environment setupโ€”only the algorithm differs.

Last updated

Was this helpful?