Proximal Policy Optimization (PPO) (Schulman et al., 2017) demonstrates SLM Lab's taxonomy-based inheritance. While PPO appears complex as a standalone algorithm, it differs from Actor-Critic in only a few key ways:
Clipped surrogate objective for policy updates
Minibatch training over multiple epochs
Old network for computing probability ratios
Code Comparison
The PPO class inherits from ActorCritic and overrides only what's different:
Inheritance Benefits
Benefit
How PPO Demonstrates It
Code reuse
calc_v(), calc_advs_v_targets(), network initialization all inherited
Fewer bugs
Only ~280 lines to review vs thousands for a standalone implementation
Fair comparison
Differences between A2C and PPO are exactly the overridden methods
Easy extension
Adding PPO variants (e.g., with different clipping) requires minimal code
Method Override Summary
Method
ActorCritic
PPO Override
init_algorithm_params
Sets gamma, lam, entropy_coef
Adds clip_eps, minibatch_size, time_horizon
init_nets
Creates actor/critic networks
Adds old_net copy
calc_policy_loss
Policy gradient with entropy
Clipped surrogate objective
calc_val_loss
MSE loss
Optional CleanRL-style clipping
train
Single gradient step
Multi-epoch minibatch training
Full Class Hierarchy
Each level adds only its distinguishing features, making the codebase:
Readable: Understanding PPO requires reading ~280 lines, not thousands
Testable: Test ActorCritic once, trust it works for PPO
Comparable: Performance differences between algorithms are attributable to their actual differences
Running the Comparison
Both use the same network architecture, memory, and environment setupโonly the algorithm differs.