๐ŸงฉModular Design

Why Modularity Matters

Deep RL research requires comparing algorithms fairly. If two implementations differ in subtle ways beyond the algorithm itselfโ€”network initialization, gradient clipping, learning rate schedulesโ€”performance differences become meaningless.

SLM Lab solves this with modular design: algorithms share the same base code, networks, memory, and training loops. When you compare PPO to A2C in SLM Lab, the only differences are the algorithmic ones.

Benefits:

Benefit
How SLM Lab Achieves It

Fair comparison

Algorithms share base classes; only algorithmic differences matter

Code reuse

New algorithms inherit tested components

Fewer bugs

Less code to review and maintain

Easy extension

Add new algorithms by overriding specific methods

Core Components

SLM Lab is built around three base classes:

Agent
 โ”œโ”€โ”€ Algorithm    Handles interaction with env, computes loss, runs training
 โ”‚    โ””โ”€โ”€ Net     Neural network function approximator
 โ””โ”€โ”€ Memory       Data storage and retrieval for training
SLM Lab Class Diagram

Algorithm

Implements the RL algorithm logic:

  • Action selection (exploration vs exploitation)

  • Loss computation (policy loss, value loss)

  • Training step (gradient updates)

Key methods:

Memory

Handles experience storage:

  • Store transitions (s, a, r, s', done)

  • Sample batches for training

  • Signal when it's time to train

Types:

  • Replay - Off-policy ring buffer

  • PrioritizedReplay - Priority-based sampling

  • OnPolicyBatchReplay - On-policy with fixed batch size

Net

Neural network architectures:

  • Configurable hidden layers and activations

  • Automatic input/output sizing

  • Optimizer and scheduler management

Types:

  • MLPNet - Fully connected layers

  • ConvNet - Convolutional + fully connected

  • RecurrentNet - LSTM/GRU for sequences

Inheritance in Action

Consider how algorithms are related:

PPO inherits from ActorCritic and only overrides:

  1. init_algorithm_params() - Add PPO-specific hyperparameters

  2. init_nets() - Add old_net for ratio computation

  3. calc_policy_loss() - Implement clipped surrogate objective

  4. train() - Multi-epoch minibatch training

Everything elseโ€”value function computation, advantage estimation, network initializationโ€”is inherited from ActorCritic.

Result: PPO is ~280 lines of code instead of thousands for a standalone implementation.

See Class Inheritance: A2C > PPO for the detailed walkthrough.

Composability

Components can be mixed and matched via the spec file:

Example compositions:

Combination
What It Creates

DQN + Replay + MLPNet

Standard DQN

DoubleDQN + PrioritizedReplay + MLPNet

DDQN+PER

DQN + Replay + RecurrentNet

DRQN (recurrent DQN)

PPO + OnPolicyBatchReplay + ConvNet

PPO for Atari

Why This Design?

Henderson et al. (2017)arrow-up-right showed that implementation details dramatically affect RL results. Two "correct" implementations of the same algorithm can have vastly different performance due to:

  • Random seed selection

  • Network initialization

  • Hyperparameter defaults

  • Gradient clipping values

  • Frame preprocessing

SLM Lab addresses this by:

  1. Shared base code - Algorithms inherit common functionality

  2. Explicit configuration - All hyperparameters in spec files

  3. Reproducibility - Spec + git SHA fully defines an experiment

  4. Benchmarking - Validated implementations with published results

Extending SLM Lab

To add a new algorithm:

Then register in slm_lab/agent/algorithm/__init__.py and create a spec file.

Further Reading

Last updated

Was this helpful?