SLM Lab
v4.1.1
v4.1.1
  • SLM Lab
  • 🖥Setup
    • Installation
    • Quick Start
  • 🚀Using SLM Lab
    • Lab Command
    • Lab Organization
    • Train and Enjoy: REINFORCE CartPole
    • Agent Spec: DDQN+PER on LunarLander
    • Env Spec: A2C on Pong
    • GPU Usage: PPO on Pong
    • Parallelizing Training: Async SAC on Humanoid
    • Experiment and Search Spec: PPO on Breakout
    • Run Benchmark: A2C on Atari Games
    • Meta Spec: High Level Specifications
    • Post-Hoc Analysis
    • TensorBoard: Visualizing Models and Actions
    • Using SLM Lab In Your Project
  • 📈Analyzing Results
    • Data Locations
    • Graphs and Data
    • Performance Metrics
  • 🥇Benchmark Results
    • Public Benchmark Data
    • Discrete Environment Benchmark
    • Continuous Environment Benchmark
    • Atari Environment Benchmark
    • RL GIFs
  • 🔧Development
    • Modular Design
      • Algorithm Taxonomy
      • Class Inheritance: A2C > PPO
    • Algorithm
      • DQN
      • REINFORCE
      • Actor Critic
    • Memory
      • Replay
      • PrioritizedReplay
      • OnPolicyReplay
      • OnPolicyBatchReplay
    • Net
      • MLP
      • CNN
      • RNN
    • Profiling SLM Lab
  • 📖Publications and Talks
    • Book: Foundations of Deep Reinforcement Learning
    • Talks and Presentations
  • 🤓Resources
    • Deep RL Resources
    • Contributing
    • Motivation
    • Help
    • Contact
Powered by GitBook
On this page

Was this helpful?

  1. 🔧Development
  2. Algorithm

REINFORCE

PreviousDQNNextActor Critic

Last updated 5 years ago

Was this helpful?

REINFORCE directly learns a parameterized policy, π\piπ, which maps states to probability distributions over actions.

Starting with random parameter values, the agent uses this policy to act in an environment and receive rewards. After an episode has finished, the "goodness" of each action, represented by, f(τ)f(\tau)f(τ), is calculated using the episode trajectory. The parameters of the policy are then updated in a direction which makes good actions (f(τ)>0)(f(\tau) > 0)(f(τ)>0) more likely, and bad actions (f(τ)<0)(f(\tau) < 0)(f(τ)<0) less likely. Good actions are reinforced, bad actions are discouraged.

The agent then uses the updated policy to act in the environment, and the training process repeats.

REINFORCE is an on policy algorithm. Only data that is gathered using the current policy can be used to update the parameters. Once the policy parameters have been updated all previous data gathered must be discarded and the collection process started again with the new policy.

There are a number of different approaches to calculating f(τ)f(\tau)f(τ). Method 3, outlined below, is common. It captures the idea that the absolute quality of the actions matters less than their quality relative to some baseline. One option for a baseline is the average of f(τ)f(\tau)f(τ) over the training data (typically one episode trajectory).

Algorithm: REINFORCE with baseline

Initialize weights θ, learning rate αfor each episode (trajectory) τ={s0,a0,r0,s1,⋯ ,rT}∼πθfor t=0 to T doθ←θ+α f(τ)t∇θlog πθ(at∣st)end forend for\begin{aligned} & \text{Initialize weights } \theta \text{, learning rate } \alpha \\ & \text{for each episode (trajectory) } \tau = \{s_0, a_0, r_0, s_1, \cdots, r_T\} \sim \pi_\theta \\ & \quad \text{for } t = 0 \text{ to } T \text{ do} \\ & \quad \quad \theta \leftarrow \theta + \alpha \ f(\tau)_t \nabla_\theta log ~ \pi_\theta(a_t|s_t) \\ & \quad \text{end for} \\ & \text{end for} \\ \end{aligned}​Initialize weights θ, learning rate αfor each episode (trajectory) τ={s0​,a0​,r0​,s1​,⋯,rT​}∼πθ​for t=0 to T doθ←θ+α f(τ)t​∇θ​log πθ​(at​∣st​)end forend for​

Methods for calculating f(τ)tf(\tau)_tf(τ)t​:

Given ∇θJ(θ) ≈∑t≥0f(τ)∇θlogπθ(at∣st), improve baseline with: 1. reward as weightage f(τ)=∑t′≥trt′2. add discount factor f(τ)=∑t′≥tγt′−trt′3. introduce baseline f(τ)=∑t′≥tγt′−trt′−b(st)\begin{aligned} & \text{Given } \nabla_\theta J(\theta) \ \approx \sum_{t \geq 0} f(\tau) \nabla_\theta log \pi_\theta(a_t|s_t), ~ \text{improve baseline with: }\\ & \quad \quad 1.\ \text{reward as weightage } f(\tau) = \sum\limits_{t' \geq t} r_{t'} \\ & \quad \quad 2.\ \text{add discount factor } f(\tau) = \sum\limits_{t' \geq t} \gamma^{t'-t} r_{t'} \\ & \quad \quad 3.\ \text{introduce baseline } f(\tau) = \sum\limits_{t' \geq t} \gamma^{t'-t} r_{t'} - b(s_t) \\ \end{aligned}​Given ∇θ​J(θ) ≈t≥0∑​f(τ)∇θ​logπθ​(at​∣st​), improve baseline with: 1. reward as weightage f(τ)=t′≥t∑​rt′​2. add discount factor f(τ)=t′≥t∑​γt′−trt′​3. introduce baseline f(τ)=t′≥t∑​γt′−trt′​−b(st​)​

See for example specs of variations of the REINFORCE algorithm.

Basic Parameters

    "agent": [{
      "name": str,
      "algorithm": {
        "name": str,
        "action_pdtype": str,
        "action_policy": str,
        "gamma": float,
        "training_frequency": int,
        "add_entropy": bool,
        "entropy_coef": float,
      },
      "memory": {
        "name": str,
        "max_size": int
        "batch_size": int
      },
      "net": {
        "type": str,
        "hid_layers": list,
        "hid_layers_activation": str,
        "optim_spec": dict,
      }
    }],
    ...
}
  • algorithm

    • action_policy string specifying which policy to use to act. For example, "Categorical" (for discrete action spaces), "Normal" (for continuous actions spaces with one dimension), or "default" to automatically switch between the two depending on the environment.

    • training_frequency how many episodes of data to collect before each training iteration. A common value is 1.

    • entropy whether to add entropy to the f(τ)tf(\tau)_tf(τ)t​ to encourage exploration

    • entropy_coef coefficient to multiply the entropy of the distribution with when adding it to f(τ)tf(\tau)_tf(τ)t​

  • memory

    • batch_size number of examples to collect before training. Only relevant for batch on policy memory: "OnPolicyBatchReplay"

  • net

Advanced Parameters

    "agent": [{
      "net": {
        "rnn_hidden_size": int,
        "rnn_num_layers": int,
        "seq_len": int,
        "clip_grad": bool,
        "clip_grad_val": float,
        "lr_decay": str,
        "lr_decay_frequency": int,
        "lr_decay_min_timestep": int,
        "lr_anneal_timestep": int,
        "gpu": int

      }
    }],
    ...
}
  • net

name

action_pdtype

gamma

name . Compatible types;

type . Compatible types; .

hid_layers

hid_layers_activation

optim_spec

rnn_hidden_size

rnn_num_layers

seq_len

clip_grad:

clip_grad_val:

lr_decay:

lr_decay_frequency:

lr_decay_min_timestep:

lr_anneal_timestep:

gpu:

Williams, 1992
reinforce.json
general param
general param
general param
general param
"OnPolicyReplay", "OnPolicyBatchReplay"
general param
all networks
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param