SLM Lab
v4.1.1
v4.1.1
  • SLM Lab
  • 🖥Setup
    • Installation
    • Quick Start
  • 🚀Using SLM Lab
    • Lab Command
    • Lab Organization
    • Train and Enjoy: REINFORCE CartPole
    • Agent Spec: DDQN+PER on LunarLander
    • Env Spec: A2C on Pong
    • GPU Usage: PPO on Pong
    • Parallelizing Training: Async SAC on Humanoid
    • Experiment and Search Spec: PPO on Breakout
    • Run Benchmark: A2C on Atari Games
    • Meta Spec: High Level Specifications
    • Post-Hoc Analysis
    • TensorBoard: Visualizing Models and Actions
    • Using SLM Lab In Your Project
  • 📈Analyzing Results
    • Data Locations
    • Graphs and Data
    • Performance Metrics
  • 🥇Benchmark Results
    • Public Benchmark Data
    • Discrete Environment Benchmark
    • Continuous Environment Benchmark
    • Atari Environment Benchmark
    • RL GIFs
  • 🔧Development
    • Modular Design
      • Algorithm Taxonomy
      • Class Inheritance: A2C > PPO
    • Algorithm
      • DQN
      • REINFORCE
      • Actor Critic
    • Memory
      • Replay
      • PrioritizedReplay
      • OnPolicyReplay
      • OnPolicyBatchReplay
    • Net
      • MLP
      • CNN
      • RNN
    • Profiling SLM Lab
  • 📖Publications and Talks
    • Book: Foundations of Deep Reinforcement Learning
    • Talks and Presentations
  • 🤓Resources
    • Deep RL Resources
    • Contributing
    • Motivation
    • Help
    • Contact
Powered by GitBook
On this page

Was this helpful?

  1. 🔧Development
  2. Algorithm

Actor Critic

PreviousREINFORCENextMemory

Last updated 5 years ago

Was this helpful?

Actor-Critic algorithms combine value function and policy estimation. They consist of an actor, which learns a parameterized policy, π\piπ, mapping states to probability distributions over actions, and a critic, which learns a parameterized function which assigns a value to actions taken.

There are a variety of approaches to training the critic. Three options are provided with the baseline algorithms.

  • Learn the V function and use it to approximate the Q function. See for some examples specs.

  • Advantage with n-step forward returns from . See for some example specs.

  • Generalized advantage estimation from . See for some example specs.

The actor and critic can be trained separately or jointly, depending on whether the networks are structured to share parameters.

The critic is trained using temporal difference learning, similarly to the algorithm. The actor is trained using policy gradients, similarly to . However in this case, the estimate of the "goodness" of actions is evaluated using the advantage, AÏ€A^{\pi}AÏ€, which is calculated using the critic's estimation of VVV. AÏ€A^{\pi}AÏ€ measures how much better the action taken in states s was compared with expected value of being in that state s, i.e. the expectation over all possible actions in state s.

Actor-Critic algorithms are on policy. Only data that is gathered using the current policy can be used to update the parameters. Once the policy parameters have been updated all previous data gathered must be discarded and the collection process started again with the new policy.

Algorithm: Simple Actor Critic with separate actor and critic parameters

For i = 1 .... N:1. Gather data (si,ai,ri,si′) by acting in the environment using your policy2. Update Vfor j = 1 ... K:Calculate target values,  yi for each exampleUpdate critic network parameters, using a regression loss, e.g. MSEL(θ)=12∑i∣∣(yi−V(si;θ))∣∣23. Calculate advantage,  Aπ , for each example, using value function, V4. Calculate policy gradient∇ϕJ(ϕ)≈∑iAtπ(si,ai)∇ϕlogπϕ(ai∣si)5. Use gradient to update actor parameters ϕ\begin{aligned} &\text{For i = 1 .... N:} \\ &\quad \text{1. Gather data } {(s_i, a_i, r_i, s'_i)} \ \text{by acting in the environment using your policy} \\ &\quad \text{2. Update V} \\ &\quad \quad \quad \text{for j = 1 ... K:} \\ &\quad \quad \quad \quad \text{Calculate target values, } ~y_i~ \text{for each example} \\ &\quad \quad \quad \quad \text{Update critic network parameters, using a regression loss, e.g. MSE} \\ &\quad \quad \quad \quad \quad \quad L(\theta) = \frac{1}{2} \sum_i || (y_i - V(s_i; \theta)) ||^2 \\ &\quad \text{3. Calculate advantage, } ~A^{\pi}~ \text{, for each example, using value function, } V \\ &\quad \text{4. Calculate policy gradient} \\ & \quad \quad \quad \quad \nabla_{\phi}J(\phi) \approx \sum_i A^\pi_t(s_i, a_i) \nabla_\phi log \pi_\phi (a_i | s_i)\\ &\quad \text{5. Use gradient to update actor parameters } \phi \\ \end{aligned}​For i = 1 .... N:1. Gather data (si​,ai​,ri​,si′​) by acting in the environment using your policy2. Update Vfor j = 1 ... K:Calculate target values,  yi​ for each exampleUpdate critic network parameters, using a regression loss, e.g. MSEL(θ)=21​i∑​∣∣(yi​−V(si​;θ))∣∣23. Calculate advantage,  Aπ , for each example, using value function, V4. Calculate policy gradient∇ϕ​J(ϕ)≈i∑​Atπ​(si​,ai​)∇ϕ​logπϕ​(ai​∣si​)5. Use gradient to update actor parameters ϕ​

See and for example specs of variations of the Actor-Critic algorithm.

Basic Parameters

    "agent": [{
      "name": str,
      "algorithm": {
        "name": str,
        "action_pdtype": str,
        "action_policy": str,
        "gamma": float,
        "use_gae": bool,
        "lam": float,
        "use_nstep": bool,
        "num_step_returns": int,
        "add_entropy": bool,
        "entropy_coef": float,
        "training_frequency": int,
      },
      "memory": {
        "name": str,
      },
      "net": {
        "type": str,
        "shared": bool,
        "hid_layers": list,
        "hid_layers_activation": str,
        "actor_optim_spec": dict,
        "critic_optim_spec": dict,
      }
    }],
    ...
}
  • algorithm

    • action_policy string specifying which policy to use to act. "Categorical" (for discrete action spaces), "Normal" (for continuous actions spaces with one dimension), or "default" to automatically switch between the two depending on the environment.

    • use_gae whether to calculate the advantage using generalized advantage estimation.

    • use_nstep: whether to calculate the advantage using n-step forward returns. If use_gae is true this parameter will be ignored.

    • num_step_returns number of forward steps to use when calculating the target for advantage estimation using nstep forward returns.

    • add_entropy whether to add entropy to the advantage to encourage exploration

    • entropy_coef coefficient to multiply the entropy of the distribution with when adding it to advantage

    • training_frequency when using episodic data storage (memory) such as "OnPolicyReplay", this means how many episodes of data to collect before each training iteration - a common value is 1; or when using batch data storage (memory) such as "OnPolicyBatchReplay", how often to train the algorithm. Value of 32 means train every 32 steps the agent takes in the environment using the 32 examples since the agent was previously trained.

  • memory

  • net

    • shared whether the actor and the critic should share parameters

Advanced Parameters

    "agent": [{
    "algorithm" : {
        "training_epoch": int,
        "policy_loss_coef": float,
        "val_loss_coef": float
      }
      "net": {
        "use_same_optim": bool,
        "rnn_hidden_size": int,
        "rnn_num_layers": int,
        "seq_len": int,
        "clip_grad": bool,
        "clip_grad_val": float,
        "lr_decay": str,
        "lr_decay_frequency": int,
        "lr_decay_min_timestep": int,
        "lr_anneal_timestep": int,
        "gpu": int
      }
    }],
    ...
}
  • algorithm

    • training_epoch how many gradient steps to take when training the critic. Only applies when the actor and critic have separate parameters.

    • policy_loss_coef how much weight to give to the policy (actor) component of the loss when the actor and critic have shared parameters, so are trained jointly.

    • val_loss_coef how much weight to give to the critic component of the loss when the actor and critic have shared parameters, so are trained jointly.

  • net

    • use_same_optim whether to use the optim_actor for both the actor and critic. This can be useful when using conducting a parameter search.

name

action_pdtype

gamma

lam ∈[0,1]\in [0,1]∈[0,1] trade off between bias and variance when using generalized advantage estimation. 0 corresponds to low variance and high bias. 1 corresponds to high variance and low bias.

name . Compatible types;

type . All are compatible.

hid_layers

hid_layers_activation

actor_optim_spec optimizer for the actor

critic_optim_spec optimizer for the critic

rnn_hidden_size

rnn_num_layers

seq_len

clip_grad:

clip_grad_val:

lr_decay:

lr_decay_frequency:

lr_decay_min_timestep:

lr_anneal_timestep:

gpu:

general param
general param
general param
general param
"OnPolicyReplay", "OnPolicyBatchReplay"
general param
networks
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param
ac.json
Mnih et. al. 2016
a2c.json
Schulman et. al, 2015
a2c.json
DQN
REINFORCE
ac.json
a2c.json