๐ŸŽญActor-Critic

Actor-Critic algorithms combine value function and policy estimation. They consist of an actor, which learns a parameterized policy, ฯ€\pi, mapping states to probability distributions over actions, and a critic, which learns a parameterized function which assigns a value to actions taken.

There are a variety of approaches to training the critic. Three options are provided with the baseline algorithms.

See slm_lab/spec/benchmark/a2c/arrow-up-right for example A2C specs.

The actor and critic can be trained separately or jointly, depending on whether the networks are structured to share parameters.

The critic is trained using temporal difference learning, similarly to the DQN algorithm. The actor is trained using policy gradients, similarly to REINFORCE. However in this case, the estimate of the "goodness" of actions is evaluated using the advantage, Aฯ€A^{\pi}, which is calculated using the critic's estimation of VV. Aฯ€A^{\pi} measures how much better the action taken in states s was compared with expected value of being in that state s, i.e. the expectation over all possible actions in state s.

Actor-Critic algorithms are on policy. Only data that is gathered using the current policy can be used to update the parameters. Once the policy parameters have been updated all previous data gathered must be discarded and the collection process started again with the new policy.

Algorithm: Simple Actor Critic with separate actor and critic parameters

Forย iย =ย 1ย ....ย N:1.ย Gatherย dataย (si,ai,ri,siโ€ฒ)ย byย actingย inย theย environmentย usingย yourย policy2.ย Updateย Vforย jย =ย 1ย ...ย K:Calculateย targetย values,ย ย yiย forย eachย exampleUpdateย criticย networkย parameters,ย usingย aย regressionย loss,ย e.g.ย MSEL(ฮธ)=12โˆ‘iโˆฃโˆฃ(yiโˆ’V(si;ฮธ))โˆฃโˆฃ23.ย Calculateย advantage,ย ย Aฯ€ย ,ย forย eachย example,ย usingย valueย function,ย V4.ย Calculateย policyย gradientโˆ‡ฯ•J(ฯ•)โ‰ˆโˆ‘iAtฯ€(si,ai)โˆ‡ฯ•logฯ€ฯ•(aiโˆฃsi)5.ย Useย gradientย toย updateย actorย parametersย ฯ•\begin{aligned} &\text{For i = 1 .... N:} \\ &\quad \text{1. Gather data } {(s_i, a_i, r_i, s'_i)} \ \text{by acting in the environment using your policy} \\ &\quad \text{2. Update V} \\ &\quad \quad \quad \text{for j = 1 ... K:} \\ &\quad \quad \quad \quad \text{Calculate target values, } ~y_i~ \text{for each example} \\ &\quad \quad \quad \quad \text{Update critic network parameters, using a regression loss, e.g. MSE} \\ &\quad \quad \quad \quad \quad \quad L(\theta) = \frac{1}{2} \sum_i || (y_i - V(s_i; \theta)) ||^2 \\ &\quad \text{3. Calculate advantage, } ~A^{\pi}~ \text{, for each example, using value function, } V \\ &\quad \text{4. Calculate policy gradient} \\ & \quad \quad \quad \quad \nabla_{\phi}J(\phi) \approx \sum_i A^\pi_t(s_i, a_i) \nabla_\phi log \pi_\phi (a_i | s_i)\\ &\quad \text{5. Use gradient to update actor parameters } \phi \\ \end{aligned}

See slm_lab/spec/benchmark/a2c/arrow-up-right for example Actor-Critic specs.

Basic Parameters

    "agent": {
      "name": str,
      "algorithm": {
        "name": str,
        "action_pdtype": str,
        "action_policy": str,
        "gamma": float,
        "lam": float,
        "entropy_coef_spec": {...},
        "training_frequency": int,
      },
      "memory": {
        "name": str,
      },
      "net": {
        "type": str,
        "shared": bool,
        "hid_layers": list,
        "hid_layers_activation": str,
        "optim_spec": dict,
      }
    },
    ...
}
  • algorithm

    • action_pdtype general param

    • action_policy string specifying which policy to use to act. "Categorical" (for discrete action spaces), "Normal" (for continuous actions spaces with one dimension), or "default" to automatically switch between the two depending on the environment.

    • lam โˆˆ[0,1]\in [0,1] GAE lambda parameter. If set, uses Generalized Advantage Estimation. Trade-off between bias and variance: 0 = low variance/high bias, 1 = high variance/low bias. Typical value: 0.95-0.97.

    • num_step_returns if set (and lam is not), uses n-step returns instead of GAE. Number of forward steps for advantage estimation. Note: When using n-step returns, training_frequency is automatically set to num_step_returns.

    • entropy_coef_spec schedule for entropy coefficient added to the loss to encourage exploration. Example: {"name": "no_decay", "start_val": 0.01, "end_val": 0.01, "start_step": 0, "end_step": 0}

    • training_frequency when using episodic data storage (memory) such as "OnPolicyReplay", this means how many episodes of data to collect before each training iteration - a common value is 1; or when using batch data storage (memory) such as "OnPolicyBatchReplay", how often to train the algorithm. Value of 32 means train every 32 steps the agent takes in the environment using the 32 examples since the agent was previously trained.

  • net

Advanced Parameters

  • algorithm

    • training_epoch how many gradient steps to take when training the critic. Only applies when the actor and critic have separate parameters.

    • policy_loss_coef how much weight to give to the policy (actor) component of the loss when the actor and critic have shared parameters, so are trained jointly.

    • val_loss_coef how much weight to give to the critic component of the loss when the actor and critic have shared parameters, so are trained jointly.

    • normalize_v_targets normalize value targets to prevent gradient explosion. Uses running statistics normalization (like SB3's VecNormalize).

  • net

PPO (Proximal Policy Optimization)

PPO extends Actor-Critic with clipped surrogate objective and minibatch updates. See slm_lab/spec/benchmark/ppo/arrow-up-right for example PPO specs.

PPO-Specific Parameters

  • clip_eps_spec PPO clipping parameter, typically starts at 0.2. Can use a schedule for decay.

  • minibatch_size number of samples per minibatch during training

  • time_horizon number of environment steps collected before each training update (training_frequency = num_envs ร— time_horizon)

  • training_epoch number of passes through collected data per update

  • normalize_v_targets (v5) normalize value targets using running statistics to prevent gradient explosion with varying reward scales

  • clip_vloss (v5) CleanRL-style value loss clippingโ€”clips value predictions relative to old predictions using clip_eps. Improves stability for some environments.

circle-info

MuJoCo: Use normalize_v_targets: true for continuous control tasks. Atari: Use clip_vloss: true for image-based tasks.

Last updated

Was this helpful?