Actor Critic

There are a variety of approaches to training the critic. Three options are provided with the baseline algorithms.

The actor and critic can be trained separately or jointly, depending on whether the networks are structured to share parameters.

Actor-Critic algorithms are on policy. Only data that is gathered using the current policy can be used to update the parameters. Once the policy parameters have been updated all previous data gathered must be discarded and the collection process started again with the new policy.

Algorithm: Simple Actor Critic with separate actor and critic parameters

See ac.json and a2c.json for example specs of variations of the Actor-Critic algorithm.

Basic Parameters

    "agent": [{
      "name": str,
      "algorithm": {
        "name": str,
        "action_pdtype": str,
        "action_policy": str,
        "gamma": float,
        "use_gae": bool,
        "lam": float,
        "use_nstep": bool,
        "num_step_returns": int,
        "add_entropy": bool,
        "entropy_coef": float,
        "training_frequency": int,
      },
      "memory": {
        "name": str,
      },
      "net": {
        "type": str,
        "shared": bool,
        "hid_layers": list,
        "hid_layers_activation": str,
        "actor_optim_spec": dict,
        "critic_optim_spec": dict,
      }
    }],
    ...
}
  • algorithm

    • action_pdtype general param

    • action_policy string specifying which policy to use to act. "Categorical" (for discrete action spaces), "Normal" (for continuous actions spaces with one dimension), or "default" to automatically switch between the two depending on the environment.

    • use_gae whether to calculate the advantage using generalized advantage estimation.

    • use_nstep: whether to calculate the advantage using n-step forward returns. If use_gae is true this parameter will be ignored.

    • num_step_returns number of forward steps to use when calculating the target for advantage estimation using nstep forward returns.

    • add_entropy whether to add entropy to the advantage to encourage exploration

    • entropy_coef coefficient to multiply the entropy of the distribution with when adding it to advantage

    • training_frequency when using episodic data storage (memory) such as "OnPolicyReplay", this means how many episodes of data to collect before each training iteration - a common value is 1; or when using batch data storage (memory) such as "OnPolicyBatchReplay", how often to train the algorithm. Value of 32 means train every 32 steps the agent takes in the environment using the 32 examples since the agent was previously trained.

  • net

Advanced Parameters

    "agent": [{
    "algorithm" : {
        "training_epoch": int,
        "policy_loss_coef": float,
        "val_loss_coef": float
      }
      "net": {
        "use_same_optim": bool,
        "rnn_hidden_size": int,
        "rnn_num_layers": int,
        "seq_len": int,
        "clip_grad": bool,
        "clip_grad_val": float,
        "lr_decay": str,
        "lr_decay_frequency": int,
        "lr_decay_min_timestep": int,
        "lr_anneal_timestep": int,
        "gpu": int
      }
    }],
    ...
}
  • algorithm

    • training_epoch how many gradient steps to take when training the critic. Only applies when the actor and critic have separate parameters.

    • policy_loss_coef how much weight to give to the policy (actor) component of the loss when the actor and critic have shared parameters, so are trained jointly.

    • val_loss_coef how much weight to give to the critic component of the loss when the actor and critic have shared parameters, so are trained jointly.

  • net

Last updated