๐ญActor-Critic
Actor-Critic algorithms combine value function and policy estimation. They consist of an actor, which learns a parameterized policy, ฯ, mapping states to probability distributions over actions, and a critic, which learns a parameterized function which assigns a value to actions taken.
There are a variety of approaches to training the critic. Three options are provided with the baseline algorithms.
Learn the V function and use it to approximate the Q function.
Advantage with n-step forward returns from Mnih et. al. 2016.
Generalized advantage estimation from Schulman et. al, 2015.
See slm_lab/spec/benchmark/a2c/ for example A2C specs.
The actor and critic can be trained separately or jointly, depending on whether the networks are structured to share parameters.
The critic is trained using temporal difference learning, similarly to the DQN algorithm. The actor is trained using policy gradients, similarly to REINFORCE. However in this case, the estimate of the "goodness" of actions is evaluated using the advantage, Aฯ, which is calculated using the critic's estimation of V. Aฯ measures how much better the action taken in states s was compared with expected value of being in that state s, i.e. the expectation over all possible actions in state s.
Actor-Critic algorithms are on policy. Only data that is gathered using the current policy can be used to update the parameters. Once the policy parameters have been updated all previous data gathered must be discarded and the collection process started again with the new policy.
Algorithm: Simple Actor Critic with separate actor and critic parameters
See slm_lab/spec/benchmark/a2c/ for example Actor-Critic specs.
Basic Parameters
"agent": {
"name": str,
"algorithm": {
"name": str,
"action_pdtype": str,
"action_policy": str,
"gamma": float,
"lam": float,
"entropy_coef_spec": {...},
"training_frequency": int,
},
"memory": {
"name": str,
},
"net": {
"type": str,
"shared": bool,
"hid_layers": list,
"hid_layers_activation": str,
"optim_spec": dict,
}
},
...
}algorithmnamegeneral paramaction_pdtypegeneral paramaction_policystring specifying which policy to use to act. "Categorical" (for discrete action spaces), "Normal" (for continuous actions spaces with one dimension), or "default" to automatically switch between the two depending on the environment.gammageneral paramlamโ[0,1] GAE lambda parameter. If set, uses Generalized Advantage Estimation. Trade-off between bias and variance: 0 = low variance/high bias, 1 = high variance/low bias. Typical value: 0.95-0.97.num_step_returnsif set (andlamis not), uses n-step returns instead of GAE. Number of forward steps for advantage estimation. Note: When using n-step returns,training_frequencyis automatically set tonum_step_returns.entropy_coef_specschedule for entropy coefficient added to the loss to encourage exploration. Example:{"name": "no_decay", "start_val": 0.01, "end_val": 0.01, "start_step": 0, "end_step": 0}training_frequencywhen using episodic data storage (memory) such as "OnPolicyReplay", this means how many episodes of data to collect before each training iteration - a common value is 1; or when using batch data storage (memory) such as "OnPolicyBatchReplay", how often to train the algorithm. Value of 32 means train every 32 steps the agent takes in the environment using the 32 examples since the agent was previously trained.
memorynamegeneral param. Compatible types; "OnPolicyReplay", "OnPolicyBatchReplay"
nettypegeneral param. All networks are compatible.sharedwhether the actor and the critic should share parametershid_layersgeneral paramhid_layers_activationgeneral paramactor_optim_specgeneral param optimizer for the actorcritic_optim_specgeneral param optimizer for the critic
Advanced Parameters
algorithmtraining_epochhow many gradient steps to take when training the critic. Only applies when the actor and critic have separate parameters.policy_loss_coefhow much weight to give to the policy (actor) component of the loss when the actor and critic have shared parameters, so are trained jointly.val_loss_coefhow much weight to give to the critic component of the loss when the actor and critic have shared parameters, so are trained jointly.normalize_v_targetsnormalize value targets to prevent gradient explosion. Uses running statistics normalization (like SB3's VecNormalize).
netuse_same_optimwhether to use the same optimizer for both actor and critic. Useful for parameter search.rnn_hidden_sizegeneral paramrnn_num_layersgeneral paramseq_lengeneral paramclip_grad_val: general paramlr_scheduler_spec: optional learning rate scheduler configgpu: general param
PPO (Proximal Policy Optimization)
PPO extends Actor-Critic with clipped surrogate objective and minibatch updates. See slm_lab/spec/benchmark/ppo/ for example PPO specs.
PPO-Specific Parameters
clip_eps_specPPO clipping parameter, typically starts at 0.2. Can use a schedule for decay.minibatch_sizenumber of samples per minibatch during trainingtime_horizonnumber of environment steps collected before each training update (training_frequency = num_envs ร time_horizon)training_epochnumber of passes through collected data per updatenormalize_v_targets(v5) normalize value targets using running statistics to prevent gradient explosion with varying reward scalesclip_vloss(v5) CleanRL-style value loss clippingโclips value predictions relative to old predictions usingclip_eps. Improves stability for some environments.
MuJoCo: Use normalize_v_targets: true for continuous control tasks. Atari: Use clip_vloss: true for image-based tasks.
Last updated
Was this helpful?