Actor Critic
Actor-Critic algorithms combine value function and policy estimation. They consist of an actor, which learns a parameterized policy, , mapping states to probability distributions over actions, and a critic, which learns a parameterized function which assigns a value to actions taken.
There are a variety of approaches to training the critic. Three options are provided with the baseline algorithms.
Learn the V function and use it to approximate the Q function. See ac.json for some examples specs.
Advantage with n-step forward returns from Mnih et. al. 2016. See a2c.json for some example specs.
Generalized advantage estimation from Schulman et. al, 2015. See a2c.json for some example specs.
The actor and critic can be trained separately or jointly, depending on whether the networks are structured to share parameters.
The critic is trained using temporal difference learning, similarly to the DQN algorithm. The actor is trained using policy gradients, similarly to REINFORCE. However in this case, the estimate of the "goodness" of actions is evaluated using the advantage, , which is calculated using the critic's estimation of . measures how much better the action taken in states s was compared with expected value of being in that state s, i.e. the expectation over all possible actions in state s.
Actor-Critic algorithms are on policy. Only data that is gathered using the current policy can be used to update the parameters. Once the policy parameters have been updated all previous data gathered must be discarded and the collection process started again with the new policy.
Algorithm: Simple Actor Critic with separate actor and critic parameters
See ac.json and a2c.json for example specs of variations of the Actor-Critic algorithm.
Basic Parameters
"agent": [{
"name": str,
"algorithm": {
"name": str,
"action_pdtype": str,
"action_policy": str,
"gamma": float,
"use_gae": bool,
"lam": float,
"use_nstep": bool,
"num_step_returns": int,
"add_entropy": bool,
"entropy_coef": float,
"training_frequency": int,
},
"memory": {
"name": str,
},
"net": {
"type": str,
"shared": bool,
"hid_layers": list,
"hid_layers_activation": str,
"actor_optim_spec": dict,
"critic_optim_spec": dict,
}
}],
...
}algorithmnamegeneral paramaction_pdtypegeneral paramaction_policystring specifying which policy to use to act. "Categorical" (for discrete action spaces), "Normal" (for continuous actions spaces with one dimension), or "default" to automatically switch between the two depending on the environment.gammageneral paramuse_gaewhether to calculate the advantage using generalized advantage estimation.lamtrade off between bias and variance when using generalized advantage estimation. 0 corresponds to low variance and high bias. 1 corresponds to high variance and low bias.use_nstep: whether to calculate the advantage using n-step forward returns. Ifuse_gaeistruethis parameter will be ignored.num_step_returnsnumber of forward steps to use when calculating the target for advantage estimation using nstep forward returns.add_entropywhether to add entropy to the advantage to encourage explorationentropy_coefcoefficient to multiply the entropy of the distribution with when adding it to advantagetraining_frequencywhen using episodic data storage (memory) such as "OnPolicyReplay", this means how many episodes of data to collect before each training iteration - a common value is 1; or when using batch data storage (memory) such as "OnPolicyBatchReplay", how often to train the algorithm. Value of 32 means train every 32 steps the agent takes in the environment using the 32 examples since the agent was previously trained.
memorynamegeneral param. Compatible types; "OnPolicyReplay", "OnPolicyBatchReplay"
nettypegeneral param. All networks are compatible.sharedwhether the actor and the critic should share parametershid_layersgeneral paramhid_layers_activationgeneral paramactor_optim_specgeneral param optimizer for the actorcritic_optim_specgeneral param optimizer for the critic
Advanced Parameters
"agent": [{
"algorithm" : {
"training_epoch": int,
"policy_loss_coef": float,
"val_loss_coef": float
}
"net": {
"use_same_optim": bool,
"rnn_hidden_size": int,
"rnn_num_layers": int,
"seq_len": int,
"clip_grad": bool,
"clip_grad_val": float,
"lr_decay": str,
"lr_decay_frequency": int,
"lr_decay_min_timestep": int,
"lr_anneal_timestep": int,
"gpu": int
}
}],
...
}algorithmtraining_epochhow many gradient steps to take when training the critic. Only applies when the actor and critic have separate parameters.policy_loss_coefhow much weight to give to the policy (actor) component of the loss when the actor and critic have shared parameters, so are trained jointly.val_loss_coefhow much weight to give to the critic component of the loss when the actor and critic have shared parameters, so are trained jointly.
netuse_same_optimwhether to use theoptim_actorfor both the actor and critic. This can be useful when using conducting a parameter search.rnn_hidden_sizegeneral paramrnn_num_layersgeneral paramseq_lengeneral paramclip_grad: general paramclip_grad_val: general paramlr_decay: general paramlr_decay_frequency: general paramlr_decay_min_timestep: general paramlr_anneal_timestep: general paramgpu: general param
Last updated
Was this helpful?