swas compared with expected value of being in that state
s, i.e. the expectation over all possible actions in state
action_policystring specifying which policy to use to act. "Categorical" (for discrete action spaces), "Normal" (for continuous actions spaces with one dimension), or "default" to automatically switch between the two depending on the environment.
use_gaewhether to calculate the advantage using generalized advantage estimation.
use_nstep: whether to calculate the advantage using n-step forward returns. If
truethis parameter will be ignored.
num_step_returnsnumber of forward steps to use when calculating the target for advantage estimation using nstep forward returns.
add_entropywhether to add entropy to the advantage to encourage exploration
entropy_coefcoefficient to multiply the entropy of the distribution with when adding it to advantage
training_frequencywhen using episodic data storage (memory) such as "OnPolicyReplay", this means how many episodes of data to collect before each training iteration - a common value is 1; or when using batch data storage (memory) such as "OnPolicyBatchReplay", how often to train the algorithm. Value of 32 means train every 32 steps the agent takes in the environment using the 32 examples since the agent was previously trained.
training_epochhow many gradient steps to take when training the critic. Only applies when the actor and critic have separate parameters.
policy_loss_coefhow much weight to give to the policy (actor) component of the loss when the actor and critic have shared parameters, so are trained jointly.
val_loss_coefhow much weight to give to the critic component of the loss when the actor and critic have shared parameters, so are trained jointly.
use_same_optimwhether to use the
optim_actorfor both the actor and critic. This can be useful when using conducting a parameter search.