REINFORCE
REINFORCE Williams, 1992 directly learns a parameterized policy, , which maps states to probability distributions over actions.
Starting with random parameter values, the agent uses this policy to act in an environment and receive rewards. After an episode has finished, the "goodness" of each action, represented by, , is calculated using the episode trajectory. The parameters of the policy are then updated in a direction which makes good actions more likely, and bad actions less likely. Good actions are reinforced, bad actions are discouraged.
The agent then uses the updated policy to act in the environment, and the training process repeats.
REINFORCE is an on policy algorithm. Only data that is gathered using the current policy can be used to update the parameters. Once the policy parameters have been updated all previous data gathered must be discarded and the collection process started again with the new policy.
There are a number of different approaches to calculating . Method 3, outlined below, is common. It captures the idea that the absolute quality of the actions matters less than their quality relative to some baseline. One option for a baseline is the average of over the training data (typically one episode trajectory).
Algorithm: REINFORCE with baseline
Methods for calculating :
See reinforce.json for example specs of variations of the REINFORCE algorithm.
Basic Parameters
"agent": [{
"name": str,
"algorithm": {
"name": str,
"action_pdtype": str,
"action_policy": str,
"gamma": float,
"training_frequency": int,
"add_entropy": bool,
"entropy_coef": float,
},
"memory": {
"name": str,
"max_size": int
"batch_size": int
},
"net": {
"type": str,
"hid_layers": list,
"hid_layers_activation": str,
"optim_spec": dict,
}
}],
...
}algorithmnamegeneral paramaction_pdtypegeneral paramaction_policystring specifying which policy to use to act. For example, "Categorical" (for discrete action spaces), "Normal" (for continuous actions spaces with one dimension), or "default" to automatically switch between the two depending on the environment.gammageneral paramtraining_frequencyhow many episodes of data to collect before each training iteration. A common value is 1.entropywhether to add entropy to the to encourage explorationentropy_coefcoefficient to multiply the entropy of the distribution with when adding it to
memorynamegeneral param. Compatible types; "OnPolicyReplay", "OnPolicyBatchReplay"batch_sizenumber of examples to collect before training. Only relevant for batch on policy memory: "OnPolicyBatchReplay"
nettypegeneral param. Compatible types; all networks.hid_layersgeneral paramhid_layers_activationgeneral paramoptim_specgeneral param
Advanced Parameters
"agent": [{
"net": {
"rnn_hidden_size": int,
"rnn_num_layers": int,
"seq_len": int,
"clip_grad": bool,
"clip_grad_val": float,
"lr_decay": str,
"lr_decay_frequency": int,
"lr_decay_min_timestep": int,
"lr_anneal_timestep": int,
"gpu": int
}
}],
...
}netrnn_hidden_sizegeneral paramrnn_num_layersgeneral paramseq_lengeneral paramclip_grad: general paramclip_grad_val: general paramlr_decay: general paramlr_decay_frequency: general paramlr_decay_min_timestep: general paramlr_anneal_timestep: general paramgpu: general param
Last updated
Was this helpful?