tis updated using an estimate from time
t+1. This allows Q-Learning algorithms to be online and incremental, so the agent can be trained during an episode.
action_policystring specifying which policy to use to act. "boltzmann" or "epsilon_greedy".
explore_var_startinitial value for the exploration parameters (tau or epsilon)
explore_var_endend value for the exploration parameters (tau or epsilon)
explore_anneal_epihow many episodes to take to reduce the exploration parameter value from start to end. Reduction is currently linear.
training_batch_epochhow many gradient updates to make per batch.
training_epochhow many batches to sample from the replay memory each time the agent is trained
training_frequencyhow often to train the algorithm. Value of 3 means train every 3 steps the agent takes in the environment.
batch_sizehow many examples to include in each batch when sampling from the replay memory.
max_sizemaximum size of the memory. Once the memory has reached maximum capacity, the oldest examples are deleted to make space for new examples.
training_min_timestephow many time steps to wait before starting to train. It can be useful to set this to 0.5 - 1x the batch size so that the
DQNhas a few examples to learn from in the first training iterations.
action_policy_updatehow to update the
explore_varparameter in the action policy each episode. Available options are "linear_decay", "rate_decay", and "periodic_decay". See policy_util.py for more details.
update_typemethod of updating
target_net. "replace" or "polyak". "replace" replaces
update_frequencytime steps. "polyak" updates
target_net+ (1 -
neteach time step.
update_frequencyhow often to update
netwhen using "replace"
target_netwhen updating the