DQN
Last updated
Last updated
Q-learning (Watkins, 1989, Mnih et. al, 2013) algorithms estimate the optimal Q function, i.e the value of taking action A in state S under the optimal policy. Q-learning algorithms have an implicit policy (strategy for acting in the environment). This is typically -greedy, in which the action with the maximum Q value is selected with probability and a random action is taken with probability , or boltzmann (see definition below). Random actions encourage exploration of the state space and help prevent algorithms from getting stuck in local minima.
Q-learning algorithms are off-policy algorithms because the target value used to train the network is independent of the policy used to generate the training data. This makes it possible to use experience replay to train an agent.
It is bootstrapped algorithm; updates to the Q function are based on existing estimates, and a temporal difference algorithm; the estimate in time t
is updated using an estimate from time t+1
. This allows Q-Learning algorithms to be online and incremental, so the agent can be trained during an episode.
See dqn.json for example specs of variations of the DQN algorithm (e.g. DQN, DoubleDQN, DRQN). Parameters are explained below.
algorithm
name
general param
action_pdtype
general param
action_policy
string specifying which policy to use to act. "boltzmann" or "epsilon_greedy".
"boltzmann" policy selects actions by sampling from a probability distribution over the actions. This is generated by taking a softmax over all the Q-values (estimated by a neural network) for a state, adjusted by the temperature parameter, tau.
"epsilon_greedy" policy selects a random action with probability epsilon, and the action corresponding to the maximum Q-value with (1 - epsilon).
explore_var_start
initial value for the exploration parameters (tau or epsilon)
explore_var_end
end value for the exploration parameters (tau or epsilon)
explore_anneal_epi
how many episodes to take to reduce the exploration parameter value from start to end. Reduction is currently linear.
gamma
general param
training_batch_epoch
how many gradient updates to make per batch.
training_epoch
how many batches to sample from the replay memory each time the agent is trained
training_frequency
how often to train the algorithm. Value of 3 means train every 3 steps the agent takes in the environment.
memory
name
general param. Compatible types; "Replay", "PrioritizedReplay"
batch_size
how many examples to include in each batch when sampling from the replay memory.
max_size
maximum size of the memory. Once the memory has reached maximum capacity, the oldest examples are deleted to make space for new examples.
net
type
general param. Compatible types; all networks
hid_layers
general param
hid_layers_activation
general param
optim_spec
general param
algorithm
training_min_timestep
how many time steps to wait before starting to train. It can be useful to set this to 0.5 - 1x the batch size so that the DQN
has a few examples to learn from in the first training iterations.
action_policy_update
how to update the explore_var
parameter in the action policy each episode. Available options are "linear_decay", "rate_decay", and "periodic_decay". See policy_util.py for more details.
memory
use_cer
: whether to used Combined Experience Replay
net
rnn_hidden_size
general param
rnn_num_layers
general param
seq_len
general param
update_type
method of updating target_net
. "replace" or "polyak". "replace" replaces target_net
with net
every update_frequency
time steps. "polyak" updates target_net
with polyak_weight
target_net
+ (1 - polyak_weight
) net
each time step.
update_frequency
how often to update target_net
with net
when using "replace" update_type
.
clip_grad
: general param
clip_grad_val
: general param
loss_spec
: general param
lr_decay
: general param
lr_decay_frequency
: general param
lr_decay_min_timestep
: general param
lr_anneal_timestep
: general param
gpu
: general param
polyak_weight
how much weight to give the old target_net
when updating the target_net
using "polyak" update_type