๐Ÿ’ŽDQN

Deep Q-Learning

Q-learning (Watkins, 1989arrow-up-right, Mnih et. al, 2013arrow-up-right) algorithms estimate the optimal Q function, i.e the value of taking action A in state S under the optimal policy. Q-learning algorithms have an implicit policy (strategy for acting in the environment). This is typically ฯต\epsilon-greedy, in which the action with the maximum Q value is selected with probability 1โˆ’ฯต1 - \epsilon and a random action is taken with probability ฯต\epsilon, or boltzmann (see definition below). Random actions encourage exploration of the state space and help prevent algorithms from getting stuck in local minima.

Q-learning algorithms are off-policy algorithms because the target value used to train the network is independent of the policy used to generate the training data. This makes it possible to use experience replay to train an agent.

It is bootstrapped algorithm; updates to the Q function are based on existing estimates, and a temporal difference algorithm; the estimate in time t is updated using an estimate from time t+1. This allows Q-Learning algorithms to be online and incremental, so the agent can be trained during an episode.

Algorithm: DQN with target network

Forย kย =ย 1ย ....ย N:Gatherย dataย (si,ai,ri,siโ€ฒ)ย byย actingย inย theย environmentย usingย someย policyforย jย =ย 1ย ...ย M:Sampleย aย batchย ofย dataย fromย theย replayย memoryForย pย =ย 1ย ....ย Z:1.ย Calculateย targetย valuesย forย eachย exampleyi=ri+ฮณย maxโกaโ€ฒQ(siโ€ฒ,aโ€ฒ;ฮธTAR)โˆฃsi,ai2.ย Updateย networkย parameters,ย usingย MSEย lossLj(ฮธ)=12โˆ‘iโˆฃโˆฃ(yiโˆ’Q(si,ai;ฮธ))โˆฃโˆฃ2Periodicallyย updateย ฮธTARย withย ฮธย orย aย mixย ofย ฮธย andย ฮธTAR\begin{aligned} & \text{For k = 1 .... N:} \\ & \quad \text{Gather data } {(s_i, a_i, r_i, s'_i)} \ \text{by acting in the environment using some policy} \\ & \quad \text{for j = 1 ... M:} \\ & \quad \quad \text{Sample a batch of data from the replay memory} \\ & \quad \quad \text{For p = 1 .... Z:} \\ & \quad \quad \quad \quad \text{1. Calculate target values for each example} \\ & \quad \quad \quad \quad \quad \quad y_i = r_i + \gamma \ \max\limits_{a'} Q(s'_i, a'; \theta_{TAR})|s_i, a_i \\ & \quad \quad \quad \quad\text{2. Update network parameters, using MSE loss} \\ & \quad \quad \quad \quad\quad \quad L_j(\theta) = \frac{1}{2} \sum_i || (y_i - Q(s_i,a_i; \theta)) ||^2 \\ & \quad \text{Periodically update} ~\theta_{TAR} ~ \text{with}~ \theta ~\text{or a mix of}~ \theta ~\text{and}~ \theta_{TAR} \\ \end{aligned}

See slm_lab/spec/benchmark/dqn/arrow-up-right for example specs of DQN variations (DQN, DoubleDQN, DDQN+PER). Parameters are explained below.

Basic Parameters

    "agent": {
      "name": str,
      "algorithm": {
        "name": str,
        "action_pdtype": str,
        "action_policy": str,
        "explore_var_spec": {...},
        "gamma": float,
        "training_batch_iter": int,
        "training_iter": int,
        "training_start_step": int,
        "training_frequency": int,
      },
      "memory": {
        "name": str,
        "batch_size": int,
        "max_size": int
      },
      "net": {
        "type": str,
        "hid_layers": list,
        "hid_layers_activation": str,
        "optim_spec": dict,
      }
    },
    ...
  }
  • algorithm

    • action_pdtype general param

    • action_policy string specifying which policy to use to act. "boltzmann" or "epsilon_greedy".

      • "boltzmann" policy selects actions by sampling from a probability distribution over the actions. This is generated by taking a softmax over all the Q-values (estimated by a neural network) for a state, adjusted by the temperature parameter, tau.

      • "epsilon_greedy" policy selects a random action with probability epsilon, and the action corresponding to the maximum Q-value with (1 - epsilon).

    • explore_var_spec dict controlling exploration parameter decay:

      • name: decay type ("linear_decay", "rate_decay", "periodic_decay", "no_decay")

      • start_val: initial value for tau (boltzmann) or epsilon (epsilon_greedy)

      • end_val: final value after decay

      • start_step: step to begin decay

      • end_step: step to reach end_val

    • training_batch_iter how many gradient updates to make per batch

    • training_iter how many batches to sample from the replay memory each time the agent is trained

    • training_frequency how often to train the algorithm. Value of 3 means train every 3 steps the agent takes in the environment.

    • training_start_step how many time steps to wait before starting to train. Set this to 0.5-1x batch size so DQN has examples to learn from.

  • memory

    • batch_size how many examples to include in each batch when sampling from the replay memory.

    • max_size maximum size of the memory. Once the memory has reached maximum capacity, the oldest examples are deleted to make space for new examples.

  • net

Advanced parameters

  • memory

  • net

    • rnn_hidden_size general param

    • rnn_num_layers general param

    • update_type method of updating target_net. "replace" or "polyak". "replace" replaces target_net with net every update_frequency time steps. "polyak" updates target_net with polyak_coef target_net + (1 - polyak_coef) net each time step.

    • update_frequency how often to update target_net with net when using "replace" update_type.

    • polyak_coef โˆˆ[0,1]\in [0, 1] how much weight to give the old target_net when updating the target_net using "polyak" update_type

    • clip_grad_val: general param

    • loss_spec: general param

    • lr_scheduler_spec: optional learning rate scheduler config

Last updated

Was this helpful?