SLM Lab
v4.1.1
v4.1.1
  • SLM Lab
  • 🖥Setup
    • Installation
    • Quick Start
  • 🚀Using SLM Lab
    • Lab Command
    • Lab Organization
    • Train and Enjoy: REINFORCE CartPole
    • Agent Spec: DDQN+PER on LunarLander
    • Env Spec: A2C on Pong
    • GPU Usage: PPO on Pong
    • Parallelizing Training: Async SAC on Humanoid
    • Experiment and Search Spec: PPO on Breakout
    • Run Benchmark: A2C on Atari Games
    • Meta Spec: High Level Specifications
    • Post-Hoc Analysis
    • TensorBoard: Visualizing Models and Actions
    • Using SLM Lab In Your Project
  • 📈Analyzing Results
    • Data Locations
    • Graphs and Data
    • Performance Metrics
  • 🥇Benchmark Results
    • Public Benchmark Data
    • Discrete Environment Benchmark
    • Continuous Environment Benchmark
    • Atari Environment Benchmark
    • RL GIFs
  • 🔧Development
    • Modular Design
      • Algorithm Taxonomy
      • Class Inheritance: A2C > PPO
    • Algorithm
      • DQN
      • REINFORCE
      • Actor Critic
    • Memory
      • Replay
      • PrioritizedReplay
      • OnPolicyReplay
      • OnPolicyBatchReplay
    • Net
      • MLP
      • CNN
      • RNN
    • Profiling SLM Lab
  • 📖Publications and Talks
    • Book: Foundations of Deep Reinforcement Learning
    • Talks and Presentations
  • 🤓Resources
    • Deep RL Resources
    • Contributing
    • Motivation
    • Help
    • Contact
Powered by GitBook
On this page
  • Deep Q-Learning
  • Algorithm: DQN with target network
  • Basic Parameters
  • Advanced parameters

Was this helpful?

  1. 🔧Development
  2. Algorithm

DQN

PreviousAlgorithmNextREINFORCE

Last updated 5 years ago

Was this helpful?

Deep Q-Learning

Q-learning (, ) algorithms estimate the optimal Q function, i.e the value of taking action A in state S under the optimal policy. Q-learning algorithms have an implicit policy (strategy for acting in the environment). This is typically ϵ\epsilonϵ-greedy, in which the action with the maximum Q value is selected with probability 1−ϵ1 - \epsilon1−ϵ and a random action is taken with probability ϵ\epsilonϵ, or boltzmann (see definition below). Random actions encourage exploration of the state space and help prevent algorithms from getting stuck in local minima.

Q-learning algorithms are off-policy algorithms because the target value used to train the network is independent of the policy used to generate the training data. This makes it possible to use experience replay to train an agent.

It is bootstrapped algorithm; updates to the Q function are based on existing estimates, and a temporal difference algorithm; the estimate in time t is updated using an estimate from time t+1. This allows Q-Learning algorithms to be online and incremental, so the agent can be trained during an episode.

Algorithm: DQN with target network

For k = 1 .... N:Gather data (si,ai,ri,si′) by acting in the environment using some policyfor j = 1 ... M:Sample a batch of data from the replay memoryFor p = 1 .... Z:1. Calculate target values for each exampleyi=ri+γ max⁡a′Q(si′,a′;θTAR)∣si,ai2. Update network parameters, using MSE lossLj(θ)=12∑i∣∣(yi−Q(si,ai;θ))∣∣2Periodically udpate θTAR with θ or a mix of θ and θTAR\begin{aligned} & \text{For k = 1 .... N:} \\ & \quad \text{Gather data } {(s_i, a_i, r_i, s'_i)} \ \text{by acting in the environment using some policy} \\ & \quad \text{for j = 1 ... M:} \\ & \quad \quad \text{Sample a batch of data from the replay memory} \\ & \quad \quad \text{For p = 1 .... Z:} \\ & \quad \quad \quad \quad \text{1. Calculate target values for each example} \\ & \quad \quad \quad \quad \quad \quad y_i = r_i + \gamma \ \max\limits_{a'} Q(s'_i, a'; \theta_{TAR})|s_i, a_i \\ & \quad \quad \quad \quad\text{2. Update network parameters, using MSE loss} \\ & \quad \quad \quad \quad\quad \quad L_j(\theta) = \frac{1}{2} \sum_i || (y_i - Q(s_i,a_i; \theta)) ||^2 \\ & \quad \text{Periodically udpate} ~\theta_{TAR} ~ \text{with}~ \theta ~\text{or a mix of}~ \theta ~\text{and}~ \theta_{TAR} \\ \end{aligned}​For k = 1 .... N:Gather data (si​,ai​,ri​,si′​) by acting in the environment using some policyfor j = 1 ... M:Sample a batch of data from the replay memoryFor p = 1 .... Z:1. Calculate target values for each exampleyi​=ri​+γ a′max​Q(si′​,a′;θTAR​)∣si​,ai​2. Update network parameters, using MSE lossLj​(θ)=21​i∑​∣∣(yi​−Q(si​,ai​;θ))∣∣2Periodically udpate θTAR​ with θ or a mix of θ and θTAR​​

See for example specs of variations of the DQN algorithm (e.g. DQN, DoubleDQN, DRQN). Parameters are explained below.

Basic Parameters

    "agent": [{
      "name": str,
      "algorithm": {
        "name": str,
        "action_pdtype": str,
        "action_policy": str,
        "explore_var_start": float,
        "explore_var_end": float,
        "explore_anneal_epi": int,
        "gamma": float,
        "training_batch_epoch": int,
        "training_epoch": int,
        "training_frequency": int,
      },
      "memory": {
        "name": str,
        "batch_size": int,
        "max_size": int
      },
      "net": {
        "type": str,
        "hid_layers": list,
        "hid_layers_activation": str,
        "optim_spec": dict,
      }
    }],
    ...
  }
  • algorithm

    • action_policy string specifying which policy to use to act. "boltzmann" or "epsilon_greedy".

      • "boltzmann" policy selects actions by sampling from a probability distribution over the actions. This is generated by taking a softmax over all the Q-values (estimated by a neural network) for a state, adjusted by the temperature parameter, tau.

      • "epsilon_greedy" policy selects a random action with probability epsilon, and the action corresponding to the maximum Q-value with (1 - epsilon).

    • explore_var_start initial value for the exploration parameters (tau or epsilon)

    • explore_var_end end value for the exploration parameters (tau or epsilon)

    • explore_anneal_epi how many episodes to take to reduce the exploration parameter value from start to end. Reduction is currently linear.

    • training_batch_epoch how many gradient updates to make per batch.

    • training_epoch how many batches to sample from the replay memory each time the agent is trained

    • training_frequency how often to train the algorithm. Value of 3 means train every 3 steps the agent takes in the environment.

  • memory

    • batch_size how many examples to include in each batch when sampling from the replay memory.

    • max_size maximum size of the memory. Once the memory has reached maximum capacity, the oldest examples are deleted to make space for new examples.

  • net

Advanced parameters

    "agent": [{
      "algorithm": {
        "training_min_timestep": int,
        "action_policy_update": str,
      },
      "memory": {
        "use_cer": bool
      },
      "net": {
        "rnn_hidden_size": int,
        "rnn_num_layers": int,
        "seq_len": int,
        "update_type": str,
        "update_frequency": int,
        "polyak_weight": float,
        "clip_grad": bool,
        "clip_grad_val": float,
        "loss_spec": dict
        "lr_decay": str,
        "lr_decay_frequency": int,
        "lr_decay_min_timestep": int,
        "lr_anneal_timestep": int,
        "gpu": int
      }
    }],
    ...
}
  • algorithm

    • training_min_timestep how many time steps to wait before starting to train. It can be useful to set this to 0.5 - 1x the batch size so that the DQN has a few examples to learn from in the first training iterations.

  • memory

  • net

    • update_type method of updating target_net. "replace" or "polyak". "replace" replaces target_net with net every update_frequency time steps. "polyak" updates target_net with polyak_weight target_net + (1 - polyak_weight) net each time step.

    • update_frequency how often to update target_net with net when using "replace" update_type.

    • polyak_weight ∈[0,1]\in [0, 1]∈[0,1] how much weight to give the old target_net when updating the target_net using "polyak" update_type

name

action_pdtype

gamma

name . Compatible types;

type . Compatible types;

hid_layers

hid_layers_activation

optim_spec

action_policy_update how to update the explore_var parameter in the action policy each episode. Available options are "linear_decay", "rate_decay", and "periodic_decay". See for more details.

use_cer: whether to used

rnn_hidden_size

rnn_num_layers

seq_len

clip_grad:

clip_grad_val:

loss_spec:

lr_decay:

lr_decay_frequency:

lr_decay_min_timestep:

lr_anneal_timestep:

gpu:

Watkins, 1989
Mnih et. al, 2013
dqn.json
general param
general param
general param
general param
"Replay", "PrioritizedReplay"
general param
all networks
general param
general param
general param
policy_util.py
Combined Experience Replay
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param
general param