# DQN

## **Deep Q-Learning**

Q-learning ([Watkins, 1989](http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf), [Mnih et. al, 2013](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)) algorithms estimate the optimal Q function, i.e the value of taking action A in state S under the optimal policy. Q-learning algorithms have an implicit policy (strategy for acting in the environment). This is typically $$\epsilon$$-greedy, in which the action with the maximum Q value is selected with probability $$1 - \epsilon$$ and a random action is taken with probability $$\epsilon$$, or boltzmann (see definition below). Random actions encourage exploration of the state space and help prevent algorithms from getting stuck in local minima.

Q-learning algorithms are off-policy algorithms because the target value used to train the network is independent of the policy used to generate the training data. This makes it possible to use experience replay to train an agent.

It is bootstrapped algorithm; updates to the Q function are based on existing estimates, and a temporal difference algorithm; the estimate in time `t` is updated using an estimate from time `t+1`. This allows Q-Learning algorithms to be online and incremental, so the agent can be trained during an episode.

### **Algorithm: DQN with target network**

$$
\begin{aligned}
& \text{For k = 1 .... N:} \\
& \quad \text{Gather data } {(s\_i, a\_i, r\_i, s'*i)} \ \text{by acting in the environment using some policy} \\
& \quad \text{for j = 1 ... M:} \\
& \quad \quad \text{Sample a batch of data from the replay memory} \\
& \quad \quad \text{For p = 1 .... Z:} \\
& \quad \quad \quad \quad \text{1. Calculate target values for each example} \\
& \quad \quad \quad \quad \quad \quad y\_i = r\_i + \gamma \ \max\limits*{a'} Q(s'*i, a'; \theta*{TAR})|s\_i, a\_i \\
& \quad \quad \quad \quad\text{2. Update network parameters, using MSE loss} \\
& \quad \quad \quad \quad\quad \quad L\_j(\theta) = \frac{1}{2} \sum\_i || (y\_i - Q(s\_i,a\_i; \theta)) ||^2 \\
& \quad \text{Periodically udpate} ~~\theta\_{TAR} \~ \text{with}~~ \theta ~~\text{or a mix of}~~ \theta ~~\text{and}~~ \theta\_{TAR} \\
\end{aligned}
$$

See [dqn.json](https://github.com/kengz/SLM-Lab/blob/master/slm_lab/spec/dqn.json) for example specs of variations of the DQN algorithm (e.g. DQN, DoubleDQN, DRQN). Parameters are explained below.

### **Basic Parameters**

```python
    "agent": [{
      "name": str,
      "algorithm": {
        "name": str,
        "action_pdtype": str,
        "action_policy": str,
        "explore_var_start": float,
        "explore_var_end": float,
        "explore_anneal_epi": int,
        "gamma": float,
        "training_batch_epoch": int,
        "training_epoch": int,
        "training_frequency": int,
      },
      "memory": {
        "name": str,
        "batch_size": int,
        "max_size": int
      },
      "net": {
        "type": str,
        "hid_layers": list,
        "hid_layers_activation": str,
        "optim_spec": dict,
      }
    }],
    ...
  }
```

* `algorithm`
  * `name` [*general param*](https://kengz.gitbooks.io/slm-lab/content/algorithms.html)
  * `action_pdtype` [*general param*](https://kengz.gitbooks.io/slm-lab/content/algorithms.html)
  * `action_policy` string specifying which policy to use to act. "boltzmann" or "epsilon\_greedy".
    * "boltzmann" policy selects actions by sampling from a probability distribution over the actions. This is generated by taking a softmax over all the Q-values (estimated by a neural network) for a state, adjusted by the temperature parameter, tau.    &#x20;
    * "epsilon\_greedy" policy selects a random action with probability epsilon, and the action corresponding to the maximum Q-value with (1 - epsilon).
  * `explore_var_start` initial value for the exploration parameters (tau or epsilon)
  * `explore_var_end` end value for the exploration parameters (tau or epsilon)
  * `explore_anneal_epi` how many episodes to take to reduce the exploration parameter value from start to end. Reduction is currently linear.
  * `gamma` [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `training_batch_epoch` how many gradient updates to make per batch.
  * `training_epoch` how many batches to sample from the replay memory each time the agent is trained
  * `training_frequency` how often to train the algorithm. Value of 3 means train every 3 steps the agent takes in the environment.
* `memory`
  * `name` [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms). Compatible types; ["Replay", "PrioritizedReplay"](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/memory)
  * `batch_size` how many examples to include in each batch when sampling from the replay memory.
  * `max_size` maximum size of the memory. Once the memory has reached maximum capacity, the oldest examples are deleted to make space for new examples.
* `net`
  * `type` [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms). Compatible types; [all networks](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/neural-networks)
  * `hid_layers` [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `hid_layers_activation` [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `optim_spec` [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)

### **Advanced parameters**

```python
    "agent": [{
      "algorithm": {
        "training_min_timestep": int,
        "action_policy_update": str,
      },
      "memory": {
        "use_cer": bool
      },
      "net": {
        "rnn_hidden_size": int,
        "rnn_num_layers": int,
        "seq_len": int,
        "update_type": str,
        "update_frequency": int,
        "polyak_weight": float,
        "clip_grad": bool,
        "clip_grad_val": float,
        "loss_spec": dict
        "lr_decay": str,
        "lr_decay_frequency": int,
        "lr_decay_min_timestep": int,
        "lr_anneal_timestep": int,
        "gpu": int
      }
    }],
    ...
}
```

* `algorithm`
  * `training_min_timestep` how many time steps to wait before starting to train. It can be useful to set this to 0.5 - 1x the batch size so that the `DQN` has a few examples to learn from in the first training iterations.
  * `action_policy_update` how to update the `explore_var` parameter in the action policy each episode. Available options are "linear\_decay", "rate\_decay", and "periodic\_decay". See [policy\_util.py](https://github.com/kengz/SLM-Lab/blob/master/slm_lab/agent/algorithm/policy_util.py) for more details.
* `memory`
  * `use_cer`: whether to used [Combined Experience Replay](https://arxiv.org/abs/1712.01275)
* `net`
  * `rnn_hidden_size` [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `rnn_num_layers` [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `seq_len` [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `update_type` method of updating `target_net`. "replace" or "polyak". "replace" replaces `target_net` with `net` every `update_frequency` time steps. "polyak" updates `target_net` with `polyak_weight`  *`target_net` + (1 - `polyak_weight`)* `net` each time step.
  * `update_frequency` how often to update `target_net` with `net` when using "replace" `update_type`.
  * `polyak_weight` $$\in \[0, 1]$$ how much weight to give the old `target_net` when updating the `target_net` using "polyak" `update_type`
  * `clip_grad`: [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `clip_grad_val`: [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `loss_spec`: [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `lr_decay`: [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `lr_decay_frequency`: [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `lr_decay_min_timestep`: [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `lr_anneal_timestep`: [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
  * `gpu`: [*general param*](https://slm-lab.gitbook.io/slm-lab/v4.2.0/development/algorithms)
