# REINFORCE

REINFORCE [Williams, 1992](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) directly learns a parameterized policy, $$\pi$$, which maps states to probability distributions over actions.

Starting with random parameter values, the agent uses this policy to act in an environment and receive rewards. After an episode has finished, the "goodness" of each action, represented by, $$f(\tau)$$, is calculated using the episode trajectory. The parameters of the policy are then updated in a direction which makes good actions $$(f(\tau) > 0)$$ more likely, and bad actions $$(f(\tau) < 0)$$ less likely. Good actions are reinforced, bad actions are discouraged.

The agent then uses the updated policy to act in the environment, and the training process repeats.

REINFORCE is an on policy algorithm. Only data that is gathered using the current policy can be used to update the parameters. Once the policy parameters have been updated all previous data gathered must be discarded and the collection process started again with the new policy.

There are a number of different approaches to calculating $$f(\tau)$$. Method 3, outlined below, is common. It captures the idea that the absolute quality of the actions matters less than their quality relative to some baseline. One option for a baseline is the average of $$f(\tau)$$ over the training data (typically one episode trajectory).

**Algorithm: REINFORCE with baseline**

$$
\begin{aligned}
& \text{Initialize weights } \theta \text{, learning rate } \alpha \\
& \text{for each episode (trajectory) } \tau = {s\_0, a\_0, r\_0, s\_1, \cdots, r\_T} \sim \pi\_\theta \\
& \quad \text{for } t = 0 \text{ to } T \text{ do} \\
& \quad \quad \theta \leftarrow \theta + \alpha \ f(\tau)*t \nabla*\theta log \~ \pi\_\theta(a\_t|s\_t) \\
& \quad \text{end for} \\
& \text{end for} \\
\end{aligned}
$$

Methods for calculating $$f(\tau)\_t$$:

$$
\begin{aligned}
& \text{Given } \nabla\_\theta J(\theta) \ \approx \sum\_{t \geq 0} f(\tau) \nabla\_\theta log \pi\_\theta(a\_t|s\_t), \~ \text{improve baseline with: }\\
& \quad \quad 1.\ \text{reward as weightage } f(\tau) = \sum\limits\_{t' \geq t} r\_{t'} \\
& \quad \quad 2.\ \text{add discount factor } f(\tau) = \sum\limits\_{t' \geq t} \gamma^{t'-t} r\_{t'} \\
& \quad \quad 3.\ \text{introduce baseline } f(\tau) = \sum\limits\_{t' \geq t} \gamma^{t'-t} r\_{t'} - b(s\_t) \\

\end{aligned}
$$

See [slm\_lab/spec/benchmark/reinforce/](https://github.com/kengz/SLM-Lab/tree/master/slm_lab/spec/benchmark/reinforce) for example REINFORCE specs.

**Basic Parameters**

```python
    "agent": {
      "name": str,
      "algorithm": {
        "name": str,
        "action_pdtype": str,
        "action_policy": str,
        "gamma": float,
        "training_frequency": int,
        "entropy_coef_spec": {...},
      },
      "memory": {
        "name": str,
      },
      "net": {
        "type": str,
        "hid_layers": list,
        "hid_layers_activation": str,
        "optim_spec": dict,
      }
    },
    ...
}
```

* `algorithm`
  * `name` [*general param*](/slm-lab/v5.2.0/development/algorithms.md)
  * `action_pdtype` [*general param*](/slm-lab/v5.2.0/development/algorithms.md)
  * `action_policy` string specifying which policy to use to act. For example, "Categorical" (for discrete action spaces), "Normal" (for continuous actions spaces with one dimension), or "default" to automatically switch between the two depending on the environment.
  * `gamma` [*general param*](/slm-lab/v5.2.0/development/algorithms.md)
  * `training_frequency` how many episodes of data to collect before each training iteration. A common value is 1.
  * `entropy_coef_spec` schedule for entropy coefficient added to the loss to encourage exploration. Example: `{"name": "no_decay", "start_val": 0.01, "end_val": 0.01, "start_step": 0, "end_step": 0}`
  * `center_return` (optional, default `false`) whether to center returns by subtracting the mean before computing policy gradient. Can improve training stability.
* `memory`
  * `name` [*general param*](/slm-lab/v5.2.0/development/algorithms.md). Compatible types; ["OnPolicyReplay", "OnPolicyBatchReplay"](/slm-lab/v5.2.0/development/memory.md)
* `net`
  * `type` [*general param*](/slm-lab/v5.2.0/development/algorithms.md). Compatible types; [all networks](/slm-lab/v5.2.0/development/neural-networks.md).
  * `hid_layers` [*general param*](/slm-lab/v5.2.0/development/algorithms.md)
  * `hid_layers_activation` [*general param*](/slm-lab/v5.2.0/development/algorithms.md)
  * `optim_spec` [*general param*](/slm-lab/v5.2.0/development/algorithms.md)

**Advanced Parameters**

```python
    "agent": {
      "net": {
        "rnn_hidden_size": int,
        "rnn_num_layers": int,
        "seq_len": int,
        "clip_grad_val": float,
        "lr_scheduler_spec": dict,
        "gpu": str
      }
    },
    ...
}
```

* `net`
  * `rnn_hidden_size` [*general param*](/slm-lab/v5.2.0/development/algorithms.md)
  * `rnn_num_layers` [*general param*](/slm-lab/v5.2.0/development/algorithms.md)
  * `seq_len` [*general param*](/slm-lab/v5.2.0/development/algorithms.md)
  * `clip_grad_val`: [*general param*](/slm-lab/v5.2.0/development/algorithms.md)
  * `lr_scheduler_spec`: optional learning rate scheduler config
  * `gpu`: [*general param*](/slm-lab/v5.2.0/development/algorithms.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://slm-lab.gitbook.io/slm-lab/v5.2.0/development/algorithms/reinforce.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.