> For the complete documentation index, see [llms.txt](https://slm-lab.gitbook.io/slm-lab/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://slm-lab.gitbook.io/slm-lab/development/algorithms/a2c/sac.md).

# SAC

## **Soft Actor-Critic**

SAC ([Haarnoja et al., 2018](https://arxiv.org/abs/1801.01290)) is an off-policy Actor-Critic algorithm that maximizes a trade-off between expected reward and entropy — encouraging the policy to be as random as possible while still performing well. This leads to better exploration and more robust policies.

SAC supports both continuous actions (reparameterization trick) and discrete actions (exact expectation, [Christodoulou, 2019](https://arxiv.org/abs/1910.07207)).

See [slm\_lab/spec/benchmark/sac/](https://github.com/kengz/SLM-Lab/tree/master/slm_lab/spec/benchmark/sac) and [slm\_lab/spec/benchmark\_arc/sac/](https://github.com/kengz/SLM-Lab/tree/master/slm_lab/spec/benchmark_arc/sac) for example SAC specs.

### **Algorithm: SAC**

$$
\begin{aligned}
& \text{For k = 1 .... N:} \\
& \quad \text{Sample batch } {(s\_i, a\_i, r\_i, s'*i)} \text{ from replay buffer} \\
& \quad \text{Compute soft targets: } y\_i = r\_i + \gamma\left(\min*{j=1,2} Q\_{\bar\theta\_j}(s'*i, a'*i) - \alpha \log\pi*\phi(a'*i | s'*i)\right) \\
& \quad \text{Update critics: } J(\theta) = \frac{1}{2}\sum\_i \left(Q*\theta(s\_i, a\_i) - y\_i\right)^2 \\
& \quad \text{Update actor: } J(\phi) = \mathbb{E}*{s,a\sim\pi}\left\[\alpha\log\pi*\phi(a|s) - \min\_j Q\_{\theta\_j}(s, a)\right] \\
& \quad \text{Update temperature: } J(\alpha) = \mathbb{E}\left\[-\alpha\left(\log\pi\_\phi(a|s) + \mathcal{H}\_\text{target}\right)\right] \\
& \quad \text{Soft update target networks: } \bar\theta \leftarrow \tau\theta + (1-\tau)\bar\theta
\end{aligned}
$$

The temperature parameter $$\alpha$$ is automatically tuned to maintain a target entropy $$\mathcal{H}\_\text{target}$$.

### **Basic Parameters**

```python
"agent": {
  "name": str,
  "algorithm": {
    "name": "SoftActorCritic",
    "action_pdtype": str,
    "action_policy": "default",
    "gamma": float,
    "training_frequency": int,
    "training_iter": int,
    "training_start_step": int,
  },
  "memory": {
    "name": "Replay",
    "batch_size": int,
    "max_size": int
  },
  "net": {
    "type": str,
    "arc": dict,
    "optim_spec": dict,
    "polyak_coef": float,
  }
}
```

* `algorithm`
  * `name`: `"SoftActorCritic"`
  * `action_pdtype`: `"Normal"` for continuous, `"Categorical"` for discrete
  * `action_policy`: `"default"`
  * `gamma` [*general param*](/slm-lab/development/algorithms.md)
  * `training_frequency`: steps between updates. 1 = update every step
  * `training_iter`: gradient steps per update (UTD ratio). Typical: 1–4
  * `training_start_step`: steps before training begins. Fills replay buffer first
* `memory`
  * Compatible types: `"Replay"`, `"PrioritizedReplay"` (see [Memory](/slm-lab/development/memory.md))
  * `batch_size`: examples per training batch. Typical: 256
  * `max_size`: replay buffer capacity. Typical: 1e6
* `net`
  * `polyak_coef`: soft update coefficient for target networks. τ = 1 - polyak\_coef. Typical: 0.995 (τ = 0.005)

### **Advanced Parameters**

```python
"algorithm": {
  "policy_delay": int,         # update actor every N critic updates (default 1)
  "fixed_alpha": float,        # disable auto-tuning, use fixed temperature (e.g. 0.02)
  "alpha_lr": float,           # separate learning rate for temperature optimizer
  "spectral_norm": bool,       # spectral norm on penultimate critic layer
  "symlog": bool,              # symlog Q-value compression (DreamerV3)
}
```

* `policy_delay`: update actor every N critic updates. 2 = TD3-style delayed actor updates
* `fixed_alpha`: disable automatic entropy tuning and use a fixed temperature. Set to e.g. `0.02` for Atari
* `spectral_norm`: apply spectral normalization to penultimate critic layer for stability

### **SAC vs PPO**

|                                | SAC                                   | PPO                              |
| ------------------------------ | ------------------------------------- | -------------------------------- |
| **Type**                       | Off-policy                            | On-policy                        |
| **Data reuse**                 | Replay buffer                         | Discards after update            |
| **Sample efficiency**          | High                                  | Moderate                         |
| **Hyperparameter sensitivity** | Low                                   | Moderate                         |
| **Best for**                   | Continuous control, sample efficiency | Discrete, Atari, stable training |

SAC is the recommended algorithm for continuous control (MuJoCo, LunarLanderContinuous). For discrete environments, PPO is generally more reliable. See [CrossQ](/slm-lab/development/algorithms/a2c/crossq.md) for a faster SAC variant.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://slm-lab.gitbook.io/slm-lab/development/algorithms/a2c/sac.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
