For the complete documentation index, see llms.txt. This page is also available as Markdown.

โšกAsync Training: Hogwild!

This tutorial covers asynchronous training using Hogwild!โ€”a technique for parallelizing network training across multiple processes with shared parameters.

Educational Purpose: Hogwild! is included primarily for learning about async RL architectures. For production training, use PPO with vectorized environments (num_envs)โ€”it's simpler and more efficient.

How Hogwild! Works

Hogwild! enables lock-free parallel training by having multiple workers update shared network parameters simultaneously. SLM Lab implements this using PyTorch multiprocessing with shared memory.

Worker 1 โ”€โ”ฌโ”€โ†’ Shared Global Network โ†โ”€โ”ฌโ”€ Worker 3
Worker 2 โ”€โ”˜        (CPU)              โ””โ”€ Worker 4

Each worker:

  1. Collects experience from its own environment

  2. Computes gradients on its local network

  3. Pushes gradients to the shared global network

  4. Pulls updated weights from global network

Meta Spec for Hogwild!

Enable distributed training in the meta spec:

{
  "meta": {
    "distributed": "synced",  // or "shared"
    "max_session": 4          // Number of parallel workers
  }
}

Distributed Modes

Mode
Behavior
Use Case

"synced"

Sync parameters after each training step

A3C (on-policy)

"shared"

Continuous parameter sharing

Async SAC (off-policy)

false

Disabled (default)

Standard training

Key Requirements

  • GlobalAdam or GlobalRMSprop โ€” Optimizers that support shared state across processes

  • max_session > 1 โ€” Number of parallel workers

A3C on Pong

A3C (Mnih et al., 2016) uses "synced" mode for on-policy training.

Spec: slm_lab/spec/benchmark/a3c/a3c_gae_pong.json

Run:

Async SAC on Humanoid

For off-policy algorithms like SAC, use "shared" mode for continuous parameter sharing.

Spec: slm_lab/spec/benchmark/async_sac/async_sac_mujoco.json

Run:

With 16 parallel sessions, a 50M frame run completes much faster than sequential training.

Frame counting: The x-axis shows per-session frames. Total frames = per-session ร— max_session.

Historical Results (v4)

These graphs are from v4 async SAC training:

Async SAC Humanoid returns
Async SAC Humanoid moving average

For validated v5 Humanoid results using synchronous PPO, see Continuous Benchmarkโ€”PPO achieves 3774 on Humanoid-v5.

Comparison: Async vs Vectorized

For most use cases, vectorized environments are simpler and faster:

Aspect

Vectorized (num_envs)

Hogwild! (distributed)

Parallelism

Environment stepping

Network training

Complexity

Simple

Complex (multiprocessing)

GPU

Full GPU acceleration

Global nets on CPU

Use case

Production

Learning, CPU-bound

When Hogwild! Helps

Hogwild! can help when:

  • Network training is the bottleneck (not environment stepping)

  • You have many CPU cores available

  • Learning about async RL architectures

For most RL workloads, environment stepping is the bottleneck, so vectorized environments (num_envs) are more effective.

Historical Context

A3C was groundbreaking when GPUs were expensive and CPU parallelism was the main scaling strategy. Today, GPU-accelerated vectorized training (PPO, A2C) is more practical for most use cases.

SLM Lab includes async training for:

  • Understanding async RL architectures

  • Reproducing classic papers

  • CPU-only training scenarios

Last updated

Was this helpful?