For the complete documentation index, see llms.txt. This page is also available as Markdown.

โ˜๏ธRemote Training

SLM Lab uses dstack for cloud GPU training and HuggingFace for experiment storage.

Setup

1. Install dstack

uv tool install dstack

2. Configure dstack

Create an account at dstack Sky and get your token:

dstack project add --name your-project --url https://sky.dstack.ai --token $DSTACK_TOKEN -y

This saves configuration to ~/.dstack/config.yml.

3. Set up HuggingFace credentials

Create a .env file in your SLM-Lab directory:

HF_TOKEN=hf_xxxxxxxxxxxx
HF_REPO=your-username/slm-lab-results

Source before running:

source .env

Running Remote Experiments

Basic Commands

The -n flag names your run for easy identification.

Monitoring Runs

Checking Results

When a run completes, check the final score in logs:

The total_reward_ma is the final moving average score.

Managing Results

Pull Results

Download completed experiments from HuggingFace:

List Experiments

Push Local Results

Configuration

Hardware

SLM Lab defaults to L4 GPU ($0.39/hr) which handles all benchmark environments. The configuration is in .dstack/run-gpu-train.yml.

For very large models or faster training, you can switch to V100:

Cost tip: GPU instances are often cheaper than equivalent CPU instances due to fractional GPU sharing. Always use --gpu unless your workload is CPU-bound.

In search mode, multiple trials share one GPU:

With gpu: 0.125, 8 trials run in parallel on a single GPUโ€”ideal for ASHA search.

Max Duration

Runs have safeguards to prevent runaway costs: CPU runs are limited to 4 hours, GPU runs to 6 hours. Edit .dstack/*.yml to adjust.

Fleet Setup (dstack 0.20+)

For dstack 0.20+, create a fleet before running:

Workflow Example

Batch Running

Launch multiple experiments to maximize GPU utilization:

Troubleshooting

Run fails to start

Check fleet status and GPU availability.

Results not uploading

Ensure HF_TOKEN and HF_REPO are set in .env and sourced.

Low GPU utilization

Low GPU util often means:

  • Environment stepping is slow - increase num_envs

  • Batch size too small - increase minibatch_size

  • Config issue - verify spec settings

More Resources

Last updated

Was this helpful?