โ˜๏ธRemote Training

SLM Lab uses dstackarrow-up-right for cloud GPU training and HuggingFace for experiment storage.

Setup

1. Install dstack

uv tool install dstack

2. Configure dstack

Create an account at dstack Skyarrow-up-right and get your token:

dstack project add --name your-project --url https://sky.dstack.ai --token $DSTACK_TOKEN -y

This saves configuration to ~/.dstack/config.yml.

3. Set up HuggingFace credentials

Create a .env file in your SLM-Lab directory:

HF_TOKEN=hf_xxxxxxxxxxxx
HF_REPO=your-username/slm-lab-results

Source before running:

source .env

Running Remote Experiments

Basic Commands

The -n flag names your run for easy identification.

Monitoring Runs

Checking Results

When a run completes, check the final score in logs:

The total_reward_ma is the final moving average score.

Managing Results

Pull Results

Download completed experiments from HuggingFace:

List Experiments

Push Local Results

Configuration

Hardware

SLM Lab defaults to L4 GPU ($0.39/hr) which handles all benchmark environments. The configuration is in .dstack/run-gpu-train.yml.

For very large models or faster training, you can switch to V100:

circle-info

Cost tip: GPU instances are often cheaper than equivalent CPU instances due to fractional GPU sharing. Always use --gpu unless your workload is CPU-bound.

In search mode, multiple trials share one GPU:

With gpu: 0.125, 8 trials run in parallel on a single GPUโ€”ideal for ASHA search.

Max Duration

Runs have safeguards to prevent runaway costs: CPU runs are limited to 4 hours, GPU runs to 6 hours. Edit .dstack/*.yml to adjust.

Fleet Setup (dstack 0.20+)

For dstack 0.20+, create a fleet before running:

Workflow Example

Batch Running

Launch multiple experiments to maximize GPU utilization:

circle-check

Troubleshooting

Run fails to start

Check fleet status and GPU availability.

Results not uploading

Ensure HF_TOKEN and HF_REPO are set in .env and sourced.

Low GPU utilization

Low GPU util often means:

  • Environment stepping is slow - increase num_envs

  • Batch size too small - increase minibatch_size

  • Config issue - verify spec settings

More Resources

Last updated

Was this helpful?