First-Time AI Jobs With Slurm On Cloud GPU Clusters - v1

Harshal
1 day ago
5 min read

Reflections From A Product Manager Testing AI Infra Firsthand; Not A Guide

I wanted to understand how developers use Slurm to train and fine-tune AI models on cloud GPU clusters. So, I ran a personal, independent experiment using three platforms: RunPod, Nebius, and CoreWeave.

This is a personal, independent experiment documenting my first-time experience trying to run AI fine-tuning jobs with Slurm on cloud GPU clusters. I'm a Product Manager—not a Slurm expert or systems engineer. This post is not a tutorial. Instead, it’s a log of my first attempt: what I tried, what worked, what didn’t, and what I wish I’d known.

I cover these steps:

Sign up and getting a cluster
Slurm setup
Submitting test jobs
Running fine-tuning scripts
Monitoring and debugging
CLI setup and tooling

After testing the services, I spent 1 hour and 5 minutes writing this. You need 4 minutes to read this.

Instruction manual to run AI servers with 3 logos.

Insights Gained

Researchers (AI, HPC) are expected to use Slurm, but most aren’t fluent with Slurm.
Many universities publish Slurm guides tailored to their university environments. UMich, UChicago, UIbk examples.
SSH access is fragile and error-prone. It depends on correctly configuring the MLOps platform, your local machine, and more. I forgot to enable a public IP address and had to consider workarounds like WireGuard, Tailscale, and Teleport.
Web-based terminals eliminate setup headaches. They remove the need to install CLI tools or perfect your SSH configuration.
One-click Jupyter notebooks offer a smooth developer experience.
User interfaces make a difference. Nebius and CoreWeave’s clean UIs made some tasks easier compared to using only the CLI or API.
Platform UIs are way easier to use than using only the CLI or API. Nebius and CoreWeave made some tasks easier through their clean UIs.
Pre-built Docker images save time. On CoreWeave, I avoided installing everything from scratch thanks to their container setup, which allowed me to choose the image I wanted. Their platform installed it for me.
Error handling was tough. When jobs failed, there wasn’t always a clear log or message to understand why, for example, when I faced spot instance interruptions on RunPod.
Setting up Slurm on a new GPU cluster is challenging. But, once you have set it up, there is sufficient know-how on the Internet to vibe code your way in running AI workloads using Slurm.
It is hard to get GPUs from cloud providers under these conditions: a) latest GPUs b) immediately and b) only for a few hours. This use-case likely doesn’t resonate with CSP (compute service provider) revenue goals.
It is standard practice to launch your training jobs using some abstraction, where possible. For example, some MLops platforms offer hyperparameter sweep. But, its also standard practice to SSH into your GPU to monitor or debug.

Step 1: Sign Up

RunPod

Sign up
Add billing information
Add payment method
Add SSH public key if want to use SSH to machines
Installed CLI, which was helpful to transfer files from local machine or cloud storage to the GPUs.

brew install runpod/runpodctl/runpodctl

Nebius

Did not have an email sign-up option.

Installed their CLI and authenticated via browser.
CLI login flow via browser and callback worked smoothly.

curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash
nebius version
nebius completion zsh > ~/.nebius/completion.zsh.inc
export NB_PROFILE_NAME=hptest_profile_name

Slurm requires cluster access, which becomes a sales conversation.

Step 2: Getting A Cluster Up And Running

RunPod

Tried to launch an "Instant Cluster" via their UI.
Friction: No GPUs were available. Got a sales email instead.

Thanks for reaching out about our new Instant Clusters. To help us prep on our end, could you share your use case? Would be great to line things up — let's set up a quick call for intros.

So, I instead rented pods.

On-demand pod rental is quick and instant.

Nebius

Had to request a cluster manually via ticket.
Installed their CLI and authenticated via browser.
Friction: Created a cluster but forgot to enable public IP. Couldn't connect. Had to start over.
Auto-generated names were helpful. UI for adding GPUs/CPUs was clean.

CoreWeave

Didn’t go through the whole provisioning process but followed their docs for Slurm-on-Kubernetes (SUNK).
Pre-built Docker images available for HuggingFace fine-tuning.

Step 3: Installing And Configuring Slurm

RunPod

Followed community repo and docs to set up Slurm.
Installed Slurm on each pod manually.
Had to add slurm user to munge group for authentication.
Created necessary directories.
Generated slurm.conf and gres.conf.

ClusterName=localcluster
SlurmctldHost=node-0
...
NodeName=node-0 CPUs=... RealMemory=... Gres=gpu:8
NodeName=node-1 CPUs=... RealMemory=... Gres=gpu:8
PartitionName=gpupart Nodes=node-0,node-1 Default=YES
...
NodeName=node-0 Name=gpu File=/dev/nvidia[0-7]

This tells Slurm each node has 8 GPUs at /dev/nvidia0 through /dev/nvidia7.
Started the daemons:

slurmctld -D
slurmd -D

CoreWeave

Used their Slurm-compatible Docker image with HuggingFace libraries.
No manual installation of Slurm required if using SUNK.
I realized that setting up slurm the first time as a new AI developer is painful. It is also painful to setup slurm or kubernetes on a new GPU cluster you get.

Step 4: Submitting a Job

RunPod

On a standalone RunPod instance, I tried running fine-tuning without slurm using an AI example I was well aware of.

nvidia-smi
curl --remote-name-all <URL>/example_data/{train.bin,val.bin}
git clone --branch <personal> https://github.com/<hidden>/nanoGPT.git
cd nanoGPT
python train.py config/train_shakespeare_char.py

Friction: had to guess multiple variations of the entrypoint command to figure out the right dataset paths, despite this being a well-known example with sufficient documentation.

Nebius

Below set of instructions defines how many machines we are using, max training jobs launched, 1 GPU per node so 1 task per GPU. assign more CPUs per task to speed up data loading. Assigning a timeout and mentioning the output format.

#SBATCH --job-name=llama3b-finetune
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=4
#SBATCH --time=24:00:00
#SBATCH --output=%x-%j.out

srun hostname

Ran this script to confirm Slurm was running correctly across nodes.
Got expected output: hostnames from two nodes.

Used a job script adapted from AMD and HuggingFace examples. References:

export MASTER_ADDR=$(srun --nodes=1 --ntasks=1 hostname)
export MASTER_PORT=29500
...
srun accelerate launch \
  --multi_gpu \
  --num_machines=$SLURM_JOB_NUM_NODES \
  --num_processes=$(($SLURM_GPUS_ON_NODE * $SLURM_JOB_NUM_NODES)) \
  --machine_rank=$SLURM_NODEID \
  --main_process_ip=$MASTER_ADDR \
  --main_process_port=$MASTER_PORT \
  your_finetune_script.py --your-args

Step 5: Monitoring and Debugging

squeue -u $USER         # Check job status
sacct -j <JOBID>        # See exit codes, state
scontrol show hostnames $SLURM_JOB_NODELIST
ssh <node-name>         # SSH into node for debugging

Used squeue, sacct, and ssh to monitor jobs.
code tunnel let me open VS Code via browser for live debugging with a message similar to what I’ve seen before with another product:

"Open this link in your browser https://vscode.dev/tunnel/slurm-h100-209-189"

You can SSH into a running training workload. This is standard practice for debugging or monitoring distributed jobs.

Next Steps

This was my first experience trying to run fine-tuning jobs with Slurm. Through this bumpy ride, I learned the differences in real-world infra and MLOps abstractions. I also got more exposure to the “GPU scarcity” problem statement.

Next step, I can

Use Slurm on other MLops platforms
Use Slurm for Inference / deploying a model
Use abstracted CLI or GUI flows, like Axolotl or VertexAI