ML Pipeline Guide for Slurm
This is a guide for building and running ML pipelines with Slurm + Grafana + Prometheus.
Table of Contents
- Overview
- Assumptions
- Pipeline Architecture
- Core Concepts
- Pipeline Stages
- Running the Pipeline
- Monitoring
Overview
This guide explains how to build and run ML training pipelines on Slurm-managed GPU clusters. The pipeline fine-tunes a large language model (20B parameters) using QLoRA and is structured as a chain of dependent Slurm jobs.
Code Repository: https://github.com/ori-edge/slurm-ml-pipelines
Key Features:
- Modular 4-stage pipeline (Setup → Preprocessing → Training → Evaluation)
- Automatic job sequencing via Slurm dependencies
- GPU resource allocation per stage
- Real-time monitoring with Prometheus + Grafana
Assumptions
This guide assumes the following are already in place:
Infrastructure
- VM provisioned with SSH access (IP address known)
- Slurm installed and configured (controller + compute node running)
- GPUs configured in Slurm (2x H100 80GB with GRES setup)
- CUDA drivers installed (
nvidia-smiworks)
Software
- Python 3 installed with
python3-venvpackage - Docker and Docker Compose installed
- Monitoring stack deployed (Prometheus, Grafana, DCGM exporter containers)
- Slurm exporter configured as a systemd service
- Pipeline code already deployed at
~/ml-pipeline/on the VM
Network & Access
- SSH key configured at
~/.ssh/id_rsafor VM access - HuggingFace account with API token for model access
For initial Slurm setup, refer to the Slurm Setup Guide.
Pipeline Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ ML PIPELINE WORKFLOW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ Job 1 │ │ Job 2 │ │ Job 3 │ │ Job 4 │
│ │ SETUP │─ ───▶│ PREPROCESS │────▶│ TRAINING │────▶│ EVALUATION │
│ │ │ │ │ │ │ │ │
│ │ 01_setup.py │ │02_preproc.py │ │03_training.py│ │04_eval.py │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │ │ │
│ ▼ ▼ ▼ ▼
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ • Validate │ │ • Load data │ │ • Load model │ │ • Load model │
│ │ environment│ │ • Tokenize │ │ • Apply LoRA │ │ • Generate │
│ │ • Cache model│ │ • Pack seqs │ │ • Train DDP │ │ • Calc metrics│
│ │ • Cache data │ │ • Mask prompts│ │ • Save ckpts │ │ • Save results│
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │
│ Resources: Resources: Resources: Resources:
│ CPU: 8 cores CPU: 16 cores CPU: 32 cores CPU: 8 cores
│ GPU: 1 GPU: 0 (CPU-only) GPU: 2 GPU: 1
│ Time: 30 min Time: 30 min Time: 2 hours Time: 30 min
│ │
└─────────────────────────────────────────────────────────────────────────┘
Core Concepts
Job Dependencies
Slurm job dependencies ensure jobs execute in the correct order. A dependent job remains in PENDING state until its prerequisite completes successfully.
For ML pipelines, use afterok - this ensures each stage only runs if the previous stage succeeded.
Creating a Job Chain:
# Submit Job 1 (no dependencies)
JOB_ID_1=$(sbatch --parsable jobs/01_setup.sbatch)
# Submit Job 2 (depends on Job 1)
JOB_ID_2=$(sbatch --parsable --dependency=afterok:$JOB_ID_1 jobs/02_preprocessing.sbatch)
# Submit Job 3 (depends on Job 2)
JOB_ID_3=$(sbatch --parsable --dependency=afterok:$JOB_ID_2 jobs/03_training.sbatch)
# Submit Job 4 (depends on Job 3)
JOB_ID_4=$(sbatch --parsable --dependency=afterok:$JOB_ID_3 jobs/04_evaluation.sbatch)
What happens:
- All 4 jobs are submitted immediately to Slurm
- Job 1 starts running (state: RUNNING)
- Jobs 2, 3, 4 wait in queue (state: PENDING with reason: Dependency)
- When Job 1 completes successfully → Job 2 starts
- If any job fails → all subsequent jobs are cancelled
Resource Allocation
Each pipeline stage has different resource requirements. Slurm allocates exactly what each job needs.
Resource Directives in sbatch:
#SBATCH --cpus-per-task=32 # Number of CPU cores
#SBATCH --mem=64G # Memory limit
#SBATCH --gres=gpu:h100:2 # GPU type and count
#SBATCH --time=02:00:00 # Maximum runtime (HH:MM:SS)
Resource Planning by Stage:
| Stage | CPUs | Memory | GPUs | Time | Rationale |
|---|---|---|---|---|---|
| Setup | 8 | 32GB | 1 | 30min | Model download, validation |
| Preprocessing | 16 | 64GB | 0 | 30min | CPU-bound tokenization |
| Training | 32 | 128GB | 2 | 2h | Multi-GPU training |
| Evaluation | 8 | 32GB | 1 | 30min | Single-GPU inference |
Why allocate different resources?
- Efficiency: Don't waste GPUs on CPU-bound tasks (preprocessing)
- Throughput: Other jobs can use freed resources
- Cost: Optimise GPU time
Slurm Batch Scripts
Each pipeline stage has a corresponding .sbatch file that defines:
- Resource requirements (SBATCH directives)
- Environment setup (activate venv, set paths)
- Script execution
Anatomy of a Batch Script:
#!/bin/bash
#SBATCH --job-name=03_training # Job identifier
#SBATCH --output=logs/03_training_%j.out # Stdout (%j = job ID)
#SBATCH --error=logs/03_training_%j.err # Stderr
#SBATCH --partition=gpu # Which partition to use
#SBATCH --account=default # Accounting/billing account
# Resource allocation
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --gres=gpu:h100:2
#SBATCH --time=02:00:00
# Environment setup
cd ~/ml-pipeline
source venv/bin/activate
# Execute the Python script
python scripts/03_training.py
Key Points:
%jin output path → replaced with job ID (e.g.,03_training_12345.out)- Environment must be activated in the script (Slurm doesn't inherit your shell)
- Exit code of the Python script becomes the job's exit code
Pipeline Stages
Stage 1: Environment Setup
Purpose: Validate environment, download and cache model/dataset.
Script: scripts/01_setup.py
Batch Job: jobs/01_setup.sbatch
What it does:
- Validates CUDA/GPU availability
- Installs Python dependencies (if needed)
- Downloads model from HuggingFace Hub → caches locally
- Downloads dataset → caches locally
- Saves configuration for subsequent stages
Resource Requirements:
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:h100:1 # Need 1 GPU to validate CUDA
#SBATCH --time=00:30:00
Why run this as a separate job?
- Downloads can fail (network issues) → retry without re-running training
- Model caching saves time on subsequent pipeline runs
- Validates environment before committing to long training
Stage 2: Data Preprocessing
Purpose: Transform raw data into training-ready format.
Script: scripts/02_preprocessing.py
Batch Job: jobs/02_preprocessing.sbatch
What it does:
- Loads raw dataset (prompt-completion pairs)
- Applies tokenizer chat template
- Prompt masking: Sets labels to -100 for prompt tokens (model only learns to predict responses)
- Packing: Concatenates samples into fixed 1024-token blocks (maximizes GPU utilization)
- Saves processed data to disk
Resource Requirements:
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:0 # No GPU needed - CPU-bound
#SBATCH --time=00:30:00
Why no GPU?
- Tokenization is CPU-bound
- Frees GPUs for other users while preprocessing runs
Stage 3: Model Training
Purpose: Fine-tune the model using QLoRA.
Script: scripts/03_training.py
Batch Job: jobs/03_training.sbatch
What it does:
- Loads base model with 4-bit quantization (QLoRA)
- Applies LoRA adapters (trainable parameters)
- Loads preprocessed data
- Runs distributed training across GPUs (DDP)
- Saves checkpoints and final model
Resource Requirements:
#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:h100:2 # Both GPUs for DDP
#SBATCH --time=02:00:00
Training Configuration:
- Batch size: 1 per GPU
- Gradient accumulation: 8 steps
- Effective batch size: 1 × 2 GPUs × 8 = 16
- Learning rate: 2e-4
- Max steps: 200
Stage 4: Model Evaluation
Purpose: Evaluate fine-tuned model quality.
Script: scripts/04_evaluation.py
Batch Job: jobs/04_evaluation.sbatch
What it does:
- Loads base model + LoRA adapters
- Generates responses for test samples
- Calculates perplexity (model confidence)
- Calculates BERTScore (semantic similarity)
- Saves metrics and sample outputs
Resource Requirements:
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:h100:1 # Single GPU for inference
#SBATCH --time=00:30:00
Metrics Explained:
| Metric | What it Measures | Good Value |
|---|---|---|
| Perplexity | Model confidence (lower = better) | < 10 |
| BERTScore F1 | Semantic similarity to expected | > 0.7 |
Running the Pipeline
All pipeline commands must be executed from the VM where Slurm is installed. The pipeline code is already deployed at ~/ml-pipeline/.
Full Pipeline Execution
The recommended way to run the pipeline is using the submission script that chains all jobs.
Step 1: SSH into the VM (from your local machine)
ssh ubuntu@<your-vm-ip> -i ~/.ssh/id_rsa
Step 2: Navigate to pipeline directory (on the VM)
cd ~/ml-pipeline
Step 3: Ensure environment is set up (on the VM)
# Create .env with your HuggingFace token (if not already done)
echo "HF_TOKEN=your_token_here" > .env
chmod 600 .env
# Create logs directory (if not exists)
mkdir -p logs
# Create Python virtual environment (first time only)
# python3 -m venv venv
# source venv/bin/activate
# pip install transformers datasets accelerate peft bitsandbytes bert-score torch
# Activate Python virtual environment
source venv/bin/activate
Step 4: Run the full pipeline (on the VM)
# Make script executable (only needed once)
chmod +x full_ml_pipeline.sh
# Run the pipeline
./full_ml_pipeline.sh
Expected Output:
==============================================
SUBMITTING ML PIPELINE (DEPENDENT CHAIN)
==============================================
Submitted Job 1 (Setup): 100
Submitted Job 2 (Preprocessing): 101 (depends on 100)
Submitted Job 3 (Training): 102 (depends on 101)
Submitted Job 4 (Evaluation): 103 (depends on 102)
==============================================
Pipeline submitted successfully!
Use 'squeue' to monitor the progress.
==============================================
What happens next:
- All 4 jobs are in the queue
- Job 1 starts immediately
- Jobs 2-4 wait with
(Dependency)reason - As each job completes, the next one starts
Individual Job Execution
For debugging or re-running specific stages (all commands on the VM):
# On the VM
cd ~/ml-pipeline
source venv/bin/activate
# Run only setup
sbatch jobs/01_setup.sbatch
# Run only preprocessing (after setup completed)
sbatch jobs/02_preprocessing.sbatch
# Run only training (after preprocessing completed)
sbatch jobs/03_training.sbatch
# Run only evaluation (after training completed)
sbatch jobs/04_evaluation.sbatch
Re-running a failed stage:
# Check which job failed
sacct --format=JobID,JobName,State,ExitCode -X
# Example: Training failed, re-run it
sbatch jobs/03_training.sbatch
# Then re-run evaluation with dependency on new training job
JOB_ID=$(sbatch --parsable jobs/03_training.sbatch)
sbatch --dependency=afterok:$JOB_ID jobs/04_evaluation.sbatch
Monitoring
CLI Monitoring (in the VM)
These commands are run on the VM via SSH.
Watch job queue in real-time:
# On the VM
watch -n 5 squeue -u $USER
Example output during pipeline run:
JOBID PARTITION NAME STATE TIME NODELIST REASON
100 gpu 01_setup COMPLETED 0:02:15 virtual-m
101 gpu 02_preprocess RUNNING 0:05:32 virtual-m
102 gpu 03_training PENDING 0:00 (Dependency)
103 gpu 04_evaluation PENDING 0:00 (Dependency)
View job logs while running (on the VM):
# Stream current output
tail -f ~/ml-pipeline/logs/03_training_102.out
# View completed job output
cat ~/ml-pipeline/logs/03_training_102.out
Check job history (on the VM):
# Today's jobs
sacct --format=JobID,JobName,State,Elapsed,ExitCode -X
# Specific job details
sacct -j 102 --format=JobID,JobName,State,AllocCPUS,AllocTRES,Elapsed
Grafana Dashboard (via SSH Tunnel)
Grafana runs on the VM but is not directly accessible from the internet due to firewall restrictions. Use SSH tunneling to securely access it from your local machine.
Prerequisites: Ensure monitoring services are running (on the VM)
# Check if services are running
sudo systemctl status slurm-exporter
sudo docker ps # Should show prometheus, grafana, dcgm-exporter
# Start if needed
sudo systemctl start slurm-exporter
cd ~/monitoring && sudo docker compose up -d
Why SSH Tunneling?
- The VM's ports 3000 (Grafana) and 9090 (Prometheus) are blocked by the cloud firewall
- SSH tunneling forwards these ports through your existing SSH connection
- Your browser connects to
localhost:3000which is tunneled to the VM's Grafana
Step 1: Create SSH tunnel (run on your LOCAL machine, not the VM)
Open a new terminal on your local machine and run:
ssh -L 3000:localhost:3000 -L 9090:localhost:9090 ubuntu@<your-vm-ip> -i ~/.ssh/id_rsa -N
Important: Keep this terminal open for the duration of your monitoring session. The -N flag means no remote command is executed - it just forwards ports.
Step 2: Access Grafana (on your LOCAL machine)
Open your browser and go to: http://localhost:3000
- Username:
admin - Password:
slurm123
Step 3: View "Slurm + GPU Monitoring" dashboard
Available Panels:
| Panel | Description |
|---|---|
| GPU Utilization Gauges | Real-time % usage per GPU |
| GPU Memory Used | Memory consumption over time |
| Jobs Running/Pending | Current queue status |
| Current Job Runtimes | Bar chart of active job durations |
| Job Wait Times | How long jobs waited in queue |
What to expect during training:
GPU Utilization Timeline:
─────────────────────────────────────────────────▶ time
GPU 0: ░░░░▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░
Setup │ Training (50-60%) │ Eval
GPU 1: ░░░░▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░
idle │ Training (50-60%) │ idle