Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

HPC Cluster Usage & Slurm

University of Pisa

Secure Shell (SSH) refresher


Generate and manage keys

ssh-keygen -t ed25519 -C "you@uni" -f ~/.ssh/sspa_ed25519
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/sspa_ed25519

Custom SSH config for the mini-cluster

~/.ssh/config

Host sspa-controller
  HostName localhost
  Port 2222
  User sspa
  IdentityFile ~/.ssh/sspa_ed25519
  IdentitiesOnly yes

Chaining hosts with ProxyJump


SSH tunnels (port forwarding)


Push keys for passwordless access

ssh-copy-id -i ~/.ssh/sspa_ed25519.pub -p 2222 sspa@localhost
# manual fallback:
cat ~/.ssh/sspa_ed25519.pub | ssh -p 2222 sspa@localhost \
  "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"

Slurm


Why Slurm?


Learning goals


Mini-cluster layout (Docker)


Start or reset the environment

cd docker_images              # called docker_files in the IDE overview
./build_images.sh             # only first time or after Dockerfile edits
docker compose up -d controller worker1 worker2

SSH into the controller (step-by-step)

  1. Ensure containers are running (docker compose ps).

  2. Password defaults to sspa-password, but reset it anytime from the host:

    cd docker_images
    docker compose exec controller bash -lc 'echo "sspa:sspa-password" | chpasswd'
  3. On the host, trust the controller and log in:

    ssh -p 2222 sspa@localhost
  4. (Optional, safer) copy your SSH key for passwordless logins:

    ssh-copy-id -p 2222 sspa@localhost
  5. Verify you are on the controller: hostname should print controller.

  6. Test Slurm visibility: sinfo must list the debug partition with worker[1-2] nodes.


Troubleshooting the SSH hop


Slurm concepts refresher


Step-by-step Slurm workflow

  1. Inspect resourcessinfo -Nl shows nodes, states, core counts.

  2. Author a job script — encode directives (#SBATCH) + workload commands.

  3. Submitsbatch job.sh returns a job ID.

  4. Monitorsqueue -j <id> during runtime; sacct -j <id> for completed history.

  5. Interactsrun inside scripts launches tasks; salloc grants an interactive shell.

  6. Terminate earlyscancel <id> (single job) or scancel -u $USER (bulk) when something goes wrong.

  7. Collect outputs — Slurm writes <jobname>.o<jobid> / .e<jobid> unless you override paths.


Resource requests that matter


Minimal batch script template

#!/bin/bash
#SBATCH --job-name=hello-slurm
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --time=00:02:00
#SBATCH --output=slurm-%j.out

module purge            # no modules here, but keep habit
set -euo pipefail

srun -l hostname        # label output with rank IDs
python3 codes/lab02/hello.py

Interactive allocations


Job arrays & dependencies

#!/bin/bash
#SBATCH --job-name=param-scan
#SBATCH --array=0-9
#SBATCH --output=logs/scan_%A_%a.out

PARAMS=(0.1 0.2 0.5 1 2 5 10 20 50 100)
python3 codes/lab02/sweep.py --beta "${PARAMS[$SLURM_ARRAY_TASK_ID]}"

Monitoring & debugging checklist


Work vs scratch spaces


Exercises to place under codes/lab02