Slurm User Guide

Slurm User Guide

The Slurm workload manager is used to run jobs on our HPC platform. Users interact with the system through commands (srun, sbatch) and submit jobs to specific partitions (queues).

Available Partitions

  • cloud – runs on our virtualized cloud infrastructure. Suitable for smaller or flexible workloads.

  • physical – runs on dedicated bare-metal compute nodes. Use this for high-performance CPU jobs.

  • physical-gpu – runs on bare-metal nodes equipped with GPUs (e.g. A100). Use this for GPU-accelerated workloads.

Interactive Jobs with srun

The srun command launches an interactive session on a compute node. This is useful for debugging, testing code, or short runs.

Example: request 2 CPU cores and 4 GB of memory for 1 hour on the cloud partition:

srun --partition=cloud --cpus-per-task=2 --mem=4G --time=01:00:00 --pty bash

Once the session starts, you will be inside a compute node shell where you can run your program interactively.

Example CPU Job Script

#!/bin/bash
#SBATCH --job-name=cpu_test
#SBATCH --output=cpu_test_%j.out
#SBATCH --partition=physical
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=01:00:00

# Load the default compiler runtime
module load gcc-runtime/13.2.0

echo "Running on host: $(hostname)"
echo "Job started at: $(date)"

# Example program: Python hello world
python3 -c "print('Hello from Slurm on $(hostname)!')"

echo "Job finished at: $(date)"

Submit it:

sbatch cpu_test.sh

Check the output:

cat cpu_test_<jobid>.out

Example GPU Job Script

For GPU workloads, request GPUs with --gres=gpu:<N>. Example with 1 GPU on physical-gpu:

Interactive session:

srun --partition=physical-gpu --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=01:00:00 --pty bash

Batch job script:

#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --output=gpu_test_%j.out
#SBATCH --partition=physical-gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=01:00:00

# Load CUDA
module load cuda/12.9.0

echo "Running on host: $(hostname)"
nvidia-smi

# Example: run a CUDA program or container
# ./my_gpu_script

Submit it:

sbatch my_gpu_script.sh

Check the output:

cat gpu_test_<jobid>.out

Monitoring Jobs

  • List your jobs:

squeue -u $USER
  • Cancel a job:

scancel <jobid>
  • Show job details:

scontrol show job <jobid>

Loading Software with Modules

Our cluster uses Lmod (via Spack) for managing software. Before running your jobs, you may need to load specific compilers, libraries, or applications.

  • To list available modules:

module avail
  • To search for a module by keyword:

module spider gcc
  • To load a module:

module load gcc-runtime/13.2.0
  • To see what modules you have loaded:

module list

Conda / Miniforge

For users who want to manage Python environments and packages independently of the cluster-wide software modules, we recommend using Miniforge or Conda in your home directory. This allows you to create isolated environments and install Python packages without affecting other users.

Installing Miniforge

  1. Download the latest Miniforge installer:

cd $HOME
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
  1. Run the installer:

bash Miniforge3-Linux-x86_64.sh
  1. Follow the prompts:

    • Accept the license.

    • Install into your home directory (default: $HOME/miniforge3).

    • Allow the installer to initialize Conda by modifying your shell startup file.

  2. Activate Conda:

source $HOME/miniforge3/bin/activate

You can also add the initialization to your shell automatically:

conda init

Creating and Managing Environments

  • Create a new environment:

conda create --name my_env python=3.11
  • Activate the environment:

conda activate my_env
  • Deactivate the environment:

conda deactivate
  • List all environments:

conda env list

Installing Packages

Within an activated environment, you can install Python packages independently:

conda install numpy scipy matplotlib

Or use pip inside the environment:

pip install pandas seaborn
  • Update packages:

conda update numpy
  • Remove packages:

conda remove matplotlib
  • Delete an environment:

conda env remove --name my_env

Best Practices

  • Always activate your environment before running Python programs.

  • Keep separate environments for different projects to avoid package conflicts.

  • Avoid installing packages directly into the base Conda environment — use named environments instead.

Running Docker Images with Apptainer on HPC

This guide shows how to pull and run a Docker image using Apptainer in an HPC environment managed by Slurm.

Apptainer is the container platform available on our HPC system. It allows you to package applications, dependencies, and environments into portable container images that can run across all partitions (cloud, physical, and physical-gpu).

Step 1: Request an Interactive Slurm Session

Start an interactive session using srun:

srun --partition=cloud --cpus-per-task=2 --mem=4G --time=01:00:00 --pty bash
  • --partition=cloud → the partition to use

  • --cpus-per-task=2 → number of CPU cores

  • --mem=4G → memory allocation

  • --time=01:00:00 → run time limit (hh:mm:ss)

  • --pty bash → interactive shell

Step 2: Pull the Docker Image

Pull a Docker image and convert it to a SIF file:

apptainer pull docker://hello-world

Output:

hello-world_latest.sif
  • This creates a file hello-world_latest.sif in your current directory.

Step 3: Run the Container

Run the container with:

apptainer run hello-world_latest.sif

Expected output:

Hello from Docker!
This message shows that your installation appears to be working correctly.

Step 4: Optional: Inspect or Enter the Container

Inspect metadata:

apptainer inspect hello-world_latest.sif

Run a command inside the container:

apptainer exec hello-world_latest.sif ls /

Interactive shell inside the container:

apptainer shell hello-world_latest.sif

Notes

  • apptainer pull → downloads the image once and converts it to a reusable .sif file.

  • apptainer run → executes the container’s default program.

  • apptainer exec → runs a custom command inside the container.

  • apptainer shell → opens an interactive shell inside the container.