Skip to content

Hands-On Draco HPC-Cluster

Processing:


0) Preliminaries

0.1) Draco HPC cluster @ FSU Jena


source: https://zedif.gitpages.uni-jena.de/courses/hpc-intro-2024-04/images/IMG_20230502_1221474_small.jpg


0.2) Material


1) Introducing Draco

1.1) General remarks

  • base system installed in 2021, expansion in 2022/23
  • for all members of Thuringian Universities
  • compute intensive workloads and interactive sessions

1.2) Hardware

  • 108 standard compute + 2 login nodes
    • 8 nodes: each 48 cores + 384GB RAM
    • 92+2 nodes: 48 cores + 256GB RAM
    • 8 nodes: each 16 cores + 192GB RAM
  • 17 compute nodes with GPU accelerators
    • 4 nodes: 4x NVIDIA V100, 32GB
    • 10 nodes: 4x NVIDIA A100, 40/80GB
    • 3 nodes: 1x NVIDIA A100, 80GB

  • 5 high-memory compute nodes
    • 1 node: 72 cores, 4 TB RAM
    • 4 nodes: 64 cores 2.3 TB RAM
  • 4 visualization nodes
    • backends for "Remote Workstations"
    • 32 cores, 768GB RAM
    • 1x node: NVIDIA Quadro RTX 5000
    • 3x node: NVIDIA RTX A5000

1.3) Storage systems

  • BeeGFS
    • parallel file system
    • perfect for streamed I/O
    • "for tasks with few big files"
    • mounted on all nodes:
      • /home: user directories of 197 TB
      • /work: work partition of 524 TB

  • VASTData
    • all-flash parallel file system
    • perfect for random-read access ("SSD-like")
    • "for ML tasks, i.e. many small files"
    • mounted on all nodes:
      • /vast: work partition of 273 TB

1.4) Filesystems


source: https://zedif.gitpages.uni-jena.de/courses/hpc-intro-2024-04/#57


2) Accessing Draco

2.1) Requirements

  • EAH employees must apply for FSU guest account
  • requires VPN connection
  • SSH-Login (standard):
ssh userid@login1.draco.uni-jena.de
ssh userid@login2.draco.uni-jena.de

2.2) "Remote Desktop"

  • Remote Desktop via NiceDCV remote desktop:
  • Hardware:
    • dedicated nodes on Draco (vis0x)
    • 3D-accellerated graphics (GPU: NVIDIA RTX A5000)

  • shared with other users
  • usage:
    • only for interactive, visual work
    • no workloads


source: https://zedif.gitpages.uni-jena.de/courses/hpc-intro-2024-04/images/enginframe3.png


2.3) data transfer


2.4) SSH

  • open a terminal and use "ssh" or PuTTY to log in to Draco:
Welcome to the HPC cluster
 ______   ______ _______ _______  _____ 
 |     \ |_____/ |_____| |       |     |
 |_____/ |    \_ |     | |_____  |_____|
-> To stay informed, please sign up for the mailling list at
   https://lserv.uni-jena.de/mailman/listinfo/draco-user
   and regularly browse /cluster/ChangeLog.
-> To report problems or ask for help, please use the ticket system (preferably)
   https://servicedesk.uni-jena.de/servicedesk/customer/portal/121/create/647
   or send a message to draco-admin@listserv.uni-jena.de.
-> To launch compute jobs, get familiar with Slurm (sinfo, sbatch, srun, salloc) 
   and environment modules (module avail) to load software packages.
-> To launch a quick interactive (single-core) bash session on a node, type
   srun --time 01:00:00 --pty bash

3) Login nodes

3.1) General remarks

  • entrance server(s) to a cluster

    source: https://static.wikia.nocookie.net/lotr/images/2/2e/Durin%27s_door.png

  • shared by many users
  • used to...
    • manage (submit) compute jobs
    • preparation / compilation of software
    • data transfers to / from cluster

3.2) What you should not do

  • no extensive computations on login nodes
  • no intensive (I/O) tasks on login nodes, else all users are affected (and complain)
  • all computations and interactive sessions shall be submitted to compute nodes via Slurm workload manager

4) Slurm


source: https://static.wikia.nocookie.net/enfuturama/images/8/80/Slurm-1-.jpg


4.1) General remarks

  • resource manager and job scheduler
  • how it works:
    • (I) user submits job to queue
    • (II) Slurm schedules and allocates resources
  • documentation

  • advantages:
    • resources (CPUs, memory, GPUs, etc.) are fairly shared among users
    • compute nodes are never overloaded
  • disadvantages:
    • user has to get used to workflow (different to using a workstation)

4.2) Workflow

  1. write batch script (mostly bash):
    • specify all required resources in #SBATCH header
    • bash commands below header are executed at runtime
  2. submit job via sbatch, which is usually a Bash script, but Perl or Python scripts are possible, too:
[userid@login1 ~] sbatch 01_logical_cpus.sh
Submitted batch job 1125237

  • example slurm script: 01_logical_cpus.sh
#!/bin/bash
#SBATCH --job-name=01_logical_cpus
#SBATCH --partition=short
#SBATCH --ntasks=1
#SBATCH --output=01_logical_cpus.out.%j
#SBATCH --error=01_logical_cpus.err.%j
#SBATCH --time=10:00
echo "Job started at: $(date)"
python3 01_logical_cpus.py
sleep 30
echo "Job finished at: $(date)"

4.3) Watch your jobs

  • squeue: to view the state of submitted jobs and job steps
    • list queued or running jobs:
[login1: ~] squeue --me
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1125237     short 01_logic  mi74hig PD       0:00      1 (Priority)
  • ...a few seconds later...
[login1: ~] squeue --me -l
Mon Apr 29 22:28:44 2024
JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
1125237     short 01_logic  mi74hig  RUNNING       0:15     10:00      1 node013

  • scancel: to cancel a pending or running job from the queue, if not needed anymore or changes to job scripts are required
[userid@login1 ~] sbatch 01_logical_cpus.sh
Submitted batch job 1125238
[userid@login1 ~]$ scancel 1125238
[userid@login1 ~]$ squeue --me
JOBID PARTITION       NAME     USER  ST    TIME  NODES NODELIST(REASON)

  • switch to the node, where you job has been submited: [userid@login1 ~] ssh node0100
  • use htop to view processes: [userid@login1 ~] htop
  • exit the node: exit

  • after completion we can get job details from Slurm database using sacct:
[login1: ~] sacct -j 1125237 -X
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1125237      01_logica+      short  fsu-users          2  COMPLETED      0:0
  • Keep JobID in your logs!

4.4) Work with job outputs

  • example job created two files - one keeps the STDOUT, the other the STDERR output
[login1: ~] ls *.1125237
01_logical_cpus.err.1125237  01_logical_cpus.out.1125237
  • get STDOUT and STDERR outputs:
[login1: ~] cat 01_logical_cpus.out.1125237
Job started at: Mon Apr 29 22:28:29 CEST 2024
Node: node013.cluster
Number of Logical CPU cores: 96
Job finished at: Mon Apr 29 22:28:59 CEST 2024
[login1: ~] cat 01_logical_cpus.err.1125237
<empty>

4.5) Have a look on cluster usage

  • sinfo -s: provides an overview on all job queues (partitions), their allocations and limits
 [login1: ~] sinfo
 PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
 standard       up 3-00:00:00      5 drain* node[037-038,041-042,054]
 standard       up 3-00:00:00     14    mix node[043,048,050-052,069,081,087-091,097-098]
 standard       up 3-00:00:00     37  alloc node[039-040,044-047,049,053,055-068,070-072,082-086,092-096,099-100]
 short*         up    3:00:00      4 drain* node[009-012]
 short*         up    3:00:00      4   idle node[013-016]
 long           up 14-00:00:0     12 drain* node[017-020,025-028,033-036]
 long           up 14-00:00:0     11    mix node[022-024,029-030,032,073-074,076,078-079]
 long           up 14-00:00:0      5  alloc node[021,031,075,077,080]
 fat            up 3-00:00:00      1    mix fat01
 fat            up 3-00:00:00      4   down fat[02-05]
 gpu            up 3-00:00:00      8    mix gpu[005-007,013-017]
 ...            ...               ..   .... ...

  • partition short* marks the default which is chosen if no partition is specified
  • alloc / mix nodes are completely / partly allocated by jobs
  • idle nodes are free
  • down nodes are offline
  • drain nodes won't schedule new jobs (usually wait for some maintenance work)

4.6) interactive sessions

  • salloc: allocate resources (nodes, cores, gpu, ...) for an interactive session
  • similar to working on own workstation
  • all requested resources are reserved, even if not used
    • exit session if not used!
  • useful for JupyterLab

  • salloc example:
[login1 ~]$ salloc --partition=gpu --gres=gpu:1  --time=10:00
salloc: Pending job allocation 126922
salloc: job 126922 queued and waiting for resources
salloc: job 126922 has been allocated resources
salloc: Granted job allocation 126922
salloc: Waiting for resource configuration
salloc: Nodes gpu006 are ready for job
[gpu006 ~]$ nvidia-smi  --list-gpus
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ab55fed8-8e56-ead0-31da-564d381a308c)
[gpu006 ~]$ python3 gpu_train_test.py
  .............................
  <output of gpu_train_test.py>
  .............................
[gpu006 ~]$ exit
salloc: Relinquishing job allocation 126922

  • srun: to launch an interactive bash session (implicit slurm allocation)
[userid@login1 ~]$ srun --time 01:00:00 --pty bash
srun: job 821487 queued and waiting for resources
srun: job 821487 has been allocated resources
[userid@node013 ~]$ ...type your commands....
[userid@node013 ~]$ exit
[userid@login1 ~]$

  • srun (I): to distribute allocated resources to independent (sub)tasks running simultaneously
[login1 ~]$ cat trivially_parallel.sbatch
#!/bin/bash
#SBATCH --job-name trivially_parallel
#SBATCH --ntasks=3
#SBATCH --output  trivially_parallel.out.%j
#SBATCH --error   trivially_parallel.err.%j
srun -n 1 myprg -i input1 -o results1  &
srun -n 1 myprg -i input2 -o results2  &
srun -n 1 myprg -i input3 -o results3  &
wait

4.7) Job parameters

  • for sbatch and salloc
--partition=standard,short
--nodes=2
--ntasks=1  # Number of unix processes (or MPI ranks)
--cpus-per-task=6  # Number of cores/cpus per task
--hint=nomultithread
--gres=gpu:4  # Number of allocated GPUs
--mem=40G
--mem-per-cpu, --mem-per-gpu
--mail, --mail-type
--time=DD-HH:MM:SS
--output, --error
...

  • additional parameters for node selections:
    • -w, --nodelist= [nodelist], to select specific node(s): salloc --partition=short -w node009
    • -x, --exclude=[nodelist], to exclude faulty node(s): sbatch --exclude=node013 myimportant.sbatch

  • additional parameters for node selections:
    • --exclusive reserves the complete node: salloc --partition=standard --time=10:00 --exclusive
    • -C, --constraint=[list], selects node(s) with specific features: salloc --constraint=cpu7762

  • How to get pre-defined features? Use sinfo!
[userid@login1 ~]$ sinfo -o "%20N  %10c  %10m  %35f  %10G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES                       GRES       
node[001-008]         96          385000      cpu6248r,intel,ram384                (null)     
node[009-100]         96          257000      cpu8360y,intel,ram256                (null)     
node[101-108]         32          192078      cpu6134,intel,ram192                 (null)     
fat01                 144         4127000     cpu8360y,intel,ram4096               (null)     
fat[02-05]            128         2322014     cpu9334,amd,ram2355                  (null)     
gpu[001-004]          256         1031000     cpu7762,amd,ram1024,v100s            gpu:v100s: 
gpu005                256         515000      cpu7762,amd,ram512,a100,a100_40gb    gpu:a100:4 
gpu[006-007]          64          515000      cpu73f3,amd,ram512,a100,a100_40gb    gpu:a100:4 
gpu[008-009,013-017]  64          515000      cpu73f3,amd,ram512,a100,a100_80gb    gpu:a100:4 
gpu[010-012]          64          257000      cpu7343,amd,ram256,a100,a100_80gb    gpu:a100:1 
...

  • sbatch example:
#!/bin/bash
#SBATCH --job-name=testjob
#SBATCH --partition=short,standard           # partion or list of partitions 
#SBATCH --nodes=1                            # default: 1
#SBATCH --ntasks=1                           # unix processes (or MPI ranks)
#SBATCH --cpus-per-task=4                    # shared-mem parallelization
#SBATCH --hint=nomultithread                 # two logical CPUs never on same core
#SBATCH --mem=4096                           # job gets killed of more is allocated
#SBATCH --output=testjob.out.%j
#SBATCH --error=testjob.err.%j
#SBATCH --time=01:00:00                      # job will be killed after 1 hour
#SBATCH --mail=max.throughput@uni-jena.de
#SBATCH --mail-type=FAIL                     # NONE, BEGIN, END, FAIL, REQUEUE, AL

  • salloc example:
salloc --partition=gpu --gres=gpu:2 --mem=40G

4.8) Further useful sbatch / salloc parameters

there are more options for job management:


4.9) Notes on software and hardware interactions

  • GPUs must be explicitely requested:
    • GRES are resources (e.g. GPUs) a job may need. They have to be explicitly requested, for example, when submitting to gpu partition
#SBATCH --partition=gpu 
#SBATCH --gres=gpu:2        # Requests two A100 GPUs on a gpu node

  • logical CPUs vs. cores:
    • hyperthreads (Intel) / SMT (AMD) are enabled
    • each processor core appears as two logical CPUs
    • Slurm CPUS = logical CPUs, i.e. cores per node = CPUS/2

  • Specify: #SBATCH --hint=nomultithread for threaded programs (OpenMP, pthreads,..) if one logical CPU per core is beneficial (often)

6) Parallel applications

  • HPC cluster != cluster of many workstations attached to a global file system

  • they may serve as good machines for high-throughput computations (many independent but similar tasks)

  • actual purpose: parallelized applications, which may simultaneously use multiple cores and/or nodes, and/or one or multiple GPU accelerators, in a single tasks


  • Difficult part: actual parallelization of an application (i.e. algorithms), or parts of it:
    • GPU applications (e.g. using CUDA)
      • usage of GPU and CPU threads hidden in library functions (e.g. Tensorflow or pyTorch)
    • distributed-memory applications (e.g. MPI)
    • threaded shared-memory application (e.g. pthreads, OpenMP)
  • visit the zedif workshop and have a look at the slides

7) GPU usage

7.1) GPU nodes

  • compute nodes with additional GPU accelerators (e.g. NVIDIA A100)
  • advantages:
    • many more threads than usual compute nodes
    • Example: NVIDIA A100
      • features 19.5 teraflops of FP32 performance vs. Gigaflops per CPU
      • 6912 CUDA cores, 40GB memory

  • disadvantages:
    • lightweight threads (cf OpenMP)
    • limited GPU memory
    • overhead cost due to offloading data to GPUs (via CPUs)
    • requires sufficient work load for performance

7.2) Calculations on GPU nodes

#!/bin/bash
#SBATCH --tasks=1
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=00:30:00
#SBATCH --job-name=MNiST_beginner
#SBATCH --output=MNiST_beginner.out.%j
#SBATCH --error=MNiST_beginner.err.%j
# Load environment variables (unload all preexisting)
module purge
# Load conda environment with tensorflow
source /home/$USER/miniconda3/etc/profile.d/conda.sh
conda activate tensorflow-gpu
python3 MNiST_beginner.py

8) Software

8.1) General remarks

  • all programm installations are maintained, but perhaps not the lastest
  • Dnf/Yum package manager of AlmaLinux8 (cf. RedHat8)
  • Spack package manager or manual installatio
  • Singularity software container:
    • Docker is not available because of a conflict with (non-root) multi-user environments
    • Singularity / Apptainer is mature replacement on HPC clusters

8.2) Software environment modules

  • modules provide paths and environment variables for each custom package
  • list of available environment modules:
[userid@login1 ~]$ module avail
...
apps/alphafold/2.2.2   apps/guppy/5.0.7              apps/turbomole/7.6            mpi/openmpi/4.1.1
apps/bowtie2/2.4.2     apps/guppy/5.0.7-cpu          compiler/gcc/10.4.0           nvidia/cuda/11.3
apps/cellranger/7.1.0  apps/mathematica/12.3         compiler/gcc/11.3.0           nvidia/cuda/11.7

  • load and unload a module, e.g. a specific compiler version
[userid@login ~]$ module load compiler/gcc/12.2.0
[userid@login ~]$ module unload compiler/gcc/12.2.0
  • What is the effect?: it changes / sets environment variables
[userid@login1 ~]$ module show apps/mathematica/12.3 
...
prepend-path    PATH /cluster/apps/mathematica/12.3/Executables
setenv          MATHEMATICA_BASE /cluster/apps/mathematica/12.3
setenv          NVIDIA_DRIVER_LIBRARY_PATH /usr/lib64/libnvidia-tls.so.465.19.01
setenv          CUDA_LIBRARY_PATH /usr/lib64/libcuda.so

  • Switch, list and purge loaded modules
[userid@login1 ~]$ module load compiler/gcc/12.2.0  apps/mathematica/12.3
[userid@login1 ~]$ module switch compiler/gcc/12.2.0 compiler/gcc/11.3.0  
[userid@login1 ~]$ module list
Currently Loaded Modulefiles:
1) compiler/gcc/11.3.0   2) apps/mathematica/12.3  
[userid@login1 ~]$ module purge
[userid@login1 ~]$ module list
No Modulefiles Currently Loaded.

8.3) Bring your own software

  • use configure/make/cmake...
  • conda environments:
    • either use conda which comes with the "Intel Distribution of Python" of the Intel oneAPI AI ToolKit

  • or install Miniconda3 in their private or group folder
[userid@login1 ~]$ wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
[userid@login1 ~]$ bash Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
...
install path: /home/userid/miniconda3/py310_23.3.1-0
  • you should temporally use /vast for installation!

  • you may want to prepend Intel's Anaconda channel for performance reassons

[userid@login1 ~]$ conda config --add channels intel

9) General remarks

  • avoid submitting hundrest of short (few-minutes) jobs:
    • group tiny jobs into bigger jobs
    • good job runs between 3h to 48h
  • try to avoid very long (many days) jobs:
    • nodes may fail, organizes your workflows being fail safe
    • organize workflow into set of concurrent or chain of subsequent job

  • choose the right number of cores / tasks per node, and I/O operations:
    • a job's bottleneck is often the file system (many "I/O operations")
    • independent jobs running on different cores compete for file access
    • avoid 100's of open/close file operations and only log what you need

10) Where to start?

10.1) First hands-on

  1. get a FSU guest account
  2. get familiar with workflow / do's and don'ts...
  3. connect with ssh / ftp
  4. install miniconda to /vast/$USER/ and create an enviroment
  5. think about your projekt and workflow: hardware, software requirements, bootleneck
  6. upload minimal example of your projekt (small dataset and model)
  7. write bash script with respect to do'es and dont's...
  8. submit and track the job
  9. check output
  10. think about your projekt and workflow...

10.2) Further directions

  • consider RDM and version control: documentation on input, output, log file, batch file
  • familiarize yourself with job efficiency: start with fewer cores
  • consider i/o bottleneck: Pre-process data if necessary
  • you may need to select a node with certain features