Hands-On Draco HPC-Cluster
Processing:
- provided by Thomas R. Holy
- Edit status: 05.06.2025
0) Preliminaries
0.1) Draco HPC cluster @ FSU Jena

source: https://zedif.gitpages.uni-jena.de/courses/hpc-intro-2024-04/images/IMG_20230502_1221474_small.jpg
0.2) Material
- zedif slides have been adopted for this presentation
- visit zedif training courses
- the documentation of GWDG is also useful
1) Introducing Draco
1.1) General remarks
- base system installed in 2021, expansion in 2022/23
- for all members of Thuringian Universities
- compute intensive workloads and interactive sessions
1.2) Hardware
- 108 standard compute + 2 login nodes
- 8 nodes: each 48 cores + 384GB RAM
- 92+2 nodes: 48 cores + 256GB RAM
- 8 nodes: each 16 cores + 192GB RAM
- 17 compute nodes with GPU accelerators
- 4 nodes: 4x NVIDIA V100, 32GB
- 10 nodes: 4x NVIDIA A100, 40/80GB
- 3 nodes: 1x NVIDIA A100, 80GB
- 5 high-memory compute nodes
- 1 node: 72 cores, 4 TB RAM
- 4 nodes: 64 cores 2.3 TB RAM
- 4 visualization nodes
- backends for "Remote Workstations"
- 32 cores, 768GB RAM
- 1x node: NVIDIA Quadro RTX 5000
- 3x node: NVIDIA RTX A5000
1.3) Storage systems
- BeeGFS
- parallel file system
- perfect for streamed I/O
- "for tasks with few big files"
- mounted on all nodes:
- /home: user directories of 197 TB
- /work: work partition of 524 TB
- VASTData
- all-flash parallel file system
- perfect for random-read access ("SSD-like")
- "for ML tasks, i.e. many small files"
- mounted on all nodes:
- /vast: work partition of 273 TB
1.4) Filesystems

source: https://zedif.gitpages.uni-jena.de/courses/hpc-intro-2024-04/#57
2) Accessing Draco
2.1) Requirements
- EAH employees must apply for FSU guest account
- requires VPN connection
- SSH-Login (standard):
ssh userid@login1.draco.uni-jena.de
ssh userid@login2.draco.uni-jena.de
2.2) "Remote Desktop"
- Remote Desktop via NiceDCV remote desktop:
- Hardware:
- dedicated nodes on Draco (vis0x)
- 3D-accellerated graphics (GPU: NVIDIA RTX A5000)
- shared with other users
- usage:
- only for interactive, visual work
- no workloads

source: https://zedif.gitpages.uni-jena.de/courses/hpc-intro-2024-04/images/enginframe3.png
2.3) data transfer
2.4) SSH
- open a terminal and use "ssh" or PuTTY to log in to Draco:
Welcome to the HPC cluster
______ ______ _______ _______ _____
| \ |_____/ |_____| | | |
|_____/ | \_ | | |_____ |_____|
-> To stay informed, please sign up for the mailling list at
https://lserv.uni-jena.de/mailman/listinfo/draco-user
and regularly browse /cluster/ChangeLog.
-> To report problems or ask for help, please use the ticket system (preferably)
https://servicedesk.uni-jena.de/servicedesk/customer/portal/121/create/647
or send a message to draco-admin@listserv.uni-jena.de.
-> To launch compute jobs, get familiar with Slurm (sinfo, sbatch, srun, salloc)
and environment modules (module avail) to load software packages.
-> To launch a quick interactive (single-core) bash session on a node, type
srun --time 01:00:00 --pty bash
3) Login nodes
3.1) General remarks
- entrance server(s) to a cluster

source: https://static.wikia.nocookie.net/lotr/images/2/2e/Durin%27s_door.png
- shared by many users
- used to...
- manage (submit) compute jobs
- preparation / compilation of software
- data transfers to / from cluster
3.2) What you should not do
- no extensive computations on login nodes
- no intensive (I/O) tasks on login nodes, else all users are affected (and complain)
- all computations and interactive sessions shall be submitted to compute nodes via Slurm workload manager
4) Slurm

source: https://static.wikia.nocookie.net/enfuturama/images/8/80/Slurm-1-.jpg
4.1) General remarks
- resource manager and job scheduler
- how it works:
- (I) user submits job to queue
- (II) Slurm schedules and allocates resources
- documentation
- advantages:
- resources (CPUs, memory, GPUs, etc.) are fairly shared among users
- compute nodes are never overloaded
- disadvantages:
- user has to get used to workflow (different to using a workstation)
4.2) Workflow
- write batch script (mostly bash):
- specify all required resources in
#SBATCHheader - bash commands below header are executed at runtime
- specify all required resources in
- submit job via sbatch, which is usually a Bash script, but Perl or Python scripts are possible, too:
[userid@login1 ~] sbatch 01_logical_cpus.sh
Submitted batch job 1125237
- example slurm script:
01_logical_cpus.sh
#!/bin/bash
#SBATCH --job-name=01_logical_cpus
#SBATCH --partition=short
#SBATCH --ntasks=1
#SBATCH --output=01_logical_cpus.out.%j
#SBATCH --error=01_logical_cpus.err.%j
#SBATCH --time=10:00
echo "Job started at: $(date)"
python3 01_logical_cpus.py
sleep 30
echo "Job finished at: $(date)"
4.3) Watch your jobs
- squeue: to view the state of submitted jobs and job steps
- list queued or running jobs:
[login1: ~] squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1125237 short 01_logic mi74hig PD 0:00 1 (Priority)
- ...a few seconds later...
[login1: ~] squeue --me -l
Mon Apr 29 22:28:44 2024
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1125237 short 01_logic mi74hig RUNNING 0:15 10:00 1 node013
- scancel: to cancel a pending or running job from the queue, if not needed anymore or changes to job scripts are required
[userid@login1 ~] sbatch 01_logical_cpus.sh
Submitted batch job 1125238
[userid@login1 ~]$ scancel 1125238
[userid@login1 ~]$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
- switch to the node, where you job has been submited:
[userid@login1 ~] ssh node0100 - use htop to view processes:
[userid@login1 ~] htop - exit the node:
exit
- after completion we can get job details from Slurm database using sacct:
[login1: ~] sacct -j 1125237 -X
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1125237 01_logica+ short fsu-users 2 COMPLETED 0:0
- Keep JobID in your logs!
4.4) Work with job outputs
- example job created two files - one keeps the STDOUT, the other the STDERR output
[login1: ~] ls *.1125237
01_logical_cpus.err.1125237 01_logical_cpus.out.1125237
- get STDOUT and STDERR outputs:
[login1: ~] cat 01_logical_cpus.out.1125237
Job started at: Mon Apr 29 22:28:29 CEST 2024
Node: node013.cluster
Number of Logical CPU cores: 96
Job finished at: Mon Apr 29 22:28:59 CEST 2024
[login1: ~] cat 01_logical_cpus.err.1125237
<empty>
4.5) Have a look on cluster usage
- sinfo -s: provides an overview on all job queues (partitions), their allocations and limits
[login1: ~] sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
standard up 3-00:00:00 5 drain* node[037-038,041-042,054]
standard up 3-00:00:00 14 mix node[043,048,050-052,069,081,087-091,097-098]
standard up 3-00:00:00 37 alloc node[039-040,044-047,049,053,055-068,070-072,082-086,092-096,099-100]
short* up 3:00:00 4 drain* node[009-012]
short* up 3:00:00 4 idle node[013-016]
long up 14-00:00:0 12 drain* node[017-020,025-028,033-036]
long up 14-00:00:0 11 mix node[022-024,029-030,032,073-074,076,078-079]
long up 14-00:00:0 5 alloc node[021,031,075,077,080]
fat up 3-00:00:00 1 mix fat01
fat up 3-00:00:00 4 down fat[02-05]
gpu up 3-00:00:00 8 mix gpu[005-007,013-017]
... ... .. .... ...
- partition short* marks the default which is chosen if no partition is specified
- alloc / mix nodes are completely / partly allocated by jobs
- idle nodes are free
- down nodes are offline
- drain nodes won't schedule new jobs (usually wait for some maintenance work)
4.6) interactive sessions
- salloc: allocate resources (nodes, cores, gpu, ...) for an interactive session
- similar to working on own workstation
- all requested resources are reserved, even if not used
- exit session if not used!
- useful for JupyterLab
- salloc example:
[login1 ~]$ salloc --partition=gpu --gres=gpu:1 --time=10:00
salloc: Pending job allocation 126922
salloc: job 126922 queued and waiting for resources
salloc: job 126922 has been allocated resources
salloc: Granted job allocation 126922
salloc: Waiting for resource configuration
salloc: Nodes gpu006 are ready for job
[gpu006 ~]$ nvidia-smi --list-gpus
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ab55fed8-8e56-ead0-31da-564d381a308c)
[gpu006 ~]$ python3 gpu_train_test.py
.............................
<output of gpu_train_test.py>
.............................
[gpu006 ~]$ exit
salloc: Relinquishing job allocation 126922
- srun: to launch an interactive bash session (implicit slurm allocation)
[userid@login1 ~]$ srun --time 01:00:00 --pty bash
srun: job 821487 queued and waiting for resources
srun: job 821487 has been allocated resources
[userid@node013 ~]$ ...type your commands....
[userid@node013 ~]$ exit
[userid@login1 ~]$
- srun (I): to distribute allocated resources to independent (sub)tasks running simultaneously
[login1 ~]$ cat trivially_parallel.sbatch
#!/bin/bash
#SBATCH --job-name trivially_parallel
#SBATCH --ntasks=3
#SBATCH --output trivially_parallel.out.%j
#SBATCH --error trivially_parallel.err.%j
srun -n 1 myprg -i input1 -o results1 &
srun -n 1 myprg -i input2 -o results2 &
srun -n 1 myprg -i input3 -o results3 &
wait
4.7) Job parameters
- for sbatch and salloc
--partition=standard,short
--nodes=2
--ntasks=1 # Number of unix processes (or MPI ranks)
--cpus-per-task=6 # Number of cores/cpus per task
--hint=nomultithread
--gres=gpu:4 # Number of allocated GPUs
--mem=40G
--mem-per-cpu, --mem-per-gpu
--mail, --mail-type
--time=DD-HH:MM:SS
--output, --error
...
- additional parameters for node selections:
- -w, --nodelist= [nodelist], to select specific node(s):
salloc --partition=short -w node009 - -x, --exclude=[nodelist], to exclude faulty node(s):
sbatch --exclude=node013 myimportant.sbatch
- -w, --nodelist= [nodelist], to select specific node(s):
- additional parameters for node selections:
- --exclusive reserves the complete node:
salloc --partition=standard --time=10:00 --exclusive - -C, --constraint=[list], selects node(s) with specific features:
salloc --constraint=cpu7762
- --exclusive reserves the complete node:
- How to get pre-defined features? Use sinfo!
[userid@login1 ~]$ sinfo -o "%20N %10c %10m %35f %10G "
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
node[001-008] 96 385000 cpu6248r,intel,ram384 (null)
node[009-100] 96 257000 cpu8360y,intel,ram256 (null)
node[101-108] 32 192078 cpu6134,intel,ram192 (null)
fat01 144 4127000 cpu8360y,intel,ram4096 (null)
fat[02-05] 128 2322014 cpu9334,amd,ram2355 (null)
gpu[001-004] 256 1031000 cpu7762,amd,ram1024,v100s gpu:v100s:
gpu005 256 515000 cpu7762,amd,ram512,a100,a100_40gb gpu:a100:4
gpu[006-007] 64 515000 cpu73f3,amd,ram512,a100,a100_40gb gpu:a100:4
gpu[008-009,013-017] 64 515000 cpu73f3,amd,ram512,a100,a100_80gb gpu:a100:4
gpu[010-012] 64 257000 cpu7343,amd,ram256,a100,a100_80gb gpu:a100:1
...
- sbatch example:
#!/bin/bash
#SBATCH --job-name=testjob
#SBATCH --partition=short,standard # partion or list of partitions
#SBATCH --nodes=1 # default: 1
#SBATCH --ntasks=1 # unix processes (or MPI ranks)
#SBATCH --cpus-per-task=4 # shared-mem parallelization
#SBATCH --hint=nomultithread # two logical CPUs never on same core
#SBATCH --mem=4096 # job gets killed of more is allocated
#SBATCH --output=testjob.out.%j
#SBATCH --error=testjob.err.%j
#SBATCH --time=01:00:00 # job will be killed after 1 hour
#SBATCH --mail=max.throughput@uni-jena.de
#SBATCH --mail-type=FAIL # NONE, BEGIN, END, FAIL, REQUEUE, AL
- salloc example:
salloc --partition=gpu --gres=gpu:2 --mem=40G
4.8) Further useful sbatch / salloc parameters
there are more options for job management:
4.9) Notes on software and hardware interactions
- GPUs must be explicitely requested:
- GRES are resources (e.g. GPUs) a job may need. They have to be explicitly requested, for example, when submitting to gpu partition
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2 # Requests two A100 GPUs on a gpu node
- logical CPUs vs. cores:
- hyperthreads (Intel) / SMT (AMD) are enabled
- each processor core appears as two logical CPUs
- Slurm CPUS = logical CPUs, i.e. cores per node = CPUS/2
- Specify:
#SBATCH --hint=nomultithreadfor threaded programs (OpenMP, pthreads,..) if one logical CPU per core is beneficial (often)
6) Parallel applications
-
HPC cluster != cluster of many workstations attached to a global file system
-
they may serve as good machines for high-throughput computations (many independent but similar tasks)
-
actual purpose: parallelized applications, which may simultaneously use multiple cores and/or nodes, and/or one or multiple GPU accelerators, in a single tasks
- Difficult part: actual parallelization of an application (i.e. algorithms), or parts of it:
- GPU applications (e.g. using CUDA)
- usage of GPU and CPU threads hidden in library functions (e.g. Tensorflow or pyTorch)
- distributed-memory applications (e.g. MPI)
- threaded shared-memory application (e.g. pthreads, OpenMP)
- GPU applications (e.g. using CUDA)
- visit the zedif workshop and have a look at the slides
7) GPU usage
7.1) GPU nodes
- compute nodes with additional GPU accelerators (e.g. NVIDIA A100)
- advantages:
- many more threads than usual compute nodes
- Example: NVIDIA A100
- features 19.5 teraflops of FP32 performance vs. Gigaflops per CPU
- 6912 CUDA cores, 40GB memory
- disadvantages:
- lightweight threads (cf OpenMP)
- limited GPU memory
- overhead cost due to offloading data to GPUs (via CPUs)
- requires sufficient work load for performance
7.2) Calculations on GPU nodes
#!/bin/bash
#SBATCH --tasks=1
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=00:30:00
#SBATCH --job-name=MNiST_beginner
#SBATCH --output=MNiST_beginner.out.%j
#SBATCH --error=MNiST_beginner.err.%j
# Load environment variables (unload all preexisting)
module purge
# Load conda environment with tensorflow
source /home/$USER/miniconda3/etc/profile.d/conda.sh
conda activate tensorflow-gpu
python3 MNiST_beginner.py
8) Software
8.1) General remarks
- all programm installations are maintained, but perhaps not the lastest
- Dnf/Yum package manager of AlmaLinux8 (cf. RedHat8)
- Spack package manager or manual installatio
- Singularity software container:
- Docker is not available because of a conflict with (non-root) multi-user environments
- Singularity / Apptainer is mature replacement on HPC clusters
8.2) Software environment modules
- modules provide paths and environment variables for each custom package
- list of available environment modules:
[userid@login1 ~]$ module avail
...
apps/alphafold/2.2.2 apps/guppy/5.0.7 apps/turbomole/7.6 mpi/openmpi/4.1.1
apps/bowtie2/2.4.2 apps/guppy/5.0.7-cpu compiler/gcc/10.4.0 nvidia/cuda/11.3
apps/cellranger/7.1.0 apps/mathematica/12.3 compiler/gcc/11.3.0 nvidia/cuda/11.7
- load and unload a module, e.g. a specific compiler version
[userid@login ~]$ module load compiler/gcc/12.2.0
[userid@login ~]$ module unload compiler/gcc/12.2.0
- What is the effect?: it changes / sets environment variables
[userid@login1 ~]$ module show apps/mathematica/12.3
...
prepend-path PATH /cluster/apps/mathematica/12.3/Executables
setenv MATHEMATICA_BASE /cluster/apps/mathematica/12.3
setenv NVIDIA_DRIVER_LIBRARY_PATH /usr/lib64/libnvidia-tls.so.465.19.01
setenv CUDA_LIBRARY_PATH /usr/lib64/libcuda.so
- Switch, list and purge loaded modules
[userid@login1 ~]$ module load compiler/gcc/12.2.0 apps/mathematica/12.3
[userid@login1 ~]$ module switch compiler/gcc/12.2.0 compiler/gcc/11.3.0
[userid@login1 ~]$ module list
Currently Loaded Modulefiles:
1) compiler/gcc/11.3.0 2) apps/mathematica/12.3
[userid@login1 ~]$ module purge
[userid@login1 ~]$ module list
No Modulefiles Currently Loaded.
8.3) Bring your own software
- use configure/make/cmake...
- conda environments:
- either use conda which comes with the "Intel Distribution of Python" of the Intel oneAPI AI ToolKit
- or install Miniconda3 in their private or group folder
[userid@login1 ~]$ wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
[userid@login1 ~]$ bash Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
...
install path: /home/userid/miniconda3/py310_23.3.1-0
-
you should temporally use /vast for installation!
-
you may want to prepend Intel's Anaconda channel for performance reassons
[userid@login1 ~]$ conda config --add channels intel
9) General remarks
- avoid submitting hundrest of short (few-minutes) jobs:
- group tiny jobs into bigger jobs
- good job runs between 3h to 48h
- try to avoid very long (many days) jobs:
- nodes may fail, organizes your workflows being fail safe
- organize workflow into set of concurrent or chain of subsequent job
- choose the right number of cores / tasks per node, and I/O operations:
- a job's bottleneck is often the file system (many "I/O operations")
- independent jobs running on different cores compete for file access
- avoid 100's of open/close file operations and only log what you need
10) Where to start?
10.1) First hands-on
- get a FSU guest account
- get familiar with workflow / do's and don'ts...
- connect with ssh / ftp
- install miniconda to /vast/$USER/ and create an enviroment
- think about your projekt and workflow: hardware, software requirements, bootleneck
- upload minimal example of your projekt (small dataset and model)
- write bash script with respect to do'es and dont's...
- submit and track the job
- check output
- think about your projekt and workflow...
10.2) Further directions
- consider RDM and version control: documentation on input, output, log file, batch file
- familiarize yourself with job efficiency: start with fewer cores
- consider i/o bottleneck: Pre-process data if necessary
- you may need to select a node with certain features