Introduction to Slurm: The Job Scheduler
Slurm is the job scheduler we use in Unity. More info about what a job scheduler is can be found in the introduction. Here we will go more into depth about some elements of the scheduler. There are many more features of Slurm that go beyond the scope of this guide, but all that you as a user need to know should be available.
salloc
interactive sessions to switch from a login node to a compute node.Core Limits
There is currently a 1000 CPU core, 64 GPU limit to be shared by the users of each lab.
When you try to go over this limit, you will be denied for MaxCpuPerAccount
.
To check the resources currently in use by your PI group, use the unity-slurm-account-usage
command.
Partitions / Queues
Our cluster has a number of slurm partitions defined, also known as a queue. As you may have guessed, you as the user request to use a specific partition based on what resources your job needs. Find out which partition is best for your job here.
Jobs
A job is an operation which the user submits to the cluster to run under allocated resources.
There are two commands for this, salloc
and sbatch
. salloc
is tied to your current terminal, and can allow you to interact with your job. sbatch
is not tied to your current session, so you can start it and walk away. If you want to interact with your job and be able to walk away, you can use tmux
to make a detachable session. (see below)
SALLOC
An salloc
job is tied to your ssh session. If you break (ctrl+C) or close your ssh session during an salloc
job, the job will be killed.
You can also make an interactive job, which will allow your job to take input from your keyboard. You can run bash
in an interactive job to resume your work on a compute node just as you would on a login node. This is highly recommended.
See SALLOC Jobs for more information.
SBATCH
An sbatch
job is submitted to the cluster with no information returned to the user other than a Job ID. An sbatch
job will try to create a file in your current working directory that contains the results of your job.
See SBATCH Jobs for more information.
TMUX SALLOC
The tmux
command can be used to keep a session open even if your ssh command disconnects. This can be useful on spotty Wi-Fi so you don’t lose your work.
tmux
# tmux session opens
salloc -c 1
# interactive job on compute node opens with one cpu core
sleep 3600; echo "done"
# interactive job will have blinking cursor for an hour
# > ctrl+b
# tmux keyboard-shortcut command mode opens
# > d
# tmux session detaches, back to login node
# at this point you can log off and log back in without killing the job
tmux ls
# print list of tmux sessions
# first number on the left (call it X) is needed to re-attach the session
tmux attach-session -t X
# back to interactive job