Using SBATCH to Submit Jobs
SBATCH is a non-blocking command, meaning there is not a circumstance where running the command will cause it to hold. Even if the resources requested are not available, the job will be thrown into the queue and will start to run once resources become available. The status of a job can be seen using squeue
while it is pending or running and sacct
at any time.
squeue --me
sacct -j YOUR_JOBID
SBATCH is based around running a single file. That being said, you shouldn’t need to specify any parameters in the command other than sbatch <batch file>
, because you can specify all parameters in the command inside the file itself.
The following is an example of a batch script. Please note that the top of the script must start with #!/bin/bash
(or whatever interpreter you need, if you don’t know, use bash), and then immediately follow with #SBATCH <param>
parameters. An example of common SBATCH parameters and a simple script is below, this script will allocate 4 CPUs and one GPU in the GPU partition.
#!/bin/bash
#SBATCH -c 4 # Number of Cores per Task
#SBATCH --mem=8192 # Requested Memory
#SBATCH -p gpu # Partition
#SBATCH -G 1 # Number of GPUs
#SBATCH -t 01:00:00 # Job time limit
#SBATCH -o slurm-%j.out # %j = job ID
module load cuda/10.1.243
/modules/apps/cuda/10.1.243/samples/bin/x86_64/linux/release/deviceQuery
This script should query the available GPUs, and print only one device to the specified file. Feel free to remove/modify any of the parameters in the script to suit your needs.
Slurm can send you emails based on the status of your job via the --mail-type
argument.
Common mail types are BEGIN, END, FAIL, INVALID_DEPEND, and REQUEUE
. See the sbatch man page
Example (test that email is working for you):
salloc --mail-type=BEGIN /bin/true
or:
#!/bin/bash
#SBATCH --mail-type=BEGIN
/bin/true
There is also the --mail-user
argument, but this is optional. Our mail server knows the email you used to register your Unity account.
Time Limit Email - Preventing Loss of Work
When your job reaches its time limit, it will be killed, even if it’s 99% of the way through its task. Without checkpointing, all those CPU hours will be for nothing and you will have to schedule the job all over again.
One way to prevent this is to check on your job’s output as it approaches its time limit. You can specify --mail-type=TIME_LIMIT_80
, and Slurm will email you if 80% of the time limit has passed and your job is still running. Then you can check on the job’s output and determine if it will finish in time. If you think that your job will not finish in time, you can email us at hpc@umass.edu and we can extend your job’s time limit.