Slurm job scheduler — Niflheim 2.0 documentation (2024)

Jump to our top-level Slurm page: Slurm batch queueing system

Prerequisites

Before configuring the Multifactor_Priority_Plugin scheduler, you must first configure Slurm accounting.

Scheduler configuration

The SchedulerType configuration parameter controls how queued jobs are executed, see the Scheduling_Configuration_Guide.

SchedulerType options are sched/backfill, which performs backfill scheduling, and sched/builtin, which attempts to schedule jobs in a strict priority order within each partition/queue.

There is also a SchedulerParameters configuration parameter which can specify a wide range of parameters as described below.This first set of parameters applies to all scheduling configurations.See the slurm.conf man page for more details:

default_queue_depth=# - Specifies the number of jobs to consider for scheduling on each event that may result in a job being scheduled. Default value is 100 jobs. Since this happens frequently, a relatively small number is generally best.
defer - Do not attempt to schedule jobs individually at submit time. Can be useful for high-throughput computing.
max_switch_wait=# - Specifies the maximum time a job can wait for desired number of leaf switches. Default value is 300 seconds.
partition_job_depth=# - Specifies how many jobs are tested in any single partition, default value is 0 (no limit).
sched_interval=# - Specifies how frequently, in seconds, the main scheduling loop will execute and test all pending jobs. The default value is 60 seconds.

Backfill scheduler

We use the backfill scheduler in slurm.conf:

SchedulerType=sched/backfillSchedulerParameters=kill_invalid_depend,defer,bf_continue

but there are some backfill parameters that should be considered (see slurm.conf), for example:

...bf_interval=60,bf_max_job_start=20,bf_resolution=600,bf_window=11000

The importance of bf_window is explained as:

The default value is 1440 minutes (one day).A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation.In order to limit the amount of data managed by the backfill scheduler, if the value of bf_window is increased, then it is generally advisable to also increase bf_resolution.

So you must configure bf_window according to the longest possible MaxTime in all partitions in slurm.conf:

PartitionName= ... MaxTime=XXX

scontrol top command

The scontrol top job_list command is documented as:

Move the specified job IDs to the top of the queue of jobs belonging to the identical user ID, partition name, account, and QOS.The job_list argument is a comma separated ordered list of job IDs.Any job not matching all of those fields will not be effected.Only jobs submitted to a single partition will be effected.This operation changes the order of jobs by adjusting job nice values.The net effect on that user's throughput will be negligible to slightly negative.This operation is disabled by default for non-privileged (non-operator, admin, SlurmUser, or root) users.This operation may be enabled for non-privileged users by the system administrator by including the option "enable_user_top" in the SchedulerParameters configuration parameter.

While scontrol top job_list may be useful for the superuser to help with user requests, it is not recommended to configure SchedulerParameters=enable_user_top.The Slurm 17.11 news page (https://slurm.schedmd.com/news.html) highlights this change:

Regular user use of "scontrol top" command is now disabled.Use the configuration parameter "SchedulerParameters=enable_user_top" to enable that functionality.The configuration parameter "SchedulerParameters=disable_user_top" will be silently ignored.

There does not seem to be any documentation of why the scontrol top job_list is unwarranted, but we have made the following observations of a bad side effect:

A user requests a high priority for a job, and the superuser grants a negative nice value with scontrol update jobid=10208 nice=-10000.
The user can now assign a negative nice value to his other jobs with scontrol top jobid=10209,10210, thereby jumping ahead of normal jobs in the queue.

Preemption of jobs by high-priority jobs

Slurm supports job preemption, the act of stopping one or more “low-priority” jobs to let a “high-priority” job run.Job preemption is implemented as a variation of Slurm’s Gang Scheduling logic.When a high-priority job has been allocated resources that have already been allocated to one or more low priority jobs, the low priority job(s) are preempted.The low priority job(s) can resume once the high priority job completes.Alternately, the low priority job(s) can be requeued and started using other resources if so configured in newer versions of Slurm.

Preemption is configured in slurm.conf.

Multifactor Priority Plugin scheduler

A sophisticated Multifactor_Priority_Plugin provides a very versatile facility for ordering the queue of jobs waiting to be scheduled.See the PriorityXXX parameters in the slurm.conf file.

Multifactor configuration

The Fairshare is configured with PriorityX parameters in the Configuration section of the Multifactor_Priority_Plugin page,also documented in the slurm.conf page:

PriorityType
PriorityDecayHalfLife
PriorityCalcPeriod
PriorityUsageResetPeriod
PriorityFavorSmall
PriorityMaxAge
PriorityWeightAge
PriorityWeightFairshare
PriorityWeightJobSize
PriorityWeightPartition
PriorityWeightQOS
PriorityWeightTRES

An example slurm.conf fairshare configuration may be:

PriorityType=priority/multifactorPriorityDecayHalfLife=7-0PriorityFavorSmall=NOPriorityMaxAge=10-0PriorityWeightAge=100000PriorityWeightFairshare=1000000PriorityWeightJobSize=100000PriorityWeightPartition=100000PriorityWeightQOS=100000PropagateResourceLimitsExcept=MEMLOCKPriorityFlags=ACCRUE_ALWAYS,FAIR_TREEAccountingStorageEnforce=associations,limits,qos,safe

PriorityWeightXXX values are all 32-bit integers.The final Job Priority is a 32-bit integer.

IMPORTANT: Set PriorityWeight values high to generate wide range of job priorities.

Quality of Service (QOS)

One can specify a Quality of Service (QOS) for each job submitted to Slurm.A description and example are in the QOS page.Example QOS configurations are:

sacctmgr modify qos normal set priority=50sacctmgr add qos highsacctmgr modify qos high set priority=100

Example:

sacctmgr show qos format=name,priority

To enforce user jobs to have a QOS you must (at least) have:

AccountingStorageEnforce=qos

see the slurm.conf and Resource_Limits documents.The AccountingStorageEnforce options include:

associations - This will prevent users from running jobs if their association is not in the database. This option will prevent users from accessing invalid accounts.
limits - This will enforce limits set to associations. By setting this option, the ‘associations’ option is also set.
qos - This will require all jobs to specify (either overtly or by default) a valid qos (Quality of Service). QOS values are defined for each association in the database. By setting this option, the ‘associations’ option is also set.
safe - limits and associations will automatically be set.

The Quality of Service (QOS) Factor is defined in the Multifactor_Priority_Plugin page as:

Each QOS can be assigned an integer priority. The larger the number, the greater the job priority will be for jobs that request this QOS. This priority value is then normalized to the highest priority of all the QOS's to become the QOS factor.

A non-zero weight must be defined in slurm.conf, for example:

PriorityWeightQOS=100000

Resource Limits

To enable any limit enforcement you must at least have:

AccountingStorageEnforce=limits

in your slurm.conf, otherwise, even if you have limits set, they will not be enforced.Other options for AccountingStorageEnforce and the explanation for each are found on the Resource_Limits document.

Limiting (throttling) jobs in the queue

It is desirable to prevent individual users from flooding the queue with jobs, in case there are idle nodes available, because it may block future jobs by other users.Note:

With the MAUI scheduler (an alternative product to Slurm) this is called Throttling_Policies.

With Slurm it appears that the only way to achieve user job throttling is the following:

Using the GrpTRESRunMins parameter defined in the Resource_Limits document. See also the TRES definition.

The GrpTRESRunMins limits can be applied to associations (accounts or users) as well as QOS.Set the limit by:

sacctmgr modify association where name=XXX set GrpTRESRunMin=cpu=1000000 # For an account/user asociationsacctmgr modify qos where name=some_QOS set GrpTRESRunMin=cpu=1000000 # For a QOSsacctmgr modify qos where name=some_QOS set MaxTRESPU=cpu=1000 # QOS Max TRES per user

Partition factor priority

If some partition XXX (for example big memory nodes) should have a higher priority, this is explained in Multifactor_Priority_Plugin by:

(PriorityWeightPartition) * (partition_factor) +

The Partition factor is controlled in slurm.conf, for example:

PartitionName=XXX ... PriorityJobFactor=10PriorityWeightPartition=1000

Scheduling commands

View scheduling information for the Multifactor_Priority_Plugin by the commands:

sprio - view the factors that comprise a job’s scheduling priority:

sprio # List job prioritiessprio -l # List job priorities including username etc.sprio -w # List weight factors used by the multifactor scheduler

sshare - Tool for listing the shares of associations to a cluster:

ssharesshare -l # Long listing with additional informationsshare -a # Listing with also user information

sdiag - Scheduling diagnostic tool for Slurm

Slurm job scheduler — Niflheim 2.0 documentation (2024)

FAQs

What is the queue limit for slurm jobs? ›

You can access a maximum of 256 cores per queue at a time. Any subsequent jobs will queue until your usage allows for more to start running.

Discover More ›

How many jobs can Slurm handle? ›

There is also a Max SLURM job limit on HPC, and when the value is reached, no further jobs can be submitted to the cluster. The max value of 30,000 jobs can be handled by the SLURM job scheduler before jobs are unable to submit to HPC.

Read On ›

What are the scheduler options in slurm? ›

SchedulerType options are sched/backfill, which performs backfill scheduling, and sched/builtin, which attempts to schedule jobs in a strict priority order within each partition/queue.

Know More ›

What is a job step in slurm? ›

Job steps describe individual tasks that must be executed into a job. Most often a single job needs to execute several individual computations to be completed. Each partial execution in a job is called job step. You can execute a job step with the SLURM command: srun .

Discover More ›

How many batch jobs can be active queued at a time? ›

Batch jobs usually have the limit of five queued or active jobs simultaneously. With Flex Queues, any jobs that are submitted for execution but are not processed immediately by the system go in holding status and are placed in a separate queue (the Apex flex queue). Up to 100 batch jobs can be in the holding status.

Discover More Details ›

What is the max RAM usage for slurm? ›

Slurm imposes a memory limit on each job. By default, it is deliberately relatively small — 2 GB per node. If your job uses more than that, you'll get an error that your job Exceeded job memory limit.

Read On ›

Why did SLURM job fail? ›

The job you ran tried using more memory than what was defined in your submission script. As a result, slurm automatically killed your job. A simple fix is to increase the amount of memory dedicated to your job, using --mem at the command line or #sbatch --mem in your submission script.

See Details ›

Is SLURM used in industry? ›

The basis of Slurm is to allocate resources, manage pending work, and execute jobs, but it's the details of Slurm's architecture that make it the leading work management system in a number of industry trends.

Explore More ›

What does SLURM stand for? ›

SLURM is a queue management system and stands for Simple Linux Utility for Resource Management.

What is the difference between SLURM scheduler and Kubernetes? ›

kube-scheduler vs Slurm

Slurm is the go-to scheduler for managing the distributed, batch-oriented workloads typical for HPC. kube-scheduler is the go-to for the management of flexible, containerized workloads and microservices. Slurm is a strong candidate due to its ability to integrate with common frameworks.

Tell Me More ›

How do you check how many jobs are running SLURM? ›

Information on all running and pending batch jobs managed by Slurm can be obtained from the Slurm command squeue . Note that information on completed jobs is only retained for a limited period. Information on jobs that ran in the past is via sacct .

Get More Info Here ›

What is the difference between job and task in SLURM? ›

Slurm encapsulates resources using the idea of jobs and tasks. A job can span multiple compute nodes and is the sum of all task resources. Tasks are a subset of resources in a job and can only exists on a single compute node.

Discover More Details ›

How do you end a job on Slurm? ›

The normal method to kill a Slurm job is:

$ scancel <jobid>
$ squeue -u $USER.
$ scancel 1234567.

Read On ›

What does PD mean in Slurm? ›

The "R" in the ST column means that the job is running. You may also see Pending "PD" which means that the job is awaiting resource allocation or Configuring "CF" which means that resources have been allocated but are waiting for them to become ready for use.

Get More Info Here ›

How long is the Slurm job? ›

If the time limit is not specified in the submit script, SLURM will assign the default run time, 3 days. This means the job will be terminated by SLURM in 72 hrs. The maximum allowed run time is two weeks, 14-0:00.

Read On ›

What is the limit of 100 jobs in the flex queue? ›

Apex flex Queue (Batch)-->With the Apex flex queue, you can submit up to 100 batch jobs. The batch job is placed in the Apex flex queue, and its status is set to Holding. If the Apex flex queue has the maximum number of 100 jobs, Database. executeBatch throws a LimitException and doesn't add the job to the queue.

Tell Me More ›

What is the max job number in slurm? ›

The maximum number of jobs Slurm can have in its active database at one time. Set the values of MaxJobCount and MinJobAge to ensure the slurmctld daemon does not exhaust its memory or other resources. Once this limit is reached, requests to submit additional jobs will fail. The default value is 10000 jobs.

Get More Info ›

What is the limit of JobScheduler? ›

Also, the 100 Jobs limits is per app, and is not limited to WorkManager usage. If in your app (or in a SDK you are including) JobScheduler is used directly, this may help reaching the 100 jobs limits.

View Details ›

What is queue limit? ›

Queue-limit is used to fine-tune the number of buffers available for each queue. It can only be used on a. queuing class. Default queue limit is 100 ms of the service rate for the given queue.

Get More Info Here ›