SLURM job resource requests ignored (CPU/memory/time) on Galaxy + Azure CycleCloud setup

Greetings all!

I would really appreciate some help if anyone is able :D.

The setup:
I have set up a Galaxy instance within our Microsoft Azure subscription. Galaxy is running in a container (I wrote the image myself but based it on the Dockerfile from GitHub - bgruening/docker-galaxy: 🐋📊📚 Docker Images tracking the stable Galaxy releases.). In addition to that we setup a CycleCloud instance (Overview - Azure CycleCloud | Microsoft Learn) and within CycleCloud we setup a SLURM cluster (Overview of Azure CycleCloud Workspace for Slurm - Azure CycleCloud | Microsoft Learn).

I was able to connect Galaxy to this SLURM cluster. I can run tasks from Galaxy which will automatically sbatch to SLURM. These tasks are run in containers using Apptainer (and usually quay.io).

Everything works except that resource requests (CPU, memory, time) defined in the Galaxy job configuration or selected in the tool form are not being honored by SLURM.

My job_conf.yml looks like this:

runners:
  local:
    load: galaxy.jobs.runners.local:LocalJobRunner
    workers: 4
  slurm:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner

handling:
  processes:
    handler0:
    handler1:

execution:
  default: local
  environments:
    local:
      runner: local
      params: {}
    singularity_slurm_hpc:
      runner: slurm
      require_container: true
      params:
        submit_native_specification: >-
          --nodes=1
          --ntasks-per-node=1
          --partition=hpc
          --mem={{ memory | default(15) }}G
          --cpus-per-task={{ processors | default(4) }}
          --time={{ time | default(48) }}:00:00
      resources: all
      use_resource_params: true
      singularity_enabled: true
      singularity_volumes: $defaults,/galaxy
      singularity_run_extra_arguments: '--env APPTAINER_NO_SETGROUPS=1'
      singularity_cleanenv: true
      singularity_sudo: false
      singularity_default_container_id: docker://ubuntu:noble-20250404
      env:
        - name: LC_ALL
          value: C
        - name: APPTAINER_CACHEDIR
          value: /scratch/singularity/containercache
        - name: APPTAINER_TMPDIR
          value: /scratch/singularity/tmpdir
        - name: SINGULARITY_CACHEDIR
          value: /scratch/singularity/containercache
        - name: SINGULARITY_TMPDIR
          value: /scratch/singularity/tmpdir
        - file: /galaxy/.venv/bin/activate

tools:
  - id: minimap2
    destination: singularity_slurm_hpc
    resources: all
  - class: local
    environment: local

resources:
  default: default
  groups:
    default: []
    memoryonly: [memory]
    all: [processors, memory, time]

I also created a job_resource_params_conf.xml:

<parameters>
  <param label="CPUs" name="processors" type="integer" min="1" max="64" value="4" help="Number of CPU cores to allocate (SLURM: --cpus-per-task)" />
  <param label="Memory (GB)" name="memory" type="integer" min="1" max="256" value="15" help="Memory in GB (SLURM: --mem)" />
  <param label="Runtime (hours)" name="time" type="integer" min="1" max="4380" value="48" help="Job time limit in hours (SLURM: --time)" />
</parameters>

And a container_resolvers.yml (although I don’t think this is related to the issue):

- type: explicit_singularity
- type: explicit

The problem:
Despite configuring default and user-selectable resource parameters in job_conf.yml and job_resource_params_conf.xml, SLURM jobs always run with only 2 CPUs and 7.5GB RAM, instead of the requested 4 CPUs and 15GB RAM (or other manual settings).

The node used in the hpc partition has 8 vCPUs and 16GB RAM, so it’s not oversubscribed, yet SLURM seems to always allocate half of the available resources (this is due to configuration by CycleCloud in the slurm.conf).

But when I submit jobs manually using sbatch, from the scheduler node or from the Galaxy container (on a different VM), the job resource requests are honored correctly. So I don’t think the slurm.conf is blocking/overriding the requests.

Question:
How can I get SLURM to actually use the resource requests from Galaxy? Are there Galaxy-side defaults I’m missing? Do I need to configure anything differently in SLURM or CycleCloud? Is there something I did wrong in the job configuration?

Any advice is appreciated!

Hi @JBoom

Let’s ask at the Admin Chat for help with your question. Feel free to join here too! :hammer_and_wrench: You're invited to talk on Matrix

XRef

Let’s start there! :slight_smile:

1 Like

I’ve made a bit of progress with this issue. I replaced submit_native_specification in my job_conf.yml with the string nativeSpecification. Although the documentation I found mentioned this as a requirement for Sun Grid Engine, it appears to work for SLURM as well.

The next challenge is to actually pass resource values from job_resource_params_conf.xml into job_conf.yml. It seems that the {{ some_value }} or ${some_value} syntax is not being recognised.

Additionally, I had to make some adjustments to my files, as it appears that memory is being interpreted in megabytes rather than gigabytes.
To that end, here are my updated versions of:

job_conf.yml:

runners:
  local:
    load: galaxy.jobs.runners.local:LocalJobRunner
    workers: 4
  slurm:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner

handling:
  processes:
    handler0:
    handler1:

execution:
  default: local
  environments:
    local:
      runner: local
      params: {}
    singularity_slurm_hpc:
      runner: slurm
      require_container: true
      params:
        nativeSpecification: >-
          --nodes=1
          --partition=hpc
          --mem=${memory_mb}
          --cpus-per-task=${processors}
          --time=${time}:00:00
      resources: all
      use_resource_params: true
      singularity_enabled: true
      singularity_volumes: $defaults,/galaxy
      singularity_run_extra_arguments: '--env APPTAINER_NO_SETGROUPS=1'
      singularity_cleanenv: false
      singularity_sudo: false
      singularity_default_container_id: docker://ubuntu:noble-20250404
      env:
        - name: LC_ALL
          value: C
        - name: APPTAINER_CACHEDIR
          value: /scratch/singularity/containercache
        - name: APPTAINER_TMPDIR
          value: /scratch/singularity/tmpdir
        - name: SINGULARITY_CACHEDIR
          value: /scratch/singularity/containercache
        - name: SINGULARITY_TMPDIR
          value: /scratch/singularity/tmpdir
        - file: /galaxy/.venv/bin/activate

tools:
  - id: minimap2
    destination: singularity_slurm_hpc
    resources: all

resources:
  default: default
  groups:
    default: []
    all: [processors, memory_mb, time]

job_resource_params_conf.xml:

<parameters>
  <param label="CPUs" name="processors" type="integer" min="1" max="96" value="4" help="Number of CPU cores to allocate (SLURM: --cpus-per-task)" />
  <param label="Memory (MB)" name="memory_mb" type="integer" min="1" max="660000" value="15000" help="Memory in MB (SLURM: --mem)" />
  <param label="Runtime (hours)" name="time" type="integer" min="1" max="4380" value="48" help="Job time limit in hours (SLURM: --time)" />
</parameters>

Any guidance on this would be greatly appreciated!

Hi @JBoom

Glad you are making progress!

I’m not sure if the notation is passed or not. I pinged the Admin chat to see if they have some feedback. :hammer_and_wrench:

1 Like

My first try would be to check if it works with an XML job_conf?

2 Likes

Thank you for the suggestion, Bernt! I’ve switched my job_conf to XML format, which, interestingly, resolved a completely different problem I’d been having.

That earlier issue was that Galaxy wasn’t honouring or reading the section in tool XML files:

<requirements>
  <container type="docker">some-container-address</container>
</requirements>

However, I still can’t seem to pass the resources that users specify in a tool’s run form, or the default values I’ve set in job_resource_params_conf.xml.
I’m beginning to wonder whether I should even be handling that through job_conf.xml at all, or if it ought to be done elsewhere.

<?xml version="1.0"?>

<job_conf>
  <plugins workers="4">
    <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
    <plugin id="slurm" type="runner" load="galaxy.jobs.runners.slurm:SlurmJobRunner"/>
  </plugins>

  <handlers>
    <handler id="handler0"/>
    <handler id="handler1"/>
  </handlers>

  <destinations default="local">
    <destination id="local" runner="local"/>
    <destination id="singularity_slurm_hpc" runner="slurm">
      <param id="singularity_enabled">true</param>
      <param id="singularity_volumes">$defaults,/galaxy</param>
      <param id="singularity_run_extra_arguments">--env APPTAINER_NO_SETGROUPS=1</param>
      <param id="singularity_cleanenv">false</param>
      <param id="singularity_sudo">false</param>
      <param id="singularity_default_container_id">docker://ubuntu:noble-20250404</param>
      <param id="require_container">true</param>
      <param id="nativeSpecification">--nodes=1 --partition=hpc --mem={memory_mb} --cpus-per-task={processors} --time={time}:00:00</param>
      <param id="resources">all</param>
      <param id="use_resource_params">true</param>
      <env name="LC_ALL">C</env>
      <env name="APPTAINER_CACHEDIR">/scratch/singularity/containercache</env>
      <env name="APPTAINER_TMPDIR">/scratch/singularity/tmpdir</env>
      <env name="SINGULARITY_CACHEDIR">/scratch/singularity/containercache</env>
      <env name="SINGULARITY_TMPDIR">/scratch/singularity/tmpdir</env>
      <env file="/galaxy/.venv/bin/activate"/>
    </destination>
  </destinations>

  <resources default="default">
    <group id="default"></group>
    <group id="all">processors,memory_mb,time</group>
  </resources>

  <tools>
    <tool id="minimap2" destination="singularity_slurm_hpc" resources="all"/>
    <tool id="flye" destination="singularity_slurm_hpc" resources="all"/>
  </tools>
</job_conf>

Both ${memory_mb} and {memory_mb} won’t be accepted.

19098 Traceback (most recent call last):
19099   File "/galaxy/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job
19100     external_job_id = self.ds.run_job(**jt)
19101   File "/galaxy/.venv/lib/python3.10/site-packages/pulsar/managers/util/drmaa/__init>
19102     return DrmaaSession.session.runJob(template)
19103   File "/galaxy/.venv/lib/python3.10/site-packages/drmaa/session.py", line 314, in r>
19104     c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
19105   File "/galaxy/.venv/lib/python3.10/site-packages/drmaa/helpers.py", line 302, in c
19106     return f(*(args + (error_buffer, sizeof(error_buffer))))
19107   File "/galaxy/.venv/lib/python3.10/site-packages/drmaa/errors.py", line 151, in er>
19108     raise _ERRORS[code - 1](error_string)
19109 drmaa.errors.InvalidArgumentException: code 4: not an number: {memory_mb}

Is this injection of parameters from the job_resource_params_conf.xml into the job_conf.xml supported by default in Galaxy? Or do I need to write some custom code to support this?

Anyone that has experience with this? I would really appreciate the help!

Problem solved, I was just not reading a crucial piece of documentation…
Post that helped me figure it out: https://biostar.galaxyproject.org/p/15058/index.html.

I had to configure a dynamic rule for the SLURM destination, which allows getting the parameters from the job_resource_params_conf.xml.

For reference if anyone runs into this and comes across this post.

My job_conf.xml:

<?xml version="1.0"?>

<job_conf>
  <plugins workers="4">
    <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
    <plugin id="slurm" type="runner" load="galaxy.jobs.runners.slurm:SlurmJobRunner"/>
    <plugin id="dynamic" type="runner">
      <param id="rules_module">galaxy.jobs.rules</param>
    </plugin>
  </plugins>

  <handlers assign_with="db-skip-locked">
    <handler id="handler0"/>
    <handler id="handler1"/>
  </handlers>

  <destinations default="local">
    <destination id="local" runner="local"/>
    <destination id="cyclecloud_slurm" runner="dynamic">
      <param id="type">python</param>
      <param id="function">tool_wrapper</param>
      <env name="LC_ALL">C</env>
      <env name="APPTAINER_CACHEDIR">/scratch/singularity/containercache</env>
      <env name="APPTAINER_TMPDIR">/scratch/singularity/tmpdir</env>
      <env name="SINGULARITY_CACHEDIR">/scratch/singularity/containercache</env>
      <env name="SINGULARITY_TMPDIR">/scratch/singularity/tmpdir</env>
      <env file="/galaxy/.venv/bin/activate"/>
    </destination>
  </destinations>

  <resources default="default">
    <group id="default"></group>
    <group id="all">processors,memory_mb,time</group>
  </resources>

  <tools>
    <tool id="minimap2" destination="cyclecloud_slurm" resources="all"/>
    <tool id="kraken2_inspect" destination="cyclecloud_slurm" resources="all"/>
    <tool id="kraken2_classify" destination="cyclecloud_slurm" resources="all"/>
    <tool id="flye" destination="cyclecloud_slurm" resources="all"/>
    <tool id="extract_kraken_reads" destination="cyclecloud_slurm" resources="all"/>
  </tools>
</job_conf>

My job_resource_params_conf.xml:

<parameters>
  <param label="CPUs" name="processors" type="integer" min="1" max="96" value="4" help="Number of CPU cores to allocate (SLURM: --cpus-per-task)" />
  <param label="Memory (MB)" name="memory_mb" type="integer" min="1" max="660000" value="15000" help="Memory in MB (SLURM: --mem)" />
  <param label="Runtime (hours)" name="time" type="integer" min="1" max="4380" value="48" help="Job time limit in hours (SLURM: --time)" />
</parameters>

And then the python script that I put in /galaxy/lib/galaxy/jobs/rules/ (this script can have any name I believe, the function name is important):

#!/usr/bin/env python3

# Imports.
import logging
from galaxy.jobs import JobDestination

# Log to galaxy's logger.
log = logging.getLogger(__name__)

# Does a lot more logging when set to true.
verbose = True

def tool_wrapper(app, job, user_email, resource_params, tool_id):
    # Retrieve user specified resources or set default values if user doesn't
    # set anything.
    if tool_id == "kraken2_inspect" or tool_id == "kraken2_classify":
        processors = int(resource_params.get("processors", 16))
        memory_mb = int(resource_params.get("memory_mb", 340000))
    else:
        processors = int(resource_params.get("processors", 4))
        memory_mb = int(resource_params.get("memory_mb", 15000))

    # Set the time limit for the job, either based on user input or default
    # 48 hours.
    time_str = f"{int(resource_params.get('time', 72))}:00:00"

    # SLURM nativeSpecification with injection of dynamic resource parameters.
    native_spec = (
        f"--nodes=1 --partition=hpc "
        f"--mem={memory_mb} "
        f"--cpus-per-task={processors} "
        f"--time={time_str}"
    )

    # Standard apptainer slurm parameters.
    params = {
        "singularity_enabled": "true",
        "singularity_volumes": "$defaults,/galaxy",
        "singularity_run_extra_arguments": "--env APPTAINER_NO_SETGROUPS=1",
        "singularity_cleanenv": "false",
        "singularity_sudo": "false",
        "singularity_default_container_id": "docker://ubuntu:noble-20250404",
        "require_container": "true",
        "use_resource_params": "true",
        "nativeSpecification": native_spec,
        "resources": "all",
    }

    if verbose:
        log.info(
            f"Tool: {tool_id}, CPUs: {processors}, "
            f"Mem: {memory_mb}MB, Time: {time_str}"
        )

    # Return JobDestination with the params as dict.
    return JobDestination(id="cyclecloud_slurm", runner="slurm", params=params)

Thank you for your help!

1 Like