Greetings all!
I would really appreciate some help if anyone is able :D.
The setup:
I have set up a Galaxy instance within our Microsoft Azure subscription. Galaxy is running in a container (I wrote the image myself but based it on the Dockerfile from GitHub - bgruening/docker-galaxy: πππ Docker Images tracking the stable Galaxy releases.). In addition to that we setup a CycleCloud instance (Overview - Azure CycleCloud | Microsoft Learn) and within CycleCloud we setup a SLURM cluster (Overview of Azure CycleCloud Workspace for Slurm - Azure CycleCloud | Microsoft Learn).
I was able to connect Galaxy to this SLURM cluster. I can run tasks from Galaxy which will automatically sbatch to SLURM. These tasks are run in containers using Apptainer (and usually quay.io).
Everything works except that resource requests (CPU, memory, time) defined in the Galaxy job configuration or selected in the tool form are not being honored by SLURM.
My job_conf.yml looks like this:
runners:
local:
load: galaxy.jobs.runners.local:LocalJobRunner
workers: 4
slurm:
load: galaxy.jobs.runners.slurm:SlurmJobRunner
handling:
processes:
handler0:
handler1:
execution:
default: local
environments:
local:
runner: local
params: {}
singularity_slurm_hpc:
runner: slurm
require_container: true
params:
submit_native_specification: >-
--nodes=1
--ntasks-per-node=1
--partition=hpc
--mem={{ memory | default(15) }}G
--cpus-per-task={{ processors | default(4) }}
--time={{ time | default(48) }}:00:00
resources: all
use_resource_params: true
singularity_enabled: true
singularity_volumes: $defaults,/galaxy
singularity_run_extra_arguments: '--env APPTAINER_NO_SETGROUPS=1'
singularity_cleanenv: true
singularity_sudo: false
singularity_default_container_id: docker://ubuntu:noble-20250404
env:
- name: LC_ALL
value: C
- name: APPTAINER_CACHEDIR
value: /scratch/singularity/containercache
- name: APPTAINER_TMPDIR
value: /scratch/singularity/tmpdir
- name: SINGULARITY_CACHEDIR
value: /scratch/singularity/containercache
- name: SINGULARITY_TMPDIR
value: /scratch/singularity/tmpdir
- file: /galaxy/.venv/bin/activate
tools:
- id: minimap2
destination: singularity_slurm_hpc
resources: all
- class: local
environment: local
resources:
default: default
groups:
default: []
memoryonly: [memory]
all: [processors, memory, time]
I also created a job_resource_params_conf.xml:
<parameters>
<param label="CPUs" name="processors" type="integer" min="1" max="64" value="4" help="Number of CPU cores to allocate (SLURM: --cpus-per-task)" />
<param label="Memory (GB)" name="memory" type="integer" min="1" max="256" value="15" help="Memory in GB (SLURM: --mem)" />
<param label="Runtime (hours)" name="time" type="integer" min="1" max="4380" value="48" help="Job time limit in hours (SLURM: --time)" />
</parameters>
And a container_resolvers.yml (although I donβt think this is related to the issue):
- type: explicit_singularity
- type: explicit
The problem:
Despite configuring default and user-selectable resource parameters in job_conf.yml
and job_resource_params_conf.xml
, SLURM jobs always run with only 2 CPUs and 7.5GB RAM, instead of the requested 4 CPUs and 15GB RAM (or other manual settings).
The node used in the hpc partition has 8 vCPUs and 16GB RAM, so itβs not oversubscribed, yet SLURM seems to always allocate half of the available resources (this is due to configuration by CycleCloud in the slurm.conf
).
But when I submit jobs manually using sbatch
, from the scheduler node or from the Galaxy container (on a different VM), the job resource requests are honored correctly. So I donβt think the slurm.conf
is blocking/overriding the requests.
Question:
How can I get SLURM to actually use the resource requests from Galaxy? Are there Galaxy-side defaults Iβm missing? Do I need to configure anything differently in SLURM or CycleCloud? Is there something I did wrong in the job configuration?
Any advice is appreciated!