Galaxy Job Execution Error with SLURM Setup

hohosharon · September 9, 2025, 10:33pm

Dear Galaxy Team,

I am setting up Galaxy v25 on my local VM in a shared HPC environment. However, I encountered an error related to the SLURM setup in my job_conf.yml. Below are the relevant excerpts and error messages:

job_conf.yml snippet:

execution:
...
    bigmem:
      runner: slurm
      native_specification: '--mem-per-cpu=256000'

tools:
- id: plr_tool_2
  handler: special_handlers
  environment: bigmem

Error output:

/galaxy/lib/galaxy/jobs/__init__.py:1781: SAWarning: TypeDecorator MutableJSONType() will not produce a cache key because the ``cache_ok`` attribute is not set to True. This can have significant performance implications including some performance degradations in comparison to prior SQLAlchemy versions. Set this attribute to True if this type object's state is safe to use in a cache key, or False to disable this warning. (Background on this warning at: ) result = self.sa_session.execute(update_stmt)

python: error: Invalid DebugFlag: AuditRPCs
python: error: DebugFlags invalid: AuditRPCs,Cgroup
python: error: Unable to establish controller machine

Observations:

Other tools (e.g., secondary_structure) run fine under the local environment.
The error seems specific to SLURM configuration and/or the bigmem environment.

Could you advise on what might be misconfigured in job_conf.yml or SLURM settings to resolve this?

Thank you for your help!

jennaj · September 10, 2025, 8:38pm

Hi @hohosharon

I’ve asked your question over at the developer’s matrix chat to see if anyone recognizes what may be going wrong. They will likely reply here but feel free to join this Dev chat and the Admin chat! You're invited to talk on Matrix

Let’s start there!

Xref

mvdbeek · September 11, 2025, 6:49am

It looks like there is a problem in combination with GitHub - natefoo/slurm-drmaa: DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm and the debug flags that you can set in `slurm.conf` (Slurm Workload Manager - slurm.conf). Is there a chance you can ask your cluster admin to take out the debug flags to test this ?

@nate do you know if there is something that could be done in slurm-drmaa ?

hohosharon · September 11, 2025, 6:02pm

From what I’ve found, this issue may be related to the current configuration in /etc/slurm/slurm.conf: DebugFlags=AuditRPCs,Cgroup.

I’ll ask our cluster admin to remove this setting for testing and will update you on whether it resolves the problem. Thank you for your guidance!

hohosharon · September 12, 2025, 5:23pm

Hi @mvdbeek

I received the following feedback from our cluster team:

“We cannot remove those debug flags, we need them set. The debug flags are valid. Maybe the software trying to submit the jobs is not configured correctly? I’m not sure why it’s even looking at those instead of ignoring them. These flags do not affect clients—only the controller and the slurmd service daemons are affected.”

Do you have any idea why Galaxy might be calling these, or if there’s a way we can configure Galaxy to ignore them? or how to set up the env for this situation.

Thank you,

Sharon

hohosharon · September 21, 2025, 10:10pm

I solved the issue by copying /etc/slurm/slurm.conf to my local slurm.conf, commenting out DebugFlags=AuditRPCs,Cgroup, and then exporting it with: export SLURM_CONF=~/slurm.conf