Infinity spam messages to log: Job was resubmitted and is being dispatched immediately

casio · September 23, 2025, 1:02pm

Hi,

I run a local Galaxy installation and have setup some resubmission rules. Using 24.2.3 I didn’t had any issues with this, however, after upgrading to 25.0.2 the log gets an infinity amount of spam messages like

galaxy.jobs.handler DEBUG 2025-09-23 12:44:10,097 [pN:main.1,p:4176744,tN:JobHandlerQueue.monitor_thread] (82999) Job was resubmitted and is being dispatched immediately
galaxy.jobs.handler DEBUG 2025-09-23 12:44:10,098 [pN:main.1,p:4176744,tN:JobHandlerQueue.monitor_thread] (82999) Dispatching to high_memory_and_cpu runner

I can stop this only by shutting down the server, and manipulating the SQL database by setting all jobs with the status ‘resubmitted’ to ‘error’.

I can see that the jobs are successfully resubmitted to the slurm queue with the correct settings, but after they are done computing, they never appear in the user interface as computed. They stay in their ‘yellow’-computing status.

Either we have a bug in this Galaxy version, or more likley, my job_conf.yml is errounous. I add it here, maybe someone sees some obvious mistake?

runners:
  local:
    load: galaxy.jobs.runners.local:LocalJobRunner
    workers: 4
  high_memory_and_cpu:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4
  high_cpu:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4
  high_memory_single_core:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4
  moderate_memory_single_core:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4
  multicore_cpu:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4
  singlecore_cpu:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4
  ultra_high_memory:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4
  multicore_data_fetch:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4  
  general:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4
  medium_memmory_and_high_cpu:
    load: galaxy.jobs.runners.slurm:SlurmJobRunner
    workers: 4

execution:
  default: general
  environments:
    general:
      runner: general 
      native_specification: '--mem=10000 --cpus-per-task=3'
      env:
        - name: '_JAVA_OPTIONS'
          value: '-Xmx10G'
      resubmit:
        - condition: memory_limit_reached
          environment: medium_memmory_and_high_cpu
    medium_memmory_and_high_cpu:
      runner: medium_memmory_and_high_cpu
      native_specification: '--cpus-per-task=10 --mem=30000'
      resubmit:
        - condition: memory_limit_reached
          environment: high_memory_and_cpu
    high_memory_and_cpu:
      runner: high_memory_and_cpu
      native_specification: '--cpus-per-task=10 --mem=80000'
      resubmit:
        - condition: memory_limit_reached
          environment: ultra_high_memory
    ultra_high_memory:
      runner: ultra_high_memory
      native_specification: '--cpus-per-task=10 --mem=300000'
    high_cpu:
      runner: high_cpu
      native_specification: '--cpus-per-task=10 --mem=20000'
      env:
        - name: '_JAVA_OPTIONS'
          value: '-Xmx20G'
      resubmit:
        - condition: memory_limit_reached
          environment: medium_memmory_and_high_cpu
    high_memory_single_core:
      runner: high_memory_single_core
      native_specification: '--cpus-per-task=1 --mem=80000'
      resubmit:
        - condition: memory_limit_reached
          environment: ultra_high_memory
    moderate_memory_single_core:
      runner: moderate_memory_single_core
      native_specification: '--cpus-per-task=1 --mem=20000'
    multicore_data_fetch:
      runner: multicore_data_fetch
      native_specification: '--cpus-per-task=1 --mem=5000'
      resubmit:
        - condition: memory_limit_reached
          environment: singlecore_cpu
    multicore_cpu:
      runner: multicore_cpu
      native_specification: '--cpus-per-task=4 --mem=35000'
      resubmit:
        - condition: memory_limit_reached
          environment: high_memory_and_cpu
    singlecore_cpu:
      runner: singlecore_cpu
      native_specification: '--cpus-per-task=1 --mem=10000'
      resubmit:
        - condition: memory_limit_reached
          environment: high_memory_single_core

limits:
  -
    type: environment_user_concurrent_jobs
    tag: medium_memmory_and_high_cpu
    value: 10
  -
    type: environment_user_concurrent_jobs
    tag: high_memory_and_cpu
    value: 7
  -
    type: environment_user_concurrent_jobs
    tag: ultra_high_memory
    value: 1
  -
    type: environment_user_concurrent_jobs
    tag: general
    value: 20
  - 
    type: environment_total_concurrent_jobs
    tag: ultra_high_memory
    value: 3
  -
    type: environment_total_concurrent_jobs
    tag: medium_memmory_and_high_cpu
    value: 30    
  -
    type: environment_total_concurrent_jobs
    tag: general
    value: 105
  -
    type: environment_total_concurrent_jobs
    tag: high_memory_and_cpu
    value: 15
  -
    type: environment_total_concurrent_jobs
    tag: high_cpu
    value: 31
  -
    type: environment_total_concurrent_jobs
    tag: high_memory_single_core
    value: 15
  -
    type: environment_total_concurrent_jobs
    tag: moderate_memory_single_core
    value: 316
  -
    type: environment_total_concurrent_jobs
    tag: multicore_cpu
    value: 79
  -
    type: environment_total_concurrent_jobs
    tag: singlecore_cpu
    value: 316
  -
    type: environment_total_concurrent_jobs
    tag: multicore_data_fetch
    value: 20

tools:
  - id: sortmerna
    environment: high_cpu
  - id: bg_sortmerna
    environment: high_cpu 
  - id: vardict_java
    environment: high_cpu
  - id: __DATA_FETCH__
    environment: multicore_data_fetch
  - id: upload
    environment: multicore_data_fetch
  - id: upload1
    environment: multicore_data_fetch
  - id: bedtools_coveragebed
    environment: ultra_high_memory
  - id: textutil
    environment: singlecore_cpu 
  - id: bwa_mem
    environment: medium_memmory_and_high_cpu
  - id: bwa_mem2
    environment: medium_memmory_and_high_cpu
  - id: rna_star
    environment: medium_memmory_and_high_cpu
  - id: rna_star_index_builder_data_manager
    environment: high_memory_and_cpu
  - id: hisat2_index_builder_data_manager
    environment: high_memory_and_cpu
  - id: bwa_mem_index_builder_data_manager
    environment: high_memory_and_cpu
  - id: bowtie2
    environment: medium_memmory_and_high_cpu
  - id: fastqc
    environment: singlecore_cpu

The error is observed using RNA STAR going from medium_memmory_and_high_cpu into high_memory_and_cpu environment.

Thanks a lot!

jennaj · September 23, 2025, 4:53pm

Hello @casio

Thanks for sharing all the details! We’ll need some input from the administrators for this one. Let’s keep the conversation at this forum, but for reference this is where I cross-posted at their chat.

My first guess is that there is a missing attempt term in the condition blocks but they can confirm.

github.com/galaxyproject/galaxy

lib/galaxy/config/sample/job_conf.sample.yml

1ccf88119


      
          # Pick next available server and resubmit if an unknown error occurs
          # resubmit: {condition: 'unknown_error and attempt <= 3', environment: galaxycloudrunner}

More soon!

XRef
Configuration Options — Galaxy Project 25.0.2.dev0 documentation
Private Galaxy Servers

bernt-matthias · September 24, 2025, 2:36pm

One thing that I observe is that you do not need a runner for each environment. But one runner for slurm jobs is suffient:

runners:
   local:
     load: galaxy.jobs.runners.local:LocalJobRunner
     workers: 4
   slurm:
     load: galaxy.jobs.runners.slurm:SlurmJobRunner
     workers: 4

Also there is a typo: memmory .. but probably unrelated.

casio · September 25, 2025, 12:27pm

Thanks for the hint, that will clear my job_init file a bit

I also noticed the typo after I uploaded it here, but it seems conistent over all usages of this term.

casio · October 15, 2025, 10:22am

I investigated a bit more and could bring it down to an issue of the resubmission and the global limits.

execution:
  default: medium_memory_and_high_cpu
  environments:
    medium_memory_and_high_cpu:
      runner: slurm
      native_specification: '--cpus-per-task=10 --mem=400'
      resubmit:
        - condition: memory_limit_reached
          environment: high_memory_and_cpu
    high_memory_and_cpu:
      runner: slurm
      native_specification: '--cpus-per-task=10 --mem=800'
      resubmit:
        - condition: memory_limit_reached
          environment: ultra_high_memory
    ultra_high_memory:
      runner: slurm
      native_specification: '--cpus-per-task=10 --mem=300000'

limits:
  - type: destination_user_concurrent_jobs
    id: medium_memory_and_high_cpu
    value: 5
  - type: destination_user_concurrent_jobs
    id: high_memory_and_cpu
    value: 3
  - type: destination_user_concurrent_jobs
    id: ultra_high_memory
    value: 1

In this setting, if the number jobs as given per global limit e.g. for ultra_high_memory already 1 job, is submitted, it ‘somehow’ blocked from resubmission, and we end up in the endless resubmission loop. If I increase the limit to e.g. 2, than one job can be rescheduled, but we run into the same issue if we have 2 jobs which need to be resubmitted.

chatGPT proposed to move the limits into the environment definition, but that also does not work. It further wanted to add handlers, but also no success.

Anyone knowns here something? I am very sure, that this setup worked absolutly fine in Galaxy version 24.2, and only since the upgrade to 25.0 I see this issue.

jennaj · October 15, 2025, 5:53pm

Hi @casio I’ve asked again at the Admin chat for feedback. We’d like to get this working again for you! More soon.

casio · October 28, 2025, 10:00am

Hi Jenna,

any feedback so far?

Best,

casio

jennaj · October 28, 2025, 1:56pm

It doesn’t look like it – I’m going to ask them again for feedback. Thanks for the nudge @casio !

bernt-matthias · October 28, 2025, 2:31pm

Can you post the relevant part of the log file?

May I ask what the limits are good for? What is your idea with them?

casio · October 28, 2025, 4:00pm

Hi Matthias,

The relevant part of the log is, as mentioned in the opening post:

“Job was resubmitted and is being dispatched immediately.”

Unfortunately, there’s no additional information in the logs.

The goal of our setup is to ensure balanced and fair use of our server resources. We have around 160 cores and 1.8 TB of RAM distributed across five machines, which are accessed via Slurm. In total, about 10–20 users share these resources. On one hand, we want to maximize the overall load and the number of parallel jobs; on the other, we want to keep enough resources available so that new users can submit and compute jobs without long waiting times.

This led us to the concept of multiple queues with different limits on the number of parallel jobs per user, combined with automatic job resubmission when a job exceeds its allocated memory. For example, we run RNA STAR with default settings of 10 cores and 40 GB RAM, which is usually sufficient. However, for larger datasets, it may need 80 GB or occasionally even 300 GB. In such cases, jobs are resubmitted to the next queue with higher memory limits.

The logic behind this is that smaller queues (e.g., 40 GB) allow twice as many parallel jobs per user as medium ones (80 GB), while the high-memory queue (> 300 GB RAM) allows only one job per user, as we have just one such node. The limitation of parallel jobs in Galaxy helps prevent overloading the cluster. For example, if Slurm can handle 100 jobs in parallel, and a user submits 100 jobs, the cluster is full. If Galaxy restricts a user to 10 concurrent jobs, it ensures only 10 jobs are forwarded to slurm, and when another user submits a job, it can still be scheduled immediately.

This approach worked well before the Galaxy 25.0.3 update (we were previously on 24.2). However, since the update, the number of allowed jobs for a queue in the resubmission context must be at least one higher than the number of jobs being submitted. For instance, if the initial queue allows 10 jobs, that works fine. But if those jobs are resubmitted to another queue that also allows 10 jobs, we encounter the “Job was resubmitted…” loop. If the limit were 11, the system would behave as expected.

Since it’s not possible to predict how many jobs might be resubmitted, I can no longer safely enforce job limits for resubmission queues, only for the initial ones.

I hope this clarifies the issue and helps in identifying a potential solution.

Best,

casio

bernt-matthias · October 30, 2025, 9:46am

Thanks for the explanations. Sounds perfectly reasonable. As a side note: in my setup I submit jobs as the real user, i.e. SLURM jobs are created as the actual users (my Galaxy users = HPC users), then SLURM does the balancing for me.

What happens if your queue gets empty again? Do the jobs start running?

I checked the code a bit and there seems to be at least one relevant change: [24.2] Fix various job concurrency limit issues by mvdbeek · Pull Request #19824 · galaxyproject/galaxy · GitHub maybe its possible to adapt the unit cases there to check your case?

casio · November 4, 2025, 9:42am

What happens if your queue gets empty again? Do the jobs start running?

The resubmission queue never fills up and remains empty at all times because no jobs are actually submitted to it. For example, suppose 10 jobs should be resubmitted, and the target queue allows 10 parallel jobs. In that case, none of the 10 jobs are submitted, and all of them log the message: “Job was resubmitted and is being dispatched immediately.” Meanwhile, the queue stays empty the entire time.

Slurm: We submit everything as a ‘galaxy’ user. We could have it done like you, but the compute nodes would need to be integrated into the Windows domain for this. And our IT department will do this eventually but is not really helpful and cooperative. Last time our Galaxy server was already in use for a year when we finally got the approval and implementation of the compute Linux nodes into the Windows domain. Long story short, this is no alternative for us.