Galaxy keeps submitting the same job over and over again! What can I do to stop this?

I have a local Galaxy installation that so far works fine.
I now work with data collections.
I have Illumina data (128 files) in one collection (each file around 20-30 Mb, so not very big).
I now try to run jobs on that collection. For example FastQC.

Now Galaxy behaves strange: When I select the collection as FastQC input and press “Submit”, Galaxy starts to add the jobs into the History while the “Sending…” button is shown at the tool page so, I do not see the green “submitted” page yet.
However, after approx. 60 sec. Galaxy starts to submit the job again (while the “Sending…” button is still showing). The “old” job keeps being processed - it just adds a new one (with of course again 128 entries. After some time (~60sec) it adds it again and so on. This can easily add >10 jobs with over 1000 files from my inital list - that are of course all processed.

This strange behavior only stops at some point when Galaxy displays an error message saying the “the job submission failed” (red box). From then on no “new” jobs are added.
The thing is that even the first job submission was OK - it just needed some time to be processed due to the large number of files.
To me this looks like a timeout error that is triggered because the tool page is at the “sending…” level for to long.
Is there any way how I can increase such a timeout to prevent this “explosion” of jobs?

Many thanks in advance! for any help!

Hi @schmitts

Would you be able to share some more details about your local Galaxy configuration?

  1. What version of Galaxy are you running?
  2. When was that last updated/retrieved from Github?
  3. Or is it a Docker Galaxy? From what version, source, and when was it retrieved?
  4. Are you the only user of the server? Or have you set up the server to allow for multiple users?
  5. Is this the first time you have used dataset collections on your server?
  6. How much memory is available to Galaxy? 16 GB is the minimum default, but often more is needed.
  7. Any other configuration settings that may seem relevant. Examples: Are you using a cluster? Have you upgraded the database to use Postgres?

If you want to send that info in a direct message to me, instead of posting publically, that would be fine.

Thanks!

Hi @jennaj
Many thanks for your reply. Sure:

  1. What version of Galaxy are you running?
    It is version 20.01

  2. When was that last updated/retrieved from Github?
    I cloned it Galaxy from Github at 10.4.2020, no docker (except for GIEs which run in containers)

  3. Are you the only user of the server? Or have you set up the server to allow for multiple users?
    Its setup to allow for multiple users (use_remote_user: true), however - I am current the only user that has access to it.

  4. Is this the first time you have used dataset collections on your server?
    I use Galaxy since half a year. I ran many jobs already on single files, this includes FastQC on files >4Gb. If I used dataset collections, they were only very small (3-4 entries). This worked fine. This is the first larger collection (but with small files).

  5. How much memory is available to Galaxy? 16 GB is the minimum default, but often more is needed.
    I have made 32GB available for Galaxy. Furthermore, I monitored memory usage and CPU load before and during submission. The server consumes only a small amount during both processes (4 GB), CPU load increases during submission. I made 8 virtual cores available for Galaxy Server. See picture below. I have the following settings in the uwsgi-config: buffer-size: 16384, processes: 1, threads: 4, offload-threads: 2 (tried 1-4)

  6. Any other configuration settings that may seem relevant. Examples: Are you using a cluster? Have you upgraded the database to use Postgres?
    The server runs on Centos 8.1, I use Galaxy in front of a Apache reverse proxy on the same server that uses ldap authentication. I run a separate server with Postgres for the Galaxy database.

I assembles a small figure indicating the problem and also summarizing the load on the server. The time between the job resubmissions is almost exactly 60 sec. Intressting: For some tests with the same dataset the “submission error” message already showed up after 60 sec (thus the job was submitted only once and ran fine) for other tests with the same dataset it took u to 6 min until the error message appeared (thus the job was submitted 6 times until it stopped submission). Its important to highlight that the submission did not really fail. Even after the 6 jobs were submitted they all got processed nicely.

See figure:

Many thanks in advance for your help!

UPDATE: I now also tested this without a dataset collection (just added all 128 files to the history and selected all the files in the tool “multiple datasets”). Result is similar: I get the “submission failed” message after some seconds - but the jobs are not resubmitted in the meantime. Seems to be really a problem with a timeout during job submission…

1 Like

TImeouts during job submission do not stop Galaxy from processing your jobs, you just won’t see a summary of what has been submitted. That’s why you see your data appearing multiple times.

Hi @mvdbeek
Thanks - this makes total sense.
Any chance that I can increase the timeout somewhere in the config or code?
I had a look at galaxy.yml - but did not find a flag that would fit. Maybe default_job_resubmission_condition but I am not sure how to use this correctly…
Thanks again!

If your using nginx as your proxy and you connect via the uwsgi protocol you can increase the timeout by setting uwsgi_read_timeout 300; which would set the timeout to 5 minutes. If you’re using uwsgi without a proxy you should be able to set http-timeout: 300 in the uwsgi: section of galaxy.yml (you can find all valid options for uwsgi in https://uwsgi-docs.readthedocs.io/en/latest/Options.html#http-timeout), but I’m less certain about that being the right option.

1 Like

@mvdbeek Awesome, I switched to the uwsgi protocol (had httpd before) on my Apache RP. This works now without timeout.
Thanks for the helpful hints!

1 Like