I’m running a private instance of the latest Galaxy helm chart on Kubernetes. Whenever I run tools of which one of the inputs is a collection of files I experience that the outputs in the interface can be empty whereas on the backend I see these files being populated with data.
For example:
358/working:
total 4084
-rw-r–r-- 1 10001 10001 4175126 Aug 14 09:44 out.sam
lrwxrwxrwx 1 10001 10001 86 Aug 14 09:44 Reference.fasta → /galaxy/server/database/objects/9/e/3/dataset_9e34e95e-1973-4ac4-8b02-b270cae3102f.dat
359/working:
total 6404
-rw-r–r-- 1 10001 10001 6552288 Aug 14 09:44 out.sam
lrwxrwxrwx 1 10001 10001 86 Aug 14 09:44 Reference.fasta → /galaxy/server/database/objects/9/e/3/dataset_9e34e95e-1973-4ac4-8b02-b270cae3102f.dat
360/working:
total 4076
-rw-r–r-- 1 10001 10001 4166132 Aug 14 09:44 out.sam
lrwxrwxrwx 1 10001 10001 86 Aug 14 09:44 Reference.fasta → /galaxy/server/database/objects/9/e/3/dataset_9e34e95e-1973-4ac4-8b02-b270cae3102f.dat
361/working:
total 3332
-rw-r–r-- 1 10001 10001 3405050 Aug 14 09:45 out.sam
lrwxrwxrwx 1 10001 10001 86 Aug 14 09:44 Reference.fasta → /galaxy/server/database/objects/9/e/3/dataset_9e34e95e-1973-4ac4-8b02-b270cae3102f.dat
So, 4 SAM files with output. Whereas in the interface I see:
In this case, 2 files with output and 2 files empty. When I repeat the job, I get different combinations of output versus no output in the interface. Does anyone have any clue? Also, in any case when I try to download these files the result is always 0 kB.
I tried changing the metadata_strategy to extended, no difference. When trying directory_celery or extended_celery the jobs crash during metadata collection.
Thanks for posting all of these details! Very strange but we have seen similar problems before and should be able to help.
Let’s post this over to the Admin chat to see if anyone recognizes what might be going wrong. Please feel free to join the chat, too! You're invited to talk on Matrix
Some other things you could post back while we are waiting for a bit more context:
Does this only happen with specific tools? Just Minimap, or?
Does this only happen with a particular file type? BAMs?
Is this a new server? Or is it new behavior? Did you upgrade recently?
What version of Galaxy are you running?
If you run the job on a local or different cluster (if possible), what happens?
I ran into a similar problem after updating our Galaxy instance from 24.0.x to 25.0.1
For context, we run Galaxy on a separate hardware node and submit jobs to a Slurm compute cluster.
The symptoms were the same, running jobs randomly gave empty and working outputs to the user interface. This happened with different tools (tested ggplot2_heatmap2 and tp_easyjoin_tool) and different file types (pdf and tabular). And it only happened with tools submitting to Slurm.
Monitoring the Slurm jobs showed that all output files in the job directory were populated but after moving them to the dataset folder, some of them randomly got truncated.
The fix was reducing the count of Slurm job runner worker threads to 1.
On the older version, the default value of 4 worked fine. Some sort of concurrent access or race condition might be happening with multiple worker threads. Could other job runners have the same issue?
This might not be the right place where to report this, as our setups are different and I don’t know if the underlying issue is the same or different. I figure that someone else might find this information useful, as this is the first thread I found when trying to debug the issue.
Thanks for your feedback – and I agree this seems interesting! I pushed a summary over to the Admin’s chat to see if they have any comments. You're invited to talk on Matrix
This forum is a good place to discuss. Then, to report issues, you can open a ticket at the primary Galaxy repository.
Thank you for the response! Unfortunately, I had to push the update out to users over the weekend, so it will take time before I can test the 3 worker theory
I might have time next week to properly look into the source code and do some restarts.
I am hesitant to report an issue, without knowing what exactly is happening. This seems to be a rare problem.
The TRACE logs during the debugging didn’t show anything useful (or I didn’t notice). Everything I claimed is just a semi-educated guess at the moment.