Random empty output files

Hi,

I’m running a private instance of the latest Galaxy helm chart on Kubernetes. Whenever I run tools of which one of the inputs is a collection of files I experience that the outputs in the interface can be empty whereas on the backend I see these files being populated with data.

For example:

358/working:
total 4084
-rw-r–r-- 1 10001 10001 4175126 Aug 14 09:44 out.sam
lrwxrwxrwx 1 10001 10001 86 Aug 14 09:44 Reference.fasta → /galaxy/server/database/objects/9/e/3/dataset_9e34e95e-1973-4ac4-8b02-b270cae3102f.dat

359/working:
total 6404
-rw-r–r-- 1 10001 10001 6552288 Aug 14 09:44 out.sam
lrwxrwxrwx 1 10001 10001 86 Aug 14 09:44 Reference.fasta → /galaxy/server/database/objects/9/e/3/dataset_9e34e95e-1973-4ac4-8b02-b270cae3102f.dat

360/working:
total 4076
-rw-r–r-- 1 10001 10001 4166132 Aug 14 09:44 out.sam
lrwxrwxrwx 1 10001 10001 86 Aug 14 09:44 Reference.fasta → /galaxy/server/database/objects/9/e/3/dataset_9e34e95e-1973-4ac4-8b02-b270cae3102f.dat

361/working:
total 3332
-rw-r–r-- 1 10001 10001 3405050 Aug 14 09:45 out.sam
lrwxrwxrwx 1 10001 10001 86 Aug 14 09:44 Reference.fasta → /galaxy/server/database/objects/9/e/3/dataset_9e34e95e-1973-4ac4-8b02-b270cae3102f.dat

So, 4 SAM files with output. Whereas in the interface I see:

In this case, 2 files with output and 2 files empty. When I repeat the job, I get different combinations of output versus no output in the interface. Does anyone have any clue? Also, in any case when I try to download these files the result is always 0 kB.

I tried changing the metadata_strategy to extended, no difference. When trying directory_celery or extended_celery the jobs crash during metadata collection.

I’ve managed to resolve this by setting ‘outputs_to_working_directory: false’. Metadata is now properly detected. I don’t really understand why.

I do still experience the downloading issue though, any file outputs only 0 kB.

Hi @Koen_Nijbroek

Thanks for posting all of these details! Very strange but we have seen similar problems before and should be able to help.

Let’s post this over to the Admin chat to see if anyone recognizes what might be going wrong. Please feel free to join the chat, too! :hammer_and_wrench: You're invited to talk on Matrix

Some other things you could post back while we are waiting for a bit more context:

  1. Does this only happen with specific tools? Just Minimap, or?
  2. Does this only happen with a particular file type? BAMs?
  3. Is this a new server? Or is it new behavior? Did you upgrade recently?
  4. What version of Galaxy are you running?
  5. If you run the job on a local or different cluster (if possible), what happens?

XRef → Private Galaxy Servers

Hi!

I ran into a similar problem after updating our Galaxy instance from 24.0.x to 25.0.1

For context, we run Galaxy on a separate hardware node and submit jobs to a Slurm compute cluster.

The symptoms were the same, running jobs randomly gave empty and working outputs to the user interface. This happened with different tools (tested ggplot2_heatmap2 and tp_easyjoin_tool) and different file types (pdf and tabular). And it only happened with tools submitting to Slurm.

Monitoring the Slurm jobs showed that all output files in the job directory were populated but after moving them to the dataset folder, some of them randomly got truncated.

The fix was reducing the count of Slurm job runner worker threads to 1.

galaxy:
    job_config:
        runners:
            slurm:
                drmaa_library_path: /usr/local/lib/libdrmaa.so.1
                load: galaxy.jobs.runners.slurm:SlurmJobRunner
                workers: 1

On the older version, the default value of 4 worked fine. Some sort of concurrent access or race condition might be happening with multiple worker threads. Could other job runners have the same issue?

This might not be the right place where to report this, as our setups are different and I don’t know if the underlying issue is the same or different. I figure that someone else might find this information useful, as this is the first thread I found when trying to debug the issue.

1 Like

Hi @Ingvar

Thanks for your feedback – and I agree this seems interesting! I pushed a summary over to the Admin’s chat to see if they have any comments. :hammer_and_wrench: You're invited to talk on Matrix

This forum is a good place to discuss. Then, to report issues, you can open a ticket at the primary Galaxy repository.

The docs have 3 workers specified here. I wonder if that is enough to resolve the issue? → Debugging Galaxy: Slurm Compute Cluster — Galaxy Project 25.0.2.dev0 documentation

1 Like

Hello!

Thank you for the response! Unfortunately, I had to push the update out to users over the weekend, so it will take time before I can test the 3 worker theory :smiley:

I might have time next week to properly look into the source code and do some restarts.

I am hesitant to report an issue, without knowing what exactly is happening. This seems to be a rare problem.

The TRACE logs during the debugging didn’t show anything useful (or I didn’t notice). Everything I claimed is just a semi-educated guess at the moment.

Please make sure that you run a recent commit of 25.0, this sounds like what was fixed in [25.0] Fix bug: tool output file may be overwritten by Runner's multi work t… by jianzuoyi · Pull Request #20639 · galaxyproject/galaxy · GitHub

2 Likes

Yep, that looks like it :person_facepalming:

In that case, I’ll schedule an update to tag 25.0.2 sooner rather than later.

Thank you!