Random failures when concatenating fastqsanger.gz datasets via collection– "Not in GZIP format" errors

Hi all,

I’m encountering a frustrating and seemingly random issue when using the Concatenate Datasets tool in Galaxy to merge fastqsanger.gz files from two different lanes.

:light_bulb: Context:

  • I have a series of lane-specific fastqsanger.gz files, and I want to concatenate them lane-wise.
  • To do this, I’ve placed the two lanes into two dataset collections and launch parallel jobs for each pair.
  • The tool runs 22 parallel jobs, each combining two .gz fastq files.

:face_with_crossed_out_eyes: Problem:

  • Some of the tasks fail silently. They appear to succeed (green status, file size = sum of inputs), but:
    • Galaxy cannot recognize them as valid fastqsanger.gz files.
    • Downloading and inspecting them shows that they are corrupted binary files, with lots of null bytes (0x00) and do not start with 1F 8B, which is expected for valid gzip files.
  • Most importantly, the failures are random:
    • Running the exact same collections multiple times yields a different set of failed outputs each time.
    • This suggests it’s not related to the input files or tool parameters.

:magnifying_glass_tilted_left: Observation:

I carefully checked the job command lines and noticed a pattern:

:white_check_mark: The working jobs use paths like:

python /cvmfs/main.galaxyproject.org/galaxy/tools/filters/catWrapper.py \
  '/jetstream2/scratch/main/jobs/.../outputs/...dat' \
  '/jetstream2/scratch/main/jobs/.../inputs/...dat' ...

:cross_mark: The failing jobs use paths like:

python /cvmfs/main.galaxyproject.org/galaxy/tools/filters/catWrapper.py \
  '/corral4/main/jobs/.../outputs/...dat' \
  '/corral4/main/objects/.../dataset_....dat' ...

I learned that jetstream2 is a compute environment (Texas Advanced Computing Center), and corral4 is a storage backend. It seems that job dispatching between these environments is random, which could explain the random failures.

I also tried all 4 available “Concatenate Datasets” tools, and none resolved the issue.


:red_question_mark:My Questions:

  1. Why does concatenation fail when inputs come from corral4, but succeed from jetstream?
  2. Is this a known bug related to dataset mounting/caching between compute nodes and object store?
  3. Is there a workaround (e.g., force copying files to scratch before concatenation)?

This issue is blocking my pipeline, and I’d really appreciate any help or suggestions on how to proceed.

Thanks in advance!

Hi @Zhengming_Wu

Yes, there might have been some issues but those should be resolved now that the release is finalized (we have to re-route some cluster resources, and it wasn’t perfect over the last few weeks). But I’m also wondering if there is a better way to do this in general.

The tools that seem to be better choices are these:

  • Nested Cross Product
  • Merge collections
  • Collapse Collection into single dataset in order of the collection

And, the new collection type:

  • Nested Collection (lists of lists)

This would work on the collection files directly using the element identifiers, without needing to open the files to read the sequence identifiers, and should be faster and maybe more reliable, especially if these are single-cell data with the really long > title lines. Uncompressing the data would be avoided entirely until the final collapse step.

The other option is try with uncompressed data throughout the early sorting steps when the files are being read repeatedly, then compressing the result at the very end.

Whatever you decide, if you want to share back some examples – maybe just the files with that part of the workflow, we can try to help to model this, and investigate the processing issues – if any remain – but do try again now since the cluster reconnections and the full release deployment just happened later yesterday.

Let’s start there! :slight_smile:

1 Like