Random failures when concatenating fastqsanger.gz datasets via collection– "Not in GZIP format" errors

Hi @Zhengming_Wu

Yes, there might have been some issues but those should be resolved now that the release is finalized (we have to re-route some cluster resources, and it wasn’t perfect over the last few weeks). But I’m also wondering if there is a better way to do this in general.

The tools that seem to be better choices are these:

  • Nested Cross Product
  • Merge collections
  • Collapse Collection into single dataset in order of the collection

And, the new collection type:

  • Nested Collection (lists of lists)

This would work on the collection files directly using the element identifiers, without needing to open the files to read the sequence identifiers, and should be faster and maybe more reliable, especially if these are single-cell data with the really long > title lines. Uncompressing the data would be avoided entirely until the final collapse step.

The other option is try with uncompressed data throughout the early sorting steps when the files are being read repeatedly, then compressing the result at the very end.

Whatever you decide, if you want to share back some examples – maybe just the files with that part of the workflow, we can try to help to model this, and investigate the processing issues – if any remain – but do try again now since the cluster reconnections and the full release deployment just happened later yesterday.

Let’s start there! :slight_smile:

1 Like