Concatenation of RNA_Seq technical replicates

rcaudle · June 24, 2025, 3:38pm

In the past I have been able to concatenate my technical replicates in my fastq.gz files without any trouble. However, now all of the concatenate tools seem to work only sporadically. I am currently trying to reanalyzed data that I had analyzed 6 months ago without a problem, but this is not going well because it is hit or miss on the concatenation step.
Here are screen shots of a failed concatenation file.

And here are screen shots of a successful concatenation.

igor · June 25, 2025, 4:52am

Hi @rcaudle,

The ORG server has several concatenate tools. Use the same tool for all jobs. Can be done using Re-run option.

Do you get empty files? What do you see in mini-preview in the history panel, when you click on name of the concatenated file, the one with no preview?

Kind regards,
Igor

rcaudle · June 25, 2025, 11:54am

Hi Igor,

Thank you for your response. I have tried all the concatenation tools and they all are hit or miss. This was not an issue previously with this exact dataset. The tools run as if they are working and the resulting files are the appropriate size, but using the view button indicates that there are no sequences in the files. All down steam tools are unable to use these files. What is most frustrating is that sometimes the concatenation tools work properly, even on fastq.gz files that it previously failed on so I am unable to track down what is causing the problem.

jennaj · July 2, 2025, 8:07pm

Hi @rcaudle

There were some complications during the recent release cycle (25.0 Galaxy Release (June 2025) — Galaxy Project 25.0.2.dev0 documentation) but we expect that processing should be back to stable now.

Would you be able to try again then let us know if that works as expected or not? You can share back the history if there is some problem and I’ll review. A history with just a small representative sample would be ideal but however you want to report this will be fine.

I may also have alternative processing suggestions. There is a new way to compare collections that doesn’t involve uncompressing/recompressing the individual files repeatedly that may be a more robust solution.

Details were in this topic yesterday:

Random failures when concatenating fastqsanger.gz datasets via collection– "Not in GZIP format" errors

The tools that seem to be better choices are these:

Nested Cross Product << NEWER

Merge collections

Collapse Collection into single dataset in order of the collection

And, the new collection type:

Nested Collection (lists of lists) << NEWER

This would work on the collection files directly using the element identifiers, without needing to open the files to read the sequence identifiers, and should be faster and maybe more reliable, especially if these are single-cell data with the really long > title lines. Uncompressing the data would be avoided entirely until the final collapse step.

The other option is try with uncompressed data throughout the early sorting steps when the files are being read repeatedly, then compressing the result at the very end.

Please give this a try and we can follow up to get you to a solution!

rcaudle · July 3, 2025, 12:51pm

I tried all the concatenation tools this morning and they all failed. Here is the history. Galaxy

I will look into your other suggestions on handling this data. Thanks!

jennaj · July 3, 2025, 8:29pm

Thank you so much @rcaudle for sharing the simple clear history. Yes, something is going wrong and the Collapse Collection tool failed with my test too. More soon, we are looking at this with priority!

jennaj · July 7, 2025, 9:09pm

Hi @rcaudle (and others that may run into this issue!)

Thank you for sharing the examples! There are two issues going on. One with the way the cat tools work at the ORG server, and one with the cluster failure handling at the ORG server.

First: how Text Manipulation tools handle compressed datatypes

At UseGalaxy.org (may be different at other servers): The concatenate tools might work on compressed fastq data but please don’t rely on that. Most of the forms have a version of this type of warning.

This tool does not check if the datasets being concatenated are in the same format.

It means that the usual datatype handling/checks that tools perform are not “required” to be handled by this specific tool (not called out in the job environment, which is important for choosing suitable clusters to use). Most of the Text Manipulation tools work the same way. The server, and cluster nodes involved, might be able to handle data that isn’t in a plain text format, but it isn’t an explicit requirement. This might explain the inconsistent results you are seeing.

In short, these tools behave the same way as a command-line utility. For cat, the tool simply stacks the data files all on top of each other. No smart datatype filtering is applied in the job configuration set up on (on purpose) to allow for complex or mixed data file types to process. This gives some flexibility when needed, but can be confusing!

Presenting the cat tools as just cat tools, and leaving the file handling up to the user, is how to approach the usage. The solution is to add in the uncompress/recompress steps in to your workflow. Most of the other Text Manipulation tools work this same way: plain text works most reliably across servers and clusters, and you might lose your datatype and need to reassign it!

So, while some cluster node might have the required transformation support for “compressed to uncompressed to cat to re-compressed” in the job environment, that shoudn’t be replied on. You’ll need to make sure the input data is in a plain text format before using one of the cat functions, and that includes the Collapse Collection function (which again, might work on compressed but often will not).

How I would suggest doing this

I would put it all into a sub-workflow to recall whenever needed. This would allow you to split off the multiple-temporary file versions too, since this is usually uninteresting data to save.

https://usegalaxy.org/u/jen-galaxyproject/h/example-cat-for-compressed-fastq-1

Main workflow (creating the input for the subworkflow): Group the files into a temporary collection (Collection Builder or split off the files you want from the original collection(s) into a new slice with one of the Collection Operations tools/functions). This collection will not increase your storage usage since these are identical clones of pre-existing elements.
Sub-workflow: Uncompress all at once (pencil icon → convert or for a workflow Convert compressed file to uncompressed).
Sub-workflow: Collapse Collection (or any other of the Cat tools that you like)
Sub-workflow (creating the output to send back to the main workflow): Recompress all at once (pencil icon → convert or for a workflow Convert uncompressed file to compressed. These will be new elements, and will increase your storage usage, the same as if Step 2, 3, 4 were a single tool.

Second, there are some (known) issues with the clusters, which is why these empty result jobs are labeled as successful (green) and the job logs are not reporting the actual issue (it will just report that it can’t read the data, not any details since it is only looking for data eg any plain text, not parsing around a specific datatype!).

We have been working on this for the last two weeks, and that work should start to roll out to more tools over the next few weeks. Until then, failed jobs may not look failed in the expected way (red, with informative logs) and may instead be green, empty, with cryptic format logs (direct from the cluster, not parsed nicely). The issue will usually be due to unexpected content in the inputs – but please ask about anything that is not clear and we’ll help to confirm.

Hope this helps! Help for how the sub-workflow creation was streamlined in the last release can be found here → 25.0 Galaxy Release (June 2025) — Galaxy Project 25.0.2.dev0 documentation

Related Q&A Tool for merging 2x single-read illumina sequencing files (fastq) into one?

jennaj · July 23, 2025, 12:50am

Update: Some of the odd/changed behavior seen with the tools that run concatenation functions on fastq/fastq.gz data variations should be resolved now. A rerun will be needed for some cases to fully produce the correct output with the correct metadata. If anyone has a chance to run a workflow that was previously giving problems, then let me know if that is now working as expected or not, that would be super helpful! Thanks all for reporting the problems!

This wasn’t really related to the release (was a backend change to job containers) but the timing overlap mulled it all together.

What I am talking about is this (below) and it usually would show up if there was an “uncompress, recompress, uncompress” cycle going on when pre-preparing job inputs, or when a tool had multiple internal sub-steps, or when preparing job outputs for the history. If one of those by chance hit a node with a even a minor container problem, it could result in the entire job deployment being technically (sometimes fatally) “off” in some way, even when the core tool processing was working Ok, This should be fully addressed now. Whew!

NOTE: When using Collapse Collection, the expectation is that the data is in plain text. This won’t change for complicated reasons, and while some servers and some cluster nodes may be able to do the fastq uncompression for you, plenty will not, so for workflow stability across deployments and time: uncompressed first, collapse, then compress again as needed.

jennaj:

Second, there are some (known) issues with the clusters, which is why these empty result jobs are labeled as successful (green) and the job logs are not reporting the actual issue (it will just report that it can’t read the data, not any details since it is only looking for data eg any plain text, not parsing around a specific datatype!).

We have been working on this for the last two weeks, and that work should start to roll out to more tools over the next few weeks. Until then, failed jobs may not look failed in the expected way (red, with informative logs) and may instead be green, empty, with cryptic format logs (direct from the cluster, not parsed nicely). The issue will usually be due to unexpected content in the inputs – but please ask about anything that is not clear and we’ll help to confirm.

Topic		Replies	Views
Tool for merging 2x single-read illumina sequencing files (fastq) into one? usegalaxy.org support text-manipulation , macs2 , fastqsanger , epigenetics	7	677	November 5, 2021
Concatenate multiple datasets tool; combining all fastq.gz files under a single barcode to one fastq.gz file collections , tool-help	6	132	July 22, 2025
Workflows etc. not working usegalaxy.org support	11	66	July 22, 2025
How to merge multiple fastq.gz files from one sample into one fastq.gz file usegalaxy.eu support gtn-tutorial , workflow , collections	3	3737	February 27, 2023
Concatenate datasets not concatenating properly usegalaxy.eu support server-admin	3	771	October 19, 2021

Concatenation of RNA_Seq technical replicates

Related topics