Concatenation of RNA_Seq technical replicates

In the past I have been able to concatenate my technical replicates in my fastq.gz files without any trouble. However, now all of the concatenate tools seem to work only sporadically. I am currently trying to reanalyzed data that I had analyzed 6 months ago without a problem, but this is not going well because it is hit or miss on the concatenation step.
Here are screen shots of a failed concatenation file.


And here are screen shots of a successful concatenation.


Hi @rcaudle,

The ORG server has several concatenate tools. Use the same tool for all jobs. Can be done using Re-run option.

Do you get empty files? What do you see in mini-preview in the history panel, when you click on name of the concatenated file, the one with no preview?

Kind regards,
Igor

Hi Igor,

Thank you for your response. I have tried all the concatenation tools and they all are hit or miss. This was not an issue previously with this exact dataset. The tools run as if they are working and the resulting files are the appropriate size, but using the view button indicates that there are no sequences in the files. All down steam tools are unable to use these files. What is most frustrating is that sometimes the concatenation tools work properly, even on fastq.gz files that it previously failed on so I am unable to track down what is causing the problem.

Hi @rcaudle

There were some complications during the recent release cycle (25.0 Galaxy Release (June 2025) — Galaxy Project 25.0.2.dev0 documentation) but we expect that processing should be back to stable now.

Would you be able to try again then let us know if that works as expected or not? You can share back the history if there is some problem and I’ll review. A history with just a small representative sample would be ideal but however you want to report this will be fine.

I may also have alternative processing suggestions. There is a new way to compare collections that doesn’t involve uncompressing/recompressing the individual files repeatedly that may be a more robust solution.

Details were in this topic yesterday:

Please give this a try and we can follow up to get you to a solution! :hammer_and_wrench:

I tried all the concatenation tools this morning and they all failed. Here is the history. Galaxy

I will look into your other suggestions on handling this data. Thanks!

1 Like

Thank you so much @rcaudle for sharing the simple clear history. Yes, something is going wrong and the Collapse Collection tool failed with my test too. More soon, we are looking at this with priority!

Hi @rcaudle (and others that may run into this issue!)

Thank you for sharing the examples! There are two issues going on. One with the way the cat tools work at the ORG server, and one with the cluster failure handling at the ORG server.

First: how Text Manipulation tools handle compressed datatypes

At UseGalaxy.org (may be different at other servers): The concatenate tools might work on compressed fastq data but please don’t rely on that. Most of the forms have a version of this type of warning.

This tool does not check if the datasets being concatenated are in the same format.

It means that the usual datatype handling/checks that tools perform are not “required” to be handled by this specific tool (not called out in the job environment, which is important for choosing suitable clusters to use). Most of the Text Manipulation tools work the same way. The server, and cluster nodes involved, might be able to handle data that isn’t in a plain text format, but it isn’t an explicit requirement. This might explain the inconsistent results you are seeing.

In short, these tools behave the same way as a command-line utility. For cat, the tool simply stacks the data files all on top of each other. No smart datatype filtering is applied in the job configuration set up on (on purpose) to allow for complex or mixed data file types to process. This gives some flexibility when needed, but can be confusing!

Presenting the cat tools as just cat tools, and leaving the file handling up to the user, is how to approach the usage. The solution is to add in the uncompress/recompress steps in to your workflow. Most of the other Text Manipulation tools work this same way: plain text works most reliably across servers and clusters, and you might lose your datatype and need to reassign it!

So, while some cluster node might have the required transformation support for “compressed to uncompressed to cat to re-compressed” in the job environment, that shoudn’t be replied on. You’ll need to make sure the input data is in a plain text format before using one of the cat functions, and that includes the Collapse Collection function (which again, might work on compressed but often will not).


How I would suggest doing this

I would put it all into a sub-workflow to recall whenever needed. This would allow you to split off the multiple-temporary file versions too, since this is usually uninteresting data to save.

:hammer_and_wrench: https://usegalaxy.org/u/jen-galaxyproject/h/example-cat-for-compressed-fastq-1

  1. :down_arrow:Main workflow (creating the input for the subworkflow): Group the files into a temporary collection (Collection Builder or split off the files you want from the original collection(s) into a new slice with one of the Collection Operations tools/functions). This collection will not increase your storage usage since these are identical clones of pre-existing elements.

  2. :down_right_arrow: Sub-workflow: Uncompress all at once (pencil icon → convert or for a workflow Convert compressed file to uncompressed).

  3. :down_right_arrow: Sub-workflow: Collapse Collection (or any other of the Cat tools that you like)

  4. :down_right_arrow::up_arrow: Sub-workflow (creating the output to send back to the main workflow): Recompress all at once (pencil icon → convert or for a workflow Convert uncompressed file to compressed. These will be new elements, and will increase your storage usage, the same as if Step 2, 3, 4 were a single tool.


Second, there are some (known) issues with the clusters, which is why these empty result jobs are labeled as successful (green) and the job logs are not reporting the actual issue (it will just report that it can’t read the data, not any details since it is only looking for data eg any plain text, not parsing around a specific datatype!).

We have been working on this for the last two weeks, and that work should start to roll out to more tools over the next few weeks. Until then, failed jobs may not look failed in the expected way (red, with informative logs) and may instead be green, empty, with cryptic format logs (direct from the cluster, not parsed nicely). The issue will usually be due to unexpected content in the inputs – but please ask about anything that is not clear and we’ll help to confirm.

Hope this helps! :slight_smile: Help for how the sub-workflow creation was streamlined in the last release can be found here → 25.0 Galaxy Release (June 2025) — Galaxy Project 25.0.2.dev0 documentation

Related Q&A Tool for merging 2x single-read illumina sequencing files (fastq) into one?