Is there a way to merge collections with different structures? Or merge collections inside a list?

dede_sa · February 12, 2025, 12:44pm

I am working with a pipeline to analyze sequences, in which I want to upload multiple sequences into a collection, blast each one and retain top 10 results, download this 10 sequences and align the query and the 10 subject results from BLAST.

The worfklow worked only if I retained only one blast hit, as I could simply flaten the collection to it would have the same structure as the query collection for alignment. But if I want to align the query with multiple blast hits, the collection structure prevents me from doing it, as flattening it would result in all blast results in the same collection. I try to illustrate bellow what I want to do

Is there a way to transform the collections so this would work? I couldn’t use Merge Fasta in the nested list to achieve the same structure as my query dataset. If it is impossible I will separate the workflow in two, having to manually adjust datasets prior the final alignment.

jennaj · February 13, 2025, 1:08am

Welcome, @dede_sa

The short answer here is to try using tags (name, group, maybe both) to label your starting samples. These can be added during Upload as Collections or added later with a tool like Apply rules. You can filter on tags to reorganized or split off collections any way that you want.

That way any downstream files that were derived from a single starting sequence would all have the same tag(s) in the final files. Then run everything together and not worry too much about keeping it separate until you actually need the data separate again.

Remember: any particular file only exists on disc once – then collection folders are just references to that same file, and “duplications” of data across collections do not actually duplicate the data. This means you could have many collections, all holding the same files (or more/less), but each with a slightly different structure.

If interested, please

Start here for the tutorial guides → Using Galaxy and Managing your Data / Tutorial List

Then for example workflows that use these, maybe explore some IWC workflows. → GTN Pan-Galactic Workflow Search (owner = iwc)

For what these look like with a bit more context, you could explore here under the Applications menu. → https://galaxyproject.org/. Each project has several workflows – and these can have really complex tagging, collection split offs, sub-workflows, etc

I’m being brief, so please ask if you have more questions about any of this!

dede_sa · February 13, 2025, 12:46pm

I am sorry as to take a lot of time to figure this out, but when I follow the tutorial it works, but when I try with my data it seems different.

I am trying the name tagging for following all results from the same sequence. I figure in the end I can take all files from the collections and then merge the fasta based on the name tag prior to alignment, right? I couldn’t go far in this cause i am failing at the beggining.

First, all tutorials use rule-based upload for this, so that’s what I am trying. However, tutorials get their data from FTP. Rule-based collection is not allowing for me to upload my sequence. And if I upload the sequences as collection, when I try to select the collection for rule based, it does not work. Still, I was able to upload sequences as separate files under Regular Upload, then select it on the history and “build collection from rules” and the rule window opens.

I try to follow this tutorial for name tags Hands-on: Name tags for following complex histories / Name tags for following complex histories / Using Galaxy and Managing your Data
Using the regular expression (.*).ab1 I can create a new collumn with my sequence names, but when I try to add/modify column definition, there is no option for me to set column C as a “name” or “name tag” as the tutorial shows. Following the tutorial this options are available using their data they got from FTP, so maybe something is wrong with uploading my data and then trying to build the collection from it.

I failed to find any solutions to this.

jennaj · February 14, 2025, 1:17am

Hi @dede_sa

You could use this dedicated tool Tag elements.

You will need a simple tabular file with the element names of the data in the collection and the tag you want to add.

Use Extract element identifiers to get the existing element names then use Text Manipulation tools to parse that into the name: tags you wish to apply.

We have an example in a different tutorial here. The example is using group: again but you could use name:.

Hands-on: Reference-based RNA-Seq data analysis / Reference-based RNA-Seq data analysis / Transcriptomics → Click on the “Tag-based” tab and it will be the first section.

We should add that tool to the tutorial that you are following! I’ll ask the GTN to update it! I think the tool is newer than the tutorial and it somehow was missed. Ticket → Update the name tags tutorial to include Tag elements · Issue #5775 · galaxyproject/training-material · GitHub

And if you are not sure how to use Text Manipulation tools, we have a quick Cheatsheet in these with examples, plus the tools in the tool panel each have help. Most of these are analogs of command utilities and will have a similar name that works in a search, too.

Hands-on: Data Manipulation Olympics / Data Manipulation Olympics / Introduction to Galaxy Analyses
All → GTN Materials Search (query=olympics) – these include notebook and R methods

Hopefully this helps, but please let us know if it actually solves the larger processing goals and we can follow up more!

dede_sa · February 17, 2025, 5:10pm

Thanks for the suggestion. I was able to add name tags, but I can’t finish the workflow because I can’t find the right way to build a collection based on the name tags.

In the end, I think I need either to:

Take all sequences from their collection and merge FASTA based on the name tags
or 2) Take all sequences from their collection, build new collections based on name tags

However neither build list or merge FASTA tools shows options to do so based on tags.

jennaj · February 18, 2025, 6:36pm

Hi @dede_sa

I can think of two ways to approach this without seeing the workflow itself.

Filter using element identifiers. These are probably the same as originally assigned, yes?
Use group tags instead, then the Apply Rules tool to do the split, before sending those collections to the target tool(s).

You are still welcome to share the workflow. We can ask one of the developers to help if there is something else complex needed.

dede_sa · February 19, 2025, 1:41pm

Sorry for not sharing the workflow. It’s this one Galaxy

It isn’t finished because I can’t figure out a way to unite the outputs of raw_sequence, trimmed_sequence and blast_sequence in a single file per input ab1 sequence (i.e. I want to align the raw sequence with its quality trimmed version plus 5 blast results from this sequence). The objective is to help analyzing the results of sanger sequence files, given that the alignment with reference and the automated quality trim will help us define good quality basecalls with those of bad quality (which probably will result in gaps and poor alignment with reference).

I saw a few workflows that transforms fasta to tabular but I am missing the point of how this would help me group sequences by their original one.

Also another issue I found with this approach is that, even though I added name tags to the files, when the tool 8:NCBI Accession Download runs, the output file containing the fasta sequence from the BLAST results are not tagged by their query sequence (i.e. the name tags), only the other two output (error log and failed accession) contains the name tag. So, even if I manage to find a way to combine the collections mentioned before, it wouldn’t be by their name tags since the blast output would not be tagged this way.

jennaj · February 19, 2025, 9:39pm

Hi @dede_sa

Thanks for sharing the workflow.

I don’t have the right kind of data but even with “fake” data I can see the outputs. The name tags seems to be persistent throughout.

That said, you could switch to using a group tag instead, since that would allow you to use the Apply Rules tool to filter data based on the tag. Or add both and see what is useful.

You could also add in a sub-workflow. This works a bit like a foreach statement. For each original element (sequence), do this set of tasks. The last step could be the collapse step you were originally interested in. Then the result pushes back up to the main workflow.

Now – if you are running this with a just a few original sequences, this will probably work Ok, if you add in a top hit filter (right now you are keeping all hits!) and maybe adjust the target database (NT is super noisy, NR at a minimum would be a better choice and RefSeq for your species much much better).

I guess I am not really sure what you are trying to do. A batch of fastq sequences, BLAST’d against a noisy public database, then all of those hits pulled back out as sequences individually for an alignment (another blast or pairwise with another tool) … seems very noisy. Maybe there is a better way to process this data.

Meaning, this might run for a few sequences but I am guessing that NCBI is going to throttle this kind of query since they would much prefer that you pull in the full database directly in one go then run queries locally to extract individual sequences (you can also do that local sequence extraction in Galaxy).

If the goal is to do some kind of taxonomic or contamination screening, there are better methods. You can try anything you want to of course but I wanted to give you a warning about how much what you have now will scale in a big way. Too big for the public data repositories to allow that query pattern.

Hope this gives some options!

dede_sa · February 20, 2025, 12:46pm

Thanks for the reply. I think I am going somewhere, there is a last step before I can achieve my goal, but first let me explain it better.

We work with sanger sequence mostly for DNA barcoding diverse taxa. So we usually sequence COI, 12S, 18S or something on sanger. At most we will have up to 96 sequences at a time to analyze. Instead of going through manualy checking the electropherogram for base quality and blasting manually each sequence, I am developing this workflow to filter the sequence by quality and blast it (it will help us confirm the overall similarity and taxa of our sequence, to see if the sequence worked). I think the best database available to me is the NT, because a lot of our taxa are not represented on refseq. I don’t think this will be a problem since we will note be uploading FASTQ files such as those from NGS sequences (e.g. Illumina), so the input will be always small for this workflow. Also I added that the blast output would be just to top 5 hits

Having said that, I was able to Apply Rules to add group tags to all sequences, and merge the sequences into their proper collections. So the collection structure I have now is

Outer Collection
2) 4 Middle collection (one for each input ab1 sequence file) - This would scale based on the number of input ab1 sequence I run on the workflow
3) Up to 7 inner files, including the raw sequence, trimmed sequence and up to 5 BLAST hits fasta files

In order for me to use “FASTA merge” and effectively transform the 7 inner files into a single fasta for alignment, I would need somewhat to take the middle collections out of the outer one, so in the end I want this single collection to become one collection for each input sequence. In the scenario above, the single outer collection would become 4 outer collections. However, the tools “collapse collection”, “Unzip collection”, “flatten collecion” does not do that. I also tried tweaking with apply rules but I couldn’t find a way to do that using that tool. Do you have any idea of how this could be done? “Extract dataset” seems to be the tool for me but I just can’t figure out how to use this for my objective. In the image below I tried to point out the collection structure I have

jennaj · February 24, 2025, 11:27pm

Hi @dede_sa

This is very abstract without a shared workflow to discuss. Do you want to share what you have now? If you can also share a small testing dataset in a history that would also be helpful. Three sequences is probably enough.

dede_sa · February 25, 2025, 5:22pm

I’ve updated the workflow to the point where I am stuck. The last apply rules is not solving my issue to separate collection in a way that I can use Merge FASTA to create a single file fasta file per sequence/input, including the raw and editted sequence + its BLAST results.

In attachment I am adding the 4 sequences I am working on. One of those is poor quality and will result in error on the BLAST, but this shouldn’t affect downstream since I want the pipeline to be flexible in a way that it does not stop if a few of the sequences are poor quality. You can download the ab1 files here.

Thanks!

jennaj · March 24, 2025, 7:13pm

Hi @dede_sa

Apologies, I didn’t see that you had shared back the workflow!

Let’s ask one of our workflow developers to help from here. I think you will need to send each sequence into a sub-workflow where the downstream steps are performed then ending with the merge sequences tool, then that result is sent back up to the main workflow.

I’ve cross posted this over the the IWC chat room. They will probably reply here but also feel free to join the chat. You're invited to talk on Matrix

paulzierep · March 25, 2025, 7:20am

Could you share the workflow that you used when you only took one blast hit ? Maybe I could start from there

dede_sa · March 26, 2025, 12:48pm

Hi guys, thanks for the help. I ended up modifying the previous workshop to this one. I created a copy of it Galaxy but it is similar, only changing the blast output to retain a single result. I was trying to a single blast output because I figured that it would be easier and then I could just change it to retain 5 blast results, in the end it was still difficult and I couldn’t finish it, so I figured I might as well try using 5 blast results since it wasn’t as straigthforward as I hoped.

I couldn’t finish this one also because I was thinking since the begining that the workflow would need work even if low quality sequences were used. If you run the workflow, you can see that the poor quality sequence will be trimmed to a small fasta that does not result in any blast result. So in the end, when we download the NCBI results, the poor quality one will have no result. So we can’t use the merge fasta directly because it would merge collections with 4 sequences (raw fasta, trimmed fasta) with one with 3 sequences (BLAST output), so it won’t do it properly.

In the begining I tried tweaking trying to filter out empty collections (i.e. the blast result from the poor quality sequence), extract the sequences for which there is blast outputs as a list (i.e. the 3 sequences with output), to them use this list to filter all other collections to include only those that have blast output to then use “Merga FASTA”, but I couldn’t find a way to do that. I also didn’t want to exclude entirely the poor quality results, because it would still be able to be aligned with its small trimmed version.

Topic		Replies	Views
How to merge multiple fastq.gz files from one sample into one fastq.gz file usegalaxy.eu support gtn-tutorial , workflow , collections	3	3714	February 27, 2023
Merge 2 paired end collections collections , merge-collections , __apply_rules__	1	163	June 6, 2024
Subsetting collections and doing operations on an arbitary number of groups usegalaxy.eu support workflow , collections , snpeff	9	962	July 1, 2020
join 3 sequences usegalaxy.org support collections	1	524	October 26, 2021
Workflow ending in multiple VCFs instead of one single VCF usegalaxy.eu support workflow	5	21	July 26, 2024

Is there a way to merge collections with different structures? Or merge collections inside a list?

Related topics