sixpack translation problem

BenMulder · May 8, 2024, 11:35pm

why did sixpack translate only a small portion of the >5000 sequences I uploaded?

jennaj · May 9, 2024, 12:06am

It is difficult to know for certain but it seems possible that the server where you are working couldn’t handle a larger processing job if you are inputting over 5k nucleotide sequences for a six-frame translation job!

As a potential work-around, you could split the larger file up into smaller files, process each of those, then merge the results back together. Whether this is appropriate depends on the tool but the one you are using is probably Ok for this.

The process usually goes something like this. It runs in a batch, and if you put all of the tools into a simple workflow, it would run almost like a single (custom) tool.

If this helps, please let us know. If you still need more help, please share more details. For what you explain, the server where you are working and the exact tool name/version are likely the most important details.

Let’s start there, thanks!

BenMulder · May 9, 2024, 11:53pm

Hi Jennaj,
the split-run-concatenate approach you suggested worked to a certain extent as Sixpack only translated the first sequence of each dataset.
I did contemplate split my dataset in 5000 single sequence ones but that is a lot.
Is there anything I can do?
Any help is appreciated.

jennaj · May 10, 2024, 2:08am

Hi @BenMulder

You are doing the split in a batch, then running the collection through in a batch, then merging in a batch, yes? Then 5000k doesn’t matter.

That is just three clicks – one per step above – no matter how much you are splitting up. Also, this distributes the work across cluster nodes, and will process just as fast if not faster than the merged file since some can run in parallel.

I’m not sure if the EMBOSS original command line tool can process more at a time or not … but reading the tool form as a reminder to myself, I think this tool really does just process one sequence at a time. So, what you are explaining and doing now is the best way to get your data through.

Maybe I am misunderstanding? So far I think that splitting up the query is the way to get this to processed. Let us know how that works!

Later on, if any of the sub-jobs happens to fail, you can rerun just those (after maybe reviewing the input sequence to make sure that wasn’t the issue). When you rerun, there will be an extra box on the form to replace the new result back into the original collection. There might be a few of these in 5k jobs since some can fail by chance.

BenMulder · May 16, 2024, 3:29pm

Jennaj,
I was able to obtain the 3-frame translation I was looking for by:
Uploading the FASTA file to Galaxy.eu
Used SPLITFASTA to split the FASTA file in a collection of 4530 single FASTA files.
Used SIXPACK to translate the collection of 4530 single FASTA files.
Tried to use Concatenate datasets tail to head (did not work)
Concatenated (locally) in terminal
Thank you very much for your help.

jennaj · May 16, 2024, 4:17pm

Thanks for letting us know, @BenMulder

An alternative tool is Collapse Collection (also in the tool panel).

But super glad this worked out!

Topic		Replies	Views
Blastn issue; what is the alternative tool? mapping , transcriptomics	4	217	April 25, 2024
Stacks2: process_Radtags won't find paired-end data usegalaxy.eu support	16	880	March 1, 2021
How to merge multiple fastq.gz files from one sample into one fastq.gz file usegalaxy.eu support gtn-tutorial , workflow , collections	3	3692	February 27, 2023
Concatenate datasets function problem on CRISPResso usegalaxy.org support troubleshooting	1	247	December 21, 2023
Megablast jobs have been running for 21 days now usegalaxy.eu support workflow , toolshed	1	523	August 12, 2019

sixpack translation problem

Related topics