Perhaps this is an easy one, but I haven’t yet found the answer. Can anyone tell me if there is a Galaxy tool that will filter a fasta (specifically a protein fasta) to yield only the longest isoform per gene?
BTW, I understand that the longest isoform may not be the best, but this is a way of getting a multi-fasta with one entry per gene.
I have proteomes in fasta format downloaded from Uniprot and also NCBI. Something about the formatting of these fasta files is not letting me use the perl, python, and R suggestions I have found on various help forums. Some of them returned a file with almost nothing in it. Part of this difficulty is likely owing to my naivete for troubleshooting code.
A solution on Galaxy would be good, and would also be invaluable to others who need this as part of their workflows.
Thanks for any tips!
it is an interesting question. I am not aware about a dedicated tool on Galaxy for this purpose. Files from NCBI and UniProt might use different rules for sequence names, hence, require different processing.
You can do it in Galaxy. You need a tabular file with columns for gene name, transcript name, sequence length and protein sequence. The order of columns is irrelevant, and the file can have other columns. usegalaxy.* servers usually have tools for all steps needed, like getting sequence length, FASTA to tabular conversion, splitting text into smaller string. The processing depends on sequence names. Once you get the tabular file, use Datamash:
group by gene name
Print all fields from input file: set to Yes (default No)
Operation to perform: maximum on column with sequence length
This operation selects a longest sequence per gene and print the entire line from the input tabular file.
Tabular file with longest sequences can be converted into FASTQ using Tabular-to-FASTA converter.
Develop a workflow using a small subset of data, check the output, and use it with fill sized dataset.
Here an example Galaxy | Australia | Accessible History | protein length
Hope this answers the question.
Thank you very much, @igor! Your method is a very good workaround!
Although the already given answer is better you may could experiment a bit with a clustering tool. Mostly the more simple cluster tools like cd-hit give back the longest sequence of a cluster as output.
Thanks so much, @gbbio. I didn’t realize cd-hit had a longest sequence algorithm, so I will look into it.
Since posting this question, I found that different platforms (NCBI, UniProt, EMBL, etc.) have different data available – sometimes it is already filtered as one isoform per gene, and other times not. In some cases, the user must specify “canonical and isoforms” in the download to get everything.
At any rate, I would have expected more ready-made tools to filter genomes for longest isoforms, seeing as it’s a very simple and common part of workflows. AGAT has a solution that works well for gff3 files. If anyone has any other ideas for fasta files, please share.