longest isoform per gene

jaredbernard · May 9, 2022, 9:25pm

Perhaps this is an easy one, but I haven’t yet found the answer. Can anyone tell me if there is a Galaxy tool that will filter a fasta (specifically a protein fasta) to yield only the longest isoform per gene?

BTW, I understand that the longest isoform may not be the best, but this is a way of getting a multi-fasta with one entry per gene.

I have proteomes in fasta format downloaded from Uniprot and also NCBI. Something about the formatting of these fasta files is not letting me use the perl, python, and R suggestions I have found on various help forums. Some of them returned a file with almost nothing in it. Part of this difficulty is likely owing to my naivete for troubleshooting code.

A solution on Galaxy would be good, and would also be invaluable to others who need this as part of their workflows.

Thanks for any tips!

igor · May 10, 2022, 6:07am

Hi @jaredbernard
it is an interesting question. I am not aware about a dedicated tool on Galaxy for this purpose. Files from NCBI and UniProt might use different rules for sequence names, hence, require different processing.

You can do it in Galaxy. You need a tabular file with columns for gene name, transcript name, sequence length and protein sequence. The order of columns is irrelevant, and the file can have other columns. usegalaxy.* servers usually have tools for all steps needed, like getting sequence length, FASTA to tabular conversion, splitting text into smaller string. The processing depends on sequence names. Once you get the tabular file, use Datamash:
group by gene name
Print all fields from input file: set to Yes (default No)
Operation to perform: maximum on column with sequence length
This operation selects a longest sequence per gene and print the entire line from the input tabular file.
Tabular file with longest sequences can be converted into FASTQ using Tabular-to-FASTA converter.
Develop a workflow using a small subset of data, check the output, and use it with fill sized dataset.
Here an example Galaxy | Australia | Accessible History | protein length

Hope this answers the question.
Kind regards,
Igor

jaredbernard · May 10, 2022, 8:25pm

Thank you very much, @igor! Your method is a very good workaround!

gbbio · May 11, 2022, 6:42am

Although the already given answer is better you may could experiment a bit with a clustering tool. Mostly the more simple cluster tools like cd-hit give back the longest sequence of a cluster as output.

jaredbernard · May 11, 2022, 2:44pm

Thanks so much, @gbbio. I didn’t realize cd-hit had a longest sequence algorithm, so I will look into it.

Since posting this question, I found that different platforms (NCBI, UniProt, EMBL, etc.) have different data available – sometimes it is already filtered as one isoform per gene, and other times not. In some cases, the user must specify “canonical and isoforms” in the download to get everything.

At any rate, I would have expected more ready-made tools to filter genomes for longest isoforms, seeing as it’s a very simple and common part of workflows. AGAT has a solution that works well for gff3 files. If anyone has any other ideas for fasta files, please share.

Topic		Replies	Views
Extracting portion of fasta sequences from a multifasta file having contigs names and start-stop positions usegalaxy.org support	0	431	February 24, 2022
UniProt SignalP Predictions: How tobautomatically remove predicted signal equence from FASTA uniprot , fasta-manipulation , bed , text-manipulation	5	1242	December 17, 2018
NGS analysis for mRNA display	1	245	July 2, 2023
Comparison of RNA-seq data with a published paper. transcriptomics , rna_star	1	629	September 30, 2022
RNA STARSolo parameters for 10x 5' data troubleshooting , transcriptomics , single-cell	5	171	May 23, 2024

longest isoform per gene

Related topics