Hi,
I want to get gene ids from the results of SAlmon. Which GTF/CSV should be used? Do I need to use tximport. Please help
For counts âby geneâ, youâll need to incorporate a gene-transcript mapping file (tabular, two columns) or a reference annotation dataset (GTF/GFF). The transcript names should exactly match how your transcripts are labeled in the input transcript fasta.
Input the annotation data under this option on the Salmon
tool form: File containing a mapping of transcripts to genes
There is no special way that you need to Upload
the data in most cases. Use default settings and if you donât get the expected datatype, try a search at this forum with it as a keyword and youâll find much Q&A about formats, datatypes, tips to standardize content, reassign a type if Galaxy guessâs wrong (unlikely with the formats expected for this particular input). If you cannot solve it, send back a picture of the full expanded fasta and annotation datasets (both) â with the datatype and the âpeek viewâ showing and we can troubleshoot more from there.
Hope that helps!
Thank you @jennaj . I want to know if we have salmon output with transcript ids .I mean is there a way to map and convert them to gene IDs
I have provided tabular file with transcript ID in first column and gene ID in the second column. I am getting the NM_transcript IDs in salmon output.
Did you successfully get both outputs after incorporating the mapping file? The second represents counts âby geneâ. Quote from the tool form:
Salmon will output both quant.sf and quant.genes.sf files, where the latter contains aggregated gene-level abundance estimates.
This will require that the transcript identifiers in the fasta exactly match the transcript identifiers in the mapping file. Maybe check that â from your description, it seems like one of these things might be going on:
-
First, confirm that you incorporated gene identifiers. If the transcript identifier is the same as the gene identifier on rows in the mapping file, then you are still counting âby transcriptâ. So, you need to add in the gene identifiers if you want to group/count âby geneâ. Or, you can use a GTF file (faster). UCSC hosts those in their Downloads area. Youâll want the RefGene annotation track. This FAQ lists out a few exact links, but the navigation path is the same for any genome they host that has an annotation track: Help for Differential Expression Analysis - Galaxy Community Hub
-
Do the transcript identifiers exactly match between the fasta and mapping file? This means that any content on the â>â title line is only a single transcript identifier. If you need to remove description content from the title lines, try the tool
NormalizeFasta
using the option to remove anything after the first whitespace. FAQ: Datatypes - Galaxy Community Hub
Notes: Converting transcript identifiers to gene identifiers is not a one-to-one relationship. There are often multiple transcripts per gene. A count file with rows of duplicated gene identifiers wonât work with differential expression tools that are based on âgeneâ. Have Salmon
summarize counts across all transcripts associated with a particular gene. Use the same transcript fasta file and the same mapping file for all samples. In the end, youâll want output that has: one gene per row, no duplicated genes within any particular file, and the same exact genes in the same exact order in each of the count files input to Deseq2
.
If you need more help, please do post some screenshots of your data, and we might be able to spot the problem.
I see a few problems with these fasta
title lines (the â>â lines). These need to be just transcript identifiers in the same exact format as included in your transcript-to-gene mapping file. Item 1 below definitely needs to be addressed. Items 2 and/or 3 might need to be changed.
This formatting problem will definitely cause a mismatch and needs to be fixed before that fasta file is used with any tool. It is the minimal change required and the tool NormalizeFasta
can be used (explained in the prior post).
- extra content after the first whitespace â ârangeâ and everything after should be removed as it is not part of the transcript identifier portion of the fasta title lines
These two might cause problems, it depends on how those same identifiers are formatted in your mapping file.
-
extra content before the first whitespace, where the âidentifierâ is located â the âhg38_knownGene_â portion might need to be removed
-
the transcript identifiers include the version (the extra
.N
where N is a number). If your mapping file includes the version, then it is a match. If not, then it wonât match.
The mapping file wasnât posted, but you can compare yourself to the fasta identifiers.
- This
ENST00000456328
- Is different than
ENST00000456328.2
orhg38_knownGene_ENST00000456328.2
- And none of those will match up with the full title â>â line as shown in your screenshot.
Identifiers cannot include any whitespace (spaces, tabs). And as long as the identifier used matches in all your files, and doesnât include any whitespace, this tool will work.
Hope you solved the problem already, but if not yet, the above should help you to find and resolve the mismatches.
FAQ for fasta
: Datatypes - Galaxy Community Hub
Hi,
Thank you for detail explanation. But if i give transcript id in fasta file? How will the Deseq2 will give output as âENSGâ (gene ID) format
Youâll need to provide the transcript-to-gene information when executing Salmon
. That is the option discussed earlier in the post: File containing a mapping of transcripts to genes
If you still have a transcript fasta from UCSCâs Known Gene track (hg38), the matching annotation file is here: Index of /goldenPath/hg38/bigZips/genes
Youâll need to check that the ID formats in the transcript fasta match the IDs in the reference annotation you choose. That was never posted back, but you can examine it â how-to is also above in prior replies on this topic.
Be aware the Known Genes annotation track combines a few different annotation sources, not just Ensembl. If you only want Ensembl, then choose that single annotation track instead for the transcript fasta and the annotation.
I canât find the Ensembl transcript fasta in the UCSC downloads area, but it might be there or the UCSC support team can help you to find/get it. Extracting transcript fasta from the Table Browser wasnât a good choice last time I checked (can be too much data for that method). Or you could choose RefSeq instead, both the fasta and GTF are available. Index of /goldenPath/hg38/bigZips