I want to get gene ids from the results of SAlmon. Which GTF/CSV should be used? Do I need to use tximport. Please help
For counts “by gene”, you’ll need to incorporate a gene-transcript mapping file (tabular, two columns) or a reference annotation dataset (GTF/GFF). The transcript names should exactly match how your transcripts are labeled in the input transcript fasta.
Input the annotation data under this option on the
Salmon tool form:
File containing a mapping of transcripts to genes
There is no special way that you need to
Upload the data in most cases. Use default settings and if you don’t get the expected datatype, try a search at this forum with it as a keyword and you’ll find much Q&A about formats, datatypes, tips to standardize content, reassign a type if Galaxy guess’s wrong (unlikely with the formats expected for this particular input). If you cannot solve it, send back a picture of the full expanded fasta and annotation datasets (both) – with the datatype and the “peek view” showing and we can troubleshoot more from there.
Hope that helps!
Thank you @jennaj . I want to know if we have salmon output with transcript ids .I mean is there a way to map and convert them to gene IDs
I have provided tabular file with transcript ID in first column and gene ID in the second column. I am getting the NM_transcript IDs in salmon output.
Did you successfully get both outputs after incorporating the mapping file? The second represents counts “by gene”. Quote from the tool form:
Salmon will output both quant.sf and quant.genes.sf files, where the latter contains aggregated gene-level abundance estimates.
This will require that the transcript identifiers in the fasta exactly match the transcript identifiers in the mapping file. Maybe check that – from your description, it seems like one of these things might be going on:
First, confirm that you incorporated gene identifiers. If the transcript identifier is the same as the gene identifier on rows in the mapping file, then you are still counting “by transcript”. So, you need to add in the gene identifiers if you want to group/count “by gene”. Or, you can use a GTF file (faster). UCSC hosts those in their Downloads area. You’ll want the RefGene annotation track. This FAQ lists out a few exact links, but the navigation path is the same for any genome they host that has an annotation track: Help for Differential Expression Analysis - Galaxy Community Hub
Do the transcript identifiers exactly match between the fasta and mapping file? This means that any content on the “>” title line is only a single transcript identifier. If you need to remove description content from the title lines, try the tool
NormalizeFastausing the option to remove anything after the first whitespace. FAQ: Datatypes - Galaxy Community Hub
Notes: Converting transcript identifiers to gene identifiers is not a one-to-one relationship. There are often multiple transcripts per gene. A count file with rows of duplicated gene identifiers won’t work with differential expression tools that are based on “gene”. Have
Salmon summarize counts across all transcripts associated with a particular gene. Use the same transcript fasta file and the same mapping file for all samples. In the end, you’ll want output that has: one gene per row, no duplicated genes within any particular file, and the same exact genes in the same exact order in each of the count files input to
If you need more help, please do post some screenshots of your data, and we might be able to spot the problem.
I see a few problems with these
fasta title lines (the “>” lines). These need to be just transcript identifiers in the same exact format as included in your transcript-to-gene mapping file. Item 1 below definitely needs to be addressed. Items 2 and/or 3 might need to be changed.
This formatting problem will definitely cause a mismatch and needs to be fixed before that fasta file is used with any tool. It is the minimal change required and the tool
NormalizeFasta can be used (explained in the prior post).
- extra content after the first whitespace – “range” and everything after should be removed as it is not part of the transcript identifier portion of the fasta title lines
These two might cause problems, it depends on how those same identifiers are formatted in your mapping file.
extra content before the first whitespace, where the “identifier” is located — the “hg38_knownGene_” portion might need to be removed
the transcript identifiers include the version (the extra
.Nwhere N is a number). If your mapping file includes the version, then it is a match. If not, then it won’t match.
The mapping file wasn’t posted, but you can compare yourself to the fasta identifiers.
- Is different than
- And none of those will match up with the full title “>” line as shown in your screenshot.
Identifiers cannot include any whitespace (spaces, tabs). And as long as the identifier used matches in all your files, and doesn’t include any whitespace, this tool will work.
Hope you solved the problem already, but if not yet, the above should help you to find and resolve the mismatches.
fasta: Datatypes - Galaxy Community Hub
Thank you for detail explanation. But if i give transcript id in fasta file? How will the Deseq2 will give output as ‘ENSG’ (gene ID) format
You’ll need to provide the transcript-to-gene information when executing
Salmon. That is the option discussed earlier in the post: File containing a mapping of transcripts to genes
If you still have a transcript fasta from UCSC’s Known Gene track (hg38), the matching annotation file is here: Index of /goldenPath/hg38/bigZips/genes
You’ll need to check that the ID formats in the transcript fasta match the IDs in the reference annotation you choose. That was never posted back, but you can examine it – how-to is also above in prior replies on this topic.
Be aware the Known Genes annotation track combines a few different annotation sources, not just Ensembl. If you only want Ensembl, then choose that single annotation track instead for the transcript fasta and the annotation.
I can’t find the Ensembl transcript fasta in the UCSC downloads area, but it might be there or the UCSC support team can help you to find/get it. Extracting transcript fasta from the Table Browser wasn’t a good choice last time I checked (can be too much data for that method). Or you could choose RefSeq instead, both the fasta and GTF are available. Index of /goldenPath/hg38/bigZips