Correct reference transcriptome for Salmon quant on existing RNASTAR alignments

SRFlin · May 15, 2024, 5:30am

I’m trying to use Salmon quant to generate TPM count files on alignments produced using RNASTAR (human samples aligned to the HG38 genome build). I’m aware that I need a reference transcriptome for salmon and downloaded what I believed to be the correct one from the UCSC downloads page (Index of /goldenPath/hg38/bigZips). I have attempted using both the mrna.fa and refMrna.fa files but am still getting errors when I try to run the tool. I am unsure whether the problem is that I still have the wrong reference file or whether the issue is something else I’m not understanding.

The error looks like the enclosed screenshot

and this continues for numerous other transcript references.

For everything else I’ve done so far I’ve been following the ‘reference based RNA-seq data analysis’ tutorial on the galaxy training page (Hands-on: Reference-based RNA-Seq data analysis / Transcriptomics), but unfortunately this doesn’t deal with generating TPM count files or salmon and I can’t find a suitable tutorial which does.

jennaj · May 15, 2024, 6:36pm

Hi @SRFlin

This FAQ explains how to check for mismatches in reference data used in many protocols, including for Salmon → FAQ: Extended Help for Differential Expression Analysis Tools

Then one more reference, although I don’t think you’ll need it for this question → Reference genomes at public Galaxy servers: GRCh38/hg38 example

For the UCSC hg38 reference genome indexed in Galaxy, a reference annotation GTF and reference transcriptome fasta can be sources from at least these two places:

Gencode

https://www.gencodegenes.org/human/
get the first in the list of GTFs, and the first in the list of Fasta
double check the formatting. You might need to standardize the fasta with “NormalizeFasta” (I can’t remember if this is needed) and I would remove GTF headers too (some tools might have a problem with them). The FAQ above has instructions for these.

UCSC

These two are a match, and have standard human Gene Symbol and RefSeq transcript identifiers.

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/refMrna.fa.gz
The GTF will be ready to use after Upload, and the fasta will need to be uncompressed (under the pencil icon) then run through NormalizeFasta to strip out the extra characters.

Technically, any of the reference annotation GTFs in their Downloads area are based on “Gene and Gene Predictions” tracks also represented in the Table browser (or main Browser). This means you can extract a reference transcriptome fasta from the Table browser.

Just note that extracting that way means the transcript footprints (genomic coordinates) are used to extract the transcript sequence based on the genomic sequence. The result might be slightly different from the actual reference transcriptome you might find from other sources! This potential difference probably doesn’t matter for RNA-seq DE analysis, but would if later on you want to call variants or something else sensitive to specific base calls. So, if you care about the bases, use Gencode or the RefSeq above.

Hope this helps!

SRFlin · May 16, 2024, 2:12am

Thanks for your help!

I’ve now tried using both the gencode and UCSC fasta hg38 transcriptomes you’ve provided links to but unfortunately I’m still getting errors when it runs (this is occurring with both the directly downloaded and unzipped files and the files I ran through ‘NormaliseFasta’ first). I’ve also removed the GTF headers according to the FAQ link provided and tried running Salmon both with and without.

The error report it’s giving me back is similar but looks a little different now (see below).

Is it possible that the problem is with the RNA STAR BAM files I’m trying to put through salmon? They were aligned to the inbuilt HG38 reference genome but without a GTF. I’m only thinking this since on the error report it repeatedly says ‘warning transcript XXXXX appears in the reference but not the BAM’.

jennaj · May 16, 2024, 5:45pm

Hi @SRFlin

The tool is reporting that it found transcripts that were not found in the BAM. The identifiers are not simplfied, and I’m not sure which GTF that is (Gencode for both?).

I have a history here where I re-downloaded the UCSC files since I always forget which are the paired files, plus wanted to make sure something didn’t change. The built-in hg38 index would be an appropriate choice to use with the tagged data in here, and it sounds like that is what you used.

You can just review, or import this and use the data. https://usegalaxy.org/u/jen-galaxyproject/h/ucsc-hg38-gtfs-mrna

I’d be curious about how that works out. It might be time to share your history if this doesn’t work – include the run that uses those files if possible since I know those should work.

Topic		Replies	Views
Using the salmon tool I´ve got UCSC ids instead of gene_ids usegalaxy.eu support salmon	2	26	December 6, 2024
Salmonquant didnot work using my Reference transcriptome transcriptomics , tool-help , reference-transcriptome , salmon	2	23	December 9, 2024
RNAseq data alignment and counting using Salmon usegalaxy.org support mapping , transcriptomics , featurecounts , salmon	4	2290	November 29, 2022
Salmon quantification using Ensembl references usegalaxy.eu support salmon	4	1105	October 21, 2021
Where to get human reference annotation in gff3 or gtf format? reference-annotation , salmon	2	235	June 27, 2024

Correct reference transcriptome for Salmon quant on existing RNASTAR alignments

Related topics