Correct reference transcriptome for Salmon quant on existing RNASTAR alignments

I’m trying to use Salmon quant to generate TPM count files on alignments produced using RNASTAR (human samples aligned to the HG38 genome build). I’m aware that I need a reference transcriptome for salmon and downloaded what I believed to be the correct one from the UCSC downloads page (Index of /goldenPath/hg38/bigZips). I have attempted using both the mrna.fa and refMrna.fa files but am still getting errors when I try to run the tool. I am unsure whether the problem is that I still have the wrong reference file or whether the issue is something else I’m not understanding.

The error looks like the enclosed screenshot


and this continues for numerous other transcript references.

For everything else I’ve done so far I’ve been following the ‘reference based RNA-seq data analysis’ tutorial on the galaxy training page (Hands-on: Reference-based RNA-Seq data analysis / Transcriptomics), but unfortunately this doesn’t deal with generating TPM count files or salmon and I can’t find a suitable tutorial which does.

Hi @SRFlin

This FAQ explains how to check for mismatches in reference data used in many protocols, including for SalmonFAQ: Extended Help for Differential Expression Analysis Tools

Then one more reference, although I don’t think you’ll need it for this question → Reference genomes at public Galaxy servers: GRCh38/hg38 example


For the UCSC hg38 reference genome indexed in Galaxy, a reference annotation GTF and reference transcriptome fasta can be sources from at least these two places:

Gencode

  • https://www.gencodegenes.org/human/
  • get the first in the list of GTFs, and the first in the list of Fasta
  • double check the formatting. You might need to standardize the fasta with “NormalizeFasta” (I can’t remember if this is needed) and I would remove GTF headers too (some tools might have a problem with them). The FAQ above has instructions for these.

UCSC

These two are a match, and have standard human Gene Symbol and RefSeq transcript identifiers.

Technically, any of the reference annotation GTFs in their Downloads area are based on “Gene and Gene Predictions” tracks also represented in the Table browser (or main Browser). This means you can extract a reference transcriptome fasta from the Table browser.

Just note that extracting that way means the transcript footprints (genomic coordinates) are used to extract the transcript sequence based on the genomic sequence. The result might be slightly different from the actual reference transcriptome you might find from other sources! This potential difference probably doesn’t matter for RNA-seq DE analysis, but would if later on you want to call variants or something else sensitive to specific base calls. So, if you care about the bases, use Gencode or the RefSeq above.

Hope this helps! :slight_smile:

Thanks for your help!

I’ve now tried using both the gencode and UCSC fasta hg38 transcriptomes you’ve provided links to but unfortunately I’m still getting errors when it runs (this is occurring with both the directly downloaded and unzipped files and the files I ran through ‘NormaliseFasta’ first). I’ve also removed the GTF headers according to the FAQ link provided and tried running Salmon both with and without.

The error report it’s giving me back is similar but looks a little different now (see below).

Is it possible that the problem is with the RNA STAR BAM files I’m trying to put through salmon? They were aligned to the inbuilt HG38 reference genome but without a GTF. I’m only thinking this since on the error report it repeatedly says ‘warning transcript XXXXX appears in the reference but not the BAM’.

Hi @SRFlin

The tool is reporting that it found transcripts that were not found in the BAM. The identifiers are not simplfied, and I’m not sure which GTF that is (Gencode for both?).

I have a history here where I re-downloaded the UCSC files since I always forget which are the paired files, plus wanted to make sure something didn’t change. The built-in hg38 index would be an appropriate choice to use with the tagged data in here, and it sounds like that is what you used.

You can just review, or import this and use the data. https://usegalaxy.org/u/jen-galaxyproject/h/ucsc-hg38-gtfs-mrna

I’d be curious about how that works out. It might be time to share your history if this doesn’t work – include the run that uses those files if possible since I know those should work.