Correct reference transcriptome for Salmon quant on existing RNASTAR alignments

Hi @SRFlin

This FAQ explains how to check for mismatches in reference data used in many protocols, including for SalmonFAQ: Extended Help for Differential Expression Analysis Tools

Then one more reference, although I don’t think you’ll need it for this question → Reference genomes at public Galaxy servers: GRCh38/hg38 example


For the UCSC hg38 reference genome indexed in Galaxy, a reference annotation GTF and reference transcriptome fasta can be sources from at least these two places:

Gencode

  • https://www.gencodegenes.org/human/
  • get the first in the list of GTFs, and the first in the list of Fasta
  • double check the formatting. You might need to standardize the fasta with “NormalizeFasta” (I can’t remember if this is needed) and I would remove GTF headers too (some tools might have a problem with them). The FAQ above has instructions for these.

UCSC

These two are a match, and have standard human Gene Symbol and RefSeq transcript identifiers.

Technically, any of the reference annotation GTFs in their Downloads area are based on “Gene and Gene Predictions” tracks also represented in the Table browser (or main Browser). This means you can extract a reference transcriptome fasta from the Table browser.

Just note that extracting that way means the transcript footprints (genomic coordinates) are used to extract the transcript sequence based on the genomic sequence. The result might be slightly different from the actual reference transcriptome you might find from other sources! This potential difference probably doesn’t matter for RNA-seq DE analysis, but would if later on you want to call variants or something else sensitive to specific base calls. So, if you care about the bases, use Gencode or the RefSeq above.

Hope this helps! :slight_smile: