Hi @SRFlin
This FAQ explains how to check for mismatches in reference data used in many protocols, including for Salmon → FAQ: Extended Help for Differential Expression Analysis Tools
Then one more reference, although I don’t think you’ll need it for this question → Reference genomes at public Galaxy servers: GRCh38/hg38 example
For the UCSC hg38 reference genome indexed in Galaxy, a reference annotation GTF and reference transcriptome fasta can be sources from at least these two places:
Gencode
- https://www.gencodegenes.org/human/
- get the first in the list of GTFs, and the first in the list of Fasta
- double check the formatting. You might need to standardize the fasta with “NormalizeFasta” (I can’t remember if this is needed) and I would remove GTF headers too (some tools might have a problem with them). The FAQ above has instructions for these.
UCSC
These two are a match, and have standard human Gene Symbol and RefSeq transcript identifiers.
- https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
- https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/refMrna.fa.gz
- The GTF will be ready to use after Upload, and the fasta will need to be uncompressed (under the pencil icon) then run through NormalizeFasta to strip out the extra characters.
Technically, any of the reference annotation GTFs in their Downloads area are based on “Gene and Gene Predictions” tracks also represented in the Table browser (or main Browser). This means you can extract a reference transcriptome fasta from the Table browser.
Just note that extracting that way means the transcript footprints (genomic coordinates) are used to extract the transcript sequence based on the genomic sequence. The result might be slightly different from the actual reference transcriptome you might find from other sources! This potential difference probably doesn’t matter for RNA-seq DE analysis, but would if later on you want to call variants or something else sensitive to specific base calls. So, if you care about the bases, use Gencode or the RefSeq above.
Hope this helps!