Gene and transcript level quant using salmon

Hi all! I’m trying to use Salmon quant to get gene and transcript level quantification. Below are the settings I’m using:

  • Salmon quant (Galaxy Version 1.10.1+galaxy4)
  • “reads” mode
  • Using built-in hg38 transcriptome
  • paired end dataset collection
  • automatic infer for strandedness
  • disabled BAM file
  • hg38 annotation.gtf from Gencode for “File containing a mapping of transcripts to genes”

The resulting files do NOT have gene or transcript level quantification. Instead, they are divided by chromosome (first column is chr1, chr2, etc. and includes the non-standard chromosomes as well). I’ve tried running the same settings without the annotation.gtf file and get the same result. Please help! Thanks in advance.

Welcome @aperreault

Hopefully we can help!

I’m not sure where you are sourcing this but it is likely part of the problem with Salmon:

Instead, try using a reference transcriptome fasta and a reference annotation GTF – both based on the same genome assembly version.

However, remember that you will probably not want to include the actual reference genome fasta! My guess is that including it was why you are getting this result:

So – try using just the reads, along with the transcript fasta for the alignment portion and then the annotation to group those transcripts into genes. There will be two primary outputs. This is what is usually wanted to perform later downstream steps such as DESeq2.

UCSC is a good choice for human!

Please give that a try and let us know if you would like any more help! See the banner topic for how/what to screenshot or to generate for history share link when getting more specific feedback.

Let’s start there! :slight_smile:

Thanks for the quick response! A few follow up questions…

  • Why even offer the Galaxy built in transcriptome as an option if it doesn’t work for this tool?
  • Does your suggestion to use the reference transcriptome fasta mean I need to download and then use the “use one from history” option?

I don’t think I can run the tool without selecting one of the above options. Some further clarification would be helpful!

Hi @aperreault

Hopefully I can explain more. :slight_smile:

The built-in reference transcriptome fasta for Salmon can be used for some use cases. The most common is a quick assessment of abundance across known, expected baseline features. This is often used as a sort of quality assurance step – a scientific assessment of a read library at a higher level than the read quality itself. This is a simple transcriptome quantification type of analysis step.

The version hosted in Galaxy is even simpler than that! There is just one “feature” per chromosome! This includes the primary autosomes plus all of the alt and haplotypes. Running the tool against the built-in index can answer the question “Does my library have mappable reads to all components of this species’ reference genome”.

However, if you want to something more sophisticated, such as group the transcripts by gene, then you’ll need to supply both files. This is what is done when generating a gene quantification type of analysis step.

What to do

There are many different versions of annotation for all species! There is annotation per genome assembly version (the specific basepairs in an assembly) and per labeling scheme. The genome assembly version changes over longer time frames than the annotation. Indexing the genome is common but you can also use a custom genome in Galaxy. Then, getting the current annotation is usually desired, and the files are not very large, plus sourcing this yourself gives you more control as the scientist to choose what you are using: who created the annotation and what it may contain.

You can test this. Try running Salmon with the built-in fasta and without any annotation. You’ll expose the labels used in the output. These can then be compared to your reference annotation sourced from Gencode. This is similar to what you already noticed. The transcripts from the built-in index are based on chromosome names (where the transcript name should be) and the annotation from Gencode uses distinct labels for the chromosome, transcript, and gene identifiers.

Then for your other question here

Yes, you will be using the fasta and the GTF from the history. You can use the UCSC versions I linked above or you can get this fasta data from Gencode to go with the GTF you already have at GENCODE - Human Release 49.

Hope this helps! Please let us know if you have more questions. :slight_smile:

This was super helpful! Thanks for giving such detailed background on Salmon and the different reference files. I really appreciated it! :slight_smile:

1 Like