Salmon quantification using Ensembl references

Hi all,

I am trying to update my RNA-seq training materials to use the salmon tool for quantification rather than mapping and then counting.

My example data are Mouse, so I have downloaded the latest reference genomes from Ensembl and uploaded these to Galaxy:-

http://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/cdna/Mus_musculus.GRCm39.cdna.all.fa.gz
http://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.chr.gtf.gz

salmon works if I specify the fasta file from ensembl as my reference fasta. There is one row of output for each transcript

Name Length EffectiveLength TPM NumReads
ENSMUST00000178537.2 12 2.002 0.000000 0.000
ENSMUST00000178862.2 14 2.071 0.000000 0.000
ENSMUST00000196221.2 9 1.818 0.000000

However, I would like to demonstrate how to obtain gene-level estimates by specifying a transcript to gene mapping. The tool suggests that a gtf file can be used, but when I use the gtf from Ensembl I still get the same number of transcripts in my “gene quantification” (suggesting that it cannot map the transcripts).

I have also created a tab-delimited from Biomart

Transcript stable ID version Gene stable ID
ENSMUST00000082387.1 ENSMUSG00000064336
ENSMUST00000082388.1 ENSMUSG00000064337
ENSMUST00000082389.1 ENSMUSG00000064338

By my salmon gene output still has all the transcripts.

Can anyone suggest where I am going wrong, or point me to some example fasta and transcript mapping files that work as expected?

Many thanks,

Mark

Hi @Mark_J_Dunning,
I inspected the GTF files, and regarding it I think that the format is not compatible with SALMON; e.g. the transcript ENSMUST00000082387.1 is encoded in the GTF file as transcript_id "ENSMUST00000082387"; transcript_version "1", instead of just transcript_id "ENSMUST00000082387.1".

This is an example that use a GTF file for mapping the transcripts to genes.

Could you share your history with me? I would like to check why the tabular transcript-to-gene file doesn’t work.

Regards

Hi @gallardoalba,

Thanks for looking at the issue. Yes, that makes sense about the gtf file. It’s a shame that the transcript names in the gtf and fasta do not follow the same format!

Yes I can share my history with you. I need your email address for that I think?

Mark

I think I might have fixed it. My first attempt at the transcript mapping file had column headings with spaces in the name Transcript stable ID version and Gene stable ID. I wondered if this was confusing the tool into thinking there were more columns present in the data.

I changed to column headings to just “Transcript” and “Gene” and this seems to have worked. I guess also skipping the columns headings might work?

Mark

Hi @Mark_J_Dunning, did it work?