Salmon quantification using Ensembl references

Mark_J_Dunning · October 20, 2021, 12:28pm

Hi all,

I am trying to update my RNA-seq training materials to use the salmon tool for quantification rather than mapping and then counting.

My example data are Mouse, so I have downloaded the latest reference genomes from Ensembl and uploaded these to Galaxy:-

http://ftp.ensembl.org/pub/release-104/fasta/mus_musculus/cdna/Mus_musculus.GRCm39.cdna.all.fa.gz
http://ftp.ensembl.org/pub/release-104/gtf/mus_musculus/Mus_musculus.GRCm39.104.chr.gtf.gz

salmon works if I specify the fasta file from ensembl as my reference fasta. There is one row of output for each transcript

Name	Length	EffectiveLength	NumReads
ENSMUST00000178537.2	12	2.002	0.000
ENSMUST00000178862.2	14	2.071	0.000
ENSMUST00000196221.2	9	1.818

However, I would like to demonstrate how to obtain gene-level estimates by specifying a transcript to gene mapping. The tool suggests that a gtf file can be used, but when I use the gtf from Ensembl I still get the same number of transcripts in my “gene quantification” (suggesting that it cannot map the transcripts).

I have also created a tab-delimited from Biomart

Transcript stable ID version	Gene stable ID
ENSMUST00000082387.1	ENSMUSG00000064336
ENSMUST00000082388.1	ENSMUSG00000064337
ENSMUST00000082389.1	ENSMUSG00000064338

By my salmon gene output still has all the transcripts.

Can anyone suggest where I am going wrong, or point me to some example fasta and transcript mapping files that work as expected?

Many thanks,

Mark

gallardoalba · October 20, 2021, 2:08pm

Hi @Mark_J_Dunning,
I inspected the GTF files, and regarding it I think that the format is not compatible with SALMON; e.g. the transcript ENSMUST00000082387.1 is encoded in the GTF file as transcript_id "ENSMUST00000082387"; transcript_version "1", instead of just transcript_id "ENSMUST00000082387.1".

This is an example that use a GTF file for mapping the transcripts to genes.

Could you share your history with me? I would like to check why the tabular transcript-to-gene file doesn’t work.

Regards

Mark_J_Dunning · October 20, 2021, 3:39pm

Hi @gallardoalba,

Thanks for looking at the issue. Yes, that makes sense about the gtf file. It’s a shame that the transcript names in the gtf and fasta do not follow the same format!

Yes I can share my history with you. I need your email address for that I think?

Mark

Mark_J_Dunning · October 20, 2021, 4:20pm

I think I might have fixed it. My first attempt at the transcript mapping file had column headings with spaces in the name Transcript stable ID version and Gene stable ID. I wondered if this was confusing the tool into thinking there were more columns present in the data.

I changed to column headings to just “Transcript” and “Gene” and this seems to have worked. I guess also skipping the columns headings might work?

Mark

gallardoalba · October 21, 2021, 8:40am

Hi @Mark_J_Dunning, did it work?

Topic		Replies	Views
RNAseq data alignment and counting using Salmon usegalaxy.org support mapping , transcriptomics , featurecounts	4	1293	November 29, 2022
Salmon quant to DESeq2 usegalaxy.org support	0	380	November 12, 2020
Salmon transcripts are not aggregated to genes at all usegalaxy.org support galaxy-local	1	375	October 17, 2019
Help with Salmon and edgeR usegalaxy.org support transcriptomics , edger	2	438	April 17, 2024
Getting partial conversion from transcript to gene when using Salmon on Galaxy usegalaxy.org support	0	293	July 12, 2019

Salmon quantification using Ensembl references

Related Topics