Hi, wanted to ask for some advice regarding RNAseq data analysis using Salmon.
I was going to use this tool to align my RNAseq data to human reference genome and then obtain the read counts for downstream DESeq2 analysis. I obtained human reference genome data from ensembl (cDNA) in fa.gz format. However, I see that for the alignment, I need to have alignment files in .bam format. I also seem to need GTF files to map transcripts to genes?
I am a little bit at a loss on how to proceed with the alignment. At the moment I have trimmed RNAseq data in fastsanger.gz format and that cDNA reference from ensembl, but I understand some intermediary step is needed?
Would appreciate any advice on how to proceed - most Salmon tutorials Iâve looked at were for count quantification or written for advanced users of bioconda etc., and the threads on this forum cover issues dowstream of the initial alignment steps. So at the moment Iâm struggling
Hi @Egle,
Salmon works on a reference transcriptome. You donât need alignments: Salmon takes FASTQ files. Because of alternative splicing, you need a file with transcript to genes map. I had less issues with a tabular transcript-to-genes map file. Here is my old history with Salmon: Galaxy | Australia | Published History | Transcript quantification with Salmon 2018/06/08
You may also consider a traditional approach: mapping with HiSAT2 and read counting with featureCounts. GTN provides several detailed tutorials for this approach.
Kind regards,
Igor
Thank you very much for your response and for sharing the examples of data files that you use for analysis - this is very helpful.
I would like to ask a little bit more about how you made up the genes to transcripts map. I understand you used Trinity tool to assemble your reads and then the same tool to extract tabular data on genes of interest to create the map? The reads - were they reference transcriptome or transcripts of interest?
Also, I understand you used the âreadâ mode of salmon? Regarding the orientation and strandedness of reads, do you always use âinwardâ and ânot strandedâ or does it depend on your data?
The reason Iâm wondering is that I ran the counting in âreadâ mode, but I am not sure if I did that correctly (probably not :/).
What I did was:
Normalised the ensembl reference cDNA to Fasta format.
I used the ensembl reference in Fasta format as a built-in index for reference.
I used my RNAseq data (paired-end) as input data with strandedness to be detected automatically. The RNAseq data was only modified for base quality and primers/adapters prior to this step.
Regarding the âFile containing a mapping of transcripts to genesâ, I downloaded the ensembl table on human gene stable ID versions, Transcript stable ID versions, gene names and descriptions using BioMart and used this table as such a file.
Ran the quantification.
The result was compatible with your examples except for Gene names or Transcript stable IDs.
However, I was somewhat concerned about the number of Transcript IDs that remained unchanged to gene names. I assume they were not mapped? And possibly because I missed the assembly of my RNAseq data prior to step 3.? Or maybe salmon did not like that I used gene ID or transcript ID versions rather than IDs alone?
Would appreciate your opinion.
As for HiSAT2 and featureCounts - yes, I considered trying out HiSAT2. Problem is, I have to finish my DEG analysis by some point in December, so I am not sure how much time I have to learn the specifics of each tool. We selected salmon for its speed and relative reliability compared to STAR (which I understand is a current gold standard?).
The mapping file needs to be just two columns: transcriptID (tab) geneID
What you decide to use as the transcriptID needs to exactly match the content of the transcriptomeâs fasta title line (everything on the > lines, no extra content or spaces).
The geneID column must have content. The Salmon count input will be organized/grouped by those IDs.
The other help in this post might help, too. An fatal error with DESeq2. I didnât see this prior conversation before that original reply, but I think it covers what youâll need to address.
Question: should a transcriptID ever be used as a geneID?
Answer: Probably not. If multiple transcripts are actually part of the same gene/locus, the reads will map to all of those transcripts you could end up with a scientific problem when counting (those âmulti-mappedâ reads excluded). You would also have problems mapping the entire âgeneIDsâ column of data to other annotation since some wouldnât actually be geneIDs.
@Egle:
one of Trinity outputs is a file with gene to transcript map. I swapped columns to get transcript to genes map: Salmon expects transcript ID in the fist column. Other tools use gene to transcript map, with genes in 1st column. The history was created as an example of Salmon use years ago. I used transcripts from Trinity. In your terminology these are âtranscripts of interestâ, but you can use a similar setup for reference transcriptome.
Other settings, such as strandness, depend on the data. If you library was created using a stranded protocol, use an appropriate setting. You should get information on strandness from your sequencing provider. If you have a reference genome and gene annotation, you can figure out if your data is stranded or not.
Identical number of genes and transcripts: check the transcript to gene map and make sure the transcript IDs are in 1st column and genes are in 2d. Check the annotation. Do you see alternatively spliced transcripts?
Personally I prefer HiSAT2 over STAR, but both are fine.
Kind regards,
Igor