Dear All,
Apologies to bother you again. Another Salmon quant question. While checking the output files, I noticed that the number of lines in the output files (“Gene quantification” and “Quantification”) was inconsistent. For instance:
I understand that “Gene quantification” vs “Quantification” output files show different things: one the counts for genes while the other for transcripts. While that could possibly explain different numbers of rows, I am somewhat puzzled by higher numbers of lines in the “Gene quantification” file - I would expect more transcripts (and their versions) than genes.
Also, some files have the same number of lines: ~190,000.
Second thing, when I was trying to set-up the pipeline for my analysis and ran DESeq2, the PCA spaced the datasets a bit weirdly: I expected them to group by test condition but they didn’t. Of course, sample size was too small for PCA (only 4 datasets in total) but having noticed this difference in line numbers, I now began to worry that maybe I ended up incorrectly or partially mapping/aligning/referencing the datasets and/or maybe they are partially counted by Salmon and/or partially compared when running DESeq2.
Third concerning thing was that my transcript-to-gene table I tried when testing the pipeline earlier had ~270,000 lines (probably because of transcript versions). I dropped that when DESeq2 did not work and switched to ensembl cDNA reference in .gtf format (~3,200,000 lines).
Problem is, I did not notice this issue with differences in numbers of lines when running the pilots (they were all ~190,000 lines, irrespective of which dataset or output file it was). And I am sure I used the same parameters for all the datasets - pilot and the remaining ones.
Would appreciate your opinion on whether this is a problem and if so, how you would recommend to solve it.
Thanks!