Salmon output issue: differences in line numbers of Quantification and Gene Quantification output files

Dear All,

Apologies to bother you again. Another Salmon quant question. While checking the output files, I noticed that the number of lines in the output files (“Gene quantification” and “Quantification”) was inconsistent. For instance:

image

image

I understand that “Gene quantification” vs “Quantification” output files show different things: one the counts for genes while the other for transcripts. While that could possibly explain different numbers of rows, I am somewhat puzzled by higher numbers of lines in the “Gene quantification” file - I would expect more transcripts (and their versions) than genes.

Also, some files have the same number of lines: ~190,000.

Second thing, when I was trying to set-up the pipeline for my analysis and ran DESeq2, the PCA spaced the datasets a bit weirdly: I expected them to group by test condition but they didn’t. Of course, sample size was too small for PCA (only 4 datasets in total) but having noticed this difference in line numbers, I now began to worry that maybe I ended up incorrectly or partially mapping/aligning/referencing the datasets and/or maybe they are partially counted by Salmon and/or partially compared when running DESeq2.

Third concerning thing was that my transcript-to-gene table I tried when testing the pipeline earlier had ~270,000 lines (probably because of transcript versions). I dropped that when DESeq2 did not work and switched to ensembl cDNA reference in .gtf format (~3,200,000 lines).

Problem is, I did not notice this issue with differences in numbers of lines when running the pilots (they were all ~190,000 lines, irrespective of which dataset or output file it was). And I am sure I used the same parameters for all the datasets - pilot and the remaining ones.

Would appreciate your opinion on whether this is a problem and if so, how you would recommend to solve it.

Thanks!

Did a little bit of follow-up reading and found some threads that may be relevant:

https://github.com/COMBINE-lab/salmon/issues/800

https://github.com/COMBINE-lab/salmon/issues/214

Both suggest that Salmon discards duplicate transcripts. I checked the reports of the output files and there were some issues that seem to be similar:

The interesting thing is that the second thread also mentions some mismatches between the fasta and .gtf files. Is this something you have ever run into when doing your analyses?

I am thinking of perhaps trying different mapping/alignment and count tools :confused:

This is likely your problem, too. And many posts at this forum are about format/content issues with either fasta, GTF, or both when used together. Tools are literal – any identifier that is compared between files needs to exactly match between all inputs.

Short list of absolute requirements for Salmon → DESeq2

  1. Transcript fasta
  • did the file completely Upload to Galaxy? Truncated fasta is very hard to detect. So, just try to reproduce it by loading the file again and comparing
  • has correct formatting – try running it through the tool NormalizeFasta using the options to wrap at 80 bases and checking the box to remove “everything after the first whitespace”
  • does not contain any duplicated transcripts_id values – these are on the > title lines
  • if you get the fasta from a data provider, all in one file, it might have problems the first and second will address but you won’t need to worry about third
  1. GTF

    • should be from the same data provider where you sourced the fasta
    • make sure it is also completed Uploaded, and use “auto-detect” for the datatype. If Galaxy doesn’t guess gtf there is a problem to fix
    • the most common problem is header lines # at the top. Remove the headers, then “redetect” the datatype until Galaxy guesses gtf. If it won’t be produced, then you don’t have a GTF (maybe is GFF3 which will not work well?)
    • inspect the identifiers in that file – does the transcript_id attribute exactly match those in the fasta (minus the >)?
    • does the file contain gene_id attributes? are those different from the transcript_id attributes? if they are the same value, you need to find a GTF that has actual gene_id values.
  2. Execute all Salmon runs with the same fasta, along with the same GTF that you plan to use with DESeq2.

These FAQs cover the “how-to”. The UCSC hosted Ensembl genes track is probably what you want to use. Get the fasta from the Table Browser but get the GTF from their downloads area (only – not the TB).

I also added some tags to your post that will find related prior topics here.

1 Like

Dear @jennaj,

Thank you so much for taking time to look into this and for recommending information sources to check out - this is very helpful.

This is likely your problem, too. And many posts at this forum are about format/content issues with either fasta, GTF, or both when used together. Tools are literal – any identifier that is compared between files needs to exactly match between all inputs.

That makes sense. Although I am still somewhat confused in terms of how I can check whether my reference transcriptome and the annotations match exactly. Are there any tools that would allow me do that? Visual inspection of FASTA files suggests that they should match - e.g. both the reference and the annotation file show transcripts with versioned IDs. But that does not confirm whether they are fully compatible.

  • did the file completely Upload to Galaxy? Truncated fasta is very hard to detect. So, just try to reproduce it by loading the file again and comparing

Thank you, will try that!

  • does not contain any duplicated transcripts_id values – these are on the > title lines

As above - is there any tool that allows this specifically? Both for Fasta and .gtfs?

  • if you get the fasta from a data provider, all in one file, it might have problems the first and second will address but you won’t need to worry about third

Yes, I downloaded everything with from ensembl initially, so assumed they should be compatible. I suspect tools not liking something about the .gtf files I used, though: I also tried reference transcriptome and .gtf annotation file from Gencode as substitutes for the ensembl files. Which changed the situation with lines (62,263 lines for Gene Quantification and ~250,000 lines for the Quantification files). And genes and transcripts were labelled respectively in the output tables (experiment with ensembl files provided output as gene IDs in Quantification and transcript IDs in Gene Quantification). But when I ran DESeq2 with Salmon output using Gencode files, I was only able to use Quantification output but not the Gene Quantification output. So something is still not quite right?

  • make sure it is also completed Uploaded, and use “auto-detect” for the datatype. If Galaxy doesn’t guess gtf there is a problem to fix

Will try reuploading, although nothing suggested truncation yet. I used the ftp urls to upload those and it was .gtf or .gtf.gz.

  • the most common problem is header lines # at the top. Remove the headers, then “redetect” the datatype until Galaxy guesses gtf. If it won’t be produced, then you don’t have a GTF (maybe is GFF3 which will not work well?)

Yes! I tried removing headers. This worked but line titles remained. Salmon output, however, looked the same as prior to removing the headers. I don’t think I played around with datatypes much, will check what happened there.

  • does the file contain gene_id attributes? are those different from the transcript_id attributes? if they are the same value, you need to find a GTF that has actual gene_id values.

The gtf files I used contained both. In terms of difference - do you mean ENSG… vs ENST… IDs? Those were different.

The UCSC hosted Ensembl genes track is probably what you want to use.

Yes, I haven’t tried this one - will see if it helps.

@jennaj

I checked the datatypes of the gtf annotation files, they were all gtf. Tried the headers again but they did not help.

Checked the UCSC gene tracks but I understand most datasets are tailored to genome analyses rather than RNAseq. I found this dataset: analysis set. It contains the fasta file where data is broken down by chromosome rather than by gene or transcript and it does not seem to have a corresponding annotation file? Have you ever used this dataset for your analyses? Which annotation would you recommend to use it with to avoid possible mismatches?

I checked the recommendations for ensembl and gencode gtfs but the links go back to the main sites of these databases, so I understand I would need to use similar or the same datasets to the ones I’ve tried, including transcriptome references.

I have also tried investigating this bug of DESeq2 after having tried to run the Salmon Gene Quantification output files with the reference transcripts and annotation from GENCODE. The current bug report is the following:

Warning message:
In Sys.setlocale(“LC_MESSAGES”, “en_US.UTF-8”) :
OS reports request to set locale to “en_US.UTF-8” cannot be honored
Import genomic features from the file as a GRanges object … OK
Prepare the ‘metadata’ data frame … OK
Make the TxDb object … OK
‘select()’ returned 1:1 mapping between keys and columns
reading in files with read.delim (install ‘readr’ package for speed up)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
reading in files with read.delim (install ‘readr’ package for speed up)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Error in .local(object, …) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.

Example IDs (file): [ENSG00000210194, ENSG00000210191, ENSG00000210176, …]

Example IDs (tx2gene): [ENST00000456328, ENST00000450305, ENST00000473358, …]

This can sometimes (not always) be fixed using ‘ignoreTxVersion’ or ‘ignoreAfterBar’.

Calls: get_deseq_dataset … tximport → summarizeToGene → summarizeToGene → .local
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The “phase” metadata column contains non-NA values for features of type
stop_codon. This information was ignored.

This bug occurred when using annotation files containing or not containing headers.

The report refers to this ‘ignoreTxVersion’ or ‘ignoreAfterBar’ element, and I also saw that some Bioconductor users recommend bypassing issues related to tximport warnings using ignoreTxversion function (https://support.bioconductor.org/p/81954/). I was wondering if there is an equivalent function/setting for this in DESeq2 wrapper in Galaxy? If so, perhaps it could help?

I also saw a tximport tool in the menu. I wonder if this may be worth trying to work around the bug somehow?

I also read that GENCODE .gtfs can contain duplicated transcripts. Could this interfere with matching and should I somehow filter the transcripts?

I was also recommended to try inputting Salmon counts without refering to the annotation and annotate my data later in the workstream. Would that be possible with the inbuilt version of DESeq2 (I was only able to choose between transcript-to-gene table or .gtf/.gff files)?

Would appreciate your thoughts :slightly_smiling_face:

Hi @Egle

Try with these two files for human as an example of a paired set of reference data.

These are direct URLs – can be copied and pasted into the Upload tool. Leave all the auto settings at default.

For “sanity checks” you can copy one of the transcript_id values and use a tool like “Select lines that match an expression”. Use that transcript_id value as a term for a text search against both of these files. Then check – is that transcript_id only in the fasta file once? Are all the lines in the GTF describing features for that same transcript_id (at a minimum, all the transcript_id lines should all be associated with the same gene_id).

It should definitely be possible to get this from other data providers. If the files at a particular site are confusing, you could write in to them and ask how others get this data. You want a transcript_id fasta and a GTF or two column file that describes the transcript_id(s) plus minimally gene_id(s).