Stringtie Returning Different Transcript Counts Using Same Reference Annotation

I ran stringtie on my 8 samples to identify novel transcripts, I then used stringtie merge to create a new reference annotation. I then used that same reference annotation to run on my individual 8 samples. When I obtain the transcript counts for each, they don’t contain the same list of transcripts, they are variable numbers. Has anyone experienced this or know how to fix it?

Hi @KMKLOHONATZ
Do you produce counts for genes? Maybe try other tools, such as featureCounts or htseq-count. QC the data on all steps: get mapping and counting stats and visualize the starts with MultiQC.
It is rather unusual situation for gene counts.
Are you using Galaxy Australia server?
Kind regards,
Igor

Hi @igor ,

Yes it produces counts for both genes and transcripts. I have run the QC as well and it all looks ok.

Yes I am using the Austrailia server because the usegalaxy.org server is still not working for Stringtie on these exact same samples and generates counts files that are empty.

1 Like

Hi @igor ,

As a followup I have also done Feature Counts and the same issue is occurring, different number of transcripts are being counted even though they are all using the same reference annotation. What is also strange is that it isn’t consistent as to which are not being counted and which are.

HI @igor ,

I have tried to use a different annotation file that has been used successfully with other projects and the same issue is occurring. This implies to me there is a server issue but I truthfully do not know.

Hi @KMKLOHONATZ
I am a support officer for Galaxy Australia. Recently a user reported a failed StringTie job with No space left on device message. Based on description from your posts on this forum, it might be your account. Are you comfortable with discussion of your data on this (open) forum or do you prefer communication via Galaxy Australia help desk? I don’t see completed QC jobs among undeleted datasets in the history with failed StringTie jobs. I have some reservations about quality of the data and the annotation file used for read counting. However, I expect identical number of genes in counting files.
Kind regards,
Igor

Hi @igor ,

I have no issues with this being discussed open forum. There is a multiQC file in my history, #218, The read qualities appear to be very good but will take any advice. The annotation file I tried on the last 2 stringtie runs was one previously used in a sequencing project and we had no issues.

I just don’t understand why with the reference annotation different numbers are coming back and it concerns me about the validity of the data all together.

Here is the URL:

Hi @KMKLOHONATZ
I was curious why different number of genes were produced by StringTie (in counting mode) and featureCounts, so I checked the data and completed several tests. The count files have ~388k genes or transcripts. This is a very big number. Generally, we see ~20k protein coding genes in Eukaryotic organisms. featureCounts produced the same number of genes in counting files for “HiSAT2” samples when used with the original Annotation file (dataset #2), while I also got different number of genes with annotation produced by StringTie Merge. I don’t have an explanation for the latter. featureCounts produced counts for 34k genes with the original annotation (dataset #2), which is a reasonable number for an eukariotic organism.
I checked quality of the trimmed reads with FastQC. The nucleotide composition is rather unusual. It changes along the read length. It is a not a good sign. Some samples have three peaks in GC content. GC content varies considerably between the samples. The read length distribution is unusual, with a peaks of very short and full-length reads. I completed samtools flagstats on HiSAT2 alignments. Number of properly mapped paired reads is very small in some samples. Samtools flagstats shows extreme over-representation of reads mapped to (some) Y contigs, consistent with high proportion of duplicates in FastQC reports. I counted reads using featureCounts and visualized summaries using MultiQC. Proportion of reads assigned to features (genes) varies between 0 and 24% between the samples. I don’t know anything about the experiment and samples, but many things look suspicious.
It seems the annotation produced by StringTie/StringTie Merge has an excessive number of transcripts and genes. Both featureCounts and StringTie in Counting mode cannot handle the StringTie Merge annotation and alignments properly returning different number of genes. As I said, you can use Annotation in dataset #2, but the counts varies a lot between files, and many samples have small number of reads assigned to genes.

For tests I used StringTie in Counting mode with Output gene abundance estimation file set to Yes (in Advanced options). It works fine on my test samples, but not with your data.

Kind regards,
Igor

HI @igor ,

Thank you for looking in to this and am a little shocked these samples are having so many issues. They are supposed to be 150 bp reads that I have already trimmed. Would you suggest trying to trimm them a bit more to clean them up or do you have any suggestions?

Hi @KMKLOHONATZ
I’d say these are potential issues. It depends on type of data: standard RNA-Seq, non-coding RNAs, small RNAs - all have different properties, with different QC plots. Maybe examine gene annotations and BAM files from several samples using IGV or JBrowse. Maybe try another sample, from a published study, as a comparison. Run more QC, for example, QualiMap.

Maybe consider adding additional Summary output for HiSAT2 jobs, followed by visualization with MultiQC.

As for trimming: modern aligners (can) do soft trimming, so adapters and low quality data at reads ends usually are not an issue. Some people trim reads, other do not.

Kind regards,
Igor