Try this order instead, and omit GffCompare. The original protocol is creating a unified GTF to base counts on after the counts have been generated and is likely the source of the problem.
TrimmomaticRNAStar(orHISAT2) If usingHISAT2be sure to use the output setting to create aBAMdataset compatible withStringtie. This option is located under Advanced Options > Spliced alignment options > Transcriptome assembly reporting > select the radio button for “Report alignments tailored for transcript assemblers including StringTie”)Filter BAM datasets on a variety of attributes– this is optional, and usually used to restrict theBAMcontents. I’m assuming you are doing this on purpose and with a purpose specific to your analysis goals.StringTie Merge– enter just your reference annotationGTFunder the inputReference annotation to include in the merging. This “grooms” theGTFto a standardized formatStringtiecan interpret.Stringtie– using the groomedGTFfrom step 4StringTie Merge– enter the groomedGTFfrom step 4 plus all of theGTFsproduced byStringtiein step 5 (one per sample). This step is optional and depends on how you want to do the counting.- If you are only interested in known transcripts/genes, skip step 6.
- If you are interested in both known transcripts/genes + novel transcripts/genes within your reads, run step 6 to create a unified
GTF. Be aware that any novel transcripts/genes that do not merge with known transcripts/genes will be named byStringtie.
FeatureCounts– run this with theGTFoutput of either step 4 (known only) or step 6 (known + novel), along with allBAMmapped read results.- Proceed to
DEseq2using the same exactGTFused in step 7 and the current error about count inputs having a different number of lines (genes) will resolve. - Note: The message about the same input count file being input twice is a common usage problem, so that warning is presented to help with troubleshooting whenever the tool fails. From your screenshot and description of the inputs to
DEseq2, this doesn’t seem to be what is going on for your case. - Why are are removing extra header lines from GTFs created by
Stringtieis unclear. The reference GTF for known transcript/genes may contain extra comment lines, and those should be removed, butStringtieGTFs should contain one header line (contains sample IDs). It looks like the sample ID header line is in your data – so maybe I misinterpreted what you are doing.
Creating a unified GTF (known only, or known + novel) and using that to generate counts will then be based on the same set of transcripts/genes, and the count files will have the same number of rows (one per gene included in the unified GTF). GffCompare is a useful tool but not for DEseq2 (or edger or limma).
Prior Q&A: Galaxy Community Hub - Galaxy Community Hub
The first few results from that search points to Q&A that describes the solution (on our prior forum):
- StringTie and StringTie merge - when to apply the Guide gff (reference annotation file)?
- Stringtie Merge
One issue ticket that describes the problem with the usage protocol above included (in part). This is very technical and isn’t really needed – the same information is in this post and in the prior Q&A linked above: Stringtie 1.3.3 errors when the option to output Deseq2/EdgeR is used · Issue #1322 · galaxyproject/tools-iuc · GitHub
FAQ:Extended Help for Differential Expression Analysis Tools
For others reading (@carolyn_nielsen does have this data): If you do not have a known transcript/gene GTF, that can be omitted. But you should still Stringtie merge all of the Stringtie GTFs before running Featurecounts to generate counts. For this type of usage, all transcript/gene names will be assigned by Stringtie (random but unique identifiers – the MSTRG.N content in your data are those types of identifiers, and many would resolve/merge into Known Genes if the protocol above is followed).
Tutorials: Galaxy Training!
Hope that helps!