Try this order instead, and omit GffCompare
. The original protocol is creating a unified GTF
to base counts on after the counts have been generated and is likely the source of the problem.
Trimmomatic
RNAStar
(orHISAT2
) If usingHISAT2
be sure to use the output setting to create aBAM
dataset compatible withStringtie
. This option is located under Advanced Options > Spliced alignment options > Transcriptome assembly reporting > select the radio button for “Report alignments tailored for transcript assemblers including StringTie”)Filter BAM datasets on a variety of attributes
– this is optional, and usually used to restrict theBAM
contents. I’m assuming you are doing this on purpose and with a purpose specific to your analysis goals.StringTie Merge
– enter just your reference annotationGTF
under the inputReference annotation to include in the merging
. This “grooms” theGTF
to a standardized formatStringtie
can interpret.Stringtie
– using the groomedGTF
from step 4StringTie Merge
– enter the groomedGTF
from step 4 plus all of theGTFs
produced byStringtie
in step 5 (one per sample). This step is optional and depends on how you want to do the counting.- If you are only interested in known transcripts/genes, skip step 6.
- If you are interested in both known transcripts/genes + novel transcripts/genes within your reads, run step 6 to create a unified
GTF
. Be aware that any novel transcripts/genes that do not merge with known transcripts/genes will be named byStringtie
.
FeatureCounts
– run this with theGTF
output of either step 4 (known only) or step 6 (known + novel), along with allBAM
mapped read results.- Proceed to
DEseq2
using the same exactGTF
used in step 7 and the current error about count inputs having a different number of lines (genes) will resolve. - Note: The message about the same input count file being input twice is a common usage problem, so that warning is presented to help with troubleshooting whenever the tool fails. From your screenshot and description of the inputs to
DEseq2
, this doesn’t seem to be what is going on for your case. - Why are are removing extra header lines from GTFs created by
Stringtie
is unclear. The reference GTF for known transcript/genes may contain extra comment lines, and those should be removed, butStringtie
GTFs should contain one header line (contains sample IDs). It looks like the sample ID header line is in your data – so maybe I misinterpreted what you are doing.
Creating a unified GTF
(known only, or known + novel) and using that to generate counts will then be based on the same set of transcripts/genes, and the count files will have the same number of rows (one per gene included in the unified GTF). GffCompare
is a useful tool but not for DEseq2
(or edger
or limma
).
Prior Q&A: Galaxy Community Hub - Galaxy Community Hub
The first few results from that search points to Q&A that describes the solution (on our prior forum):
- StringTie and StringTie merge - when to apply the Guide gff (reference annotation file)?
- Stringtie Merge
One issue ticket that describes the problem with the usage protocol above included (in part). This is very technical and isn’t really needed – the same information is in this post and in the prior Q&A linked above: Stringtie 1.3.3 errors when the option to output Deseq2/EdgeR is used · Issue #1322 · galaxyproject/tools-iuc · GitHub
FAQ:Extended Help for Differential Expression Analysis Tools
For others reading (@carolyn_nielsen does have this data): If you do not have a known transcript/gene GTF, that can be omitted. But you should still Stringtie merge
all of the Stringtie
GTFs before running Featurecounts
to generate counts. For this type of usage, all transcript/gene names will be assigned by Stringtie
(random but unique identifiers – the MSTRG.N content in your data are those types of identifiers, and many would resolve/merge into Known Genes if the protocol above is followed).
Tutorials: Galaxy Training!
Hope that helps!