DEseq2 error with different numbers of rows

Hi @carolyn_nielsen

Try this order instead, and omit GffCompare. The original protocol is creating a unified GTF to base counts on after the counts have been generated and is likely the source of the problem.

  1. Trimmomatic
  2. RNAStar (or HISAT2) If using HISAT2 be sure to use the output setting to create a BAM dataset compatible with Stringtie. This option is located under Advanced Options > Spliced alignment options > Transcriptome assembly reporting > select the radio button for “Report alignments tailored for transcript assemblers including StringTie”)
  3. Filter BAM datasets on a variety of attributes – this is optional, and usually used to restrict the BAM contents. I’m assuming you are doing this on purpose and with a purpose specific to your analysis goals.
  4. StringTie Merge – enter just your reference annotation GTF under the input Reference annotation to include in the merging. This “grooms” the GTF to a standardized format Stringtie can interpret.
  5. Stringtie – using the groomed GTF from step 4
  6. StringTie Merge – enter the groomed GTF from step 4 plus all of the GTFs produced by Stringtie in step 5 (one per sample). This step is optional and depends on how you want to do the counting.
    • If you are only interested in known transcripts/genes, skip step 6.
    • If you are interested in both known transcripts/genes + novel transcripts/genes within your reads, run step 6 to create a unified GTF. Be aware that any novel transcripts/genes that do not merge with known transcripts/genes will be named by Stringtie.
  7. FeatureCounts – run this with the GTF output of either step 4 (known only) or step 6 (known + novel), along with all BAM mapped read results.
  8. Proceed to DEseq2 using the same exact GTF used in step 7 and the current error about count inputs having a different number of lines (genes) will resolve.
  9. Note: The message about the same input count file being input twice is a common usage problem, so that warning is presented to help with troubleshooting whenever the tool fails. From your screenshot and description of the inputs to DEseq2, this doesn’t seem to be what is going on for your case.
  10. Why are are removing extra header lines from GTFs created by Stringtie is unclear. The reference GTF for known transcript/genes may contain extra comment lines, and those should be removed, but Stringtie GTFs should contain one header line (contains sample IDs). It looks like the sample ID header line is in your data – so maybe I misinterpreted what you are doing.

Creating a unified GTF (known only, or known + novel) and using that to generate counts will then be based on the same set of transcripts/genes, and the count files will have the same number of rows (one per gene included in the unified GTF). GffCompare is a useful tool but not for DEseq2 (or edger or limma).

Prior Q&A: Galaxy Community Hub - Galaxy Community Hub

The first few results from that search points to Q&A that describes the solution (on our prior forum):

One issue ticket that describes the problem with the usage protocol above included (in part). This is very technical and isn’t really needed – the same information is in this post and in the prior Q&A linked above: Stringtie 1.3.3 errors when the option to output Deseq2/EdgeR is used · Issue #1322 · galaxyproject/tools-iuc · GitHub

FAQ:Extended Help for Differential Expression Analysis Tools

For others reading (@carolyn_nielsen does have this data): If you do not have a known transcript/gene GTF, that can be omitted. But you should still Stringtie merge all of the Stringtie GTFs before running Featurecounts to generate counts. For this type of usage, all transcript/gene names will be assigned by Stringtie (random but unique identifiers – the MSTRG.N content in your data are those types of identifiers, and many would resolve/merge into Known Genes if the protocol above is followed).

Tutorials: Galaxy Training!

Hope that helps!

1 Like