DEseq2 error with different numbers of rows

I am trying to run DESeq2 with featureCounts files from a single cell sequencing project. I have 8 different samples, so 8 collections of tabular files. Each file has a column with Geneid and a column with the count. When I try to run DEseq (4 samples from one group vs 4 samples in another group) I get an error message saying:

Error in data.frame(…, check.names = FALSE) :
arguments imply differing number of rows: 135280, 235928, 226602, 203217, 210838, 214950, 279139, 133575
Calls: get_deseq_dataset … eval → eval → eval → cbind → cbind → data.frame

The report also states that:

The tool was executed with one or more duplicate input datasets. This frequently results in tool errors due to problematic input choices.

How can I resolve this?

Many thanks!

1 Like

Hi @carolyn_nielsen,
did you follow the same pipeline for generating the gene counts for each sample?

1 Like

Dear carolyn_nielsen,
I assume that you may have picked samples twice or more for the same factor? Or you may have selected some input files that are nout count files?

So please check your input file setting (factor and factor levels).

Cheers,
Florian

1 Like

Hi @carolyn_nielsen

Both of the replies from @gallardoalba and @Flow could be potential issues. I also added a few tags to the topic that lead to prior Q&A about DEseq2/Featurecounts/GTFs. Those topics include troubleshooting help and tutorials.

This FAQ may also help: Help for Differential Expression Analysis

Yes all the samples were processed using the same workflow:

Trimmomatic → RNAStar → Filter BAM datasets on a variety of attributes → StringTie → StringTie Merge → GffCompare → FeatureCounts

Hi there - sample input attached here. Each of the collections is the output of FeatureCounts after I removed the first duplicate row. Each collection is a different sample, containing a different number of counts files as samples didn’t all have the same number of cells.

Thanks will look at these!

Can you check if your count files have the same number of rows and order. Meaning, have you used the same annotation for all your files?

1 Like

It should have been the same annotation as the same workflow and reference genome was used for each collection. But, when I open each of the collections shown above I can see that the count files they contain sometimes have different numbers of rows to the other collections, which I think relate (very loosely?) to the numbers in the error message above: 135280, 235928, 226602, 203217, 210838, 214950, 279139, 133575.

For example, the files in the first collection (n=49) seem to all say approx 220,00 lines (I haven’t manually opened every single one to check):
image

Second (n=88) also approx. 220,000:
image

Third (n=87) is approx 130,000:
image

Fourth (n=59) is approx 240,000:
image

Fifth (n=88) approx 230,000:
image

Sixth (n=52) approx 290,000:
image

Seventh (n=81) approx 210,000:
image

Eighth (n=37) approx 130,000:
image

Any thoughts on why the lane #s are different and how to resolve?

2 Likes

Hi @carolyn_nielsen

Try this order instead, and omit GffCompare. The original protocol is creating a unified GTF to base counts on after the counts have been generated and is likely the source of the problem.

  1. Trimmomatic
  2. RNAStar (or HISAT2) If using HISAT2 be sure to use the output setting to create a BAM dataset compatible with Stringtie. This option is located under Advanced Options > Spliced alignment options > Transcriptome assembly reporting > select the radio button for “Report alignments tailored for transcript assemblers including StringTie”)
  3. Filter BAM datasets on a variety of attributes – this is optional, and usually used to restrict the BAM contents. I’m assuming you are doing this on purpose and with a purpose specific to your analysis goals.
  4. StringTie Merge – enter just your reference annotation GTF under the input Reference annotation to include in the merging. This “grooms” the GTF to a standardized format Stringtie can interpret.
  5. Stringtie – using the groomed GTF from step 4
  6. StringTie Merge – enter the groomed GTF from step 4 plus all of the GTFs produced by Stringtie in step 5 (one per sample). This step is optional and depends on how you want to do the counting.
    • If you are only interested in known transcripts/genes, skip step 6.
    • If you are interested in both known transcripts/genes + novel transcripts/genes within your reads, run step 6 to create a unified GTF. Be aware that any novel transcripts/genes that do not merge with known transcripts/genes will be named by Stringtie.
  7. FeatureCounts – run this with the GTF output of either step 4 (known only) or step 6 (known + novel), along with all BAM mapped read results.
  8. Proceed to DEseq2 using the same exact GTF used in step 7 and the current error about count inputs having a different number of lines (genes) will resolve.
  9. Note: The message about the same input count file being input twice is a common usage problem, so that warning is presented to help with troubleshooting whenever the tool fails. From your screenshot and description of the inputs to DEseq2, this doesn’t seem to be what is going on for your case.
  10. Why are are removing extra header lines from GTFs created by Stringtie is unclear. The reference GTF for known transcript/genes may contain extra comment lines, and those should be removed, but Stringtie GTFs should contain one header line (contains sample IDs). It looks like the sample ID header line is in your data – so maybe I misinterpreted what you are doing.

Creating a unified GTF (known only, or known + novel) and using that to generate counts will then be based on the same set of transcripts/genes, and the count files will have the same number of rows (one per gene included in the unified GTF). GffCompare is a useful tool but not for DEseq2 (or edger or limma).

Prior Q&A: Galaxy Community Hub - Galaxy Community Hub

The first few results from that search points to Q&A that describes the solution (on our prior forum):

One issue ticket that describes the problem with the usage protocol above included (in part). This is very technical and isn’t really needed – the same information is in this post and in the prior Q&A linked above: Stringtie 1.3.3 errors when the option to output Deseq2/EdgeR is used · Issue #1322 · galaxyproject/tools-iuc · GitHub

FAQ:Extended Help for Differential Expression Analysis Tools

For others reading (@carolyn_nielsen does have this data): If you do not have a known transcript/gene GTF, that can be omitted. But you should still Stringtie merge all of the Stringtie GTFs before running Featurecounts to generate counts. For this type of usage, all transcript/gene names will be assigned by Stringtie (random but unique identifiers – the MSTRG.N content in your data are those types of identifiers, and many would resolve/merge into Known Genes if the protocol above is followed).

Tutorials: Galaxy Training!

Hope that helps!

1 Like

Ah thank you this sounds promising I will edit and re-run these workflows from today and let you know how it goes!

Hi Jennaj - just to check, does this look correct? So StringTie not actually needed to get to FeatureCounts? This picks up from Step 3 above. The GTF file used is as before: Homo_sapiens.GRCh37.75.gtf

Previously this would have fed into StringTie Merge and then GffCompare.

@jennaj this worked- thank you!

2 Likes