I am trying to run DESeq2 with featureCounts files from a single cell sequencing project. I have 8 different samples, so 8 collections of tabular files. Each file has a column with Geneid and a column with the count. When I try to run DEseq (4 samples from one group vs 4 samples in another group) I get an error message saying:
Dear carolyn_nielsen,
I assume that you may have picked samples twice or more for the same factor? Or you may have selected some input files that are nout count files?
So please check your input file setting (factor and factor levels).
Both of the replies from @gallardoalba and @Flow could be potential issues. I also added a few tags to the topic that lead to prior Q&A about DEseq2/Featurecounts/GTFs. Those topics include troubleshooting help and tutorials.
Hi there - sample input attached here. Each of the collections is the output of FeatureCounts after I removed the first duplicate row. Each collection is a different sample, containing a different number of counts files as samples didn’t all have the same number of cells.
It should have been the same annotation as the same workflow and reference genome was used for each collection. But, when I open each of the collections shown above I can see that the count files they contain sometimes have different numbers of rows to the other collections, which I think relate (very loosely?) to the numbers in the error message above: 135280, 235928, 226602, 203217, 210838, 214950, 279139, 133575.
For example, the files in the first collection (n=49) seem to all say approx 220,00 lines (I haven’t manually opened every single one to check):
Second (n=88) also approx. 220,000:
Third (n=87) is approx 130,000:
Fourth (n=59) is approx 240,000:
Fifth (n=88) approx 230,000:
Sixth (n=52) approx 290,000:
Seventh (n=81) approx 210,000:
Eighth (n=37) approx 130,000:
Any thoughts on why the lane #s are different and how to resolve?
Try this order instead, and omit GffCompare. The original protocol is creating a unified GTF to base counts on after the counts have been generated and is likely the source of the problem.
Trimmomatic
RNAStar (or HISAT2) If using HISAT2 be sure to use the output setting to create a BAM dataset compatible with Stringtie. This option is located under Advanced Options > Spliced alignment options > Transcriptome assembly reporting > select the radio button for “Report alignments tailored for transcript assemblers including StringTie”)
Filter BAM datasets on a variety of attributes – this is optional, and usually used to restrict the BAM contents. I’m assuming you are doing this on purpose and with a purpose specific to your analysis goals.
StringTie Merge – enter just your reference annotation GTF under the input Reference annotation to include in the merging. This “grooms” the GTF to a standardized format Stringtie can interpret.
Stringtie – using the groomed GTF from step 4
StringTie Merge – enter the groomed GTF from step 4 plus all of the GTFs produced by Stringtie in step 5 (one per sample). This step is optional and depends on how you want to do the counting.
If you are only interested in known transcripts/genes, skip step 6.
If you are interested in both known transcripts/genes + novel transcripts/genes within your reads, run step 6 to create a unified GTF. Be aware that any novel transcripts/genes that do not merge with known transcripts/genes will be named by Stringtie.
FeatureCounts – run this with the GTF output of either step 4 (known only) or step 6 (known + novel), along with all BAM mapped read results.
Proceed to DEseq2 using the same exact GTF used in step 7 and the current error about count inputs having a different number of lines (genes) will resolve.
Note: The message about the same input count file being input twice is a common usage problem, so that warning is presented to help with troubleshooting whenever the tool fails. From your screenshot and description of the inputs to DEseq2, this doesn’t seem to be what is going on for your case.
Why are are removing extra header lines from GTFs created by Stringtie is unclear. The reference GTF for known transcript/genes may contain extra comment lines, and those should be removed, but Stringtie GTFs should contain one header line (contains sample IDs). It looks like the sample ID header line is in your data – so maybe I misinterpreted what you are doing.
Creating a unified GTF (known only, or known + novel) and using that to generate counts will then be based on the same set of transcripts/genes, and the count files will have the same number of rows (one per gene included in the unified GTF). GffCompare is a useful tool but not for DEseq2 (or edger or limma).
For others reading (@carolyn_nielsen does have this data): If you do not have a known transcript/gene GTF, that can be omitted. But you should still Stringtie merge all of the Stringtie GTFs before running Featurecounts to generate counts. For this type of usage, all transcript/gene names will be assigned by Stringtie (random but unique identifiers – the MSTRG.N content in your data are those types of identifiers, and many would resolve/merge into Known Genes if the protocol above is followed).
Hi Jennaj - just to check, does this look correct? So StringTie not actually needed to get to FeatureCounts? This picks up from Step 3 above. The GTF file used is as before: Homo_sapiens.GRCh37.75.gtf