I hope you can help me to solve this issue. I used Featurecounts, the summary shows about 69% assigned reads, but when I look inside each featurecount output file I see the list of gene IDs but the count for almost all gene IDs are 0, rarely 1.
When I proceeded to creating matrix count, the table is full oof zeros, and rarely 1, 2, 3 or 4.
How can I solve this issue?
This information may help : I have used the built-in genome mm10 (mouse reference genome). The list of reference built-in genomes of featurecounts contains for mouse genome only mm10 and mm9.
Thank you in advance
Maybe start by checking the reference annotation GTF used to see if it has content or format problems? Then check to make sure that the attribute used to summarize counts matches the attributes actually in your GTF. The default attribute for counting is “gene_id”.
A good source for GRCm38/mm10 reference annotation is UCSC. The reference annotation data from this source is available for a few different gene tracks (your choice of which to use), has UCSC-formatted chromosome identifiers (exactly the same as the built-in mm10 reference genome mapped against), plus includes the annotation attributes this tool (and most other tools) can interpret.
Tip: Try to use the same reference annotation for all steps in the same analysis pathway. Meaning, if you incorporated some other annotation during upstream mapping, you should rerun those jobs using the new annotation to avoid content problems or unexpected/incorrect results due to technical conflicts.
- Common datatypes explained >> see GTF
- Mismatched Chromosome identifiers (and how to avoid them)
- Extended Help for Differential Expression Analysis Tools
Thank you for your reply. I will try to use the UCSC reference genome.
Concerning the tip, the Hisat2 mapping step gives only the built-in genome option (there is no possibility to use genome uploaded to history).
I will try with UCSC, hopefully it would work.
Thank you !
Great – this might resolve the counting problems. Much can go wrong if a GTF is a mismatch or has unexpected content.
Choosing a reference genome (fasta) to map against is one tool option for
HISAT2 (required). Definitely use a built-in index for this option if the target reference genome is large (mouse is large!).
Including a reference annotation (GTF) is a distinct option for
HISAT2 (optional). It can be included to filter by known splice sites. Where to enter the GTF is a bit nested on the
HISAT2 tool form – find it under “Advanced Options > Spliced alignment options > GTF file with known splice sites”.
Maybe try a mapping run without a GTF and one with a GTF, and compare how the counts are reported/assigned in downstream steps. If you are not interested in discovery and are focusing on known genes/transcripts, incorporating annotation can simplify results. The Bioconductor support forum has much discussion about
Featurecounts and related topics if interested: https://support.bioconductor.org/
Thank you a lot it is really helpful. I will do so.