So I’ve always used the built-in mm10 reference in HISAT2 → featureCount for RNA-Seq analysis which return about 27,000 genes labeled with Entrez ID. However when I work with public RNA-seq datasets found in GEO, a lot of them use htseq-count which return >55,000 genes labeled with Ensemble ID. I suppose this is because they were using GRCm38 GFF/GTF for htseq-count. Am I missing data by using mm10? There is a program that I’m hoping to use called ImmunCC which uses htseq-count and Ensemble labeled gene counts directly while also merging several genes together. Could I substitute with using either mm10 GTF or featureCount instead of GRCm38 GTF and htseq-count? Thanks for your insight!
GRCm38 refer to the same reference genome assembly. These can be labeled differently (chromosome names).
Reference annotation based on that genome assembly can also differ by who created the annotation – both in content and in chromosome labeling.
There are about 30k protein-coding genes for mouse. The reference annotation built-in for
featureCounts represents those genes (only – based on
Entrez). Other reference annotation sources may contain other genomic features, including transcripts associated with genes.
Ebsembl IDs to learn if they represent genes, transcripts, and/or other features. I’m guessing transcripts from the count, but you should confirm that.
Whether you want counts by transcript or gene depends on what your analysis goals are. Your description of the tool that “merges several genes together” might be actually merging transcripts into genes… but that is another guess.
All inputs to analysis should be based on the same reference genome assembly AND build (matching chromosome identifiers). The assemblies are already the same. You’ll need to check if chromosome identifiers are a match.
You can use built-in annotation with
Featurecounts. And can use other annotation with
See these tutorials for more help:
I also added a few tags to your post that point to prior Q&A that cover common annotation sources, ways to convert IDs, plus methods to address format and content.