Difference Between using mm10 vs GRCm38 GFF/GTF

macmade · July 19, 2021, 1:52am

So I’ve always used the built-in mm10 reference in HISAT2 → featureCount for RNA-Seq analysis which return about 27,000 genes labeled with Entrez ID. However when I work with public RNA-seq datasets found in GEO, a lot of them use htseq-count which return >55,000 genes labeled with Ensemble ID. I suppose this is because they were using GRCm38 GFF/GTF for htseq-count. Am I missing data by using mm10? There is a program that I’m hoping to use called ImmunCC which uses htseq-count and Ensemble labeled gene counts directly while also merging several genes together. Could I substitute with using either mm10 GTF or featureCount instead of GRCm38 GTF and htseq-count? Thanks for your insight!

jennaj · July 19, 2021, 7:06pm

Hi @macmade

Both mm10 and GRCm38 refer to the same reference genome assembly. These can be labeled differently (chromosome names).

Reference annotation based on that genome assembly can also differ by who created the annotation – both in content and in chromosome labeling.

There are about 30k protein-coding genes for mouse. The reference annotation built-in for featureCounts represents those genes (only – based on Entrez). Other reference annotation sources may contain other genomic features, including transcripts associated with genes.

Check the Ebsembl IDs to learn if they represent genes, transcripts, and/or other features. I’m guessing transcripts from the count, but you should confirm that.

Whether you want counts by transcript or gene depends on what your analysis goals are. Your description of the tool that “merges several genes together” might be actually merging transcripts into genes… but that is another guess.

All inputs to analysis should be based on the same reference genome assembly AND build (matching chromosome identifiers). The assemblies are already the same. You’ll need to check if chromosome identifiers are a match.

You can use built-in annotation with Featurecounts. And can use other annotation with Featurecounts or HT-seq count.

See these tutorials for more help:

I also added a few tags to your post that point to prior Q&A that cover common annotation sources, ways to convert IDs, plus methods to address format and content.

Thanks!

Topic		Replies	Views
In featurecounts I got 69% assigned but count matrix full of zeros usegalaxy.org support usegalaxyorg , gtn-tutorial , htseq-count , reference-genome	4	621	November 4, 2021
ref_gene_id featurecounts usegalaxy.org support	6	3174	May 22, 2019
Htseq-count Feature type htseq-count	2	775	May 17, 2019
Genes from my reference genome are not annotated in output files. Trying to get read count per gene/abundant gene list usegalaxy.org support troubleshooting , reference-annotation , rna-seq	1	21	October 1, 2024
hisat2 and featurecounts usegalaxy.org support gtn-tutorial , workflow , galaxy-local , mapping , transcriptomics , featurecounts	23	2067	October 28, 2024

Difference Between using mm10 vs GRCm38 GFF/GTF

Related topics