Using CHM13v2.0 T2T and mm39 gene annotations for RNASeq Analysis

jankat · November 6, 2025, 12:31pm

Hello everyone! I have been learning how to use Galaxy in context of RNASeq analysis just for a few weeks now. I have been analyzing the sequencing data using the following flow;

FASTQ files obtained from the sequencing company→Cutadapt→RNA STAR→Feature Counts→RUVSeq→DESeq2 and then I would obtain the differentially expressed gene data (up-downregulated in comparison to the other group) between my 2 groups.

Both RNA STAR and Feature Counts tools have hg38 and mm10 annotation data integrated into it so using those have been okay thus far. I wanted to compare the differences in analysis results if I did the same analysis pipeline using CHM13v2.0 and mm39. Both of them are integrated into RNA STAR as well, however not into the Feature Counts tool.

I know it is possible to upload your own annotation files in .gtf format, but when I tried to do just that using a few annotation files I found on several sites, I always got errors or 0 reads at the Feature Counts results. How should I move forward with this?

I am sorry if the question is a simple one, it is just that I could not figure it out after a few days of looking it up on the internet and I have zero coding skills or previous bioinformatics knowledge.

Thanks a lot!

jennaj · November 6, 2025, 9:01pm

Welcome @jankat

For RNA-seq, using the human hg38 genome will be the “most current” genome assembly. The annotation for the T2T assembly for genes/transcripts is the hg38 annotation. The data is mapped over to the T2T coordinates. You will not need that and the extra exposed regions are unlikely to help and can harm.

For the mouse, yes, mm39 is the more current version, and annotation is available for it.

How to use these

The annotation’s features are described by coordinates on the chromosome bases. Different assemblies have different bases! So try to avoid mixing up files between assembly versions. You will also want to use the same annotation throughout an analysis for all steps since the features are also labeled in specific ways.

If you later want to use a different assembly, any step you already did that uses that assembly’s bases or coordinate system will need to be rerun.

Same for annotation. If you want to use a different GTF, all steps that use any annotation source (whether supplied by you, or built into a tool), will need to be rerun using it.

Guides

Reference data assembly choices. → Reference genomes at public Galaxy servers: GRCh38/hg38 example
Getting the data organized for your target tools. → FAQ: Extended Help for Differential Expression Analysis Tools
With much more at this forum under transcriptomics and reference-genome reference-annotation – plus search with tools names such as featurecounts

Sources

The genomes for hg38 and mm10/mm39 were all originally hosted by UCSC (in coordination with NCBI), and Gencode uses the same chromosome/coordinate scheme, so you could get the annotation from either place and these reference annotation GTFs will work with the reference assembly genomes natively indexed in Galaxy.

The example here is for hg38, but mm39 will work the same.

How to find the reference transcriptome for analysis tools: GRCh38/hg38 example

For the UCSC hg38 reference genome indexed in Galaxy, a reference annotation GTF and reference transcriptome fasta can be sources from at least these two places:

Gencode

GENCODE - Human Release 49

get the first in the list of GTFs, and the first in the list of Fasta

double check the formatting. You might need to standardize the fasta with “NormalizeFasta” (I can’t remember if this is needed) and I would remove GTF headers too (some tools might have a problem with them). The FAQ above has instructions for these.

UCSC

These two are a match, and have standard human Gene Symbol and RefSeq transcript identifiers.

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/refMrna.fa.gz

The GTF will be ready to use after Upload, and the fasta will need to be uncompressed (under the pencil icon) then run through NormalizeFasta to strip out the extra characters.

Technically, any of the reference annotation GTFs in their Downloads area are based on “Gene and Gene Predictions” tracks also represented in the Table browser (or main Browser). This means you can extract a reference transcriptome fasta from the Table browser.

This means for mouse at UCSC, this is where you’ll be looking:

And Gencode for mouse, will be here

Learning how to source and prepare your reference data is a very good skill to have. Your results will only be as good as the other data it is built upon!

Please give this a try, and if you get stuck, please ask and we can pull out the exact URLs for the data. We’ll need to know which source, which assembly, which annotation track, and which database key you decide to use for each.

Let’s start there!

jankat · November 26, 2025, 2:50pm

Thanks so much for your help! I only recently got around to completing my analysis with mm39 and it works perfectly the way you described.

jennaj · December 3, 2025, 9:42pm

Great! I’m glad this all worked out!