Using CHM13v2.0 T2T and mm39 gene annotations for RNASeq Analysis

Hello everyone! I have been learning how to use Galaxy in context of RNASeq analysis just for a few weeks now. I have been analyzing the sequencing data using the following flow;

FASTQ files obtained from the sequencing company→Cutadapt→RNA STAR→Feature Counts→RUVSeq→DESeq2 and then I would obtain the differentially expressed gene data (up-downregulated in comparison to the other group) between my 2 groups.

Both RNA STAR and Feature Counts tools have hg38 and mm10 annotation data integrated into it so using those have been okay thus far. I wanted to compare the differences in analysis results if I did the same analysis pipeline using CHM13v2.0 and mm39. Both of them are integrated into RNA STAR as well, however not into the Feature Counts tool.

I know it is possible to upload your own annotation files in .gtf format, but when I tried to do just that using a few annotation files I found on several sites, I always got errors or 0 reads at the Feature Counts results. How should I move forward with this?

I am sorry if the question is a simple one, it is just that I could not figure it out after a few days of looking it up on the internet and I have zero coding skills or previous bioinformatics knowledge.

Thanks a lot!

Welcome @jankat

For RNA-seq, using the human hg38 genome will be the “most current” genome assembly. The annotation for the T2T assembly for genes/transcripts is the hg38 annotation. The data is mapped over to the T2T coordinates. You will not need that and the extra exposed regions are unlikely to help and can harm.

For the mouse, yes, mm39 is the more current version, and annotation is available for it.

How to use these

The annotation’s features are described by coordinates on the chromosome bases. Different assemblies have different bases! So try to avoid mixing up files between assembly versions. You will also want to use the same annotation throughout an analysis for all steps since the features are also labeled in specific ways.

If you later want to use a different assembly, any step you already did that uses that assembly’s bases or coordinate system will need to be rerun.

Same for annotation. If you want to use a different GTF, all steps that use any annotation source (whether supplied by you, or built into a tool), will need to be rerun using it.

Guides

Sources

The genomes for hg38 and mm10/mm39 were all originally hosted by UCSC (in coordination with NCBI), and Gencode uses the same chromosome/coordinate scheme, so you could get the annotation from either place and these reference annotation GTFs will work with the reference assembly genomes natively indexed in Galaxy.

The example here is for hg38, but mm39 will work the same.

This means for mouse at UCSC, this is where you’ll be looking:

And Gencode for mouse, will be here



Learning how to source and prepare your reference data is a very good skill to have. Your results will only be as good as the other data it is built upon!

Please give this a try, and if you get stuck, please ask and we can pull out the exact URLs for the data. We’ll need to know which source, which assembly, which annotation track, and which database key you decide to use for each.

Let’s start there! :slight_smile: