MSTRG to Gene Name Conversion

Hello,

We are trying to analyze our files following this RNA Seq tutorial: https://galaxyproject.org/tutorials/nt_rnaseq/

Our DESeq2 outputs, however, are giving us MSTRG gene IDs, which are not very useful as the labels are only relevant internally. In researching how to convert MSTRG to a gene name, we found that a better “reference file to guide assembly” should be used to run StringTie so a gene name is outputted. However, the galaxy StringTie job fails when we use the Homo_sapiens_NCBI_GrCh38.tar.gz reference file. We have tried unzipping it to .tar and also using a DEXSeq annotation as the reference file, then ran StringTie, but this also failed. Also, the option to use a built in reference file is disabled. Currently, the only method that successfully outputs a DESeq2 file is when no reference file is used, however we are unsure how to convert the MSTRG gene ID it outputs to a gene name/prevent MSTRG outputs.

Any help is greatly appreciated; thanks in advance!
-Preformatted textAnanya

1 Like

Hi Ananya_P,

Were you able to resolve this issue? What worked for you? We are having the same issue currently.

Best,
Johnathon

1 Like

Hi Johnathon.

My apologies for the late reply. Unfortunately, we have not been able to resolve this issue. Please let me know if your team has found any helpful solutions.

Thanks,
Ananya

1 Like

Hello @Ananya_P @Johnathon_Anderson

Sorry the original question got missed.

This FAQ explains how/where to get a human reference annotation dataset that will work with these tools: https://galaxyproject.org/support/diff-expression/

I also added some tags to your post that includes other Q&A about this. In short, you need a GTF dataset that matches the UCSC version of the genome/build (if you are mapping against the built-in genome indexes). Correct format matters or tools will fail. The FAQ and linked FAQs explain with full details and common sources.

GRCh38 is the same genome as UCSC’s hg38.
GRCh37 is the same genome as UCSC’s hg19.

But the “build” may differ between sources. Check your chromosome identifiers and make sure they are a match.

Thanks!

I have been analyzing the apple transcriptome data using the Galaxy platform but after having completed the steps involved, I end up with a heatmap with MSTRG tags even though the DESEQ2 results file after havin processed through the ANNOTATE step has MSTRG tags with corresponding Gene IDs associated with them.
How to get rid of the MSTRG tags

Hi @susheel_raina

If you incorporated a reference annotation, but some transcripts/genes are still assigned to the default annotation applied by these tools, then those represent features not represented in the reference annotation. That could be because they are truly novel based on how well that genome is currently annotated.

There is a lot of QA around these tools, and the GTN has example tutorials. In the tutorials, those involve model mammal organisms with good annotation coverage, and only the known genes/transcripts were considered. This results in outputs are all fully annotated. In real life analysis, that will not always be true (good annotation coverage) plus novel data might be of interest.

I’m not sure I understand what this means. Do you mean that some transcript/genes have annotation and some don’t? If so – that would be due to novels in your data. If there is no known annotation for certain features, those won’t have a known attribute like gene or transcript name.