I have a problem when I use annotate my IDs tool. I noticed a misannotation in few genes.
In details: In original file with all of my counts the Enseble annotation is correct; each row has a unique Enseble ID. After I run the Annotate my IDs tool (to use the output in the limma-voom),
some of these genes with unique IDs are transformed into duplicate Ensemple IDs.
In some cases, it’s different isoforms of the gene that are transformed in one ensemble ID, so the gene symbol is correct, and the Ens IDs are wrong and in other cases even different gene symbols
acquired the same ensemble ID.
Any idea how this happened? I have the correct organism (mouse), and the tool version is: annotateMyIDs annotate a generic set of identifiers (Galaxy Version 3.16.0+galaxy1).
I even used a previous version of the tool but still the same result.
To see if there is something wrong with my data I also run the Annotate DESeq2/DEXSeq output tables (of the DESeq2 output), and everything is normal. Each gene has a unique ID and gene symbol.
Different annotation sources will create slightly different gene/transcript “footprints” on the reference genome. This can create a one-to-many, many-to-one, and many-to-many situations resulting in non-unique IDs in files after converting IDs between sources. Everyone would get this result, it is not your data or a tool problem.
Maybe try running the analysis with annotation from Ensembl instead? That would be the GTF incorporated during counting. UCSC has this for some genomes ready to use, but other sources could be adjusted to work with these tools. Let us know if you need help with that.
If I misunderstood (and probably did) please share more details, including why you are converting IDs and your broader technical goals. Applying different versions of Ensembl annotation could present with the same many/one problems too.
My goal was to use the limma tool for checking differential expressed genes between my different treatment groups (based on the tutorial: 2: RNA-seq counts to genes)
I can upload the annotated version of the genes to have a result that I can analyse. So I used the annotate myIDs tool to have a file with the Ensemble ID (ENMUSG000000XXXXX) and the corresponding gene name (e.g. Gapdh).
In this specific tool, I can’t upload any external gtf files (or at least I couldn’t find a way).
My problem is that Limma gave me errors with the explanation of duplicated row names, and I believed the output Annotation file is the problem.
However, based on your answer, having these multiple duplicate Ensemble IDs, even with different gene names, is expected, so this should not create any problem with any downstream tool like Limma.
Not sure If I can do anything else or if there is any other tool that I can use to give me the same annotated information to use in Limma tool
This is one reason why it is best to use the same exact annotation throughout a single analysis. If you started over and did the counting with an Ensembl GTF, then you wouldn’t need to do any gene transformations and these errors would go away.
I know that isn’t a nice answer since it means rerunning prior work but is really the only way forward unless you want to custom hand-edit the files to curate which genes are assigned to features (not recommended!).