What can i do with genes that appear as NA but have statistics ?

kizzy · September 27, 2021, 7:26pm

Hi I’m doing an RNA-seq analysis and I’m having problems identifying some of the genes in the output of the DESeq2 tool because the list of genes obtained are in Entrez id, for context the pipeline used was Fastp > HISAT2 > feature counts > DESeq2. I used g: profiler and annotateMyIDs to identify the genes by the Entrez id, but there are still some that appear as NA in the list, so I tried to identify them manually by searching the Entrez id in NCBI and then took the coordinates from there and looking if there was a match in my samples using the visualizer UCSC and IGB, in parallel some of the genes in NCBI appears as updated and have another Entrez id associated to a gene, that gene is already identified in the list that I have, the question is can I keep the gene that is so far in the list and discard the obsolete one that had been updated?. Other than that can I discard the ones that haven’t been updated and remains as NA?

jennaj · September 27, 2021, 7:49pm

Hi @kizzy

This prior Q&A is similar to your question: Should i change de EntrezID of my genes if it has changed in NCBI?

You could use the UCSC’s GTF annotation instead (incorporate it at the first step you incorporate the other annotation). Where to find it plus some related tips: RNA STAR error on trimmomtaic files - #6 by jennaj

Whether to keep data that doesn’t have any known annotation yet (NA) is your choice. If the research goal is to discover novel features, then you would probably want to keep those. If you are not hunting for novel features, then you can restrict the analysis to known features upstream during mapping.

HISAT2 has an option to only report hits with known features: Advanced options >> Spliced alignment options

GTF file with known splice sites (input the UCSC annotation)
Transcriptome assembly reporting (set to: “Report only those alignments within known transcripts”)

Hope that helps!

kizzy · September 27, 2021, 10:14pm

thank you @jennaj, so if I decided to keep the data or remove it based on if I can find a match through the visualizer it’s okay? because I didn’t use an external annotation, I used a built-in option of the featurecounts for the hg19 genome, but if I change for this new annotation my results are gonna be affected don’t they?

jennaj · September 28, 2021, 12:27am

@kizzy

Yes, making direct changes to data post-analysis will probably impact your results. That could be a big change or a small change. Even if the only difference is a label (name), the analysis data and methods wouldn’t be reproducible unless you somehow track and document what you do. If this analysis is for publication, that can get a bit tricky since you will probably need to justify the “why” and/or prove that the change was small enough to not make a meaningful difference in any larger claims or conclusions.

It would be best to use the annotation and the tool settings that will directly produce the final results. The same GTF can and should be used for all steps in the same analysis project in all research use cases that I can think of.

If you don’t want unknown features in your results (discovery is not a goal), then exclude all but known features at the mapping step. Selectively removing features in a way that is not applied to all of the data impacts results, and probably needs to be justified or at least explained.
If you want the most current gene/transcript features, use the annotation that represents those gene/transcripts. All the underlying data that define features also impact results (example: genomic coordinates). Coordinates are how features are attached to reads – so they really matter.

The RefSeq Genes track updates daily with additions at UCSC, and any deletions or other changes are reconciled at Refseq full releases (the source is NCBI, using their release schedule). If you know the source and date that you obtained the annotation, that is the data version, and anyone else would be able to obtain that same data later on.

The built-in cached annotation from Subread is not updated daily and the version is noted on the Featurecounts tool form at the bottom in the “Requirements” section. Right now, usegalaxy.* servers are using version “subread (Version 2.0.1)” dated 2020-05-13 and sourced from Subread - Browse Files at SourceForge.net.

This is your decision but reproducibility is usually very important.

kizzy · September 28, 2021, 12:41am

@jennaj thank you so much for your help

Topic		Replies	Views
featureCounts output not compatible with Annotate DeSeq2/DexSeq output tables tool-dev	2	711	March 9, 2021
DESeq2 Returning Nucleotides As Gene ID usegalaxy.org support ncbi	4	425	October 26, 2022
Should i change de EntrezID of my genes if it has changed in NCBI? usegalaxy.org support annotatemyids , reference-annotation	2	605	September 16, 2021
How can i extract gene name from custom GTF file? transcriptomics	1	298	March 12, 2024
Linking mouse gene IDs/name to Encode IDs following Salmon and DESeq analysis transcriptomics , resources , tool-help , salmon	2	9	April 16, 2025

What can i do with genes that appear as NA but have statistics ?

Related topics