Hi I’m doing an RNA-seq analysis and I’m having problems identifying some of the genes in the output of the DESeq2 tool because the list of genes obtained are in Entrez id, for context the pipeline used was Fastp > HISAT2 > feature counts > DESeq2. I used g: profiler and annotateMyIDs to identify the genes by the Entrez id, but there are still some that appear as NA in the list, so I tried to identify them manually by searching the Entrez id in NCBI and then took the coordinates from there and looking if there was a match in my samples using the visualizer UCSC and IGB, in parallel some of the genes in NCBI appears as updated and have another Entrez id associated to a gene, that gene is already identified in the list that I have, the question is can I keep the gene that is so far in the list and discard the obsolete one that had been updated?. Other than that can I discard the ones that haven’t been updated and remains as NA?
This prior Q&A is similar to your question: Should i change de EntrezID of my genes if it has changed in NCBI?
You could use the UCSC’s GTF annotation instead (incorporate it at the first step you incorporate the other annotation). Where to find it plus some related tips: RNA STAR error on trimmomtaic files - #6 by jennaj
Whether to keep data that doesn’t have any known annotation yet (NA) is your choice. If the research goal is to discover novel features, then you would probably want to keep those. If you are not hunting for novel features, then you can restrict the analysis to known features upstream during mapping.
HISAT2 has an option to only report hits with known features: Advanced options >> Spliced alignment options
- GTF file with known splice sites (input the UCSC annotation)
- Transcriptome assembly reporting (set to: “Report only those alignments within known transcripts”)
Hope that helps!
thank you @jennaj, so if I decided to keep the data or remove it based on if I can find a match through the visualizer it’s okay? because I didn’t use an external annotation, I used a built-in option of the featurecounts for the hg19 genome, but if I change for this new annotation my results are gonna be affected don’t they?
Yes, making direct changes to data post-analysis will probably impact your results. That could be a big change or a small change. Even if the only difference is a label (name), the analysis data and methods wouldn’t be reproducible unless you somehow track and document what you do. If this analysis is for publication, that can get a bit tricky since you will probably need to justify the “why” and/or prove that the change was small enough to not make a meaningful difference in any larger claims or conclusions.
It would be best to use the annotation and the tool settings that will directly produce the final results. The same GTF can and should be used for all steps in the same analysis project in all research use cases that I can think of.
- If you don’t want unknown features in your results (discovery is not a goal), then exclude all but known features at the mapping step. Selectively removing features in a way that is not applied to all of the data impacts results, and probably needs to be justified or at least explained.
- If you want the most current gene/transcript features, use the annotation that represents those gene/transcripts. All the underlying data that define features also impact results (example: genomic coordinates). Coordinates are how features are attached to reads – so they really matter.
The RefSeq Genes track updates daily with additions at UCSC, and any deletions or other changes are reconciled at Refseq full releases (the source is NCBI, using their release schedule). If you know the source and date that you obtained the annotation, that is the data version, and anyone else would be able to obtain that same data later on.
The built-in cached annotation from Subread is not updated daily and the version is noted on the
Featurecounts tool form at the bottom in the “Requirements” section. Right now, usegalaxy.* servers are using version “subread (Version 2.0.1)” dated 2020-05-13 and sourced from Subread - Browse Files at SourceForge.net.
This is your decision but reproducibility is usually very important.
@jennaj thank you so much for your help