DESeq2 Returning Nucleotides As Gene ID

Iisa · October 26, 2022, 1:03am

Hi all.
After having used DESeq2, my results show up as nucleotides (per NCBI: Search NCBI databases - NLM) instead of genes.
Do you know what causes this and how I could find out the correct gene IDs?

jennaj · October 26, 2022, 5:19pm

Hi @Iisa

The URL you shared is a link to a search function with no content.

Please share more details about the inputs and result. The description so far seems very odd - so this will be best done by generating and posting a share link to the working history. You can post that back here publicly, or ask for a moderator to start up a direct message chat to post that in instead.

Let’s start there, thanks

Iisa · October 26, 2022, 5:22pm

Hi @jennaj,

right, I just wanted to show which site I had used to look for the IDs.
I would appreciate a direct message chat with a moderator.

Thank you.

jennaj · October 26, 2022, 7:35pm

Hi @Iisa

The content of the gene_id and transcript_id fields of the reference annotation GTF are what Featurecounts uses to generate the counts. Mapped hits at the “transcript_id” level (aka transcript coordinates) are summarized per “gene_id” in that output. Then, those gene_id level counts are interpreted by DESeq2 for differential expression.

For this data, the annotation was predicted using an automated annotation pipeline developed by NCBI. To review the general stats for the annotation, and the pipeline details, go to the assembly page here https://www.ncbi.nlm.nih.gov/assembly/GCF_021559855.1/#/qa and expand the section “Comment”

The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about PGAP can be found here: NCBI Prokaryotic Genome Annotation Pipeline

Example line from the public GTF

NZ_CP058222.1 Protein Homology CDS 1 1401 . + 0 gene_id “HW370_RS00005”; transcript_id “unassigned_transcript_1”; gbkey “CDS”; gene “dnaA”; inference “COORDINATES: similar to AA sequence:RefSeq:NP_418157.1”; locus_tag “HW370_RS00005”; product “chromosomal replication initiator protein DnaA”; protein_id “WP_000059111.1”; transl_table “11”; exon_number “1”;

What those values represent

NZ_CP058222.1 == chromosome identifier, pairs with the genome fasta on the title line > value (everything before the first whitespace) and the third column in BAMs generated via mapping against the fasta. Those BAMs were not in the shared history but that’s OK – are the BAMs you input to Featurecounts (maybe in another history, or locally on your computer, or some other online tool)
gene “dnaA” == the gene the pipeline guessed as the closest valid match. This is what you could lookup, example: https://www.ncbi.nlm.nih.gov/gene/?term=dnaA
gene_id “HW370_RS00005” and locus_tag “HW370_RS00005” == both are the same value. This is a distinct name, presumably since more than one locus could be assigned to the same “gene” via the pipeline’s “inference” algorithm. Technically, “gene_id” must be unique in any single assembly’s annotation – most tools will error or produce weird results if there are any duplications.

You could transform the GTF to a tabular format with gffread then use Join two Datasets side by side on a specified field to map gene_id/locus_tag → gene in batch. Tutorial with these and more manipulations: Data Manipulation Olympics.

You can also search the tool panel with the keyword “gtf” to find more tools to manipulate or extract (or extract and simplify) any other attribute data you are interested in mapping the gene_id/locus_tag value to (the first column in the DESeq2 output table). Example: you want to map to the "inference “COORDINATES: similar to AA sequence:RefSeq:NP_418157.1"” you would need to strip all the extra content until just the NP value remains to use that key in a search to get the associated protein fasta.

Hope that helps!

Iisa · October 26, 2022, 8:37pm

That helped a lot. Thank you again, @jennaj .

Topic		Replies	Views
Annotate DESeq2/DEXSeq output usegalaxy.eu support transcriptomics , deseq2 , stringtie	1	294	July 2, 2023
Annotate DESeq2/DEXSeq output tables usegalaxy.org support tool-help , deg_annotate	5	54	November 8, 2024
How can i extract gene name from custom GTF file? transcriptomics	1	301	March 12, 2024
Problem with the annotatemyIDs tool usegalaxy.eu support limma_voom	3	557	July 25, 2023
Differential Expression from Salmon Quant outputs without GTF/GFF3 annotation file usegalaxy.org support salmon	3	677	November 16, 2022

DESeq2 Returning Nucleotides As Gene ID

Related topics