DESeq2 Returning Nucleotides As Gene ID

Hi all.
After having used DESeq2, my results show up as nucleotides (per NCBI: Search NCBI databases - NLM) instead of genes.
Do you know what causes this and how I could find out the correct gene IDs?

Hi @Iisa

The URL you shared is a link to a search function with no content.

Please share more details about the inputs and result. The description so far seems very odd - so this will be best done by generating and posting a share link to the working history. You can post that back here publicly, or ask for a moderator to start up a direct message chat to post that in instead.

Let’s start there, thanks :slight_smile:

Hi @jennaj,

right, I just wanted to show which site I had used to look for the IDs.
I would appreciate a direct message chat with a moderator.

Thank you. :slight_smile:

Hi @Iisa

The content of the gene_id and transcript_id fields of the reference annotation GTF are what Featurecounts uses to generate the counts. Mapped hits at the “transcript_id” level (aka transcript coordinates) are summarized per “gene_id” in that output. Then, those gene_id level counts are interpreted by DESeq2 for differential expression.

For this data, the annotation was predicted using an automated annotation pipeline developed by NCBI. To review the general stats for the annotation, and the pipeline details, go to the assembly page here https://www.ncbi.nlm.nih.gov/assembly/GCF_021559855.1/#/qa and expand the section “Comment”

The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about PGAP can be found here: NCBI Prokaryotic Genome Annotation Pipeline

Example line from the public GTF

NZ_CP058222.1 Protein Homology CDS 1 1401 . + 0 gene_id “HW370_RS00005”; transcript_id “unassigned_transcript_1”; gbkey “CDS”; gene “dnaA”; inference “COORDINATES: similar to AA sequence:RefSeq:NP_418157.1”; locus_tag “HW370_RS00005”; product “chromosomal replication initiator protein DnaA”; protein_id “WP_000059111.1”; transl_table “11”; exon_number “1”;

What those values represent

  • NZ_CP058222.1 == chromosome identifier, pairs with the genome fasta on the title line > value (everything before the first whitespace) and the third column in BAMs generated via mapping against the fasta. Those BAMs were not in the shared history but that’s OK – are the BAMs you input to Featurecounts (maybe in another history, or locally on your computer, or some other online tool)
  • gene “dnaA” == the gene the pipeline guessed as the closest valid match. This is what you could lookup, example: https://www.ncbi.nlm.nih.gov/gene/?term=dnaA
  • gene_id “HW370_RS00005” and locus_tag “HW370_RS00005” == both are the same value. This is a distinct name, presumably since more than one locus could be assigned to the same “gene” via the pipeline’s “inference” algorithm. Technically, “gene_id” must be unique in any single assembly’s annotation – most tools will error or produce weird results if there are any duplications.

You could transform the GTF to a tabular format with gffread then use Join two Datasets side by side on a specified field to map gene_id/locus_tag → gene in batch. Tutorial with these and more manipulations: Data Manipulation Olympics.

You can also search the tool panel with the keyword “gtf” to find more tools to manipulate or extract (or extract and simplify) any other attribute data you are interested in mapping the gene_id/locus_tag value to (the first column in the DESeq2 output table). Example: you want to map to the "inference “COORDINATES: similar to AA sequence:RefSeq:NP_418157.1"” you would need to strip all the extra content until just the NP value remains to use that key in a search to get the associated protein fasta.

Hope that helps!

That helped a lot. Thank you again, @jennaj . :slight_smile:

1 Like