Troubleshooting: Kraken2, GTDB, NCBI, and taxa (taxid taxids)

Hello everyone!

I recently ran Kraken2 using GTDB v220 and noticed many inconsistencies in NCBI taxids Kraken is reporting. For example, Bacteria → taxid: 3 instead of 2, or Pseudomonadota → taxid: 175 instead of 1224. Has anyone got the same issue? Is it something to do with the database?

Welcome @yvserra

Yes, GTDB applies different taxa labels versus NCBI. Many of these are a type of placeholder used for unresolved classifications. Think these as index values you can track across GTDB releases instead of identifiers to map directly 1-1 to NCBI.

In Galaxy, you can access the mapping between the two the identifier schemes with:

  • NCBI-GTDB map Map taxonomic classifications between the NCBI and GTDB taxonomies

The upstream tool could have been either

  • GTDB-Tk Classify genomes
  • Kraken2 using the GTDB database

GTDB RS220 release

The tracking tool at GTDB will probably be really interesting for you! You can also try a query like this → GTDB - Taxon History (?from=R220&to=NCBI&query=p__Pseudomonadota)

I also see many questions like yours at their forum! This would be the best community for you to discuss scientific questions about classifications. They would also be able to investigate any truly odd situations. → https://forum.gtdb.ecogenomic.org/



I’m going to take this opportunity to explain a bit more about how all of this works together. It can be quite confusing! Where do you explore data? How to interpret results? Where are the resources? Who can you ask if you need more context! If you are working in Galaxy, this forum is always a great place to start and we can point you to where to dig deeper.

Kraken indexes: The Big Picture

Below is how Galaxy sources Kraken indexes, and GTDB in particular.

Dataflow

  1. The GTDB team creates and publishes the base reference data.

    https://gtdb.ecogenomic.org/

  2. Then, the Langmead lab creates a compiled version of the GTDB releases (and others!) suitable for community use with Kraken/Kraken2, KrakenUniq, Bracken.

    Index zone by BenLangmead

  3. The GalaxyProject sources the compiled indexes from the the Langmead lab’s AWS repository. This is technical and not user facing but you can explore!

    data hosted at http://datacache.galaxyproject.org/
    group GitHub - galaxyproject/idc: Simon's Data Club - Reference data for Galaxy servers
    technical tutorial Hands-on: Reference Data with CVMFS / Reference Data with CVMFS / Galaxy Server administration

How to review closer

The important part I wanted to explain is that our copy should be an exact mirror of the original source data. This means you can explore and discuss the content everywhere that scientists are using GTDB. I would try the GTDB forum if you notice a problem with classifications – I see many topics where this has been discussed in the past.

Let’s start there and I hope this helps to explain how it all links together! :slight_smile:

Hello @jennaj! All this information is really helpful. I tried to use the tool available in Galaxy but it didn’t work precisely as I needed this time. So, I downloaded GTDB’s latest database (R226) to build a lookup table in R, then retrieved the full taxonomy based on my data. There were still some gaps, but much fewer than the original output from Kraken2/bracken. In any case, the explanation you provided helped a lot. :handshake:

1 Like

Great, I’m glad this lead to the right place and thanks for letting us know what worked! The classifications resolve release over release, so finding more connections in the newer release makes sense! We’ll be updating Galaxy soon too. → Update GTDB-Tk indexes to R10-RS226 · Issue #68 · galaxyproject/idc · GitHub

1 Like