I recently ran Kraken2 using GTDB v220 and noticed many inconsistencies in NCBI taxids Kraken is reporting. For example, Bacteria → taxid: 3 instead of 2, or Pseudomonadota → taxid: 175 instead of 1224. Has anyone got the same issue? Is it something to do with the database?
Yes, GTDB applies different taxa labels versus NCBI. Many of these are a type of placeholder used for unresolved classifications. Think these as index values you can track across GTDB releases instead of identifiers to map directly 1-1 to NCBI.
In Galaxy, you can access the mapping between the two the identifier schemes with:
NCBI-GTDB map Map taxonomic classifications between the NCBI and GTDB taxonomies
The tracking tool at GTDB will probably be really interesting for you! You can also try a query like this → GTDB - Taxon History (?from=R220&to=NCBI&query=p__Pseudomonadota)
I also see many questions like yours at their forum! This would be the best community for you to discuss scientific questions about classifications. They would also be able to investigate any truly odd situations. → https://forum.gtdb.ecogenomic.org/
I’m going to take this opportunity to explain a bit more about how all of this works together. It can be quite confusing! Where do you explore data? How to interpret results? Where are the resources? Who can you ask if you need more context! If you are working in Galaxy, this forum is always a great place to start and we can point you to where to dig deeper.
Kraken indexes: The Big Picture
Below is how Galaxy sources Kraken indexes, and GTDB in particular.
Dataflow
The GTDB team creates and publishes the base reference data.
Then, the Langmead lab creates a compiled version of the GTDB releases (and others!) suitable for community use with Kraken/Kraken2, KrakenUniq, Bracken.
The important part I wanted to explain is that our copy should be an exact mirror of the original source data. This means you can explore and discuss the content everywhere that scientists are using GTDB. I would try the GTDB forum if you notice a problem with classifications – I see many topics where this has been discussed in the past.
Let’s start there and I hope this helps to explain how it all links together!
Hello @jennaj! All this information is really helpful. I tried to use the tool available in Galaxy but it didn’t work precisely as I needed this time. So, I downloaded GTDB’s latest database (R226) to build a lookup table in R, then retrieved the full taxonomy based on my data. There were still some gaps, but much fewer than the original output from Kraken2/bracken. In any case, the explanation you provided helped a lot.
Great, I’m glad this lead to the right place and thanks for letting us know what worked! The classifications resolve release over release, so finding more connections in the newer release makes sense! We’ll be updating Galaxy soon too. → Update GTDB-Tk indexes to R10-RS226 · Issue #68 · galaxyproject/idc · GitHub