I’m running a blastn search to identify best hits for a large batch of sequences. I would like to return the scientific name for the best hit in the tabular data. I using the following command:
Yes, you are correct! The taxonomy functions are not wrapped into the BLAST+ functions directly. Instead, you can cut out the hits and run one of these tools to pull in the extra data you are interested in, then join columns as wanted.
Fetch taxonomic representation (find this at UseGalaxy.eu for now)
NCBI Datasets Gene download gene sequences and metadata
Placing all steps into a simple workflow is what most do. Then, that workflow functions in practical use as a “single custom tool” when you execute it.
For how to do this exactly, I’m guessing you already know about the our tabular data parsing tools but in case not or for anyone else reading later on: please try a search in the tool panel with common utility names, or see our tutorial here for a short tour.
I’ve tried similar approaches with just the GI number (instead of GB number)…no success.
I’ve tried using the NCBI Datasets Gene tool instead…no success.
I have a list of taxon ID’s (column 13 in the OP). I haven’t found a way to easily convert these to species/common names.
Background (in case it helps):
I have a large dataset of ~700K reads acquired from from an ice-age bison bone using ONT sequencing. I’m trying to filter out bacterial contaminants and identify closest matches for whatever is left. I’m using a kraken2 filtering step to classify reads as bacterial or unclassified (non-bacterial). I’m then BLASTing the “unclassified” reads to identify closest matches. I can get the taxid from the BLAST search, but I need to find an an automated way to get scientific names/common names for my hits (preferably through Galaxy so that I can build this all into a workflow). Any suggestions related to this last step (or others along the way) are welcome.
Update: I noticed the “Fetch Taxonomic Representation” tool you referenced in the edit. I tried this over at UseGalaxy.eu. It looks like exactly what I’m looking for…however, it’s still not working as expected.
These NCBI tools are a bit picky since these go remotely through their API. The 3rd field looks Ok (unless any are empty or have NA values?) but I am wondering if the 1st has too many characters or if the dashes are leading to an unexpected column split somewhere.
So, confirm the 3rd column has valid values then I would suggest trying to simplify the query names in column 1. The first part of the IDs appear to be the same, maybe just keep the end? Underscores should be Ok but I would avoid other special characters.
In short, I think solving this is just fiddling with the formats a bit. Let us know if you can’t solve it and I’ll try to come up with an example that mimics yours for testing. Or, you can share yours back and I’ll experiment.