I can think of a few ways to do this.
For step 1, you input a taxID to get a fasta result, correct? That is your mapping baseline. You could extract the fasta identifiers for each taxID query and put that into a tabular file.
Or, working from now, you could extract that same information (and supplemental data) in Galaxy using this tool.
- NCBI Datasets Genomes download genome sequence, annotation and metadata (
at ORG)
Once you have a tabular file with the genome accessions with the taxID and other metadata, you can convert the BAM to a tabular format and perform the join (Step 7 above).
I would run this in a workflow to batch process the work, and to better isolate the matches per taxID. You could always concatenate or group later, then cut out columns to apply summaries or custom graphs or reports.
The processing would be something like this:
- Put the taxIDs into a single column tabular file.
- Use Split file to dataset collection with a chunk size of 1. (one taxID per collection element). (
at ORG)
- Process the collection folder through the NCBI Datasets Genomes tool
- Convert your BAM to a tabular format
- Join your data from step 3 with step 4. There are many join tools, so I’ll let you pick the one to use. We have examples in these tutorials → Data Manipulation Olympics. The first in the list uses the web tools that you can include in the workflow, or you can move into an interactive environment.
- Optional – summarize your results with counts before or after the join above
If you are new to Galaxy and workflows, we have many examples! All is web based (or you can extract and edit the plain text for advanced use through the API). You can even try running a few samples (maybe 3 taxIDs) of the data through the target tools, tune the queries, extract what is working into a new workflow, then customize from there to create a reusable process that runs everything in a batch.
This is my favorite simple introduction to workflows
And full details for what is possible is here
With existing workflows with even more customization can be found here
- GTN Pan-Galactic Workflow Search – Vetted Workflows
Please let me know if I misunderstood anything or if you have any problems doing this! This was an interesting question and I hope I didn’t overload my reply. Let us know how this works out!