Assign Taxonomy to Bowtie2 Alignments to Custom Reference Database

I am looking to assign non-microbial taxonomy (Nematoda, Platyhelminthes, & Eutheria) to reads aligned to custom reference databases using Bowtie2. What is the best way to go about this? Is there a way to link the aligned reads to the taxonomic annotations of the reference sequences using tools available in Galaxy? I have not found a tutorial or work flow that accomplishes taxonomic assignment using Bowtie2 alignments.

Steps Completed:

  1. I used the NCBI Datasets command-line tools (CLI) to download reference genomes by Tax ID. For example: datasets download genome taxon 6231 --reference --filename nematoda_dataset.zip
  2. I then concatenated the individual .fna files into a single .fna file and gzipped the file before uploading into Galaxy.
  3. I used galaxy-upload to import my custom reference database to Galaxy and normalized it prior to use.
  4. I have successfully used Bowtie2 to align my sequences to the custom reference databases I uploaded and normalized.

Possible next steps:
5. Extract the RNAME (reference sequence name) for each aligned read
6. Map Reference IDs to Taxonomy: create a file or use the summary file in json format downloaded from NCBI using: datasets summary genome taxon 6231 --reference > nemameta.json
7. Match each RNAME in the SAM/BAM file with its corresponding Taxonomy ID using the mapping file.

Thank you!
Laura

1 Like

Hi @Laura_Peirson

I can think of a few ways to do this.

For step 1, you input a taxID to get a fasta result, correct? That is your mapping baseline. You could extract the fasta identifiers for each taxID query and put that into a tabular file.

Or, working from now, you could extract that same information (and supplemental data) in Galaxy using this tool.

  • NCBI Datasets Genomes download genome sequence, annotation and metadata ( :link: at ORG)

Once you have a tabular file with the genome accessions with the taxID and other metadata, you can convert the BAM to a tabular format and perform the join (Step 7 above).

I would run this in a workflow to batch process the work, and to better isolate the matches per taxID. You could always concatenate or group later, then cut out columns to apply summaries or custom graphs or reports.

The processing would be something like this:

  1. Put the taxIDs into a single column tabular file.
  2. Use Split file to dataset collection with a chunk size of 1. (one taxID per collection element). ( :link: at ORG)
  3. Process the collection folder through the NCBI Datasets Genomes tool
  4. Convert your BAM to a tabular format
    • custom report → BAM to SAM to tabular format with BAM-to-SAM ( :link: at ORG) and Convert SAM to interval ( :link: at ORG)
    • or, bed output → use bedtools BAM to BED converter ( :link: at ORG)
  5. Join your data from step 3 with step 4. There are many join tools, so I’ll let you pick the one to use. We have examples in these tutorials → Data Manipulation Olympics. The first in the list uses the web tools that you can include in the workflow, or you can move into an interactive environment.
  6. Optional – summarize your results with counts before or after the join above

If you are new to Galaxy and workflows, we have many examples! All is web based (or you can extract and edit the plain text for advanced use through the API). You can even try running a few samples (maybe 3 taxIDs) of the data through the target tools, tune the queries, extract what is working into a new workflow, then customize from there to create a reusable process that runs everything in a batch.

This is my favorite simple introduction to workflows

And full details for what is possible is here

With existing workflows with even more customization can be found here

Please let me know if I misunderstood anything or if you have any problems doing this! This was an interesting question and I hope I didn’t overload my reply. Let us know how this works out! :slight_smile:

This is great! Thank you. I will work on the process you suggested and report back for everyone.

1 Like