Is it possible to get the BOLDistilled COI available as a database for Kraken2?

prof.garrison.smc · April 29, 2026, 6:58pm

Hi,

We are starting to use Galaxy to analyze our eDNA data. Since we are using PCR for the COI locus, we can accelerate taxonomic assignment by focusing on the COI sequences in a database like the BOLDistilled datasets. Is it possible to have these available as a classification database for use with Kraken2?

Thanks!

Keith

jennaj · April 29, 2026, 7:45pm

Welcome @prof.garrison.smc

This is a good question! Would you like to share some more details about the data source? I don’t see Kraken2 indexes but might be looking in the wrong place. I also don’t see a “download index” option, just the API connection for batches of data.

BOLDistilled – BOLD

Then, I can share what we have already (most of the standard indexes). This just came up this morning! Popular tools! → Kraken2 databases question - #2 by jennaj

My initial thought is … would this group be interested in working with the Langmead lab, and would either be interested in generating new indexes for the wider Kraken/2 user community based on the new data stream? Then, questions like is this even possible or is it another curation project? If possible and resourced, then this index would have reproducible, open source data hosting, and could flow down to everyone, including Galaxy. But, I might be missing something about this that doesn’t make sense, so I would be curious what you think more!

Two more options that would only need a fasta version of the index is BLAST and VSearch. They seem to have these already indexed, so maybe write to them and ask? Any fasta from the history can work with the Galaxy versions of these. If there was other metadata in mapping files, either cvs or tsv could be used with data manipulation tools, too.

Let’s start there! Fewer silos is better for researchers, but sometimes the connections are complicated!

prof.garrison.smc · April 29, 2026, 9:09pm

Hi!

Sorry for not posting the direct link to the BOLDistilled downloads. I was blocked from adding the link from that host. It looks like the available formats are BLAST SINTAX and VSEARCH. The appeal of this site’s taxonomy files is that they are de-duplicated, curated, and limited to the COI locus. The database may have most utility for those of us doing metabarcoding from amplicons, though.

prof.garrison.smc · April 29, 2026, 11:34pm

I also got the sense that individual users might be able to download the BOLDistilled files and then use them to build a custom Kraken database, but I do not think the Kraken2-build command is available to us.

wm75 · April 30, 2026, 10:37am

Hi @prof.garrison.smc,

I started recently with working on Galaxy-based metabarcoding analysis workflows for animals and plants, so this is a very timely question

I’m wondering though whether kraken2 is the best tool for taxonomy assignment in this case.

With targeted amplification of a single locus like COI, you would typically not assign raw reads, but rather condense them into a small number of amplicon sequence variants (e.g. with dada2) and kraken2 is not optimized for this few-query-sequences situation.

Wouldn’t BLAST, Sintax or Vsearch be better options, i.e. exactly the tools that they are offering their database for?

Another question: how does BOLDistilled compare to Midori2 and/or do you know other data sources besides those?

prof.garrison.smc · April 30, 2026, 7:46pm

Hi @wm75

Thanks for the suggestions about alternative workflows! I am being guided mainly by the workflow of the training project detailing microbiome analysis using Nanopore sequence data. Sorry for not posting a link. I keep getting an error that I cannot post a link from that host, but it seems to happen with any link I try to post.

I think we will likely be using Nanopore as our data source, because the per sample analysis cost is much lower than Illumina for us at present. The other appeal to this workflow is that it ends in a Krona chart, which I think is a very user-friendly data visualization, especially for work with undergrads like I do.

I have also done the pipeline training series for dada2. If I remember correctly, the metabarcoding workflow ends in R. The phyloseq package also seems to produce good visualizations without too much need for sophisticated coding knowledge.

I have looked at Midori. Possibly after generating the ASV dataset, I could use BLAST to generate the taxonomic assignments from either a Midori or BOLDistilled reference database? I would need to get good at hybridizing data pipelines, but I am learning that this is half the battle : )

jennaj · April 30, 2026, 9:20pm

Hello @prof.garrison.smc

Thank you for letting us know the spam filters seem a bit aggressive! URLs to GTN tutorials and to a shared history at a UseGalaxy server should be possible, but I’ll double check that it is working correctly. Meanwhile, I adjusted your account to allow links.

These are the two primary ways to share:

FAQ: Sharing your History
Then for workflows, an invocation is best! please see How to capture a share link to a Workflow Invocation

wm75 · May 1, 2026, 2:16pm

Yes, that would be the idea. You can either use BLAST directly or use “VSearch search”, which gives you more control over output filtering and formatting.

As for using Nanopore data, that is not an issue per-se and you can still use dada2. One limitation that you’ll face with ONT data, unless you’re sequencing with extra low-error rate protocols, is the number of species you can resolve reliably in the same sample. This is simply because if two COI sequences are closely related the ONT sequencing error rate might blur the line between them entirely.

Next week, I will be in a workshop with one day dedicated to metabarcoding analyses. If you want I can afterwards share with you the Galaxy analysis workflow(s) that we’re using there, and you’ll know it has at least been tested once on some real data.

prof.garrison.smc · May 7, 2026, 10:13pm

Yes, it would be great to see the pipelines that you learn about in your workshop!

I agree with you about the Nanopore error rate being a potential problem, especially when trying to group sequences for taxonomic ID. At this point, the per sample cost accessibility for use in student labs outweighs the potential issues with error rates. I really want to get NGS data into students hands from samples they collect, so they are motivated to get it though the entire analysis pipeline. It does argue for archiving samples for future analysis with a lower error rate technology.