Dear Galaxy Team,
I would like to run Kaiju using a reference database that includes bacteria, archaea, and eukaryotes, such as refseq_nr. Could you please help add this database? Thank you!
Dear Galaxy Team,
I would like to run Kaiju using a reference database that includes bacteria, archaea, and eukaryotes, such as refseq_nr. Could you please help add this database? Thank you!
Welcome @chuanzhai
The Kaiju indexes do not appear to under undergoing updates anymore (as of 2024). This is likely why the tool is only hosted at UseGalaxy.eu and not the other UseGalaxy servers.
Instead, please have a look at Kraken2 and related tools. We have many tutorials that can guide you through using them and the indexes are current from the same public source that others are using when working outside of Galaxy.
Galaxy Training Network
Where we source the indexes → Kraken2 databases question - #2 by jennaj
We hope this explains the current situation and provides an alternative! ![]()
@chuanzhai while the DBs not receiving updates will increasingly be problematic, I triggered installation of all the existing genomic DBs now on Galaxy Europe. They should be available as DB choices in the tool from tomorrow on.
Cheers,
Wolfgang
Hopefully you can get this running on Galaxy. I have wanted this for a long time, I think it gives a more accurate classification than Kraken2.
Hi @chuanzhai and @Jon_Colman
The indexes appear to be in place! Please give the tool a try!
I didn’t go through and test each, so if either of you run into problems, please share back the full log messages and inputs/parameters and we can help to investigate!
Glad this could be done! ![]()
I tried running Kaiju twice, using different databases, both failed??
Ah, ok, the quick addition was worth a try!
I was able to reproduce your use case with tool test data and found another small issue as well. I’ve ticketed these here → Corrections for kaiju_kaiju 1.10.1+galaxy1 · Issue #8045 · galaxyproject/tools-iuc · GitHub.
@wm75 is out right now but he’ll see this when he returns. Maybe there was some part of the nr nr_euk and refseq indexes that didn’t get replicated into the correct location for the working job directory to see it. The others are Ok.
Warning that the other issue I found will need to be corrected in order to use the same options that you applied if you want to try a different index. In short, “Enable SEG low complexity filter” need to be toggled to Yes or the job falls through to a different problem. The is technically supported by the underlying tool and I didn’t find a known issue so it may be spurious and something else is happening here.
Hope this helps and more next week! ![]()
Yeah, I suspected some small issues. I didn’t want to spend too much time, as it was slow processing.
Ah, yes. Not surprisingly, the tool has very different memory requirements depending on the DB. Our default is only good enough for the very small viral and pladmid ones.
We’ll need to configure per DB memory requirements. Give us a day or two to get this set up.
All installed DBs on Galaxy Europe should now be working. Please report if there are remaining issues.
I will give it a try!!
I ran Kaiju with the refseq_nr database (2024-08-13) on Galaxy Europe, and did not get any assignments to viruses or microbial eukaryotes (all assigned reads were to cellular organisms, and within that, only bacteria and archaea).
I know from running Kaiju with the nr_euk database and Kraken2 with the core_nt database that there are viruses and eukarotes in my samples. I am wondering if the refseq_nr database on Galaxy EU is incomplete?
Thanks for your help!
@slghose I’m not sure this is a technical issue with that DB on Galaxy Europe.
The files we have for it look ok, at least superficially, and there are no complaints from the tool either.
On the other hand it is entirely possible to have hits in refseq, nr and nr_euk that you don’t get with refseq_nr. Not sure how exactly sequences are selected for inclusion in the latter, but from
RefSeq non-redundant proteins :
“Non-redundant RefSeq protein records are currently provided for archaeal and bacterial RefSeq genomes, with the exception of selected reference genomes, by the NCBI prokaryotic genome annotation pipeline.”
So this matches well with your observation, doesn’t it?
@wm75 Thanks for your response. I based my assumption that the refseq_nr database for Kaiju also included some viruses and microbial eukaryotes on this Kaiju documentation that describes the databases. There it says that refseq_nr from 2023-06-17 and 2024-08-13 should contain “Protein sequences from Archaea, bacteria, and microbial eukaryotes from NCBI RefSeq non-redundant protein collection, as well as viral protein sequences from NCBI RefSeq.” I used the refseq_nr version from 2024, so I think it should have some viruses/microbial eukaryotes in it.