Hi there, is there a way to make the core_nt database available for Kraken 2 on the galaxy.eu server?
I have multiple metagenome samples from nanopore sequencing. When run through Porechop and fastp for QC the Kraken results for confidence=0.5 it yields between 85-90% unclassified.
Of course, this decreases with a decreased confidence screening, but it also yields unexpected results. This is a bioreactor consortium of environmental origin, I am mostly expecting soil bacteria.
I have tried all the 8GB, full, 2022, and 2024 Prebuilt Refseq indexes but got similar results. It is my understanding that using the core_nt database would help with this or show that the issue is in my samples. Running it natively is not an option at this time. Any recommendations?
The UseGalaxy.org server also hosts the EUPathDB if you wanted to try it to see what happens. The EU server will host this soon, too but meanwhile you can move data between servers to access the distinct indexes.
The core_nt database is likely much too large to host at a public Galaxy server but I’ve logged the request with our team anyway to see what others think. Maybe there is something special that can be done in the future at the public sites.
If this is something you wanted to try yourself, the limitation is not Galaxy itself, but the attached cluster nodes that execute the public jobs. These are significant but this database index is truly large. Running Galaxy yourself (maybe the Docker version) and attaching it to a cluster node that can handle the job (maybe cloud based) is one idea.
Updating this: I tried the same analysis with the Core_nt database versus the Standard PFP and only 4% more of the reads (from 38% unassigned to 34%) were classified with no significant changes in taxonomical assignment. So it may not be more accurate. For both, I used confidence 0.05 and minimum hit group 3 on pre-treated nanopore metagenome reads.
This might be coming from our samples. Does anyone have any other suggestions?
Then my last suggestion is to reach out to the Galaxy micro community scientists to see if they have more ideas. The link to their chat is at the very top of the tutorials above, and I’ve cross posted your question over there to get this started.
Dear @Lily_ofthepond;
confidence score in kraken2 means, that taxa assigned with less than 5 % are dropped. This will remove all low abundance taxa. You might have many low abundance taxa in your samples. You could try to lower the confidence score or even set it to 0. Although kraken2 has a high false positive rate, 0.02 is often a good treat off. Here is a good discussion: Guidance on confidence score · Issue #265 · DerrickWood/kraken2 · GitHub
You could also try metaphlan, check if it assigns more reads and compare results.