Kraken2 database

mycojon · February 16, 2024, 9:03pm

Is there any galaxy site that has the kraken2 nt-database 719 GB installed?? I keep running into problems where the databases are missing some of the species in my samples.

Also, I see that kraken2 can work similar to Diamond where it can translate to protein and search against a protein database. Are there any protein databases installed on galaxy?

igor · February 19, 2024, 6:57am

Galaxy Europe has a number of Diamond databases.
Kind regards,
Igor

zw97 · November 8, 2024, 3:51am

Have you found a Galaxy site with the Kraken2 nt-database installed? I’m also experiencing issues with missing species in my samples, which is resulting in a very low annotation rate. If you have any information, I would greatly appreciate your help!

igor · November 8, 2024, 3:56am

Hi @zw97
Do you mean Blast nt database indexed for kraken2? I am not aware about this option. I am not sure if it is suitable for Kraken, because of redundancy, but I might be wrong.
Kind regards,
Igor

Jon_Colman · November 8, 2024, 5:24pm

Hi Igor,
If you look at the available Kraken2 databases that are hosed on the Amazon Cloud service, it’s referred to as the “core_nt Database” Very large collection, inclusive of Genbank, RefSeq, TPA and PDB. It’s current release is 9/4/2024 with a size of 233.3 GB. The most inclusive one currently on Galaxy.eu is the PlusPFP from 9/2022 is 142 GB which is KNOWN to be defective as it was found to be unintentially missing sequences (it’s current size is 188 GB).

I suspect the core_nt database may include host sequences, which would be EXTREMELY helpful in running Kraken2 for unknown stuff.

I would like to request this to be added to the Galaxy options if possible.

Jon

jennaj · November 8, 2024, 6:23pm

Hi @Jon_Colman

We just added EUPathDB to UseGalaxy.org for Kraken2 this week. Would you want to try that one?

EuPathDB

mycojon · November 8, 2024, 7:08pm

That might be an interesting database for me to try!! One issue that I’ve been having is that my samples have Plasmodium Ovale, but NONE of the other databases include that species (actually 2 species Plasmodium Ovale Wallikeri and Plasmodium Ovale Curtisi) which are part of the pathogenic human malaria species.

I also previously avoided the usegalaxy.org kraken2 as I believe it didn’t have the option to extract the results by ID.

jennaj · November 8, 2024, 7:27pm

Hi @mycojon

You can search text files with tools that are similar to what is used command line. Are you familiar with that? Even if not, this is a powerful way to use Galaxy and not that different from spreadsheet/text editing programs.

This is a short cheat sheet and guide with examples. It is formatted like a tutorial, but you can also just use the parts you need. Maybe the Filtering section?

Hands-on: Data Manipulation Olympics / Data Manipulation Olympics / Introduction to Galaxy Analyses

To filter by simple terms that may not be isolated into a column by itself, you can use a tool like Select. If you need help constructing a search, you can share a line that you want to query, and your query term so far, and we can try to help with it.

To first get all the information on a single line, you can use a tool like Convert Kraken data to Galaxy taxonomy representation.

Maybe you already saw all of this … so consider this help for the next person who had a similar concern about data parsing abilities. If you have a specific example that is not working or you are not sure about tool choices, you can ask more questions too!

mycojon · November 9, 2024, 7:07pm

Hi Jennifer,
The new EuPathDB definitely works better to identify the Plasmodium species, thanks!!!

I found that Galaxy EU had a new Kraken2 database Mycobacterium V1, do you have any information on this one??? I would like to be able to see the list of the species that it includes. I knew that I had some Mycobacterium species in my samples, but it’s been painstaking to try and pull them out of the samples. I ran this one on my samples, and there was a massive amount of identified Mycobacterium as well as other species. I’m not questioning that they are there, as when I do host removal there is a pretty sizeable non-host set of reads. I did notice in the output that there isn’t a human host 9606 in the database, is this why it identifies much more species?

For what I’m trying to do currently is just to pull everything non-host out of my samples except for (virus, fungi, archea), and do whatever QC and error correction I can do. Then take these reads for submission to an online bioinformatics platform.

One confusion that I have is regarding mapping reads against a reference sequence, and how Kraken2 works.

For example, in the following scenario.

First remove adapters, and confirm via FASTQC that the quality looks good overall and that adapters are removed.
Map with Bowtie2 default settings (I’m not sure if this is the best setting as there are likely some errors in the reads) for a known species in my sample.
Results show significant amounts of mapped reads.
Now I take the unmapped reads and run them with Kraken2 on the Mycobacterium V1, and there a massive amount of reads that Bowtie2 didn’t find of the same species as well as numerous others.

Is there a better Bowtie2 settings that I should be using??? I don’t mind if the species is correct exactly, as my main goal is to subtractivly remove host reads, not by mapping as that has shown to take along a high percentage of my microbial reads.

mycojon · November 9, 2024, 7:38pm

Hi Jennifer,
Since it seems the two Kraken2 databases that I need to use are on either Galaxy.org or Galaxy EU, and my files are quite large. I had heard that I can move histories to other Galaxy servers?? So can I move them between the two Galaxy sites without having to download/upload???

mycojon · November 11, 2024, 4:07pm

Hi Jennifer,
I finally figured out how to transfer my files between servers, this was a HUGE help. It took me a while to figure that I had to share the file first.

A question regarding the EuPathDB, I noticed that this database does not include a human reference. For this type of database, should I do Host removal before/after, or at all??? I know that the Plasmodium species matches the Human reference very closely, so host removal would likely remove actual plasmodium reads.

Thanks,

wm75 · November 11, 2024, 6:12pm

Look at the full description of that new database. It contains a link, which leads to a zenodo record with exact details about how the db has been built. Specifically, the genomes used are those listed in the assembly_summary.txt file there.

mycojon · November 11, 2024, 7:00pm

Thanks Wolfgang, I found it!!! It looks like a great database, so many of those species aren’t represented in other databases.

mycojon · November 12, 2024, 7:12pm

Hi Jennifer,
Is Galaxy.org working properly??? It was working great, then it seemed like nearly all programs were running very slowly. It took hours even to move files between servers.

Just running de-interlace on a relatively small file didn’t even complete after running all night.

Thanks

jennaj · November 12, 2024, 9:47pm

Hi @mycojon It has been working for me all day today. I’m starting another test in a different account. Meanwhile, if you want to post back a share link to a history that seems stalled, I can take a look. Thanks!

Update: I just launched a workflow and a bunch of job ran nearly immediately, and now I have queued jobs. That is very normal and expected. That said, I am still willing to take a look if you still want to share.

mycojon · November 13, 2024, 12:55am

Is a share link the same link I would use to copy to a different server?

mycojon · November 13, 2024, 1:00am

HI Jennifer,
Here is one of the histories. I tried to de-interlace this same file that ran last night, which I deleted and restarted. Same issue with other histories on galaxy.org as well.

https://usegalaxy.org/u/squidly/h/mandy

jennaj · November 13, 2024, 9:51pm

Hi @mycojon

The problem with this run is the format of the data. This job will never complete and will eventually die. You can just delete then purge the output of the processing job since it is not going to be useful, and will just prevent other jobs from running.

These reads have an odd sequence identifier on the @ lines. These looks like single end reads for the first part, then these have the /1 and /2 notation at the end. Did you manipulate these yourself? Why? Maybe we can help to come up with a better analysis strategy.

What you are you trying to do here? I see that the reads came from a prior mapping job. The input to that mapping job was set as single end data but these had the odd notation on the sequence @ lines already. Do you have the original reads somewhere?

You can explain more if you want to, but I don’t think “de-interlacing” this data makes any sense, since this is not paired end data, it is single end. Getting the original reads again into Galaxy would be much better for a few reasons.

Update: Ah, mate pair data. Ok. Do you have the original reads still? What you were discussing this post? Mate-Pair Reads. Try to work with that data. The tools you are using to do the split for pairs is not going to know how to process it the way you are doing this now.

Please let me know if I can help more.

mycojon · November 13, 2024, 10:45pm

HI Jennifer,
I’m working on deleting these and starting from scratch again. I’m trying to clean, optimize and error correct my reads before submitting to an online bioinformatics platform, which ideally wants all good quality reads. I’m using Kraken2 to extract reads of interest, along with mapping sequences that are missing from Kraken2, ideally getting rid of Viruses, Fungi, Archea, and host reads.

The /1 and /2 notation I believe came from the Extract Reads from Kraken2 primarily. So this part on the de-interlacer was the Extract Kraken reads, then mapping to a reference with BBmap. I was trying to de-interleave the unmatched reads.

The other source for the /1 and /2 comes from rescuing unpaired reads with the concept that Read 2 is the reverse complement of Read 1 so I’m technically not adding or removing actual information. This set didn’t have very many unpaired reads after trimming, but since I didn’t know if the unpaired reads were of interest, I rescued them by the following.

For standard paired end reads
I would take the unpaired reads and do trimmomatic Maxinfo with Min 35 for both forward and reverse.
I would reverse complement the /2 reads to put them in the same alignment as the /1 reads (not sure if this even makes sense)
I would concatenate the unpaired reads with the result becoming the new /1 reads, and reverse complement to create a new /2
Using FASTQ to SAM I would use the new /1 and /2 reads, and use the option to strip the /1 designation
From the BAM file I would convert to compressed FASTQ (interleaved reads).
Finally de-interleaving the reads to give me paired end reads that I could put back into my reads set.

For my Mate-Pair Reads (I have 3 sets), this was more problematic. They were 150x2 Novaseq 6000 reads with massive POLY-G tails. In removing the POLY-G, It would often complete remove read 2, leaving me with over 100 MB compressed R1 reads. It’s my understanding that most likely reads that are shorter than 150 after adapter removal are most likely FR orientation, and 150 reads would most likely be RF reads. I really don’t know how to correct read orientation with Galaxy, I know that BBtools has SplitNextera to do this, but it’s not on the Galaxy site.

Hopefully this makes sense?

jennaj · November 19, 2024, 12:06am

Hi @mycojon

Ok – some of these steps seem odd but you can try to get things processed.

My primary advice would be to:

Review what is the recommended processing at NCBI for similar types of submissions, or look for a guide at forum where you are submitting, then follow that. They might have a workflow you could translate to a Galaxy workflow – and that would allow you to process everything in a batch! Change a parameter, rerun, generate some statistics, then decide what to do. You could even publish that workflow so other people could use it, including people who might later be using your data!
Consider not attempting to rescue reads that fail at certain steps. These usually are not that important. Why? Often those are just lower quality copies of whatever is already in the sample that did pass through. Maybe you can think of a way to confirm this in your own data, and make decisions from there.