Diamond troubleshooting

@mycojon Diamond translated short reads and map to a database. It is super fast. The tool is available on several Galaxy servers. Galaxy Europe has reasonable collection of databases. users can create custom databases.
Kind regards,
Igor

1 Like

Hi Igor,

I will go ahead and try Diamond on the European site. I had noticed that the standard galaxy doesn’t have any databases attached to the Diamond program.

If I understand correctly. I know that KAIJU translates the submitted reads to 6 frames, etc., then will classify against a protein database. I assume that Diamond works essentially the same???

I have noticed when using the CZID.org online classifier, for Eukaryote “Plasmodium Ovale Wallikeri” , which will classify reads against both the NT and NR database. I would get about 8,000 reads to the NR database and 0 reads to the NT database. Using KAIJU which works similar to Diamond, I would get about 400,000 reads against the NR + Eukarote database. I get similar results with the bacteria Klebsiella Pneumonia. Can you explain why the results are so different??

Jon

Hi Jon,
I don’t know how Diamond works. Consider checking the manual or github page or paper describing it.
I am not familiar with CZID.org, but it is not uncommon to see different results from different tools. Results depend on databases, search algorithms and settings. Maybe talk to people working in this field.
Kind regards,
Igor

I have been trying to figure out the Diamond program, it’s been over my head so far.

Here is what I’m really looking for, maybe there is something already setup for.

I have my shotgun dna reads FastqR1 FastqR2. For most, I can get good results with Kraken2. But there are a few that have better hits using the NR database. Also one is a Eukarote “Plasmodium Ovale”

You mentioned the mapping tool, I can’t find anything that uses the NR or Eukarote databases to map against??

JOn

I “think” I figured it out. It looks like I can convert my reads to fasta to input on diamond against one of the databases.

Since it looks like diamond doesn’t take paired reads, what is the best way to input with both reads?? Interleave?

These two are options

  1. Create a paired collection of your reads, and map the ends separately.
  2. Interleave the reads into a single file and map together.

You’ll have some complications with both, and mostly about how to handle the different ends of the pair having different mapping results, but the output is in a tabular file that is easy to work with using Text Manipulation tools. GTN Materials Search

I’m going to split out this into a new topic since it is different.

And, scientifically, I’m not convinced that this is the right tool for your data … but you are certainly free to explore!

Thanks Jennifer, I will try doing the paired collection. I’m currently waiting to see what Diamond results are for two separate options. Just doing forward reads both raw and unidentified reads.

I have found that the Galaxy Europe has much more usability than the standard Galaxy.

I am learning more about variability on results from different sources. Since I have run my samples on several commercial platforms, I have an idea what I’m looking for. The biggest problem is that trying to use various databases, they are somewhat incomplete. One I found was malaria, which in humans is composed of 5 species that cause it. Many of the reference databases only include 4 species. Yet my sample shows Plasmodium ovale. So depending where I run the sample, its a false negative. Same thing with mycobacterium leprae.

The commercial programs have the same problem (czid.org, bugseq.com, one codex, etc). Also using kraken2, which seems to work well, depends on what databases I use none of them have all of the relevant sequences.

Hi Jennifer,

I have another question regarding a mapping issue. I have some samples that include Plasmodium (malaria species) and toxoplasma gondii. I tried mapping using HISAT2 to map the human reads, but upon analyzing the mapped reads with Kraken2 it seems that HISAT2 is also mapping Eukaryotes along with the Hg38 human reference. Looking at my Kraken2 results it appears to identify about 24,000 Eukaryotes, though it only identifies about 1,000. So I’m guessing there is about 23,000 Eukaryotes getting caught in the human mapping that Kraken2 doesn’t identify???

Can you recommend how to not lose these reads being cross mapped to human??

Thank you,

Jon