Kraken2 database

mycojon · November 19, 2024, 9:14pm

Hi Jennifer,
I already had a suspect issue with the sequencing to start. I had an expectation of Mycobacterium to be present within the samples. In my research it was clear that sequencing of Mycobacterium is problematic. #1 Illumina sequencing is known to be problematic with High GC bacteria, especially mycobacterium. #2 The extraction kit by Zymo Research also mentioned some issued with sequencing High GC bacteria, and recommends additional steps for High GC bacteria.

So I went into the sequencing anticipating some sort of issues, though it was much more problematic than I had anticipated. Since Mycobacterium has a high lipids content in their cell wall, standard procedures don’t adequately compensate for this, causing sequencing issues. I have read many published studies and reports regarding suggested additional steps to adequately clean Mycobacterium DNA.

For standard processing, I completely agree that standard processing procedures are likely adequate. But when having problematic microbes that you anticipate in your samples, and due to problem of getting the sequencing lab to adjust to your needs (without spending tons of money). My logic of rescuing reads makes sense, to me at least, if either a forward or a reverse read is of good quality, then it makes sense to keep it. Since the R2 is just the reverse complement of R1, it’s theoretically not adding or subtracting from the dataset. If there were only a small amount, it would make sense to ignore them, but especially in the initial sequencing done on the Novaseq 6000 the loss would have been massive 150MB compressed files are many millions of reads.

Can you answer one question that I’m confused on. If I am trying to rescue both forward and reverse reads, is there any reason to reverse complement R2 reads to concatenate with the R1 reads???

jennaj · November 19, 2024, 10:34pm

Do you mean to attempt to recreate a pair? You will need to disclose what you have done in the sample notes if others will be okay with – or expect in some situations – data pre-processed that way. That archive would be the best people to advise you through this, especially if what happened is common for that species. At a minimum you would need to disclose in the sample description what you did in the way other people usually disclose the same process, yes?

If this was just data for yourself, you could of course do whatever you want and see what happens, then explain in your publication of the results, but when publishing data to a sequence archive, I think there are larger considerations. Whoever is later using those reads needs to know where they came from.

mycojon · November 19, 2024, 11:10pm

Hi Jennifer,
This data is for personal use at this time. Though from what I have found so far, the right persons may have interest in my findings. For example: 1. A person/animal with Malaria infection, the higher likelihood of a coinfection with Mycobacterium as well as numerous other bacterial species. 2. Malaria is not considered contagious, yet I can show through numerous samples that I’ve done that it is indeed contagious person-to-person as well as person-to-animal (canine anyways).

Patient #1 contracted unknown infection from surgery at hospital. Doctors refused to try to diagnose. 16S sequencing shows a wide range of abnormal bacteria in blood and urine. Shotgun sequencing shows massive amount of Plasmodium Ovale (maybe smaller amounts of others), as well as massive amounts of Atypical Mycobacterium, and other bacteria in whole blood and CSF leaking from nose of Patient #1, also found in Spouse and now deceased dog of Patient #1. Repeat testing 6 months later showed the same infection in Patient #1, also now in another dog from Patient #1, and both parents of Patient #1 (living separately). So I’m dealing with a highly contagious disease that doctors are completely ignoring.

As for the recreating a pair from a single read, I have heard the question asked before, and the response was that’s not possible. Though from what I have done so far, it does seem to work with the method that I used. Now for those experiments that are sequencing a single species, it makes sense to discard anything of lower quality. In my exercise, I want to know EVERYTHING that’s in the sample as well as possible.

mycojon · November 25, 2024, 10:27pm

Hi Jennifer, I have some confusion on some results that I’m getting.

So I already know that I have Mycobacterium in my samples, though I’m having a difficult time determining the exact species and counts. I don’t know exactly how the Kraken2 works, but I’m assuming that even without host removal that my Kraken2 output should not show much of anything??

Without any prior Host Removal, but with adapters cleaned and Trimmomatic MAXINFO Min 35. When I run the Kraken2 on Mycobacterium V1 database, it comes up with quite a list of species. I’m assuming when the Kraken2 report lists side-by-side counts that are the same that both reads match the reference (paired reads).

So if I just take the output for the Mycobacterium V1 database, and selecting the top 3 Mycobacterium species identified.

Kraken2 Output Bowtie2 Default end-to-end
Mycobacterium 1100029.7 322,878 x 322,878 938,578 x 938,578
Mycobacterium 1554424.7 231,090 x 231,090 1,028,371 x 1,028,371
Mycobacterium Angelicum 121,732 x 121,732 1 x 1

Two of the highest counts are Mycobacterium 1100029.7 and Mycobacterium 1554424.7. I take Bowtie2 run in default end-to-end and this gives me Mycobacterium 1554424.7 (1,028,371 paired reads) or Mycobacterium 1100029.7 (938,578 paired reads). I know that NCBI lists both of these species as contaminated, but I’m confused on this, I know that some species have actual high similarity to human dna. When I do Host Removal with Bowtie Hg38 It removes a considerable amount of these reads, the CHM13 T2T removes even more reads.

What is the correct answer on the 1100029.7 and 1554424.7?? Do I use the Host Removed reads, or is the host removal removing actual mycobacterium reads??
On the Mycobacterium Angelicum, shouldn’t I be picking it up with Bowtie end-to-end, or should I look at a local alignment?

Thanks!

(admin redacted for privacy)

jennaj · November 26, 2024, 1:12am

Hi @mycojon

For scientific advice – and that is where you are at now, determining how to interpret your results – the Galaxy resources we have are the people that are part of our larger community. To find those working with microbial data, find the link to the Matrix chat at the top of the tutorial’s here. → Microbiome / Tutorial List

You can share the link to the information you posted here for extra context. Hope this works out!