Bowtie2 Mapping Issues

So I have a species that I’m trying to remove from my reads, in order to only do assembly with reads that map well with my species of interest.

My first step was just to get clean reads of adapters with minor trimming. Followed by Overlap Error Correction with BBMerge and Tadpole. This should give me error corrected reads, in order to just extract reads that match maybe 95% or better. These are 2x150 illumina reads, with many that are relatively short.

Now to get the reads for assembly, I first mapped with BWA-Aln as it’s a global aligner for very short reads so I can get all my reads in the 35-100bp, with high percentage match. This went fine! So now I want to make sure I have the longer reads that match well too, so for this I used Bowtie2 Default and was checking the results. I was checking reads from the unmapped reads and running a few on NCBI Blast, only just picking the first read (forward and reverse), blast reported 100% Match for Length and ID. Obviously this should have clearly gone into my mapped reads group.

So what went wrong??? Is there a better program or Bowtie2 setting in order to make sure that I’m only getting the high matching reads to the reference?? Please don’t suggest Bowtie2 Local, as I find those results kinda worthless as it will match if only a small portion of a 150bp read matches the reference.

Hi @Jon_Colman

The NGS aligners like BWA and Bowtie are doing more than finding the good matches. They are looking for unique and exact matches to the same species’ genome and for the most part can handle short reads without providing too many non-specific hits. If a short read is matching in too many places (“multi-mapper”), it won’t even capture a reported hit.

BLASTN is mostly just looking for a match. That could be in multiple places with some settings, and weaker, and can be used for cross-species comparisons too. The results can get explosively large with non-specific hits very quickly with short read queries. You are probably not seeing this because of the somewhat defensive build in parameters.

You can review and experiment with the full options with these tools to manipulate this behavior in small ways, but the different mapping tool approaches are fundamentally different. The indexes are different and the algorithms are different. This seems to explain your observations.

The full parameter list is down on the tool forms, and what each do are in the linked tool documentation. And for a good top level explanation of NGS aligners, the slides in this tutorial are one place to start. → Hands-on: Mapping / Mapping / Sequence analysis.

To explore where you reads are mapping, you could load up the BAM and any reference genome annotation that is available for the genome into a genome browser like IGV. This can provide some insight into what might be going on scientifically at the locations your reads are mapping (or not!).

Hope this helps and glad to see you are making some progress! :slight_smile:

Ok, I’ve been trying a different approach, after a few other attempts. I wanted a mapper that should find reads between 35bp-150bp. I didn’t like bowtie2 local, as these reads are already trimmed, and it appears to accept a very short region maybe 30bp in a 100bp-150bp, and call that a match?? Minimap2 appeared to have good reviews, and it’s said to work within all of my read lengths. So initially I’m just trying to map to the species suspected in my samples, and will do any host removal after on these reads. I thought I was good, but then it seemed like it was missing a lot of reads. So my new approach is initially with Minimap2, but then mapping again with the same reference set with BLM-aln. Given that BLM-aln is rated to give 97% matching reads (I believe), these should be closest matching my species. Looking at the reads, there are a LOT of reads mostly in the 35-50bp range, but there were also numerous reads in the 75-100bp range. I would think a 100bp read at 97% match per BWA-aln, that Minimap2 should have found it. They were relatively complex reads not full of repeats. Though I did notice that BWA-aln also picks up a lot of long repeat regions that Minimap2 didn’t find.
So my basic plan was to take my mapped reads and Error Correct them with BBmerge default settings, without actually merging. Then map them to HG38 with BBmap default with minid 97%, and then follow with trying to get them assembled.
Does anyone have any opinions on BBmerge/Tadpole error correction?? I tested it on some reads and it’s night and day difference. So with a set of reads mapped to one species, and assembled with Spades, most of my contigs were in the 300-600 range, with some maybe upto 2000. Then blasting the contigs, the match wasn’t that great, mostly between 94-96%. Then I took the same reads and running BBmerge and Tadpole Error Correction. I reassembled with Spades and I was getting contigs in the 4000-8000 range with a BLAST result usually 98%+ match. Does this make sense???

Yes, shorter reads combined into longer reads can help with assembly. Glad you found a solution!

Hi Jenna,
I have another mapping question, so for my samples all whole blood samples from 6 different but related subjects. I’m using BWA-aln & BWA-MEM and mapping for three separate species (Naegleria Fowleria strain Karachi NF001, Plasmodium Ovale Wallikeri, and Mycobacterium Tuberculosis 76 different Oman strains). What is the likelihood that for each separate species if I BLASTn with somewhat similar sequences (so I believe that’s local alignment), and use Naegleria Fowleri in the species reference to Blast against, that I get Naegleri Fowleri Karachi NF001 hits for all three species, and all 6 subjects??? Does it sound like I most likely have errors in my reads so they don’t map clean to Naegleria Fowleri, rather than having three extremely unusual/rare species from all 6 subjects?? I used three completely different sequencing labs, and two different types of blood collection tubes.
I plan to use the Genome Viewer to look at them later, I’m just trying to segregate by mapping first.

Hi @Jon_Colman

Differentiating between species among your samples is a complicated question! Even if sample variation was confirmed with higher quality calls and deep coverage across samples/collections, it wouldn’t necessarily indicate an entirely different species. You might just have different strains.