I have been having difficulty mapping some eukaryotic species. I hadn’t thought of this issue until now. The species have relatively high similarity to humans with lots of ambiguous regions. I had thought because I was trying to map at 99% - 100% matches to my reference sequences and host reads, but realized there were a ton of reads that map 99-100% that weren’t being mapped.
I’m thinking that Bowtie2 or BBmap are my only mapping options and BBmap is very slow, so it’s Bowtie2. I’m unsure the correct settings for this. From what I read -a will get me most of the reads I was missing. I set the penalty for N to 0 penalty.
It seems that these other settings should be changed so that it will also catch the other reads I’m missing 1. Disable No Mixed 2. Disable No Discordant 3. Allow Dovetailing. Does this sound correct, or are there other options to change???
Do you mean that you are trying to get all of the mapping locations for reads that may map with similar identify/coverage to multiple regions of a reference genome?
If yes, then Bowtie2 is probably not the best choice. Bowtie2 is intended to map same species reads back to a reference genome uniquely. Cross-species is unlikely to work well, especially for non-unique hits. The indexes supporting the tool and the parameters are all intended for this goal. You can certainly play around with these but it may never get you what you want!
Instead, is there a reason why are you not using BLASTN? We host this in Galaxy too (so you don’t need to go to the NCBI website with cut/paste). Custom genomes can work fine and getting those indexed is possible to speed up the searches. The query and target can be reads or assemblies or genomic reads (DNA) and expression reads (RNA) or nucleotide and protein sequence or the entire sequence and fragments. Basic BLAST is very simple and direct and one of the original mapping tools created: find common bits of sequence according to a match criteria.
Then, once you have the mapping results, the BLAST report itself can be parsed to refine the results. Filter on criteria about the HSPs, or overlaps with regions of interest, and similar.
Let’s start there but the first instinct for this was “why limit to NGS mapping tools intended for same species”? Are you still trying to perform a contamination screen, to remove human from your pool of sample reads? Thanks for explaining more!
What I’m trying to do, is several of my species that I already found present, have too much similarity to human. So my plan is to initially do vs local with bowtie against the group of species, so I believe this should grasp all of potentially aligning reads. Then at this point I can map my grouped reads that map 100% perfectly to my species fasta, followed by mapping 100% perfectly to human (hg38 & t2t). Just this should whittle down a lot of reads. Followed by doing the same at about 99.5%, then try to break the rest into about 40mb groups that I can merge with BBmerge. Then remapping the BBmerged reads the same way to get at least most of the reads in the 99-100%. Then with the other reads, do host removal.
What I was running into, was having some reads with 60mb of reads that include Ambiguous reads that map to my species of interest, as well as human. The only other way to remove the rest of the human reads is with a local alignment, which also removes a high percentage of my species reads.
So, a cascade through mappings for keep/retain. Some steps are determining which reads to keep and others which to remove. If this is all just based on identify/coverage (homology), this is what BLAST is best for. You could try this for your final screen against human – capture the hits then filter them to determine the final keep/remove. I think this is where you were having an issue before?
Typical workflows for contamination removal examples are at NCBI. Maybe the parameters are interesting? → Contamination in Sequence Databases
You could also review topics like this one → https://www.biostars.org/p/188362/. That forum is very good in general if you want to reach more scientist to discuss current methodologies.
Yes, essentially cascading through mappings. But like I said about the ambiguous mapping issues, I can do host removal but only down to a complete end to end, otherwise local alignment takes my species reads too. But the ambiguous reads with denovo assembly can assemble nearly complete reference regions, but then others it tries to assemble with the human reads.
On the Biostars the recommendation is to use BBsplit or FASTQ Screen to map multiple species competatively at the same time, but I don’t see these available on Galaxy.
I did try to decontaminate the reference sequences, I tested by breaking down the reference to 150bases and mapping against human trying for perfect matches. with hg38 it aligns with about 0.5% with t2t it aligns with about 1.5%. So with perfect alignment it’s not missing a lot.
I did the FCS Adapter and Foreign contamination tool, there were retained adapters in my Naegleria Fowleri, Plasmodium, and Tuberculosis references, so I took those out. These also had human similar regions, so that gave me an Excluded and Clean fasta. Even using Hisat2 on the Clean fasta took away quite a lot of 99% matching my species reads, and Hisat2 is known to be less agressive. So I just opted to use the references with the adapters removed since I’m trying to map near 100% ID.
So back to my first question, I think Bowtie2 will work. It’s my understanding that the -a setting will map the ambiguous reads, such that I’m trying to map at near exact matches. The other three for Disable Mixed, Disable No Discordant, Allow Dovetailing should give me the reads where only one read matches and the other has more errors?? Does this make sense???
As for how to use the BLASTN on here like you are explaining. Is there an example you can show me somewhere??? I’m assuming from what I see that I can use the BLASTN somehow to find what contigs match at 100%, etc.?? But don’t understand how I would use that for reads?
Yes, there are a few metrics in BLASTN reports. Percent identify is just one. The other that will likely be interesting for you is coverage. This will be similar to what Bowtie2 is doing for you. Each hit with a fragment alignment will be for a portion of the read, and each read could have multiple HSPs. You could capture the coordinates, merge alignment blocks, and such to get an idea if a read has complete coverage against the genome.
Or, you can try to experiment with the Bowtie2 hits. I’m not sure what you mean exactly by ambiguous. Do you mean a read that maps to multiple locations with the same statistics (multimapping?)? Or that has a higher edit distance (what I think you mean by adjusting -a?)? You can try but the limit is about 5 over 100 bases if I am remembering correctly. The other options describe how the read pair ends are allowed to align, not so much the quality of the alignment since what is kept is still “the best unique” but to where: where one end doesn’t align and the other does (mixed – you’ll have these now), different regions (discordant - you’ll also have these now), or overlapping (dovetail – this is the only additive). You could also check the BAM to see if any of those are present already or post-filter them out (Filter BAM).
Oh yes, actually multimapping reads. My understanding using -a setting, is that -a will take all multimapped reads, there are a lot in the Plasmodium/Naegleri Fowleri genomes. Just what I have identified so far I have Naegleri Fowleri Karachi NF001, Plasmodium Ovale Both POW1 & POW2, and Mycobacterium Tuberculosis BG24 (these are in all of my blood samples at different levels, but there are still more species I also have several strains of Pseudomonas A.. I’ve confirmed these by denovo assembly. Just taking these 4 and mapping at 100% match I have some of my runs with 50mb of 150x2, so many millions. My thought was to eventually get all of the reads and assemble them, even the reads that align to human, which should give me a more complete assembly per genome. Then to determine read counts per genome, is to map them 99-100% to the reads, then subtract human reads that map 100%. This should give me total reads without bias of the human reads.
So I’m mapping currently with the settings to get all of the reads. It will probably take overnight before this is done, I don’t mind taking extra time if it’s going to finally get me done with this.