I am currently trying to align some short read samples against a viral genome using the HISAT2 aligner. I was getting extremely low levels of alignment when it was recommended that I try altering the scoring settings to allow for more mismatch to reflect higher viral mutation rate. I chose some arbitrary values and ran the aligner a few different times and managed to get anywhere from <.01% to ~24% alignment rate. After doing some literature review I found a suggested allowed mismatch of 2-3 bases per read for viruses, but HISAT2’s scoring options only allow tweaking the score value of a mismatch. Does anyone with more experience with HISAT2 have an idea of how to tweak the settings to accomplish this?
there are quite a few aspects to your question and I’m not sure I’m able to address all of them, but I’ll try:
After doing some literature review I found a suggested allowed mismatch of 2-3 bases per read for viruses
That seems like a strange advice since obviously such a value would depend on the sequencing technology used (expected error rate of the sequencing platform, length of the sequenced reads) and the virus under study (how divergent are different isolates of that virus)
managed to get anywhere from <.01% to ~24% alignment rate
That begs the question what the nature of your sample is. If that’s, for example, a direct patient isolate from human body fluid/tissue most of the reads in there would come from the host and you might want to clean the input data (remove host reads) before proceeding. If the reads come from PCR-amplified material obtained using virus-specific primers (ampliconic data) you would expect nearly 100% of your reads to be mappable to the viral genome. So to know what’s an expected outcome/good strategy we need more info about your data and its origin.
The HISAT2 parameter that corresponds most closely to what you are asking for in your question is likely --mp (“Maximum mismatch penalty” under “Scoring Options” in Galaxy), but I’d recommend you start tweaking aligner settings only after you addressed points 1 & 2 above. Aligner default settings are usually rather carefully chosen to give good results across a rather wide range of input data and chances are high you’ll change them to worse settings than the default.
There are alternatives to using HISAT2, like bwa-mem, bowtie2 and minimap2 (the first two should be good choices for Illumina-sequenced data, the latter preferable for long reads from Oxford Nanopore or PacBio sequencers). If you’re really interested in optimizing the mapping step, it would make sense to start with comparing results across different mappers first, then care about tweaking the defaults of one particular program?
Regarding cleaning of your data to get rid off host reads, there’s this tutorial on how to do this in Galaxy:
it uses SARS-CoV-2/human data as an example, but should be applicable to other virus/host combinations as well.