HISAT2 Virus Alignment Settings

A_Blanchard · October 27, 2021, 9:18pm

I am currently trying to align some short read samples against a viral genome using the HISAT2 aligner. I was getting extremely low levels of alignment when it was recommended that I try altering the scoring settings to allow for more mismatch to reflect higher viral mutation rate. I chose some arbitrary values and ran the aligner a few different times and managed to get anywhere from <.01% to ~24% alignment rate. After doing some literature review I found a suggested allowed mismatch of 2-3 bases per read for viruses, but HISAT2’s scoring options only allow tweaking the score value of a mismatch. Does anyone with more experience with HISAT2 have an idea of how to tweak the settings to accomplish this?

wm75 · October 28, 2021, 10:11am

Hi @A_Blanchard,
there are quite a few aspects to your question and I’m not sure I’m able to address all of them, but I’ll try:

After doing some literature review I found a suggested allowed mismatch of 2-3 bases per read for viruses

That seems like a strange advice since obviously such a value would depend on the sequencing technology used (expected error rate of the sequencing platform, length of the sequenced reads) and the virus under study (how divergent are different isolates of that virus)
managed to get anywhere from <.01% to ~24% alignment rate

That begs the question what the nature of your sample is. If that’s, for example, a direct patient isolate from human body fluid/tissue most of the reads in there would come from the host and you might want to clean the input data (remove host reads) before proceeding. If the reads come from PCR-amplified material obtained using virus-specific primers (ampliconic data) you would expect nearly 100% of your reads to be mappable to the viral genome. So to know what’s an expected outcome/good strategy we need more info about your data and its origin.
The HISAT2 parameter that corresponds most closely to what you are asking for in your question is likely --mp (“Maximum mismatch penalty” under “Scoring Options” in Galaxy), but I’d recommend you start tweaking aligner settings only after you addressed points 1 & 2 above. Aligner default settings are usually rather carefully chosen to give good results across a rather wide range of input data and chances are high you’ll change them to worse settings than the default.
There are alternatives to using HISAT2, like bwa-mem, bowtie2 and minimap2 (the first two should be good choices for Illumina-sequenced data, the latter preferable for long reads from Oxford Nanopore or PacBio sequencers). If you’re really interested in optimizing the mapping step, it would make sense to start with comparing results across different mappers first, then care about tweaking the defaults of one particular program?

wm75 · October 28, 2021, 10:15am

Regarding cleaning of your data to get rid off host reads, there’s this tutorial on how to do this in Galaxy:

it uses SARS-CoV-2/human data as an example, but should be applicable to other virus/host combinations as well.

Topic		Replies	Views
Aligning Illumina RNAseq to SARS-CoV2 genome mapping , sars-cov-2	2	800	April 26, 2021
HISAT2 reports samples with 0% alignment rates usegalaxy.org support mapping	4	1292	September 23, 2022
Problem with HISAT2 detection of reads mapping to repeats transcriptomics , rna_star	8	2817	May 5, 2021
Low HISAT2 alignment rate and low featurecounts assigned rate usegalaxy.org.au support transcriptomics	1	251	April 23, 2024
compare Galaxy europe aligners (bowtie2, BWA and minimap2) and CLC genomics workbench usegalaxy.eu support mapping , blast	11	857	September 27, 2023

HISAT2 Virus Alignment Settings

Related topics