Problem with HISAT2 detection of reads mapping to repeats

I am analyzing RNAseq data from the Kaposi’s sarcoma herpesvirus (KSHV) and have done most of my work using tophat2. I started using Hisat2 since that is now preferred in online galaxy. The virus has repetitive regions that are in critical mRNAs and I have been quantitating reads and working out novel spliced transcripts using Tophat2 and HTseq. Tophat 2 correctly identifies the highly expressed 50bp reads (good map quality and phred scores) aligning within the repetitive regions (visualized in IGV and counted by HTseq) – I only allow a read to map once in the genome. I have run Hisat2 on the same data sets and get zero reads mapping to the repetitive region as seen visually in IGV and in the read count by HTseq. I have tried everything I can think of to detect these reads using Hisat2 but am not successful. It is almost like HIsat2 is not calling a read valid if it matches exactly to multiple sequences in the repetitive region, even though the mapping quality (50) and phred (37) scores are good. I don’t’ think this is an IGV problem, since no reads are detected by HTseq. Is there anything in the Hisat2 options that I can change to detect the reads in the repetitive region?

1 Like

Welcome @tmrose536

I would suggest trying RNA-STAR to map instead.

HISAT2 will probably fail if you attempt to capture a large number of repetitive hits. But you can try, advice for how-to here if interested:

HISAT2 Alignment tuning options

Try adjusting the -k setting. Find this setting on the HISAT2 tool form under Advanced Options > Reporting options > Primary alignments.

Also, review Advanced Options > Alignment options.

HISAT2 manual https://ccb.jhu.edu/software/hisat2/manual.shtml

Most, but not all, options have been implemented on the tool form. Others are fixed values and cannot be changed. Jobs with those options modified would always fail given the resources at https://usegalaxy.org. Other public Galaxy servers may use those same default or not. And if you set up HISAT2 on your own Galaxy server, and allocate enough resources, those defaults could be changed.

If you change any of these default settings, it can create a job that will run out of resources (memory or runtime) at a public server. You’ll need to experiment if you decide to stick with HISAT2.

Note: It is not possible to report secondary alignments in HISAT2 as wrapped by Galaxy, and some of your sequence hits may fall into that group.

Thanks!

Hi, The problem is that I am not trying to identify all the different places that a single read would align within the repetitive region in the KSHV genome, which is what changing the -k setting would appear to do. I am using the -k setting of “1” so that a read will only align once in the genome. The problem is that HISAT2 doesnt detect a single instance of a read aligning to a repeat region even though the read has all the attributes of a valid read. Using the -k setting of “1” I should have no problems with resources and all of the reads aligning to the repeat region should be mapped (once), which they arent. I tried with -k of “5” same problem. I think this is a serious issue with HISAT2, or at least a serious issue with the defaults used in the galaxy instance. I can try RNAstar but that still leaves HISAT2 with a major functional problem. Since I always look at the read alignments using IGV I see evidence of this problem which those not examining the specific read alignments would not see.

1 Like

Ah ok, thanks for explaining. I thought that you were not getting enough hits, instead of not any at all.

It sounds like your unmapped reads generate hits that are not considered as “primary” by the tool. It was designed to not report any hits at all if the number of exact matches for a particular read is very high (expected with repeats).

Try RNA-STAR instead.

Sorry, I am only getting errors running rna star with exact same bowtie data sets that work with tophat and Hisat2. I get the error that an error occurred setting metadata. I attempted to retry autodetection but no luck. It is not clear how to manually set this as indicated in the error message. Any help would be appreciated

Hi @tmrose536,

could you post a read pair or region that you think tophat2 is aligning correctly and hisat2 is not ?
(You can extract a region into a bam file in IGV, right click -> export alignment).

Just to make sure, your mapping quality threshold is 0 in IGV ? By default IGV does not show multi-mapped reads (MAPQ < 4), and it is possible tophat2 just doesn’t assign proper MAPQ values.

Hi, yes mapping quality in IGV is 0.
here is an image of IGV showing the Tophat2 alignment (top) and HISAT2 alignment (bottom) of the same data - the region of interest is indicated. Tophat2 is limited to show only 1 alignment. HISAT2 is set for -k =1. I have exported this region and will send the file by email, since I dont see a way to attach the file in this reply box.

Hi my email reply to your message with the bed file did not go through as the return email address in your email was not accepted. I will try the upload link in this box, but it only indicates an image file - NOPE only image files accepted. If you send me an email with a distinct return address I will send you the bed file.
Thanks

Hi it is an old thread but I came across it as I googled exacly the same problem - I observed that highly repetitive matches even perfect are not reported (using command line hisat2 -k with default parameter -k 5). If I set -k to high numer e.g. 1000, then the read is aligned. " . That is, if -k 2 is specified, HISAT2 will search for at most 2 distinct alignments." Maybe with -k 2 if there are more than 2 alignments none will be reported? This behaviour is confusing, any ideas?