Hisat2 Error on Bowtie2 output

Hisat2 error won’t run on Bowtie2 output. Not sure what the issue is??

https://usegalaxy.org/u/squidly/h/phoebe-15

Hi @Jon_Colman

The job parameters seem to be creating an intermediate output that is exceeding the temporary data storage space on the cluster node processing the job (different from storage in your account). I know that you are looking for weaker matches, but if the match criteria is overly non-specific, this can happen (with any mapping or comparison tool!).

As a guess, setting Ambiguous read penalty to 0 is at least contributing to the situation. Reference genomes have many gap N characters! Attempting to bridge over these in really permissive ways could certainly cause the hashing data to grow before it has a chance to have the other filters applied.

You could try changing parameters or breaking up your query into smaller chunks to see if either of those help. Maybe you can isolate the reads leading to the ramp up, and eliminate them while you investigate the others. You could also try at UseGalaxy.eu since their clusters allocate/scale resources in a different way but I think you’ll need to still experiment with parameters/query size to find a balance.

Hope this helps! :slight_smile:

I think I’m getting a better understanding of the issues I’ve been having. I believe part of the problem is some of the species in my samples. I suspect my biggest issue is with Adapter Removal. I conducted a test using reads that mapped 100% to my reference, which would mean no adapters were present. I didn’t test all, but I tested with FASTP, and it’s removing a significant part of my sequences. I’m currently trying to redo everything yet again!!! I last tried with cutadapt, and it seemed to be better. I’m trying again with Adapterremoval tool on the eu site, so far it appears to be working. What I was finding was that even after removing my sequences of interest with bowtie2 and Hisat2 (both with soft trimming or local), and my best effort to remove all human with same settings, I would still have generally 125mb forward/reverse reads. Then de novo on these, and it would still assemble a significant amount of my reads into what I previously tried to remove. I can only think this was due to residual adapters still present.

My other issue is I have two sample sets that were done on Novaseq 6000 that has a TON of Poly-g contamination. I can remove them pretty well on the 3’ end, but I also have them showing up on the 5’ end. My only thought on those is with cutadapt and assigning GGGGGGGGGG as a 5’ adapter

I have no idea what could be wrong with these files. I was doing these on EU, I first mapped with bowtie 2x on 2 different reference sets. Then I was going to quality trim the files with Trimmomatic and it throws an error with no explanation. Any ideas?? I transfered it here to try it again, with same issues.

Hi @Jon_Colman

There is a format issue in the fastq data. It seems like the data transfer didn’t complete as expected with the first file. Or, possibly, the upstream mapping job didn’t write out the unmapped reads correctly. These messages make me suspect the second but we can confirm.

Review

I usually start with these two tools. I might also try to compress or uncompress a file to see if it is intact at that higher level (compression status, not file format).

1. Fastq content, but will report about the format too

FastQC reported this. This tool will error when a format issue is detected, then place a message into the logs.

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/corral4/main/jobs/073/514/73514708/tmp -Xmx7373m -Xms256m
Failed to process file display_to_ext_fastqsanger_gz.gz
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Midline ‘@LH00213:51:22FY3FLT3:2:1113:11272:7617 1:N:0:GGCCAATATT+TATCGGACCG’ didn’t start with ‘+’ at 4817467
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:179)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:129)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77)
at java.base/java.lang.Thread.run(Thread.java:833)

2. Fastq format, and optionally paired status

And, Fastq Info reported this. This tool will not error when a format issue is detected, since the entire purpose is to report about the file format. If the input is paired data, it will also check for intact pairs and report the status.

fastq_utils 0.25.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from /corral4/main/objects/f/4/9/dataset_f4978a79-f9fd-415e-945c-9eb2684c4f74.dat
CASAVA=1.8
100000200000300000400000500000600000700000800000900000100000011000001200000
ERROR: Error in file /corral4/main/objects/f/4/9/dataset_f4978a79-f9fd-415e-945c-9eb2684c4f74.dat: line 4817469: invalid character ‘+’ (hex. code:‘2b’), expected ACGTUacgtu0123nN.

3. Compression

Each file can be converted independently with the pencil icon → Datatype → compressed to uncompressed function (second in the list of choices – not the first!). But I usually auto-build a collection then run the operation as a batch, so I did that with yours. To make the collection and results easier to understand, I also renamed each file first.

Both files uncompressed fine. This mirrors what the other two tools reported: each could uncompress the data – then they found something wrong in the format – and around the location where a quality (+) line should be, and around lines:

at 4817467
line 4817469

4. Isolate these lines in the original file

I first used Number Lines (padded to 9 digits) then used Select to pull out those lines and some around them. There are probably many ways to do this – so I’ve leave the example in the shared history and not list out hard instructions. Example: Select first lines then Select last lines would do about the same thing, and not require a regular expression.

004817460	IIIIIIII9IIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
004817461	@LH00213:51:22FY3FLT3:2:1113:7334:7617 1:N:0:GGCCAATATT+TATCGGACCG
004817462	GCAGTACAACTCCATGTGGAAGATATTATTACTCTCTAATTTTAAAGACTAGATGATCTATGTGTGTTTTTAGATTTCCTTATAG
004817463	+
004817464	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIII
004817465	@LH00213:51:22FY3FLT3:2:1113:11235:7617 1:N:0:GGCCAATATT+TATCGGACCG
004817466	+
004817467	@LH00213:51:22FY3FLT3:2:1113:11272:7617 1:N:0:GGCCAATATT+TATCGGACCG
004817468	TTTATTGGTTCATTCATTCATTCATTCATGTGCTGAGCACCTACTATGAGCCAGATGCACCAGGTATACTATGATGAACTAAATAGTTCCTTCCTAATGAGAGGAAAGCAGACATCAAATAAA
004817469	+
004817470	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9I9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
004817471	@LH00213:51:22FY3FLT3:2:1113:37374:15132 1:N:0:AGCCAATATT+TATCGGACCG
004817472	TGAGGGACCTGAAGCCCAGAGAGGACCTGGGCCTGGCCCAGAGTTGCCAAGTGGCAGAGCATGGTGGAGAGGGGCCTCCCTGGCCACAGAGCCCAGAAAACATGAGGCTGGAAAGCTGGCCAGCTTCTCCTTGAGCCAGGCAGCTGGA
004817473	+
004817474	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIII9IIIIIIII
004817475	@LH00213:51:22FY3FLT3:2:1113:6067:15132 1:N:0:GGCCAATATT+TATCGGACCG
004817476	CCACCTCAGCCTCCCAAAGAGCTGGGATTACAGGCATGAGCCACTGTGCCTGGCCCTACTTAGGATATTAAACAACAAAATGAAACAAGAAAATATATATATAAAATATAATATACATTATACTATGGGAAGCTAATGTATACATGTA
004817477	+
004817478	II9II9II-III9III9I-9IIII9III-III9III9IIIIIIIIIIIIII9IIIIII-IIIIII9I9I9IIII-IIIIIII9III9-IIIIIIIIIIIIIIIIIIIIIIIIIII-I99III9IIII-IIIIIIII-IIIIIIIII9I
004817479	@LH00213:51:22FY3FLT3:2:1113:6085:15132 1:N:0:GGCCAATATT+TATCGGACCG

My guess was close but not exact. There is a missing sequence! The labels are there but with no content – bases or quality scores.

004817465	@LH00213:51:22FY3FLT3:2:1113:11235:7617 1:N:0:GGCCAATATT+TATCGGACCG
004817466	+

What to do

This may have been a hiccup from the mapping tool when writing this read out but you could also check the original fastq input. Mapping tools are especially tolerant of format issues. Meaning, a sequence could be empty and the mapper may just happily ignore it. This is on purpose – mappers are used to filter out noise in high throughput environments.

Completely empty reads are tricky to remove with some trimming/manipulation tools but others can handle them. I have two tests running to see what happens. Manipulate Fastq (this specific read) and fastp (remove any reads shorter than 1 base).

fastp is likely the best choice! It should ignore empty reads (discards them if a length filter is included). This tool can also replace Trimmomatic. I think you were asking about poly G trimming but I can’t remember how that ended up .. please know this tool can do that for you if you want to merge your operations together (remove empty/shorter reads and trim at the same time). – I just reread: you have this on the 5’ end as well. If the base quality does not trim these, then yes, Cutadapt can be used to get those.

Please give this a try and let us know what happens! I would be curious to learn if the original file had the content problem or if this was introduced by the mapping tool when reporting the unmatched reads (and under which conditions).

Hope this helps! Good to dig into this issue and resolve it (hopefully?). I remember you reporting it before but I may have missed the prior shared data that allowed the exploration. Let me know what you think about all of this or if you have followup questions. :slight_smile:

Yeah, I’m at a loss. I had two sets that were done by the same lab with the two. I processed them the same way, so only issue in this one.

I actually checked my trimmed reads (adapter and length trimming trim to 150 and removed first two bases). This set also has the same error.

If you find a way to fix this, please send me the corrected ones.

I’m so exhausted getting these done. I figured out my main issue. Fastp, trim galore, cutadapt were all significantly removing sequencing, which left me with 300-500mb of reads after host removal. I tried with adapterremoval tool, only with this and using either nextera or illumina primers gave me clean reads. So far, this massively increased the mapping of my reference genomes.

Actually the problem on these two sets, I have had numerous error issues. Probably something with the original files. Though the one set appeared to fix the issue.

The poly-g issue. I have read that this is an issue on nextseq and novaseq 6000. I’m also finding them, though not a lot from from my other reads. Fastp generally good, but will delete the the pair if one is too short. So after bbduk poly-g removal, cutadapt worked well at least the 3’ end. I think the 5’ end needs to be an anchored sequence or it will try to find them elsewhere?? So I used 13xG as my adapters.

Yes, that seems possible, especially if a read is empty.

Did you see the change I made in the shared history? Oh – I see now, fastp also noticed that missing read in the warnings, but, importantly, it didn’t write it to the output, and FastQC didn’t find the problem again in the output file. So I think you should be able to use that new output as your foward reads now. You could run that same tool on any other read set you have with problems, as a way to filter out any sequences with a length less than one base.

Later, when you run both ends togethe, the complete pair will be removed from the final outputs (only intact pairs are retained). This is usually important but you can try to gather those together and use them – just keep in mind that completely unique content only represented by a single end of a single pair will be challenging to verify as meaningful.

fastp can remove 3’ PolyG but won’t remove the 5’. Cutadapt should be able to do this with a custom expression. And yes, this can be anchored to prevent internal sequence matches.

Rescuing the sequences is tricky I’m sure! But this part is solved, correct? I only checked the first history but the second has the same problem. I would try running the files through fastp with the length filter to see if that is enough. If not, try to do what I did to isolate and inspect the lines directly.

I will check it tomorrow if that lets it work. So I’m assuming you didn’t include any adapter removal with fastp? It’s just curious that trimmomatic didn’t throw them as unmatched reads.
Yes, its been awkward trying to rescue reads. I think if I map my genomes at perfect match and 99.5% match in coordination with the same using t2t reference.

Thanks a bunch Jenna, Have a Merry Christmas and Happy New Year!

1 Like

Correct – I toggled everything off except the length filter, and reduced that down to one base. It was a test to make sure it worked how I expected it to work. Seems that it did but please double check it too.

Well, that read was in an “in-between” state. In the file, but without content. So, Trimmomatic died and reported that it couldn’t understand. It seems fastp is better at detecting how to handle it (sort of – it is actually not even filtering, but just ignoring it when writing the output). But you’ll need at least one action applied, and the “one base of length” was the least that could be done.

I’m sure you are noticing by now how different tools handle the different edge cases. These usually aren’t even documented – they are learned with experience and experimentation – and maybe discussed at forums like this one. :slight_smile:

Wishing you a great holiday too! Thank you :santa_claus:

Yes, I fully understand now. I picked stuff up pretty well, and research everything. I was trying to go with what is tried and true. It makes a lot more sense now how when I previously tried 3rd party classification, but results were never consistent. First I think the host removal is too loose, the more adapters in their default, the more likely it could remove reads. Then tie that with a big list of adapter, and it tries to remove all of them even when not present.

Appreciate your help!!

Since I didn’t see the files that you said you fixed. I tried fastp with only 1bp minimum that brought my read 1 file size from 2GB to 757mb, which didn’t seem right unless it got rid of all the files with no bases discarding all the single end reads.

I believe I found a better workaround for this. I used BBduk with just default settings, but then having it trim to 1 base instead of discarding them. This gave me two files with at least 1bp and a file of singles that probably had no bp on one. Ran fine, trimmomatic is working well now too, so I think this killed the error..

For Read R1

Original read count 41,170,750

Fastp min 1bp only 19,467,076

BBduk min 1bp only 40,894,023

I can’t understand why Fastp dropped half the reads, you pointed me in the right direction for sure. Really didn’t want to redo that from scratch.

Thanks

Hi @Jon_Colman

I hope you are not actually going to read this today .. but when you are back.

See my fastp job in that shared history – dataset 50 with the “length-only” tag.

I toggled off all of the other actions, including adaptor trimming. This passed all of the reads through, except the empty read. In the report (optional, but I always use it), the count of reads ignored that read – so that in/out read counts were the same.

I would only use those settings for this specific action, as a sort of pre-cleaning step. It may not even work on pairs – only single datasets – you’d need to test that. Then, actual QA trimming can use pairs and synch the ends back up again. We are using the tool away from the intended use – so we didn’t have to cut and paste the file in other complex ways.

Hopefully this helps but we can follow up next week! :slight_smile:

Ok, I really have no idea to get to the shared folder, as nothing shows on my account. But I got there with the link.

So here is my issue, the dataset started out at 1.3GB 31,791,306 reads, and the dataset now after running your settings is now on dataset 50 it’s 48.7MB 1,204,366 reads. Unless I’m missing something???

When I ran BBduk with the 1bp minimum, it seemed to fix the issue, I think I read somewhere that BBduk can fix some errors. Though I just remembered, I forgot to run a minimum size on that output. But it appears that I retained all of my reads, I just need to filter the length.

Thanks @Jon_Colman for testing. Yes, I see what you mean. Fastp has more defaults than were revealed in my testing dataset. Those could impact your reads.

I’m glad you found another way to do this!