BBTools: BBduk advanced options

Bottom line here - is there a kmask=lc option for Bbduk on Galaxy? trying to figure out why matched output searching for a specific 20mer sequence, with k=19 (no mask) outputs 3 reads - one with 19/20 match (1st base mismatch) and 2 reads that do not appear to have any match (I would expect k=19 on a 20mer search to find reads that only differ at the first or last base of the 20mer, no?) - these 2 other reads (151 bp) in the output are identical for the first 88 bp, and in that region there is at best a 12/20 match. There are no reads exactly matching the 20mer (k=20 with the 20mer.fa file output is empty, again with maskmiddle=f)

Searching for help identify why these 2 reads were output, I found a parameter for bbduk to write the match in lowercase kmask=lc but this does not appear to be an option in the Galaxy instance of bbduk. is this true? thanks for any help

Welcome @joblingm

Yes, I think you are correct! The option doesn’t appear to be included in the current version of the tool wrappers for the BBMap suite.

Comparison

Source BBMap repository → Code search results · GitHub

Flags was added in Version 34.70
BBMap/docs/changelog.txt at a9ceda047a7c918dc090de0fdbf6f924292d4a1f · BioInfoTools/BBMap · GitHub

Galaxy BBTools wrapper repository → Code search results · GitHub

As hosted at UseGalaxy* servers:

BBTools: BBduk decontamination using kmers (Galaxy Version 39.08+galaxy3)

Requirements See details

  • bbmap (Version 39.08)

  • samtools (Version 1.20)



Since the conda release of BBMap includes the flag, it seems the tool wrapper could be updated! If you would like to request that this is included in a future version, the IUC would be where to create the issue ticket.

Then, for your situation, maybe explore a parameter like this one?

BBTools: BBMap short-read aligner → Mapping Options

Potential mapping sites must have at least this many consecutive exact matches *
0
Zero value ignores (kfilter)

I hope this helps! :slight_smile:

Galaxy Version 39.08+galaxy3

related to my other question today, looking to extract reads with high match to reference 20mer - ran Bbduk twice with same 20mer reference, getting “matches’ that dont meet parameters:

k=20, hamdist=1 - should find reads with only a single mismatch to ref? finds 10 reads in forward

k=19 hamdist=0 maskmiddle=t should find reads with single mismatch at 1 or 20? (and also possible mismatch at middle - this was prob not relevant) found only 3 reads (subset of the above 10 reads)

So here are details, a few matches seem OK but several have no homology at all - can anyone figure out why they found there way into outmput match files?

spike20.fa CTCCAACCTGCTGCTGCAGT 20 bases

(inverted seq = ACTGCAGCAGCAGGTTGGAG)

run Bbduk to find matches, Galaxy defaults except

k=20, hamdist=1 (1 mismatch?) maskmiddle=f finds 10 reads (in each set - only forward analyzed here)

22320/1
CTCcAACCTGCTGCTGCAGT
GCTATGACATCTCAAAGAAGCACTTTGATGCAAGCTTCTTTTGGGCCTGTGACGCACCTCTGGGAGCTTCCCTGGGCACATCTGTGAGCCATAGAGATCTGGCTGTAATGGACGTAAATAAAGAAAGAAAA
19/20 match at 1-20 lower case c is mismatch - I can see how this was matched; duplicate identical read also matched

25780/1
AAGTCGCGTATTTCGTCTTCTGGAAATCCAGCAGCCGGGGGGCGAACATCGCGGCGCTTCCATGGGAATCTGGCCCCGGGCTCAGAGCGCGGGTAGCTGGCAGAGCCTGGAGGGCGCGGCG
gCTGCAGCAGCAGGTTGGAG
GGCGCGCGGC
also 19/20 match to inv reference near end of read, first position mismatch in lowercase - same read as first read below with k=19, hamdist=1
these seem to be in agreement with parameters

one read has 18/20 match to reference starting at 112 CTCCaAcCTGCTGCTGCAGT (mismatch at 5 and 7 - why was this found?)

another read has 19/20 to inverse of reference, 68-87 single mismatch at 83 - this matches parameters

two duplicate reads have 9/10 matching the end of reference gctggcaggaCtGCTGCAGT (mismatch at 2nd position) at 49-58 previous 10 bases 39-48 all mismatch

two other related reads also found in next search 7808 and 7822 have no homology to reference 20mer that I can see visually

tenth match has CTCCAACtTGCTGCTcCaat has 15/17 match to start of reference (mismatches in lowercase) at 30-47 - close but does not meet parameters!!

repeated Bbduk on same read pair set with k=19 hamdist=0 maskmiddle=t finds 3 reads in each set - all found in above first search 10 reads

25780/1
AAGTCGCGTATTTCGTCTTCTGGAAATCCAGCAGCCGGGGGGCGAACATCGCGGCGCTTCCATGGGAATCTGGCCCCGGGCTCAGAGCGCGGGTAGCTGGCAGAGCCTGGAGGGCGCGGCG
gCTGCAGCAGCAGGTTGGAG
GGCGCGCGGC
lower case g at 122 marks mismatch next 19 bp match inverse of spike20seq

other two matches are related to each other, 100% matches first 88 bp of next read - assume the “match” is in this region
91% identity overall - only 11 of next 63 differ

7808/1
ATGTACAAATGTGCCGAGTGCGGCAAGTCCTTCAAGGGCTCCTCCGGGCTGCGCTACCACCTGCGGGACCACACGGGCGAGCGGCCCT
ACCAGTGTGGCGAGTGCGGCAAGGCCTTCAAGCGCTCCTCCCTGCTGGCCATCCACCAGCGGG
cant find any match to reference spike20 or inv

7822/1
ATGTACAAATGTGCCGAGTGCGGCAAGTCCTTCAAGGGCTCCTCCGGGCTGCGCTACCACCTGCGGGACCACACGGGCGAGCGGCCCT
CCCACTGTCGCCCCTGCCTCAAGGCCTTCAAACACTCCTCCCCGCTGGTGATGCAGGCCCGGG
cant find any match to spike20 or inv

Any help would be appreciated!

Thanks for the prompt response. Not finding a perfect match answers my question for this data set, but i can’t figure out why bbduk “matches” reads that have poor or even no recognizable homology to a 20mer with k=19 or 20 hdist= 0 or 1. I’m not sure bbmap would help me either!

Hi @joblingm

We can’t see all of your parameters with BBDuk itself, but there are more parameters for how much both sides of the match (ref versus reads) can deviate and how (mismatch versus indel) and I assume by hamdist you mean hammingdistance but this is with respect to the ref, not the read, so maybe control for the others, too, to see what you get?

When looking at the parameters, I found this topic at Biostars that seems to be relevant (the use case is near identical). The default for middlemask is likely the root reason for your results (this option is also “yes” by default in Galaxy). → Help with understanding BBduk's behavior

I’m sorry I couldn’t help more but maybe someone else will have some advice. I would also try at a forum where more people are using this tool specifically. How it works in Galaxy will be the same as anywhere else, so the advice about parameters should translate. But if you run into another missing parameter in the Galaxy wrapper, we can help to confirm. :slight_smile:

Think I’ve answered it myself now but I will have to check. I took the 10 matching read output, spiked it with a full match and reran on the single file only with same parameters and it only found 5 matches. Documentation says when bbduk is run on paired reads and it finds a match in either read set it also outputs the paired read - my current guess is that the reverse output 10 reads will have some good matches and likelymon matching paired reads from forward matches? Didnt check them, for some reason I had it in my head that they’d be the same just reversed. Will now do so. First time using bbduk so I’m learning.

Appreciate folks willingness to try and help!

1 Like