Galaxy Version 39.08+galaxy3
related to my other question today, looking to extract reads with high match to reference 20mer - ran Bbduk twice with same 20mer reference, getting “matches’ that dont meet parameters:
k=20, hamdist=1 - should find reads with only a single mismatch to ref? finds 10 reads in forward
k=19 hamdist=0 maskmiddle=t should find reads with single mismatch at 1 or 20? (and also possible mismatch at middle - this was prob not relevant) found only 3 reads (subset of the above 10 reads)
So here are details, a few matches seem OK but several have no homology at all - can anyone figure out why they found there way into outmput match files?
spike20.fa CTCCAACCTGCTGCTGCAGT 20 bases
(inverted seq = ACTGCAGCAGCAGGTTGGAG)
run Bbduk to find matches, Galaxy defaults except
k=20, hamdist=1 (1 mismatch?) maskmiddle=f finds 10 reads (in each set - only forward analyzed here)
22320/1
CTCcAACCTGCTGCTGCAGT
GCTATGACATCTCAAAGAAGCACTTTGATGCAAGCTTCTTTTGGGCCTGTGACGCACCTCTGGGAGCTTCCCTGGGCACATCTGTGAGCCATAGAGATCTGGCTGTAATGGACGTAAATAAAGAAAGAAAA
19/20 match at 1-20 lower case c is mismatch - I can see how this was matched; duplicate identical read also matched
25780/1
AAGTCGCGTATTTCGTCTTCTGGAAATCCAGCAGCCGGGGGGCGAACATCGCGGCGCTTCCATGGGAATCTGGCCCCGGGCTCAGAGCGCGGGTAGCTGGCAGAGCCTGGAGGGCGCGGCG
gCTGCAGCAGCAGGTTGGAG
GGCGCGCGGC
also 19/20 match to inv reference near end of read, first position mismatch in lowercase - same read as first read below with k=19, hamdist=1
these seem to be in agreement with parameters
one read has 18/20 match to reference starting at 112 CTCCaAcCTGCTGCTGCAGT (mismatch at 5 and 7 - why was this found?)
another read has 19/20 to inverse of reference, 68-87 single mismatch at 83 - this matches parameters
two duplicate reads have 9/10 matching the end of reference gctggcaggaCtGCTGCAGT (mismatch at 2nd position) at 49-58 previous 10 bases 39-48 all mismatch
two other related reads also found in next search 7808 and 7822 have no homology to reference 20mer that I can see visually
tenth match has CTCCAACtTGCTGCTcCaat has 15/17 match to start of reference (mismatches in lowercase) at 30-47 - close but does not meet parameters!!
repeated Bbduk on same read pair set with k=19 hamdist=0 maskmiddle=t finds 3 reads in each set - all found in above first search 10 reads
25780/1
AAGTCGCGTATTTCGTCTTCTGGAAATCCAGCAGCCGGGGGGCGAACATCGCGGCGCTTCCATGGGAATCTGGCCCCGGGCTCAGAGCGCGGGTAGCTGGCAGAGCCTGGAGGGCGCGGCG
gCTGCAGCAGCAGGTTGGAG
GGCGCGCGGC
lower case g at 122 marks mismatch next 19 bp match inverse of spike20seq
other two matches are related to each other, 100% matches first 88 bp of next read - assume the “match” is in this region
91% identity overall - only 11 of next 63 differ
7808/1
ATGTACAAATGTGCCGAGTGCGGCAAGTCCTTCAAGGGCTCCTCCGGGCTGCGCTACCACCTGCGGGACCACACGGGCGAGCGGCCCT
ACCAGTGTGGCGAGTGCGGCAAGGCCTTCAAGCGCTCCTCCCTGCTGGCCATCCACCAGCGGG
cant find any match to reference spike20 or inv
7822/1
ATGTACAAATGTGCCGAGTGCGGCAAGTCCTTCAAGGGCTCCTCCGGGCTGCGCTACCACCTGCGGGACCACACGGGCGAGCGGCCCT
CCCACTGTCGCCCCTGCCTCAAGGCCTTCAAACACTCCTCCCCGCTGGTGATGCAGGCCCGGG
cant find any match to spike20 or inv
Any help would be appreciated!