I have been trying to remove PCR duplicates from my reads without much success. Since this is a metagenomic screening assay, I do not have a reference sequence to map the reads to before duplicate removal (as required by UMITools-deduplicate and others).
I have a UMI sequence in each read and can get that sequence extracted to the FASTQ header using UMITools-extract. But I can’t seem to get past this point. All of the tools seem to only allow de-duplication after mapping, or only remove EXACT matches. Are there any ways to remove FASTQ duplicates based on UMI sequence only (allowing for one or two mismatches)? I have been looking at pRESTO CollapseSeq, but can’t figure out what to use for the different parameters. Would this tool work if I could figure out how to use it?