I am having trouble using UMI tools deduplicate to remove PCR duplicates from my data set. I have already successfully used UMI tools extract to remove the UMI sequences from the reads and add it to the read names. The reads are in FASTQ at this point. Since UMI tools deduplicate only works with BAM/SAM files, I mapped all of my reads to the host cell genome using Minimap2, which I needed to do anyway to remove host cell RNA. I then took the unmapped BAM file generated by Minimap2 and used UMI tools deduplicate with default settings. The resulting BAM file is much smaller than the original (~50 kb vs ~108MB), and when I convert the BAM to FASTA, there are no reads. I see an option in UMI tools deduplicate to “use unmapped reads” but it only works for paired end reads and mine are single ended. Anyone have any ideas what could be causing this problem?
The reads you are interested in are still unmapped at this point, and the deduplicate tool needs reads that are mapped! The tool uses the mapping coordinates to find the “duplicate” reads.
So – you need the reads in fastq format again (no matter what upstream steps you applied), and to map them against a reference genome. Then that result BAM is what you will input to the UMI deduplicate tool.
Does this help? Are you following a tutorial or other written protocol? If not, maybe try to find one, then try to replicate similar steps?
Thanks for the reply! That makes sense about the unmapped BAM file not working in UMI tools. I guess I have a dilemma then, since I don’t think using a mapped BAM file will work in my situation (but please correct me if I’m wrong)… What I am trying to do is develop a data analysis workflow to analyze metagenomic sequencing data in order to detect RNA viruses. My current workflow is this:
Extract UMIs into read names.
Remove the extra sample barcode sequences with biosed.
Remove host (human) sequences by Minimapping to the human reference genome, then splitting the BAM into mapped and unmapped files.
Deduplicate the unmapped BAM file (doesn’t work, as you said)
Convert the unmapped BAM file to FASTA.
Run reads through Kraken2 against the virus database
Convert Kraken to convert tab to Galaxy taxonomy.
Run Krona to see the percent abundances of the viruses present in the sample.
So since we will not know beforehand what will be in the sample, I don’t think it makes sense to map the reads to any particular reference. Are there any options for removing the extracted UMIs using FASTA/FASTQ reads instead of a BAM file? Or is there another way I should be approaching this workflow?
Please keep in mind that we can’t help with developing the analysis strategy, but can help with identifying Galaxy-hosted tools for discrete parts of what you want to do. The scientific logic is beyond the scope of this particular forum. That’s why I suggested finding a public protocol (publication or otherwise) and going from there.
Not a great answer and others are welcome to comment more!
Ok no worries… To answer your question, yes I would like to get accurate abundance estimates, not just a pos/neg result. For that reason I have added UMIs to all of the reads to reduce PCR bias. I have already found and successfully used all of the other tools needed in this workflow, the only step I’m having problems with is removing the PCR duplicates. Is there a Galaxy tool that can deduplicate FASTA or FASTQ files using UMI sequences? It seems like that’s really all I need…
The problem is that you need something to base the de-duplication processing on. Without a mapping result for a coordinate based approach, you are left with directly comparing the sequences to each other. Comparing sequences to each other is a type of clustering.
This is part of what the Mothur tools do. You don’t have to use these tools, but maybe reviewing the logic of how that pipeline works helps to come up with something novel for your read type?
From what I have read, I should be able to somehow identify and group identical (or similar) UMI sequences in the read name and collapse them into a single sequence. Would that be sufficient to perform the deduplication? I will look into mothur and clustering, thanks for the recommendation.