Issues with UMI Tools deduplicate

scca · September 12, 2024, 5:30pm

Hi,

I am having trouble using UMI tools deduplicate to remove PCR duplicates from my data set. I have already successfully used UMI tools extract to remove the UMI sequences from the reads and add it to the read names. The reads are in FASTQ at this point. Since UMI tools deduplicate only works with BAM/SAM files, I mapped all of my reads to the host cell genome using Minimap2, which I needed to do anyway to remove host cell RNA. I then took the unmapped BAM file generated by Minimap2 and used UMI tools deduplicate with default settings. The resulting BAM file is much smaller than the original (~50 kb vs ~108MB), and when I convert the BAM to FASTA, there are no reads. I see an option in UMI tools deduplicate to “use unmapped reads” but it only works for paired end reads and mine are single ended. Anyone have any ideas what could be causing this problem?

jennaj · September 12, 2024, 7:22pm

Hi @scca

This is where the problem was introduced.

The reads you are interested in are still unmapped at this point, and the deduplicate tool needs reads that are mapped! The tool uses the mapping coordinates to find the “duplicate” reads.

We have an example mapping in this tutorial’s step.

So – you need the reads in fastq format again (no matter what upstream steps you applied), and to map them against a reference genome. Then that result BAM is what you will input to the UMI deduplicate tool.

Does this help? Are you following a tutorial or other written protocol? If not, maybe try to find one, then try to replicate similar steps?

scca · September 12, 2024, 8:35pm

Hi Jennaj,

Thanks for the reply! That makes sense about the unmapped BAM file not working in UMI tools. I guess I have a dilemma then, since I don’t think using a mapped BAM file will work in my situation (but please correct me if I’m wrong)… What I am trying to do is develop a data analysis workflow to analyze metagenomic sequencing data in order to detect RNA viruses. My current workflow is this:

Extract UMIs into read names.
Remove the extra sample barcode sequences with biosed.
Remove host (human) sequences by Minimapping to the human reference genome, then splitting the BAM into mapped and unmapped files.
Deduplicate the unmapped BAM file (doesn’t work, as you said)
Convert the unmapped BAM file to FASTA.
Run reads through Kraken2 against the virus database
Convert Kraken to convert tab to Galaxy taxonomy.
Run Krona to see the percent abundances of the viruses present in the sample.

So since we will not know beforehand what will be in the sample, I don’t think it makes sense to map the reads to any particular reference. Are there any options for removing the extracted UMIs using FASTA/FASTQ reads instead of a BAM file? Or is there another way I should be approaching this workflow?

jennaj · September 13, 2024, 5:45pm

My question here is: do you really mean “abundance” estimates, or yes/no presence in the samples across viral genomes?

I’m guessing the second … but this is the question you’ll need to answer from what I can tell. And if it is this, then it seems you could go directly from the host filtered fastq to Kraken/Kraken2. Sort of following a protocol like this Hands-on: 16S Microbial analysis with Nanopore data / 16S Microbial analysis with Nanopore data / Microbiome

Please keep in mind that we can’t help with developing the analysis strategy, but can help with identifying Galaxy-hosted tools for discrete parts of what you want to do. The scientific logic is beyond the scope of this particular forum. That’s why I suggested finding a public protocol (publication or otherwise) and going from there.

Not a great answer and others are welcome to comment more!

scca · September 13, 2024, 10:05pm

Ok no worries… To answer your question, yes I would like to get accurate abundance estimates, not just a pos/neg result. For that reason I have added UMIs to all of the reads to reduce PCR bias. I have already found and successfully used all of the other tools needed in this workflow, the only step I’m having problems with is removing the PCR duplicates. Is there a Galaxy tool that can deduplicate FASTA or FASTQ files using UMI sequences? It seems like that’s really all I need…

jennaj · September 13, 2024, 10:37pm

Hi @scca

The problem is that you need something to base the de-duplication processing on. Without a mapping result for a coordinate based approach, you are left with directly comparing the sequences to each other. Comparing sequences to each other is a type of clustering.

This is part of what the Mothur tools do. You don’t have to use these tools, but maybe reviewing the logic of how that pipeline works helps to come up with something novel for your read type?

Hands-on: 16S Microbial Analysis with mothur (extended) / 16S Microbial Analysis with mothur (extended) / Microbiome

scca · September 14, 2024, 1:14am

My FASTA reads have the extracted UMI sequence from each read added to the read name via UMI tools extract, like this:

>4dce24ed-9a3a-403e-90cf-eb0353e2f2fc_AGGGGAGCTGCAGATGT
TTGATTGCCCAGGCCGGGCACAGTGGC[…]

From what I have read, I should be able to somehow identify and group identical (or similar) UMI sequences in the read name and collapse them into a single sequence. Would that be sufficient to perform the deduplication? I will look into mothur and clustering, thanks for the recommendation.

Topic		Replies	Views
Remove PCR duplicates without mapping first	0	55	September 24, 2024
what tools approprite for remove duplicate reads from BAM file in Usegalaxy mapping , usegalaxy , quality-control , picard_markduplicates	3	1943	April 19, 2021
de-interleave issues usegalaxy.org support fastq-deinterlacer , third-party-identities	8	78	July 1, 2024
Need help trimming sample barcodes and sequencing adapters usegalaxy.org support single-cell	4	250	September 10, 2024
Splitting BAM Mapped and Unmapped usegalaxy.org support bamfilter	4	35	March 28, 2025

Issues with UMI Tools deduplicate

Related topics