Need help trimming sample barcodes and sequencing adapters

scca · September 6, 2024, 8:57pm

Hi,

I have ~2M nanopore sequencing reads that I would like to clean up before running Kraken2 and Krona. The reads look like this:

5’ - sequencing adapter - sample barcode - UMI - sequence - UMI - sample barcode - sequencing adapter - 3’

The UMI and sample barcode was added by random priming, so in fact there are reads where multiple barcodes are present, and they are often scattered throughout the read, not always in a specific location in the transcript.

I used biosed to find the barcode sequence in all reads and replace it with an “N.” I then used Porechop to remove the Oxford Nanopore sequencing adapters. If there is a better way to do this, please let me know.

The problem I am having is removing the PCR duplicates. I have tried cd-hit-dup, VSearch dereplication, and Collapse Sequences. All of these tools only remove a few duplicates when I expected there to be many more. I have a feeling that removing the sample barcode and/or sequencing adapters have made many reads unique that otherwise would have been identical.

I know that my UMI’s are always directly 5’ or 3’ of the sample barcode. Is there a way to first search for the barcode, then define the UMI as a range of x bases to the right or left of the barcode, and deduplicate based on the UMI (which could be at any position in the read)? Or are there any other ways to handle this?

jennaj · September 9, 2024, 8:23pm

Welcome, @scca

Let’s get some help from the people that work with single cell data everyday!

They can be found at a Matrix chat, and I’ve cross posted your question over there. They may reply here or there, and feel free to join the chat! You're invited to talk on Matrix

We also have tutorials at the Galaxy Training Network (GTN) → Single Cell / Tutorial List

Let’s start there!

pavanvidem · September 9, 2024, 10:11pm

Welcome @scca,
It seems you already tried a bunch of relevant tools and your steps also make sense to me. After replacing the barcodes with Ns and trimming adapters, try the UMI-tools extract tool. With the regex extraction method, define that your UMIs should start/end with Ns which are your barcodes. There are some examples of writing regex in the tool help text. This step will only add the UMI to the read name. In the later steps in your analysis, you should find a way to use this information for deduplication. UMI tools deduplicate can do that too on mapped data.

In theory, you can use UMItools to extract your barcodes and UMIs together, but it gets hard with a variable number of barcodes.

scca · September 10, 2024, 4:46pm

Hi @pavanvidem, thanks for the reply… UMI-tools extract looks promising, but the regex code is a bit hard for me to understand. I did some research but it’s still not clear to me how to search for my sample barcode and define the UMI and extract it to the read name. I understand that regex is defining the search parameters, but do I need to include the regex code for adding the UMI to the read name or does the UMI tools program do that part?

scca · September 10, 2024, 9:58pm

Just to update I did find a regex expression that is close to what I want to do:

.(?P<discard_1>CGTTATCGTTCCGTGAATAGC){s<=1}(?P<umi_1>.{17}).

This finds the barcode sequence in the middle of the read and extracts the following 17 N bases (UMI) to the read name. However it also deletes the UMI bases from the read. Any idea how to have it keep the UMI in the sequence and just copy it to the read name?

It also seems to only scan the read once, as there are still barcode sequences remaining in a lot of reads. Is it possible to have it do multiple passes and maybe only keep the last UMI encountered in each read?

Topic		Replies	Views
Issues with UMI Tools deduplicate usegalaxy.org support tool-help , umi_tools_dedup	6	487	September 14, 2024
Remove PCR duplicates without mapping first	0	54	September 24, 2024
Questions about UMI-tools extract Extract UMI from fastq files (Galaxy Version 1.1.6+galaxy0) in scRNAseq	0	3	June 5, 2025
single cell barcode extraction for STRT-seq usegalaxy.eu support single-cell	10	1686	November 30, 2020
What are the steps for preprocessing single cell data generated from BD Rhapsody system ? single-cell	2	40	May 2, 2025

Need help trimming sample barcodes and sequencing adapters

Related topics