Need help trimming sample barcodes and sequencing adapters

Hi,

I have ~2M nanopore sequencing reads that I would like to clean up before running Kraken2 and Krona. The reads look like this:

5’ - sequencing adapter - sample barcode - UMI - sequence - UMI - sample barcode - sequencing adapter - 3’

The UMI and sample barcode was added by random priming, so in fact there are reads where multiple barcodes are present, and they are often scattered throughout the read, not always in a specific location in the transcript.

I used biosed to find the barcode sequence in all reads and replace it with an “N.” I then used Porechop to remove the Oxford Nanopore sequencing adapters. If there is a better way to do this, please let me know.

The problem I am having is removing the PCR duplicates. I have tried cd-hit-dup, VSearch dereplication, and Collapse Sequences. All of these tools only remove a few duplicates when I expected there to be many more. I have a feeling that removing the sample barcode and/or sequencing adapters have made many reads unique that otherwise would have been identical.

I know that my UMI’s are always directly 5’ or 3’ of the sample barcode. Is there a way to first search for the barcode, then define the UMI as a range of x bases to the right or left of the barcode, and deduplicate based on the UMI (which could be at any position in the read)? Or are there any other ways to handle this?

Welcome, @scca

Let’s get some help from the people that work with single cell data everyday!

They can be found at a Matrix chat, and I’ve cross posted your question over there. They may reply here or there, and feel free to join the chat! You're invited to talk on Matrix

We also have tutorials at the Galaxy Training Network (GTN) → Single Cell / Tutorial List

Let’s start there! :slight_smile:

Welcome @scca,
It seems you already tried a bunch of relevant tools and your steps also make sense to me. After replacing the barcodes with Ns and trimming adapters, try the UMI-tools extract tool. With the regex extraction method, define that your UMIs should start/end with Ns which are your barcodes. There are some examples of writing regex in the tool help text. This step will only add the UMI to the read name. In the later steps in your analysis, you should find a way to use this information for deduplication. UMI tools deduplicate can do that too on mapped data.

In theory, you can use UMItools to extract your barcodes and UMIs together, but it gets hard with a variable number of barcodes.

1 Like

Hi @pavanvidem, thanks for the reply… UMI-tools extract looks promising, but the regex code is a bit hard for me to understand. I did some research but it’s still not clear to me how to search for my sample barcode and define the UMI and extract it to the read name. I understand that regex is defining the search parameters, but do I need to include the regex code for adding the UMI to the read name or does the UMI tools program do that part?

Just to update I did find a regex expression that is close to what I want to do:

.(?P<discard_1>CGTTATCGTTCCGTGAATAGC){s<=1}(?P<umi_1>.{17}).

This finds the barcode sequence in the middle of the read and extracts the following 17 N bases (UMI) to the read name. However it also deletes the UMI bases from the read. Any idea how to have it keep the UMI in the sequence and just copy it to the read name?

It also seems to only scan the read once, as there are still barcode sequences remaining in a lot of reads. Is it possible to have it do multiple passes and maybe only keep the last UMI encountered in each read?