>90% of reads lost during demultiplexing with Barcode Splitter and Process Radtag tools


I have a 2x150 bp paired end, small RNA CLIP dataset from a previous student consisting of 12 multiplexed samples. The libraries were prepared from 20-50 nt RNA fragments and multiplexed using NEB’s small RNA kit. I am now trying to demultiplex these samples and I have tried the Barcode Splitter and Stacks2 process radtags tools. With both tools, ~96% of my reads (~480,000,000/~500,000,000) are not matched to a barcode.

I don’t have a great understanding of how this library was prepared. Moreover, since it’s a small RNA library, I don’t really understand why paired-end reads were performed (from what I see online, just R1 alone should be sufficient for libraries of this size). Since there are no index reads in the raw data I have (just R1 and R2), I assume the barcode must be read inline. Looking at the NEB small RNA manual, the barcode seems to be added to the i7 end (so 3’ end) and based on the read primer site, I assume I should only see the barcode in R1, but not R2. This is confirmed by the FastQC report, which shows the index sequences over-represented in only R1 but not R2.

Does anyone know why so many of my reads are not matching a barcode? Or have any tips for demultiplexing this library? This library has been previously analyzed by a bioinformatics lab (unfortunately I don’t have the early steps in the analysis from them, only the CLIP peak calling file), so I assume the problem is my approach to demultiplexing and not the data quality.

Thanks in advance!

Welcome, @sarahkschultz

My first guess is that the problem is with the barcode file input. Are you sure that it is complete? How to format is on the bottom of each form. For the complete list of barcodes themselves, the manufacture should have this available.

If that seems correct, then the next guess is some problem with the match parameters. Maybe these were set too stringently? Maybe run this with very permissive options to see what happens, then tune?

Let’s start there, thanks! :slight_smile:

Thanks @jennaj!
I think I have it figure out now, for R1 at least! These reads seem to now be demultiplexed perfectly.

However, I’m stuck with what to do with R2. R2 doesn’t have a barcode, so I just want them demultiplexed based on where their mate went. What is the best way to go about doing this? Barcode splitter doesn’t seem to treat the reads as pairs, and since these don’t have barcodes, of course these reads go unmatched.

Thanks for you help!

1 Like

Hi @sarahkschultz

Try running the tool with a paired end collection.

When creating the collection, use the option for Build List of Dataset Pairs to process all 12 samples at the same time. Meaning, you could use a simple paired collection (one sample) for each individually, but running as a batch is certainly possible too. I would suggest using the sample names for the original collection identifiers. Use Collection Operations → Relabel identifiers if needed, before using the Barcode Splitter tool. The use of tags is also possible to better help sort out the data when using downstream tools.

I made an example here to make sure it still works → https://usegalaxy.org/u/jen-galaxyproject/h/test-barcode-splitter-on-paired-collection-1

Note that last collection – I input the result from the tool to the Collection Operations → Flatten Collection tool to capture the Barcode Splitter assigned element identifiers along with the forward/reverse organization to better reveal what is going on, but you can manipulate that result collection however you want to.

If you are not familiar with collections, know that these are powerful ways to organize your data and are worth learning about. Find the tools to “change the shape” of the collection folders, or to filter or tag them, in the tool panel group Collection Operations. Tutorials are linked at the bottom of each tool form for more help.

Hope this helps! :slight_smile:

Hi @jennaj

Hmmm I actually did build a list of dataset pairs for my R1 and R2 files to input as a dataset collection into barcode splitter, along with my barcode file and I selected the barcodes are at the 3’ end of the sequence and up to one mismatch allowed. When I get the output though, 99% of my R1s are appropriately separated into 12 new fastqsanger files; however, only 3% of the R2s go into a file associated with a barcode and the rest go into the unmatched dataset.

Is there something I’m misunderstanding/missing? Looking at the sample dataset, it looks like the forward and reverse reads are the same in this (rather than what I have with the barcode only in the forward reads), so both reads contain the barcode this tool is looking for within this dataset and that’s why everything goes to the right file in the end I think???

Thanks! :slight_smile: