I recently inherited a targeted sequencing project that uses inline demultiplexing to separate reads by gene. Currently, an in-house tool (written in Fortran) handles this, but it relies on hardcoded paths and a rigid folder structure, and the original developer has since left. To modernize the workflow, I’d like to migrate this step to our internal Galaxy server and update the pipeline.
However, I’ve encountered a challenge: the barcodes have variable start positions. For example:
A read belongs to Barcode 1 (ACGT) if the sequence appears at positions 2–6.
A read belongs to Barcode 2 (TCGA) if the sequence appears at positions 8–12.
I’ve tested several tools, including Je demultiplex, Cutadapt, Flexbar, and Stacks2, but none seem capable of handling this (or I haven’t configured them correctly). Most tools assume barcodes are at fixed positions, while Flexbar’s sliding-window approach risks false positives (e.g., matching “ACGT” at positions 3–7 instead of 2–6).
Question: Do you know of any Galaxy-compatible tools that can demultiplex reads based on variable barcode positions? If not, I’ll either:
Modify the existing Fortran code, or
Rewrite the logic in a modern language (e.g., Python/R).
I’d prefer to avoid reinventing the wheel if a suitable tool already exists. Any suggestions would be greatly appreciated!
What about using UMI-tools first? It supports regular expressions (if am I understanding correctly!), which I think you will need here if Cutadapt couldn’t do this correctly due to the positional requirements and Flexbar was too greedy. The existing tool making decisions about how to resolve logic around “which barcode is best” when a read could match two or more based on your match thresholds means that you don’t have to! But you’d need to test to see if you like the decision…
As this would move the barcode into the sequence descriptions, those could then be parsed out to actually split the sequences into collection folders after. How many intermediate tools doesn’t really matter, since all of this could be placed into a simple workflow. I’d suggest looking at the SeqKit tools to see if these will do what you need.
Another option I can think of is running through Barcode splitter multiple times – one round per barcode. Then adding in some comparison logic to resolve false positives between the runs. This means you would be making the “which is best” decision until it matches your truth set. This seems tedious.
Finally, if you do decide to come up with your own hybrid tool (stand alone, or based on one of the others?), you could wrap that into a Galaxy tool. This sounds like what you were planning anyway as a fallback, and one extra step means you only need to rewrite one tool, put the Galaxy wrapper around it, then you can leverage the remainder of Galaxy’s tools and workflow engine for the full protocol and ongoing work.
Writing new or customizing existing tools is pretty common! And it is part of why we have so many tools! Someone has a goal, pulls in the new components, then everyone can use them. The Biocontainer could be used by people outside of the Galaxy ecosystem too. We have a package to help with the wrapping step plus tutorials.