FASTQ Barcode Splitter

Hi Admin/Team,

Could you please install the FASTQ Barcode Splitter tool from the Galaxy Tool Shed?
I’m demultiplexing Illumina dual-index RNA-seq where the barcodes are only in the FASTQ header (e.g., ... 1:N:0:GTAGTTCTGA+CCACTTAACA). I specifically need the mode:

  • Barcode source: Barcode is in the read header

  • Paired-end data: Yes

Tool details

  • Tool Shed: toolshed.g2.bx.psu.edu

  • Repository owner: peterjc

  • Repository: fastq_barcode_splitter

  • Tool ID: fastq_barcode_splitter

This lets me demux R1+R2 while keeping mates synchronized (mismatches=0, output unmatched). The standard “Barcode Splitter”/Je-Demultiplex won’t work because my barcodes are not in the read sequence.

Nice-to-have (optional)

  • bbmap (for repair.sh / “Repair reads in FASTQ”) to enforce perfect pairing post-demux.

  • sortmerna (for rRNA depletion with RFAM/SILVA DBs).

Welcome @aaku7

Would this tool combination work for you as an alternative?

Then search the tool panel for the other tools – the EU server and most of the other UseGalaxy servers have these, plus the intermediates tools you’ll likely want.

Be sure to scroll down on tool forms to see a short Help guide and link-outs to how the original tools are used, plus :graduation_cap: GTN Tutorials https://training.galaxyproject.org/ (if that tool happens to be included in one – you can also search the site directly by Categories/Domain).

If you are currently working at a different Galaxy server – contacting the administrators directly is how to request tools. Please let us know the URL if you cannot find the contact and we can try to help.

And, a reminder that you can move data between servers if you need to – all by URL. Many have large self-serve extended quota options.

Hope this helps! :slight_smile:

Thanks for your reply!
In my run the dual indexes are only in the FASTQ header (e.g. … 1:N:0:GTAGTTCTGA+CCACTTAACA), not inline in the read sequences, and I don’t have I1/I2 index FASTQs. So extract_barcodes + split_libraries won’t apply.
Could you please add either FASTQ Barcode Splitter (ToolShed repo peterjc/fastq_barcode_splitter, header-mode with paired-end) or SeqKit grep (iuc/seqkit_grep) so I can demultiplex by header while keeping pairs synchronized?
I can share my history link if helpful.

Hi @aaku7

I couldn’t find the other tool you are referencing with this functionality in the Main ToolShed or the Test ToolShed. This is where tools hosted at the public UseGalaxy servers are hosted. Was it in another ToolShed?

We do have this other barcode splitting tool but it requires a separate barcode file and doesn’t read the sequence @ lines.

If you can provide more details, I can look for it again. Maybe we can get it published over to a production ToolShed. But I’d also like to get you a solution that will work quicker than a tool installation, or possible tool update, or possibly a new wrapped tool .. since those all take variable amounts of time to complete.

This is why I suggested the Format Fastq sequences tool. It can parse out the sequence @ lines. I originally was guessing that you might have been mixing up the request. (But it seems I was wrong, so I’m glad you clarified!).

Big picture, text data is never trapped in any particular format. Another way to do this is with direct reformatting. This would probably start with a conversion from the the fastq format to tabular, then some parsing after that to get the content put into other formats (any that you need for your choice of barcode splitting tool). The processing could go into a mini workflow that can be used sort of like a single tool (by hiding intermediate outputs).

Something like this:

Please give that a review, and if you want to share an example of what your file looks like now (first few sequences of the pair is enough) then explain which tool you want to use next (to learn the expected converted file format) we can help with more specificity. If the data is in a shared history this will be easier to test – to make sure the recomindations work! FAQ: Sharing your History

Thanks! :slight_smile:

Thanks again for your reply!

My run has dual indexes only in the FASTQ header (the “@” line), e.g.:

@LH00504:305:233TJYLT3:4:1101:1065:1080 1:N:0:GTAGTTCTGA+CCACTTAACA
@LH00504:305:233TJYLT3:4:1101:1065:1080 2:N:0:GTAGTTCTGA+CCACTTAACA

There are no inline barcodes in the sequences and I don’t have I1/I2 index FASTQs, so the extract_barcodes / split_libraries tools won’t work for this case.

The only options are:

  1. FASTQ Barcode Splitter (supports “Barcode is in the read header” + paired-end)

    • ToolShed owner/repo: toolshed.g2.bx.psu.edu/repos/peterjc/fastq_barcode_splitter

    • Wrapper by Peter Cock.

or (equally fine for me)

  1. SeqKit grep (lets me filter by header, I’ll maintain pairing by extracting the same IDs from R1 and R2)

    • ToolShed owner/repo: toolshed.g2.bx.psu.edu/repos/iuc/seqkit_grep

I can also provide a 200k-read test subset of R1/R2 if helpful.

In the meantime I’m demultiplexing by building read-ID lists from R1 headers and using seqtk subseq on both R1 and R2 with the same list; that works but is slow as a single job. I can parallelize by chunking collections until a header-based demux tool is available.

Thanks a lot!

Hi @aaku7

Thanks for sharing the example!

I still think the tool I was recommending will do exactly what you want. The “barcode is the label” option means in the sequence @ lines, not where the remainder of the bases are in the actual sequence, which seems to be exactly what you are asking about. Please see the example here → Extracting Barcodes from fastq data for compatibility with split_libraries_fastq.py — Homepage

What you will need:

What the Format Fastq sequences tool parses:

So, while these have slightly different Illumina @ formatted lines, the tool allows for some customization with how many character to parse. Worse case, converting to the earlier Illumina header format is possible but I don’t think you’ll need it yet.

To be clear and where I think we may be talking about slightly different things: while your sequences do not currently have the barcode in the sequence directly yet, I think we can get your data manipulated into a format that can be split by getting the barcode out into a text file. All data would have the common sequence IDs for the remainder of the attributes, which I think is the place to start.

And this would be great! I’ll experiment then share back the example of the parsing that I think will work for you using this tool.

And for this part, could you share the link to the repository in the ToolShed? I’m not able to locate it. The link will be on the top level card labeled as Link to this repository. I can check for tools that may be hidden or deprecated with that info. The base tool_id does not appear to be a currently valid but that is part of what I am confirming.

Thanks! :slight_smile:

Thanks again for your reply, it makes all sense now. I will give this a go and will ask again if it gets stuck. I will share my history link via email. Thanks a lot

1 Like