single cell barcode extraction for STRT-seq

This answer was adapted from the gitter chat:

So two things you need to know:

  1. What your barcode format is, so you can extract the barcodes out of the sequences
  2. Whether your protocol has a list of expected barcodes (a barcode file), to check your extracted barcodes are correct.

When you extract the barcodes out of your reads, you may get some false positives that will inflate the number of cells in your sample, so the barcode file handles that by filtering out unwanted barcodes or clustering them to the expected barcodes that should exist in your sample.

  • The barcode format is usually specific to the protocol, which for Strt-seq is given on page 15 in their paper (https://doi.org/10.1007/978-1-4939-9240-9_9). From what I can understand of Step 7, the format is:
    • AATGATACGGCGACCACCGATNNNNNNGGGXX..XXCTGTCTCTTATACACATCTGACGCXXXXXXXXTCGTATGCCGTCTTCTGCTTG
      • i.e (21bp of Sequence) + (6bp UMI) + (Variable bp of Sequence again) + “GACGC” + (8bp Cell Barcode) + (Variable bp of Sequence again)
    • To extract this, you would need to provide UMI-tools extract with a regular expression pattern like:
      • (.{21})(?P<umi_1>.{6})(.*)(?P<discard_1>GACGC)(?P<cell_1>.{8})(.*) (notice that each parenthesis group here mirrors the above parentheses groups)
  • The barcode file which contains the list of true barcodes should be something you get from the specific facility that runs the Strt-seq protocol. I am not too familiar with this protocol, so it could be possible that there isn’t a barcodes file, in which case you would need to filter out false positives by using a filtering tool like the DropletUtils tool (also in Galaxy).
3 Likes