single cell barcode extraction for STRT-seq

I have tried to run single cell pre-processing workflow.but its not available for my data. I think it has a special kind of data uploading, do any body have experience in this issue?
how should we design pattern for barcode extarction?

1 Like

Hi @sara,
could you provide me more information about your data?

1 Like

hi
its a single end data by STRT-seq protocol. i have studied Galaxy training but their data is completely different from my data. first of all I have uploaded my fastq data in normal way but which I think should be in a different way. secound for barcode extraction I dont know how to make barcode file and how to design pattern for UMI tool extract. this is my data GEO accession number if it helps: GSE76381
thanks for your help

This answer was adapted from the gitter chat:

So two things you need to know:

  1. What your barcode format is, so you can extract the barcodes out of the sequences
  2. Whether your protocol has a list of expected barcodes (a barcode file), to check your extracted barcodes are correct.

When you extract the barcodes out of your reads, you may get some false positives that will inflate the number of cells in your sample, so the barcode file handles that by filtering out unwanted barcodes or clustering them to the expected barcodes that should exist in your sample.

  • The barcode format is usually specific to the protocol, which for Strt-seq is given on page 15 in their paper (https://doi.org/10.1007/978-1-4939-9240-9_9). From what I can understand of Step 7, the format is:
    • AATGATACGGCGACCACCGATNNNNNNGGGXX..XXCTGTCTCTTATACACATCTGACGCXXXXXXXXTCGTATGCCGTCTTCTGCTTG
      • i.e (21bp of Sequence) + (6bp UMI) + (Variable bp of Sequence again) + “GACGC” + (8bp Cell Barcode) + (Variable bp of Sequence again)
    • To extract this, you would need to provide UMI-tools extract with a regular expression pattern like:
      • (.{21})(?P<umi_1>.{6})(.*)(?P<discard_1>GACGC)(?P<cell_1>.{8})(.*) (notice that each parenthesis group here mirrors the above parentheses groups)
  • The barcode file which contains the list of true barcodes should be something you get from the specific facility that runs the Strt-seq protocol. I am not too familiar with this protocol, so it could be possible that there isn’t a barcodes file, in which case you would need to filter out false positives by using a filtering tool like the DropletUtils tool (also in Galaxy).
3 Likes

Hi @mtekman, there is a list of barcodes in the supplementary material of that paper, but I’m not sure if they could be used for generating a barcode file.

1 Like

Oh nice! This does indeed look like a barcodes file – how many are there?

1 Like

yes, it is. that was one of my best answers these days. thanks alot

thank you so much for your effective explanations. I have understood the barcode format but I still have problem by pattern. UMI tool extract ask me to write pattern by Ns and Xs. how should I write for this example.

There are 96 barcodes.

Hi @sara,
you can try this workflow https://usegalaxy.eu/u/gallardoalba/w/pre-processing-of-scrna-seq-strt-c1-data

There is a strange readname issue occurring between umi-tools extract and RNA STAR (FASTQ readnames with @123456_c/1_AAAAA_TTTT are erroneously cut off after the / in the RNA STAR readnames)

Here is a workflow which tries to fix that:

https://usegalaxy.eu/u/mehmet-tekman/w/strt-seq-workflow-with-barcodes-rename-sequences

1 Like