Matching NGS sequence reads to a DNA library database plus barcode counts


I have built a library of synthetic promoters with approximately 2000 different sequences, all of them of the same length, 200bp. I have pair end reads now, a file for Read 1 and another for Read 2, in fastq.gz files. The first 8 bp are of no interest, the following 8 bp are my barcode, and the rest is to be matched to a fasta file that contains the different 2000 different sequences of interest with their IDs in the defline. I want to obtain an output that shows a table containing all 2000 sequence IDs from the fasta file, with the total amont of reads matched to each sequence ID and a count of the different number of barcodes obtained for all the matching reads for each sequence ID. So it would look something like: SeqID01 - 10000 reads - 200 different barcodes. Clarification: I do not have a file with all the barcodes, they were added randomly.
How would I do this in Galaxy?


Hi @ytorres

There isn’t a single tool that I know of, but you could probably come up with a way to do this. This is how all bioinformatics was done originally, and yes, I definitely did it the hard way originally!

Your pairs are the complicating part. How is “10000” reads to be counted? Any end? Only intact pairs? Will both ends in any pair have the same or different barcode?

SeqID01 - 10000 reads - 200 different barcodes.

Then, does something like this sound correct?

  1. Extract the barcodes from data1 (read1+read2, fastq)
  2. Extract the reads from data1 with all of the artifact removed, including “8 bp are of no interest” and “following 8 bp are my barcode”. Barcodes are always 5’ either end?
  3. Compare/map reads from data2 (fasta file) to reads in data1 (fastq, pairs or singles). The mapping would be the reverse practically (probably!) but tables can be rearranged.
  4. Merge back in the barcodes for data1 reads into that table, where data2 is the primary key.
  5. Do some math: count up how many barcodes and reads per fasta.

Let’s start there :slight_smile:

I have paired end reads, each read is 150 bp, and the reads have 78 bp of overlap, therefore yielding a total sequence length of 222 bp when merged. The barcode is only present in the 5’ end of the sequence (only in read 1) and is located in positions 27-34. My sequence of interest is what comes after the barcode.
Should I merge the reads first, then proceed with pipeline described here?
Thanks for the insightful response!

Hi @ytorres

Oh yes, that makes it so much simpler! Yes, I would merge first, then proceed.