Sequence identification from pool of NGS sequence reads


We have a pool of approximately 600,000 150bp (+/- a few) reads generated from NGS.

From this pool, we want to identify the number of times a certain sequence arises. Is there a way to do this using Galaxy?

Should add, we have approximately 12 reference sequences we want to identify in the pool for their prevalence.

Hi @harriet_lahiff,
one option is to use Seqkit locate.


1 Like


Thanks for your response.

Our data files are CSV where each sequencing result is listed as a row - do you have any suggestions how to use this format, as that tool requires FASTA.GZ files?

Many thanks

Hi @harriet_lahiff

The .csv datatype means a “comma separated values” type of data file.

Try this:

  1. Convert “comma separated” to “tabular separated” format (some tools will do this directly at runtime)
  2. Convert tabular data to fastq format

Tutorial → NGS data logistics