filtering a single-end fastq.gz collection by number of reads

Hi ,

We have a single end fastq collection with thousands of files and want to keep only files having more than 200 reads in it. Is there a way to do this with existing tools in Galaxy?

Thanks!
Saurabh

Hi @microfuge

There isn’t an exact tool but you could string together multiple tools into a workflow to do this.

Process will be something like: count up the number of lines per file, filter on the line count values, capture the identifiers from elements/files that pass the filter, then filter the original collection with those.

The tools you will need will be covered in these:

1 Like

Thanks @jennaj

I used the tools toolshed.g2.bx.psu.edu/repos/iuc/seqkit_stats/seqkit_stats/2.2.0+galaxy0 and then “Collapse collection into a single dataset” to obtain a tsv file, kept the required entries with awk and then “filter collection” .

1 Like