filtering a single-end fastq.gz collection by number of reads

microfuge · September 12, 2023, 1:54pm

Hi ,

We have a single end fastq collection with thousands of files and want to keep only files having more than 200 reads in it. Is there a way to do this with existing tools in Galaxy?

Thanks!
Saurabh

jennaj · September 12, 2023, 7:49pm

Hi @microfuge

There isn’t an exact tool but you could string together multiple tools into a workflow to do this.

Process will be something like: count up the number of lines per file, filter on the line count values, capture the identifiers from elements/files that pass the filter, then filter the original collection with those.

The tools you will need will be covered in these:

microfuge · September 13, 2023, 9:57am

Thanks @jennaj

I used the tools toolshed.g2.bx.psu.edu/repos/iuc/seqkit_stats/seqkit_stats/2.2.0+galaxy0 and then “Collapse collection into a single dataset” to obtain a tsv file, kept the required entries with awk and then “filter collection” .

Topic		Replies	Views
How to merge multiple fastq.gz files from one sample into one fastq.gz file usegalaxy.eu support gtn-tutorial , workflow , collections	3	3725	February 27, 2023
Paird-end Fastq-dump Manipulation - Fastq De-Interlacer	3	2524	May 7, 2019
Filter by length usegalaxy.eu support fasta-manipulation , text-manipulation	1	262	November 6, 2023
Concatenate multiple datasets tool; combining all fastq.gz files under a single barcode to one fastq.gz file collections , tool-help	6	124	July 22, 2025
Get selected reads from SRA data sets usegalaxy.org support sra , filter , quality-control	0	613	December 1, 2020

filtering a single-end fastq.gz collection by number of reads

Related topics