FASTQ Groomer, fastq.qz and fastqsanger

As the quality scores of the FASTQ files (.fastqsanger.gz) generated by the Illumina 1.9 pipeline scaled to Sanger Phred+33 , I do not need to apply FASTQ Groomer.

Do I still need to manually assign the format/datatype of the files to fastqsanger before continuing with the data analysis in Galaxy? I tried to do this, but the process failed after running the FastQC tool.
So, how to ensure that Galaxy recognizes the correct datatype for my files if I dont change there format to fastqsanger ?

Hi @Tahleel,
Galaxy automatically recognizes and assigns a proper datatype during upload of files with reads. It works well for short reads, but not for some long reads, for example, when wide range of Phred scores is used. You can upload reads and check the assigned datatype. If a dataset got “fastq.gz” datatype change it to “fastqsanger.gz” via Edit Attributes (pencil icon) > Datatypes tub > in Assign Datatype section select “fastqsanger.gz” from the pull-down menu. Alternatively, you can specify “fastqsanger.gz” during upload, if you have GZipped files.

I do not recommend FASTQ Groomer in this situation. It creates a copy of file, while all you need is a correct metadata (datatype). Plus, the Groomer is super slow with default settings. You can speed it up by disabling summary. FASTQ Groomer is needed for reads with old illumina encoding.

Kind regards,
Igor

Hi, @igor
Thank you for your reply. The files I have are fastq.gz, but Galaxy automatically recognizes them as fastqsanger.gz. Should I change the format to fastqsanger via the Edit Attributes (pencil icon), or can I simply continue with my analysis without making any changes?

Hi @Tahleel,

No, keep the datatype assigned by Galaxy.

Your reads are in FASTQ format compressed with GZip.

Galaxy assigns metadata called datatype to files, something like a label in a shop. Tools in Galaxy handle data according to datatypes. For example, the same text file can have “tabular” or “txt” or some other datatypes. With tabular datatype it will be treated as columns (think Excel), with txt datatype it will be treated as made of lines.

If you assign fastqsanger datatype to GZipped FASTQ files, tools will expect plain text FASTQ data, and jobs will fail. GZipped files should have fastqsanger.gz (or fastq.gz). The opposite is also true. For example, if you assign “fastasnager.gz” datatype to plain text FASTQ file in Galaxy, it will not make it compressed. The tools will expect GZipped data, and most likely will fail.

Datatypes (labels) in Galaxy should match data format.

I hope I have not confused you.

Kind regards,
Igor

1 Like

It is very clear. Thank you very much for your help, @igor !