I’m attempting to analyze my ChIP-seq data in UseGalaxy. The data is in fastq.gz format and was generated by the NextSeq 2000. I’m aware that most tools in UseGalaxy require the fastqsanger format, so I’m trying to convert my data using the Fastq Groomer tool. Unfortunately, the process is taking a significant amount of time and hasn’t yielded any results thus far. I’m wondering if I’ve made a mistake with this method or if there are alternative ways to convert my data into the fastqsanger format.
Following is my current configuration:
Input FASTQ quality scores type: Sanger and Illumina 1.8+
Output FASTQ quality scores type: Sanger (gz compressed - recommended)
Force Quality Score encoding: ASCII
Summarize input data: Summarize Input
Thank you for your assistance.
When Uploading, let Galaxy auto-detect the datatype. It will assign a fastqsanger format by default for any fastq data. This way is almost always best for standard bioinformatic formats (exceptions are some exotic file types e.g. composites). If Galaxy guesses wrong, that is an important early clue that there might be a problem… meaning, won’t waste time troubleshooting some odd tool error later on.
If the data is already in Galaxy, use the pencil icon and redetect the datatype per dataset. Or even better, put the files into a collection and do that same action on the whole batch just one time.
Collections are folders of similar files, and greatly simplify the working history even when not using a workflow. Any time you have more than one file of a particular type, use a collection as a default.
This tool was important early on before quality scores were standardized, but usually never needed now. And will probably not work great on long reads. Plus it can really blow up the quota usage with near duplicated data. Try the methods above instead
Thanks for the instructions. The files I received are in fastq.gz format, so I simply set the file type as “fastq.gz.” Do you mean that if I allow UseGalaxy to automatically detect the format, it might assign them as fastqsanger format? If that’s the case, can I proceed with running the analysis using my current files without the need for further data conversion?
Yes. Galaxy will now assign fastqsanger to all fastq data.
Most data is in fastqsanger format unless it is really old (color space, etc). Meaning, has Sanger Phred+33 quality score scaling. That is what Ilumina 1.8+ pipelines produced, and most others use that now too. It has become a standard.
If you are ever not sure, or just want to double check, the tool FastQC could be run on one (or all) files to confirm that. The top of the report will include the scaling detected. Running some QA before using the primary analysis tools is a good idea anyway, e.g. not trying to troubleshoot odd errors due to lack of data prep.