I’m integrating the nf-core/demo pipeline into Galaxy as a Nextflow-based tool. One challenge I’ve run into is how Galaxy stores input datasets as .dat files without the original filenames or extensions. The nf-core pipelines require .fastq.gz filenames and absolute paths in the samplesheet.
What I want to achieve:
Accept multiple FASTQ inputs (forward/reverse reads for multiple samples) in Galaxy.
Dynamically generate a samplesheet (CSV) that references these inputs as .fastq.gz files.
Use ln -s to create symlinks from the .dat files to properly named .fastq.gz files, so the nf-core pipeline accepts them.
Ensure the samplesheet lists full paths that the pipeline can access.(e.g. sample1, path/sample1_R1.fastq.gz, path/sample1_R2.fastq.gz)
For example, something like this in the tool’s command section:
How do you normally map Galaxy’s .dat inputs to .fastq.gz filenames in a suitable way?
Are there known best practices or examples of handling multiple inputs and generating a samplesheet at runtime in Galaxy tool wrappers?
Do you rely on a particular environment variable (e.g. $GALAXY_JOB_TMP_DIR) or another approach to ensure stable absolute paths?
Any insights, examples, or references would be greatly appreciated. I want a solution that can handle multiple samples efficiently and adheres to nf-core’s filename and path requirements.
For multi sample data we usually use collections in Galaxy (paired collections for paired data). A good tutorial using this might be this one. Maybe also this and this is helpful.
As you noticed all files in galaxy are stored with a .dat extension and the filename is a unique ID. All other information, like the datatype and the name (as shown in the history … which could be the sample name if organized in collections) is stored in the Galaxy database.
In the Galaxy tools which generate the command line we can access some of the infos from the database. See for instance here. The input accepts multiple datasets at once, e.g. via a collection. And we are looping over the elements in a for loop. In the for loop we access $i.element_identifier (we use re.sub to construct safe filenames). Analogously we could access $i.ext to access the datatype. Maybe, annoyingly FASTQ is fastqsanger in Galaxy, i.e. will need a bit of extra logic here (fastq in Galaxy is a general datatype that summarizes all fastq variants).
For how jobs run: you can assume that each job runs in its own job working directory where you can create directories, symlinks, temporary files as needed.
I’m happy to say that, thanks to your advice, we’ve successfully integrated the nf-core/demo pipeline into Galaxy with multiple FASTQ inputs and dynamically generated samplesheets!
What Worked:
Using collections (in my case, a list:paired collection) was key. This allowed me to handle multiple samples at once, and each element’s element_identifier served as a meaningful sample name.
In the tool’s XML <command> block, I used a #for loop over the collection elements. Within that loop, I:
Applied re.sub to element_identifier to create safe filenames.
Created symbolic links (ln -s) from Galaxy’s .dat files to .fastq.gz filenames.
Appended lines to samplesheet.csv with absolute paths.
Running nextflow run nf-core/demo --input samplesheet.csv from the job’s temporary directory worked perfectly.
Special thanks to @jennaj for reaching out to the admins and providing initial pointers, and @bernt-matthias for detailed explanation and links to examples which were incredibly helpful. The suggestions to use collections, to carefully handle element identifiers, and to rely on $i.element_identifier were exactly what I needed. Your guidance on safely constructing filenames and understanding Galaxy’s data management approach helped me tie everything together.
This approach now runs smoothly, and the pipeline outputs are correctly recognized by Galaxy with from_work_dir. I hope this helps anyone else who finds themselves puzzling over similar challenges!
Thanks again, everyone, for your time, patience, and expertise. This project is progressing well thanks to your supports!
Excellent. Thanks for sharing the result, which looks great.
My only suggestion would be to remove tge cd $_GALAXY_JOB_TMP_DIR and just create the files in the current working dir (the job’s working dir). In the current state i would have expected that outputs are created in $_GALAXY_JOB_TMP_DIR, where Galaxy cant collect outputs.
Wonderful! This is such a big deal, @gozdekb !! Getting these two systems to work together is so incredibly useful. I’m looking forward to seeing where this goes!
Thank you very much @bernt-matthias for your follow-up and suggestion regarding cd $_GALAXY_JOB_TMP_DIR. You’re absolutely right that Galaxy won’t collect outputs that remain in the job’s temporary folder by default.
I initially used cd $_GALAXY_JOB_TMP_DIR to have a consistent place for creating symlinks, but as you pointed out, Galaxy cannot see the outputs this way. The real key is publishDir in my setup because Nextflow typically scatters outputs in dynamically generated subfolders under work/, and Galaxy cannot find them otherwise. By publishing them to working/ (or another specified path outside the tmp folder), Galaxy can detect and retrieve the final output files without extra copying or manual path adjustments.
Now, I’ve removed the cd $_GALAXY_JOB_TMP_DIR step and rely solely on publishDir to gather outputs into a Galaxy-accessible location. In both approaches, the tool writes outputs successfully to the history, but I’ll follow your suggestion to keep the workflow simpler and more transparent. Thank you again for your guidance!
Thank you, @jennaj! We also believe this integration adds a lot of power and versatility to our Galaxy workflows. Once the project is complete, we’d love to share our insights—maybe as a Galaxy blog post or even a white paper—so others can benefit from the lessons we learned. Thanks again for the support!