Help needed: Creating a samplesheet of inputs in Galaxy

gozdekb · December 12, 2024, 6:44pm

Hello community,

I’m integrating the nf-core/demo pipeline into Galaxy as a Nextflow-based tool. One challenge I’ve run into is how Galaxy stores input datasets as .dat files without the original filenames or extensions. The nf-core pipelines require .fastq.gz filenames and absolute paths in the samplesheet.

What I want to achieve:

Accept multiple FASTQ inputs (forward/reverse reads for multiple samples) in Galaxy.
Dynamically generate a samplesheet (CSV) that references these inputs as .fastq.gz files.
Use ln -s to create symlinks from the .dat files to properly named .fastq.gz files, so the nf-core pipeline accepts them.
Ensure the samplesheet lists full paths that the pipeline can access.(e.g. sample1, path/sample1_R1.fastq.gz, path/sample1_R2.fastq.gz)

For example, something like this in the tool’s command section:

ln -s "$fastq_1" "${sample_id}_R1.fastq.gz"
ln -s "$fastq_2" "${sample_id}_R2.fastq.gz"

echo "sample,fastq_1,fastq_2" > samplesheet.csv
echo "${sample_id},${sample_id}_R1.fastq.gz,${sample_id}_R_2.fastq.gz" >> samplesheet.csv

My questions:

How do you normally map Galaxy’s .dat inputs to .fastq.gz filenames in a suitable way?
Are there known best practices or examples of handling multiple inputs and generating a samplesheet at runtime in Galaxy tool wrappers?
Do you rely on a particular environment variable (e.g. $GALAXY_JOB_TMP_DIR) or another approach to ensure stable absolute paths?

Any insights, examples, or references would be greatly appreciated. I want a solution that can handle multiple samples efficiently and adheres to nf-core’s filename and path requirements.

Thank you!

jennaj · December 12, 2024, 7:02pm

Hi @gozdekb

Great to see you making progress on this!

I’ve ask the admins for feedback at their chat. Let’s see if anyone has time (being aware of time zone differences!). You're invited to talk on Matrix

Then, in a very general way, this is where to learn how datasets are organized. Connecting Users and Data — Galaxy Project 24.1.4.dev0 documentation

bernt-matthias · December 13, 2024, 9:22am

For multi sample data we usually use collections in Galaxy (paired collections for paired data). A good tutorial using this might be this one. Maybe also this and this is helpful.

As you noticed all files in galaxy are stored with a .dat extension and the filename is a unique ID. All other information, like the datatype and the name (as shown in the history … which could be the sample name if organized in collections) is stored in the Galaxy database.

In the Galaxy tools which generate the command line we can access some of the infos from the database. See for instance here. The input accepts multiple datasets at once, e.g. via a collection. And we are looping over the elements in a for loop. In the for loop we access $i.element_identifier (we use re.sub to construct safe filenames). Analogously we could access $i.ext to access the datatype. Maybe, annoyingly FASTQ is fastqsanger in Galaxy, i.e. will need a bit of extra logic here (fastq in Galaxy is a general datatype that summarizes all fastq variants).

For how jobs run: you can assume that each job runs in its own job working directory where you can create directories, symlinks, temporary files as needed.

bernt-matthias · December 13, 2024, 10:13am

Maybe also have a look here, e.g. this one.

gozdekb · December 20, 2024, 12:39pm

I’m happy to say that, thanks to your advice, we’ve successfully integrated the nf-core/demo pipeline into Galaxy with multiple FASTQ inputs and dynamically generated samplesheets!

What Worked:

Using collections (in my case, a list:paired collection) was key. This allowed me to handle multiple samples at once, and each element’s element_identifier served as a meaningful sample name.
In the tool’s XML <command> block, I used a #for loop over the collection elements. Within that loop, I:
- Applied re.sub to element_identifier to create safe filenames.
- Created symbolic links (ln -s) from Galaxy’s .dat files to .fastq.gz filenames.
- Appended lines to samplesheet.csv with absolute paths.
Running nextflow run nf-core/demo --input samplesheet.csv from the job’s temporary directory worked perfectly.

Special thanks to @jennaj for reaching out to the admins and providing initial pointers, and @bernt-matthias for detailed explanation and links to examples which were incredibly helpful. The suggestions to use collections, to carefully handle element identifiers, and to rely on $i.element_identifier were exactly what I needed. Your guidance on safely constructing filenames and understanding Galaxy’s data management approach helped me tie everything together.

Here is my tool’s XML <command> block:

    <command detect_errors="exit_code"><![CDATA[
	#import re
	cd "\$_GALAXY_JOB_TMP_DIR" &&
	echo "sample,fastq_1,fastq_2" > samplesheet.csv &&
	#for $element in $sample_collection:
	#set $sample_name = re.sub('[^A-Za-z0-9_]', '_', $element.element_identifier)
	ln -s "$element.forward" "${sample_name}_R1.fastq.gz" &&
	ln -s "$element.reverse" "${sample_name}_R2.fastq.gz" &&
	echo "${sample_name},\${_GALAXY_JOB_TMP_DIR}/${sample_name}_R1.fastq.gz,\${_GALAXY_JOB_TMP_DIR}/${sample_name}_R2.fastq.gz" >> samplesheet.csv &&
	#end for
    cat samplesheet.csv &&
    nextflow run nf-core/demo --input "samplesheet.csv" --outdir "../working/" -profile docker 
	]]>
    </command>

This approach now runs smoothly, and the pipeline outputs are correctly recognized by Galaxy with from_work_dir. I hope this helps anyone else who finds themselves puzzling over similar challenges!

Thanks again, everyone, for your time, patience, and expertise. This project is progressing well thanks to your supports!

bernt-matthias · December 20, 2024, 2:05pm

Excellent. Thanks for sharing the result, which looks great.

My only suggestion would be to remove tge cd $_GALAXY_JOB_TMP_DIR and just create the files in the current working dir (the job’s working dir). In the current state i would have expected that outputs are created in $_GALAXY_JOB_TMP_DIR, where Galaxy cant collect outputs.

jennaj · December 20, 2024, 7:02pm

Wonderful! This is such a big deal, @gozdekb !! Getting these two systems to work together is so incredibly useful. I’m looking forward to seeing where this goes!

gozdekb · December 24, 2024, 6:57am

Thank you very much @bernt-matthias for your follow-up and suggestion regarding cd $_GALAXY_JOB_TMP_DIR. You’re absolutely right that Galaxy won’t collect outputs that remain in the job’s temporary folder by default.
I initially used cd $_GALAXY_JOB_TMP_DIR to have a consistent place for creating symlinks, but as you pointed out, Galaxy cannot see the outputs this way. The real key is publishDir in my setup because Nextflow typically scatters outputs in dynamically generated subfolders under work/, and Galaxy cannot find them otherwise. By publishing them to working/ (or another specified path outside the tmp folder), Galaxy can detect and retrieve the final output files without extra copying or manual path adjustments.

Now, I’ve removed the cd $_GALAXY_JOB_TMP_DIR step and rely solely on publishDir to gather outputs into a Galaxy-accessible location. In both approaches, the tool writes outputs successfully to the history, but I’ll follow your suggestion to keep the workflow simpler and more transparent. Thank you again for your guidance!

gozdekb · December 24, 2024, 7:13am

Thank you, @jennaj! We also believe this integration adds a lot of power and versatility to our Galaxy workflows. Once the project is complete, we’d love to share our insights—maybe as a Galaxy blog post or even a white paper—so others can benefit from the lessons we learned. Thanks again for the support!

Topic		Replies	Views
Help Needed with Wrapping Nextflow Script into Galaxy Tool XML File tool-dev , planemo , galaxy-local , nextflow	3	53	September 24, 2024
Input data for nanopore analysis usegalaxy.eu support collections	1	305	November 27, 2023
nf-core Pipeline Integration to Local Galaxy workflow , tool-dev , planemo , nextflow	2	109	August 21, 2024
Demultiplexing fastq file with barcodes	3	495	September 25, 2023
Creating tool which requires specific input file extension server-admin , tool-dev	2	445	September 14, 2022

Help needed: Creating a samplesheet of inputs in Galaxy

Related topics