Getting files from history download with proper sample names

Domenico_Simone · December 5, 2022, 4:30pm

Hello,

I would like to download a history with outputs from a workflow run on dataset collections. I would like to get the downloaded files (in the datasets) with the proper related sample names, eg <sample>_scaffolds.fasta for an assembly, <sample>.gff for the related annotation and so on, but despite all my tries with relabeling and renaming output names in the workflow settings I can’t get what I want. Could you please give me some hint and/or link reference documentation?

Thank you for your kind support

Domenico Simone

igor · December 6, 2022, 4:00am

Hi @Domenico_Simone
just to confirm that I got it right: you use a workflow on a collection, output collections have the same dataset names, as the original input collection, but you want different extensions for datasets’ names in the output collections.

Personally, I prefer renaming on datasets, as collections are somewhat complicated to deal with, plus I see elevated error rate for some tools on collections. You need properly named reads, something like sample1.r1.fq.gz, to use basenames like sample1 for subsequent outputs, such as sample1.bam. Workflows can be submitted in batches, on multiple input datasets.

Please let me know if you find a good option for renaming of datasets in a collection using a Galaxy workflow.

This probably can be done using Relabel Identifiers from Collection Operations, assuming the outputs from all steps are present in the same order. I did it using Apply Rules from the same section. The rules: column A: the original names in a collection; B - extension (fixed text string starting with ‘.’); C - concatenate A and B. Set column C as identifiers.
A bit of background. The input is a collection of SE reads. The workflow maps reads to a ref genome and count reads against annotated genes. The workflow also renames the output collections by propagation of the original collection name + extensions.

History with completed workflow:
https://usegalaxy.org.au/u/igor/h/rename-datasets-in-a-collection
Workflow:
https://usegalaxy.org.au/u/igor/w/featurecountscollection

Hope this helps.
Kind regards,
Igor

igor · December 6, 2022, 5:21am

Hi @Domenico_Simone
I forgot to mention that Galaxy adds extension (datatype) to file names during download of a collection, for example, collection with BAM alignments named sample1, sample2, after download have names sample1.bam, sample1.bam.bai etc. It is handy for alignments, but some datatypes use generic tabular datatype.
Kind regards,
Igor

Domenico_Simone · December 6, 2022, 8:18am

Hi @igor many thanks for your detailed reply. Indeed, my point is about the names in the datasets folder in the tar archive that you get when you select the command “Export History to file”. I’ve checked the history you shared (thanks!) and when I export it to file, in the datasets folder there are files named like “.alignments_15.bam”, “.counts_35.tabular”: my wish is to get eg MCL-1DL.alignments.bam, MCL-1DL.counts.tabular" and so on.
I was thinking that, for each file, I might try to grep the sample name from the file itself and rename the downloaded file accordingly. However, this can work for some output formats (eg the counts file in your history), but for other outputs (eg scaffolds from assemblers) I am afraid this is not possible.
I hope I have been clear and sorry for being verbose!

Domenico

igor · December 7, 2022, 12:38am

Hi @Domenico_Simone
I have issues with decompression of exported histories, so cannot check.
Have you considered downloading of collections? After download of a collection the sample names look like: sample_name.text_added_using_Apply_Rules.datatype.
Kind regards,
Igor