galaxy tool xml file fetching files into a collection

Hi All,

I am writing galaxy wrappers for some command line scripts. I came quite far but struggle now fetching the resulting files.

In general, the command line tools I write wrappers for, work on an output folder with a wealth of different file types (xml, mat, txt, csv, svg) that belong together. A raw output folder represents a whole time series. It has no subfolders. Each time step in the time series has a set of files that belong together. The filename makes clear what belongs to one time step (e.g. output00000021.xml, output00000021_cells.mat, …).

The scripts I write wrappers for can either generate:

  • an output file for each time step (in this case the output files have the same filename structure common to the time step, but will at least differ in the file extension (e.g. output/output00000021_cell_maxabs.h5ad))
  • one file for the whole time series (in this case the file name usually starts with timeseries (e.g. output/timeseries_cell_maxabs.h5ad))
  • a subfolder with a whole set of files (output/cell_cell_type_z0.0 with files like output00000021_cell_type.jpeg)
    The folder and file names depend on the parameter setting the script is called.
    All scripts return over stderr a string which lists all generated fies names or the generated subfolder name.
    I was able to set the stdio exit_code to log level (so that the galaxy not always thinks the scripts through an error) and able to fetch the stderr output and write it into a txt file, but it is not intuitive to me how I could work with this valuable information within the xml wrapper. I tried to write chetan scrips for that, but nothing seems really to work.

I tried to discover_datasets in all ways described here: Advanced Tool Development Topics — Planemo 0.75.31.dev0 documentation

  • pattern=“designation_and_ext
  • pattern=“designation” format=“h5ad”
  • pattern=" (?P<designation>.+).h5ad" format=“h5ad”
    (by the way, what is the difference between designation, filename, and name? It is no where clearly explained.)
    In any case, what I end up is a collection with all files in the output folder (h5ad, but as well xml, mat, txt, csv, svg, …). How can I filter out only e.g only the h5ad files?
    I tried to use collection filters as described here: Galaxy Tool XML File — Galaxy Project 24.2.3.dev0 documentation
    but unsuccessfully.

Briefly, I am stuck. I would appreciate some input.
Thank you, Elmar

Hi @bb-8

The pattern match parts here

These look close but not exact to the recommended patterns. Would you like to share back some of the example file names (maybe both the ones you want to find and a few that you don’t want to find – all in a particular directly maybe?) and then the exact example code blocks that you have tried (copy/paste out of your code, or share the whole XML, long is fine)?

You could put both into a gist comment and share the links back if you think it will be too long.

You use also use the tools here to quote the text to keep any characters from being interpreted by the markdown editor.

The datatype h5ad exists in Galaxy, so you should be able to use that, but the other ways should work too. There is likely something minor going on. We can help but we’ll need to get really specific! So exact and too much info is better than vague and not enough. We want to help you! :slight_smile:


Then for

The terms are defined here → Galaxy Tool XML File — Galaxy Project 25.1.dev0 documentation

With supplemental here → Tools — Galaxy IUC Standards and Best Practices 0.1 documentation

On each of those guides, try using a browser keyword search (“find” function) to find all the places a term might be included to learn about how and when it is used.

If anything is still unclear, you can highlight or quote that part back (with a link to where to captured it) and we can try to help explain a bit more, and maybe share an example in the code base that includes it.

Let’s start there, thanks! :hammer_and_wrench:

hi Jennaj,

Thank you for this prompt response!

I will look into the pattern again and share the filenames.
It seems that I was there on the right track, just not got it running.

About designation, I actually searched the “Galaxy Tool XML File” document before.
In case of designation, the “details” do not really describe what the “attribute” is because the document explains the word with its own word.
Here:

And what merriam webster writes about designation was not much more informative either:

In the other document you pointed out, designation is not even to find.
I still don’t know what “designation” in the galaxy sense means. How it differs from filename and extension.

Thank you, Elmar

About fetching the output files:
This is the xml I am working on:

This is a listing of the “raw” output folder".

This is a listing of the output folder with all possible generated h5ad files:

There are two types of h5ad files:

  • a collapsed version, one file for the whole time series. the pattern for this filename is: timeseries_cell_<scale>.h5ad
    a possible regex pattern is: '^timeseries_cell_.+\.h5ad$

  • a uncollapsed version, a file per timestep. the pattern for this filename is: output00000000_cell_<scale>.h5ad (0 can be any integer)
    a possible regex pattern is: ^output\d{8}_cell_.+\.h5ad$

It would be cool if we could use the information in the scale variable for an exact match!

Thank you, for helping me in this!

This is about what I was looking for:

    <outputs>                                                                   
        <collection name="anndata_h5ad" type="list">                            
            <discover_datasets pattern="(?P&lt;designation&gt;.+)\.h5ad" format="h5ad" directory="output_pc" visible="false" />
        </collection>                                                           
    </outputs>                                                                  

With the solution above, I could resolve many of the patterns needed for my scripts.
However, three are still two scripts I have problems with. These scripts generate plots.
According to the parameter setting, this can be jpg, png, or tif plots.

Question: How can I specify more than one file format in <discover_datasets pattern=... format=.. directory=... visible=...> ? Comma separation elsewhere does not seem to work.

What complicates the thing, the plots are stored in a subfolder that is named according to the parameter settings. The path to this subfolder I get as standard output.
e.g. output_pc/conc_oxygen_z0.0
(conc_oxygen_z0.0 is the subfolder)

Question: How can I capture this output and use as a setting for the directory parameter in <discover_datasets pattern=... format=.. directory=... visible=...>

I would be glad, @jennaj, if you could point me into the right direction on how to solve this.
Thank you, Elmar

this is how i can get the path form the std error output.

sed -En "s/(.*conc_.+)/\1/p"

but even when I have the path, i cannot figure out how to use it in discover_dataset.

Hi @bb-8

I think you’ll need to organize the outputs in a way that you can pre-predict, then look for data. This is how you’ll know how to assign the datatype when publishing that result back to the history. It is also how you’ll be able to detect if the tool wasn’t able to complete the job how it was asked to (missing outputs) and add in error trapping for those situations.

You could review a tool_xml that has multiple optional inputs for an example. This tool has several optional outputs of different types. They are mutually exclusive which sounds how yours will work too?

  • toolshed.g2.bx.psu.edu/repos/iuc/newick_utils/newick_display/1.6+galaxy1

Hope this helps! :slight_smile: