Data collection not available as input

Hi! I am struggling with dataset collections in Galaxy 21.01. The error did not appear in 19.05 and 20.05.

My tool generates a collection according to
https://planemo.readthedocs.io/en/latest/writing_advanced.html#creating-collections
I follow section “3. Dynamic Element Count” to generate a “list”:
<collection name="output" type="list" label="Simulated immuneML dataset"> <discover_datasets pattern="__designation__" directory="outputs/result" /> </collection>

This seems to work fine, i.e. the collection shows up in green in the Galaxy interface. I can even use it in Workflows. However, other tool GUIs are unable to recognise the collection as viable input, i.e. it does not appear in the roll-down menus for selecting dataset collections.

Is this a known issue? Is anyone able to help out?

Can you share a screenshot whis this dropdown menue? It can be that you need to change one of the three buttons left to your dropdown box.

grafik

I guess, in this example, the tool specifies a collection-input:


It finds other lists, but not history item 526.

I can try one with the three buttons also.

With the three buttons:

Keep in mind that the datatype and the collection type also needs to match the one from the tool.

How is the datatype determined? This collection contains a mix of file types.

Here is an example with code for the input field:

<param name="collection_input" type="data_collection" label="Dataset as a collection" optional="true" help="This field accepts datasets in collection format, as created by Create Dataset tool."/>

This is then the problem. Usually, you have collections of one type. Its probably better to share your tool and explain what you want to achieve.

1 Like

This tool creates the collection. Various files are created by the command immune-ml, and they need to, in a different tool, be used as inputs in another call to immune-ml:

<tool id="immuneml_simulate_dataset" name="Simulate a synthetic immune receptor or repertoire dataset" version="@VERSION@.0">
  <description></description>
   <requirements>
      <requirement type="package" version="1.2.5">immuneML</requirement>
    </requirements>
  <command><![CDATA[

      cp "$yaml_input" yaml_copy &&
      immune-ml ./yaml_copy ${html_outfile.files_path}
      --tool DataSimulationTool &&

      mkdir outputs &&
      cp -r ${html_outfile.files_path}/result ./outputs/result && (mv ./outputs/result/repertoires/* ./outputs/result &>/dev/null || :) && rm -rf ./outputs/result/repertoires
      && mv ${html_outfile.files_path}/index.html ${html_outfile} && mv ${html_outfile.files_path}/immuneML_output.zip $archive
      ]]>
  </command>
  <inputs>
      <param name="yaml_input" type="data" format="txt" label="YAML specification" multiple="false"/>
  </inputs>
    <outputs>
        <data format="zip" name="archive" label="Archive: dataset simulation"/>
        <collection name="output" type="list" label="Simulated immuneML dataset">
            <discover_datasets pattern="__designation__" directory="outputs/result" />
        </collection>
        <data format="html" name="html_outfile" label="Summary: dataset simulation"/>
    </outputs>


  <help><![CDATA[

        This Galaxy tool allows you to quickly make a dummy dataset.
        The tool generates a SequenceDataset, ReceptorDataset or RepertoireDataset consisting of random CDR3 sequences, which could be used for benchmarking machine learning methods or encodings,
        or testing out other functionalities.
        The amino acids in the sequences are chosen from a uniform random distribution, and there is no underlying structure in the sequences.

        You can control:

        - The amount of sequences in the dataset, and in the case of a RepertoireDataset, the amount of repertoires

        - The length of the generated sequences

        - Labels, which can be used as a target when training ML models

        Note that since these labels are randomly assigned, they do not bear any meaning and it is not possible train a ML model with high classification accuracy on this data.
        Meaningful labels can be added using the `Simulate immune events into existing repertoire/receptor dataset <https://galaxy.immuneml.uio.no/root?tool_id=immuneml_simulation>`_ Galaxy tool.

        For the exhaustive documentation of this tool and an example YAML specification, see the tutorial `How to simulate an AIRR dataset in Galaxy <https://docs.immuneml.uio.no/galaxy/galaxy_simulate_dataset.html>`_.

        **Tool output**

        This Galaxy tool will produce the following history elements:

        - Summary: dataset simulation: a HTML page describing general characteristics of the dataset, including the name of the dataset
          (this name should be specified when importing the dataset later in immuneML), the dataset type and size, and a link to download
          the raw data files.

        - Archive: dataset simulation: a .zip file containing the complete output folder as it was produced by immuneML. This folder
          contains the output of the DatasetExport instruction including raw data files.
          Furthermore, the folder contains the complete YAML specification file for the immuneML run, the HTML output and a log file.

        - Simulated immuneML dataset: Galaxy collection containing all relevant files for the new dataset.

    ]]>
  </help>

</tool>

This yaml file can be used as input for testing:

definitions:
  datasets:
    my_random_dataset:
      format: RandomRepertoireDataset
      params:
        labels:
          HLA:
            A: 0.6
            B: 0.4
        repertoire_count: 20
        result_path: simdataout/my_random_dataset
        sequence_count_probabilities:
          10000: 0.5
          12000: 0.5
        sequence_length_probabilities:
          12: 0.25
          13: 0.25
          14: 0.25
          15: 0.25
instructions:
  my_dataset_export_instruction:
    datasets:
    - my_random_dataset
    export_formats:
    - Pickle
    type: DatasetExport

—Failed attempt at sharing code —

Is there a better way to share code?

Maybe it’s better to share this repo, although it’s in a weird site:
https://galaxy-ntnu.bioinfo.no/toolshed_nels/repos/knutwa2/immuneml_tools

I tried several things. The following seems to work in 21.01:

  1. Tag the output data sets with format = "binary"
    <collection name="output" type="list" label="Simulated immuneML dataset"> <discover_datasets pattern="__designation__" directory="outputs/result" format="binary" /> </collection>
  2. Tag with format = "data" and add the extension .data to the file names.

Why? I don’t know. I read the documentation I could find, and it’s not that helpful.

Do you think one of the two is a decent solution? Is one better than the other? What about backwards compatibility?

There are both text, tabular text and binary files in there. The user should not have to worry about the individual files (I think), and only treat them as a collection.

A third option:
3. pattern="__name__" and format="data", not necessary to change file names:

 <collection name="output" type="list" label="Simulated immuneML dataset">
    <discover_datasets pattern="__name__"  directory="outputs/result" format="data"/>
 </collection>

I like that the GUI simply displays the full names with this option.

The tools in question are on github here:

To reproduce, use the yaml file I posted earlier as input to:
immuneml_simulate_dataset.xml

Any specific dataset collection should contain datasets all with the same datatype. That is how collections are constructed/used in Galaxy. There can be complex datatypes (formats) as well. If Galaxy doesn’t already include what you want, you could define/create a new one.

Or, if you need to handle more datatypes as inputs to a tool, and want to avoid creating a new datatype, break the data down into distinct collections per datatype, then prompt for those inputs individually on the tool form.

You can add examples to the tool form to explain the usage. Maybe review other tools to see how that is done?

Also, keep in mind that if there is some type of index associated with or required to interpret an input, that can be “from the history” (created by some intermediate tool) or a built-in index (created by an administrator, directly or with a data manager). See the BLAST tool suite as one example of complex tools that can do both. Mothur and Gemini are also good examples of complex inputs that are pre-organized, formatted, then interpreted within the tool suites – but in a different way from BLAST. Or, see a tool like Maker that breaks out the different inputs by datatype.

Hope that helps!

Thanks! I will look at those tool examples. Not sure what you mean by complex datatypes. I’m guessing it has to do with a “datatypes” class hierarchy in the Galaxy source code? Not sure about “index” either. The essence for us is that the original file names must remain intact.

Multiple collections would negatively affect user friendliness.

This compatibility issue appeared between Galaxy 20.05 and 20.09.

Regarding my suggested solution above, it seems to work in 19.05, 20.09 and 21.01. I’m concerned about future compatibility though. It is a mystery what Galaxy does with the file names and file format info, hence there could be issues I haven’t thought of. The only noticable difference is that we have lost the ability to easily view .txt and .csv via GUI. That was a nice but noncrucial feature.

txt and csv files can be viewed as before. Please try on usegalaxy.eu it works there. I’m not sure what you are referring to mystery with datatypes here. Galaxy defines datatypes in a very strict way and you are encouraged to stick to well-defined, standard datatypes. If you are working with collections we can only recommend you the usage of collections with a homogenous datatype. If your several outputs are only useful together and will never be used as singleton, you can create your own “composite datatype”.

1 Like

(The following reply has been shared to @knutwa-ext and the rest of his group through another platform, but it is now copied here for other readers.)

If the scenario is that you will always have a bunch of different types of files with specific names in a bundle (possibly with some of them optional), and that you will use a specific tool to create this bundle,
I think the correct solution would be to implement a “composite datatype” as already suggested by @bjoern.gruening (I believe, however you called it “complex datatype”…). See:

https://docs.galaxyproject.org/en/release_21.01/dev/data_types.html#creating-composite-datatypes

I suggest you subclass HTML for this. To put it simply, a Galaxy dataset is implemented as a main file that is shown to the user, connected to a hidden directory of whatever files you want. Then, the main dataset file that is shown to the user would be a HTML file containing a description of your full bundle and possibly links to the individual files. Those files should then be added as extra files to the dataset.

You can implement this without the need to implement a new datatype class in Galaxy. Instead, you can just define a HTML subtype directly in the datatypes_conf.xml file, ensuring that your tools do not allow to open just any HTML file in history. Custom dataset subtype in the code would be needed if you want to allow the datatype to be automatically "sniffed’, or display specific overview content to the Galaxy history box itself.

See: Advanced Tool Development Topics — Planemo 0.74.4 documentation

With such a solution, you don’t need to use data collections at all. Data collections could, however, be used if you need to represent a higher level collection of such bundles

BTW: I see that the Galaxy devs have exported the datatypes part of the code base into a standalone pypi package: galaxy-data · PyPI

See: Document: Tool datatypes_conf.xml · Issue #8565 · galaxyproject/galaxy · GitHub

So adding a custom datatype to that package would make it possible to reuse the same functionality from the command line, if needed. As a minimum, you should then provide a one-liner pull request to that package with the contents of the datatypes_conf.xml that you want to add. Then, automagically, all updated Galaxy servers will support your data type.

1 Like