Renaming files in a collection for BiG-SPAPE to avoid errors

Dear Galaxy community,

I have a output collection from a tool (antiSMASH) that contains files in gene bank format. The files are with the extension “.genebank”. I want to supply these file as input to another tool (BIG-SCAPE). However, it accepts files with “.gbk” extension only.

When provided with the collection with .genebank names the tool (BIG-SCAPE) generates a output but does not recognize the files. I downloaded the collection, edited the extenbsion to .gbk manually and uploaded back again. This time the output from BIG-SCAPE was alright.

Which tool can I use to automate the renaming of the extension for the individual files in a collection? .gbk option is not available in the convent or datatype option for the collection. Thank you.

Hi @Shreyash_04

Instead, updating the datatype should be enough. But you can share more details if I am misunderstanding.

See the second FAQ here for the how-to for batch changing the datatype of a collection folder of dataset files → https://training.galaxyproject.org/training-material/faqs/galaxy/#collections

There isn’t a batch way to rename the dataset name since it isn’t supposed to matter. Tools interpret the datatype instead, and you have full control over that.

Walking through this for clarity.

One tool is expecting one of these →

And outputs these →

Then the next tool is expecting one of these →



If Galaxy guessed the datatype wrong because of a file extension (usually during Upload), you can directly update to the correct one, and you can do that in batch on a collection. Just make sure the data content is actually a match for the datatype.

A tool should definitely assign the datatype of outputs correctly, no matter what they are named, and if it doesn’t, that would be a bug to fix and we can confirm the problem and report it.

I know if only one tool that does interpret the actual dataset name (for reasons internal to the underlying tool that can’t be worked around).

If you determine that BIG-SCAPE is actually interpreting the dataset file name, or that Antismash is not assigning the correct datatype to outputs, I would be curious and could review an example in a shared history. We could ask the developers about it.

Please review the above and see if it solves the problem or share more details please. Thanks! :slight_smile:

Hi Jennifer,

Thank you for the detailed reply.

Unfortunately, the batch chaning datatype does not have the option to convert it to .gbk extension. Hence, I cannot use that.

The output from antiSMASH is in genebank format but with the extension “.genebank” . The BiG-SCAPE tool needs genebank files as input but in .gbk format. It does take the .genebank files and produce the output but it is blank.

The antiSMASH output.
image

The BIG-SCAPE output with the .genebank input.


The number of genomes detected is 0, that means it did not recognize the input itself.

Then the collection 310 was downloaded and renamed manually to have the .gbk extension and uploaded back to galaxy.
image

The BIG-SCAPE produced a output with recognizing the gbk files.

The problem could be solved in two ways. 1) Either the antiSMASH output is directly in .gbk format (Only the naming convention, the content is ok). 2) BIG-SCAPE should be able to process .genebank files as well as .gbk files.

Please let me know shall you need more details.

Kind regards
Shreyash

Hi @Shreyash_04

Thanks for posting those details! Very helpful.

I’ve started up a very simple history here to test to see if this can be reproduced.
https://usegalaxy.eu/u/jenj/h/test-bigscape

  1. First input has both the genbank datatype and a file name that ends with .gbk
  2. Second input has the genbank datatype but a file name that doesn’t end with .gbk

Let’s see what happens. If you want to check I have that set up right, that would be great too. A small example is helpful to the developers if this needs a change.

Update

I could reproduce the problem. Yes, the tool requires inputs like

filename.gbk

and fails with

filename

even when the correct datatype (genbank) is assigned.

The tool form does use the datatype to detect inputs in select lists. So, both the correct filename.gbk and datatype genbank are needed for a successful run.

I’ve opened a ticket at the IUC to see what can be done. Please feel free to add more details or comments. Enhancement: allow BiG-SCAPE to process inputs without a required .gbk extension in dataset filename · Issue #6015 · galaxyproject/tools-iuc · GitHub

Thanks for all the followup!! :slight_smile:

Hi Jennifer,

Thank you for creating a test history and confirming the problem. I also appreciate your efforts for raising the issue on the GitHub.

I will follow the updates on GitHub.

1 Like

Hi @Shreyash_04

Since the change might take a while to implement, I wanted to suggest a way to handle this inside Galaxy.

General path

  1. Put an “inputs”, antiSMASH and BiG-SCAPE into a workflow
  2. Use the function to rename the antiSMASH output dataset, and adjust for the required extension

You could even just put antiSMASH (along with an “inputs” – required for all workflows) into a workflow by itself to do this… but might as well stream the two together.

I should have mentioned this to start with. Sorry, was focusing on clarifying the root issue and direct usage, not workflow usage. In short, the whole download/rename/upload process could be skipped when using a workflow.

We have many workflow tutorials if this is new to you. Maybe start here → Hands-on: Creating, Editing and Importing Galaxy Workflows / Using Galaxy and Managing your Data

Screenshot of the function. Click on the target tool, then see the side panel. All of the options on the regular tool form will be there, plus a few workflow-specific options.

Hope this helps!

Hi @jennaj

Thanks a lot for the advice.

I attempted to perform the dummy analysis using a workflow with antiSMASH and BiG/SCAPE.
image
I attempted to change the datatype to .gbk in the configure output for genebank files of antiSMASH. However .gbk is not available as a option.


The rename function also did not workout. The main Output collection was renamed while the files inside the collection remained unchanged.

The analysis in the workflow did not work either. The BiG/SCAPE job failed.

One of the following may help me.

  1. Is there a function or tool to rename files in a collection using a formula. (Where only the extension is changed)
  2. Can we have the .gbk filetype made available in the convert datatype section of the collections or files.

Thank you for taking time to help me.

Best regards
Shreyash

1 Like

There are a few parts here so I’m going to number

  1. Dataype

For this part

I should have been more clear here

Name the file

filename.gbk

and assign (or leave?)

datatype assigned as genbank.

  1. “Filename” of the inputs

Correct, sorry.

The “filename” in a collection is a different attribute. Those are called the Element Identifiers, and sort of works like a file name. I just tested to see if modifying those was enough to get the tool to accept the renaming, and it did.

Instead of typing out how to do this, I’m going to share some Galaxy artifacts.

Then I created a small workflow with example manipulation steps, along with a rerun, and what resulted. This is what you can add to your workflow, or you can adapt mine.

Many more text manipulations tools are in these tutorials https://training.galaxyproject.org/training-material/search2?query=olympics and the workflow manipulations are in the workflow tutorials, and you can see the function used in other contexts in at least two other tutorials (see the bottom of the Relabel identifiers tool form).

Hope this is the solution but please review and try it out :slight_smile:

Hi @jennaj ,

Thanks a lot for the solution. It worked perfectly. I tested it with the small dataset as well as with the larger files. It works well and has produced satisfactory results.

Thank a lot. :smiley:

1 Like

:rocket: Very happy to hear that! Happy science!

1 Like