Rerunning only failed jobs in a workflow: Replace and Resume functions

Koen_Nijbroek · October 14, 2025, 5:07pm

Hello everyone!

I’m running some massive workflows on a private Galaxy instance, with up to thousands of individual jobs. Occasionally it might happen that jobs crash - this is now I fear due to very short lasting lost communication with the cluster for whatever reason. Anyhow, I’m struggling to figure out how to re-submit these crashed jobs rather than having to run the whole time-consuming workflow again.

I assume it’s possible since some jobs have now reached a ‘pause’ state - waiting for it’s dependencies to finish first.

Does anyone know? Presumably it’s either the Run / Re-run buttons (and perhaps subsequent settings) but I don’t understand the difference between them.

jennaj · October 14, 2025, 11:43pm

Hi @Koen_Nijbroek

Yes, you can rerun just the jobs that failed, replace the result back into the original set of outputs, then resume downstream tools that use those inputs. This is a per-dataset level action instead of workflow.

Why? There isn’t a good way to only fetch only the prior failures and rerun them again from a Workflow Invocation since there are some (advanced!) complications with how a workflow invocation attaches to individual prior scheduled or completed jobs. We are still thinking about solutions. For now, you can get all original inputs (again) and rerun the batch but I’m guessing this is not what you want right now.

Dataset controls (at the history level)

Kraken2 error message on paired-end collection. Not analyzing all samples

What to do

Then, the advice I would have for larger batches of work at the public servers. 1. start up the processing for an entire batch. 2. Later, if some fail, click into the collection and use the rerun icon (FAQ: Different dataset icons and their usage) on each failed dataset. This brings up the original tool form and you can run it again for just that single sample.

Bonus: If the failed job was run as part of a workflow, right above the Run Tool button (on the tool form) will be two new extra options. This is useful when running longer, complicated workflows, but also shorter ones. The result is indistinguishable from runs where everything worked perfectly to start with.

Replace elements – this sorts the new output back into the original collection output nesting (with the already successful inputs).

Resume dependencies – this starts up any downstream tools that were paused because of these upstream failures (assuming the new job works!)

Workflow controls (at the invocation level)

Then, for this part, the icons in your screenshot are for the workflow level, not individual jobs (and the datasets those are started from or written to).

The workflow Run button will load up the original workflow form again (where you can select the inputs based on the current history).

workflow-run

The workflow Rerun button will run that entire invocation over again – exactly. You’ll see a pop-up to switch to the original history again if needed (to allow the workflow to automatically select the same inputs and output data/settings again).

So, for now, to pick up only the current failures without rerunning the entire processing again: click into the output history and rerun the individual datasets using the optional items to replace/resume.

If you are the server administrator, you can also go in and start up jobs directly via the API. This involves scanning for failed job over a particular time period and rerunning them. I can share more details if you want to try this instead and are not sure where to start.

Please let us know if that helps!

wm75 · October 15, 2025, 9:56am

In addition to all of the above, if, in a high-throughput scenario, you have a lot of failed jobs that can, in principle, be resumed, but it’s just very annoying to do this job by job through the UI:

the planemo command line tool can automate this process for you as explained here:

Koen_Nijbroek · October 15, 2025, 5:06pm

Thanks for the amazing detailed answer. I’m fine with manually re-running the failed jobs, happy to know that dependencies will resume from there on. I’ll try it tomorrow!

And thanks @wm75 for the Planemo suggestion.

Koen_Nijbroek · October 16, 2025, 10:00am

Works like a charm - looking forward to a workflow-friendly solution in the distant future