Dada2: removeBimeraDenovo job runtime benchmarking

sri_microbiome · October 9, 2025, 10:11am

Hi Galaxy Team,

I’m processing 16S rRNA gut microbiome data (total 202 fastqfiles paired with R1 and R2 (forward and reverse reads )using the DADA2 workflow on usegalaxy.eu.

My job “dada2: removeBimeraDenovo on data 5104 (makeSequenceTable output)” has been stuck
in a queued/running state since Monday Oct 6 2025 (> 24 hours).
Previous runs of the same step have stayed queued for more than two weeks before I deleted them and reran the job again

Job details

Galaxy instance: usegalaxy.eu
Tool: dada2: removeBimeraDenovo
(toolshed.g2.bx.psu.edu/repos/iuc/dada2_removebimeradenovo/dada2_removeBimeraDenovo/1.34.0+galaxy0)
Input dataset: 5104 (makeSequenceTable output, ~620 MB, 202 paired-end samples)
Output dataset (stuck): 5516
History ID: a276d4a93e87f66e
Job API ID: 11ac94870d0bb33ac7136777e05e4a1f
Created: Mon Oct 6 11:11 CEST 2025
Job state: running / queued
Tool Std Out/Err: empty
Input parameter: Method = consensus (“check samples independently for bimeras and make a consensus decision on each sequence variant”)
Storage: University of Freiburg NetApp FabricPool (files32); 46.6 GB of 250 GB used (18%)

What I have already tried

Re-ran the job twice last week; each remained queued indefinitely.
All earlier DADA2 steps (filterAndTrim, learnErrors, dada, mergePairs, makeSequenceTable) finished successfully.
I also tested running removeBimeraDenovo + assignTaxonomy as mapped collections (202 jobs each); those completed and are OK ,
but I need the combined ASV table from the single run.

How do I resolve this and How do I check if this job is truly running or not ?

Best,
sri (galaxy.eu - username - sri_microbiome)

jennaj · October 9, 2025, 6:10pm

Hi @sri_microbiome

We have many topics at the forum about how to interpret how work processes through the job queues – see queued-gray-datasets. If you delete and rerun jobs, the new job enters into the back of the job queue! If you cycle this quickly enough, a job may never get a chance to move up and actually run.

Those examples also apply to running jobs! Letting the work process is how to get to the end state where job returns back from the cluster and sorted into your history results. The logs will be produced (any job) and hopefully a result (successful jobs).

It sounds like your very large job has been dispatched out to a cluster node and is processing. The best advice we have is to let this complete. The same work would take the same amount of actual processing time (orange dataset) no matter where it would be run, given the same inputs, parameters, and computing infrastructure. The administrators can’t really gleam anything more about the job until it finishes – mostly just that it is running – the same as what the dataset status indicates.

Many UseGalaxy servers will automatically kill a job after a couple of days (varies by tool and available resources → FAQ: Understanding walltime error messages ) but the UseGalaxy.eu server is special in that it will allow a job to run for an indefinite amount of time. This makes it well suited to longer running processes. The administrators will eventually kill runaway jobs, but that is usually in the time scale of weeks, not days.

So – I would strongly recommend allowing this to run! You’ve already confirmed that the inputs are appropriate for the tool with the split data runs (a very good idea!). If this tool can actually process all the data together is the next test. Aborting this and starting over will only start the entire process over again – further extending the wait time for the results.

More details → How to see the UseGalaxy.eu job queue statistics

This is probably your job!

I hope this helps and we can follow up more in a couple of days if you see some issue with the history not reflecting what can be found in the server statistics – but please know that situation would be really rare! The administrators have processes to catch unexpected technical communications between clusters and the base server, and would probably just restart the job for you, as long as it is not deleted already. In short, if you do want to ask about status, please be sure to leave the job’s datasets active in the history, so as to not interrupt what may turn out to be a successful run.

Let’s start there, and below is some bonus information that may be helpful, or you can ignore it!

BONUS!

And, because I was curious about what I could find supplemental to my general scientific observations and what is documented at DADA2: Fast and accurate sample inference from amplicon data with single-nucleotide resolution (in particular, the Tutorial section “Remove chimeras”), I decided to ask ChatGPT to frame the benchmark timing observed for the underlying Dada2 tool across scientific forums.

I think this is relevant for you! Over 202 samples is already large. You may want to review other characteristics about your data – in particular: how many ASVs – and compare to what others noticed. One scientist reported 17 days of run time but they were working on a laptop. Your job in Galaxy is running on a distributed and computationally significant cluster resource (which is quite different!). I’m going to check the current threading (this is knowable) and memory allocations (the latter is tricky because scaling can be dynamic and actual use for the inputs is reported in the job logs – which do not arrive until the end).

These are more good reasons to allow this to fully process until it completes or dies. We can review the logs to see if Galaxy should or could optimize how the tool runs in our container environment based on your real full sized data. The information in the Github issue links can also maybe help you with getting the data better prepared for this step if it needs a rerun later on – I checked and all links are valid discussions with the original tool authors (not hallucinations). I agree with what it reported and hopefully it helps to frame how this tool works. The developers state, repeatedly, that this step is the “big one” in the Dada2 pipeline.

I used GPT-4o (personal preference!) but you could also try with GPT-5 to see what it says!

Prompt

New question: can you find any benchmarking metrics for how long the processing time should for the underlying tool here Galaxy. It doesn’t have to be Galaxy and try to avoid using information from the Galaxy Help forum. I’m most interested in topics at external forums or blogs.

GPT 4o Results

I looked around, and I didn’t find any well‑documented benchmarks (outside Galaxy‑forum-style notes) that report run times specifically for removeBimeraDenovo (the DADA2 chimera removal step) in the context of the Galaxy wrapper version you linked.

However, I found a few relevant external observations and constraints from user reports and DADA2 discussions that can help you estimate or interpret expected runtimes and performance bottlenecks.

Observations from DADA2 / removeBimeraDenovo usage

1. Chimera removal can be a major computational bottleneck

A user reported that running removeBimeraDenovo on a dataset with ~330,000 ASVs (across 23 samples) on a Windows PC with 16 GB RAM took ~70 hours (and was still running) before being interrupted. (GitHub)
Another user reported that for 68 samples with read depths 400–900k per sample, removeBimeraDenovo’s runtime estimate was > 18 days using a 12‑core setup, causing them to cancel. (GitHub)

These anecdotal cases show that for large, deeply sequenced datasets, this step can dominate runtime.

2. How DADA2 scales and mitigates computational cost

The DADA2 “Big Data” guide emphasizes that sample inference (dada) is parallelizable and that DADA2 is designed so that increasing the number of samples increases compute time linearly, while memory use stays relatively flat. (benjjneb.github.io)
Nonetheless, chimera removal doesn’t always parallelize as cleanly and may depend more heavily on the number of ASVs and the complexity of comparisons.

Interpretation for the Galaxy wrapper you cited

Given the above, here’s what one might reasonably expect (or caution) when using the Galaxy dada2_removeBimeraDenovo tool:

For small to moderate-size datasets (e.g. tens of thousands of reads, moderate number of ASVs), chimera removal should finish in minutes to a few hours, especially if multithreading is enabled.
For large datasets (hundreds of thousands to millions of reads, large ASV tables), the runtime can scale into tens of hours or more.
Memory constraints are a real risk — if the job crosses physical RAM, swapping will slow things drastically.