Busco not working for a particular dataset

So, i am using busco to asses my assemblies, and busco tool on your EU server cant do this one particular assembly dataset. The problem is it takes about 24h to finish (which is a lot longer than usual) and then when it finishes, all output files are green, but contain no data (0 lines).

2023-08-23 10:17:48 ERROR: Impossible to read the lengths in /data/jwd02f/main/062/017/62017166/working/busco_downloads/lineages/aves_odb10/lengths_cutoff
2023-08-23 10:17:48 ERROR: BUSCO analysis failed!
2023-08-23 10:17:48 ERROR: Check the logs, read the user guide (User guide BUSCO v5.5.0), and check the BUSCO issue board on Issues · ezlab / busco · GitLab
Tool Exit Code 0
Job API ID 11ac94870d0bb33a2dcec8fcd4125252

Job Metrics

cgroup

CPU Time 364 hours and 44 minutes
Failed to allocate memory count 0E-7
Memory limit on cgroup (MEM) 80.0 GB
Max memory usage (MEM) 42.8 GB
Memory limit on cgroup (MEM+SWP) 8.0 EB
Max memory usage (MEM+SWP) 42.8 GB
OOM Control enabled No
Was OOM Killer active? No
Memory softlimit on cgroup 0 bytes

core

Cores Allocated 16
Memory Allocated (MB) 81920
Job Start Time 2023-08-22 09:58:38
Job End Time 2023-08-23 10:19:09
Job Runtime (Wall Clock) 24 hours and 20 minutes

hostname

hostname vgcnbwc-worker-c120m225-4042.novalocal

This Is the link from the job
https://usegalaxy.eu/api/datasets/4838ba20a6d867658ad3b9c9f3fc49b1/display?to_ext=txt

Thank you in Advance,
Best Regards,
Pavle Eric

Hi @musaqa

Is there anything special about the assembly?

Long or short contigs? Numerous? Any quality assessment or filtering done post assembly?

The assembly tutorials cover that type of review, see https://training.galaxyproject.org/ and search with the keyword “assembly”.

BTW – you’ll need to set the history to a shared state for others to be able to see the job information view. But the tips above should help without that, along with the author/publication links at the bottom of the tool form. There are not any known issues with the tool, and the EU server hosts some of the largest public clusters. So this is almost certainly a data problem to solve, not a technical tool problem.

Hi jennaj,

Thanks for the quick reply,

well the assembly is more or less the same as any other that I used busco tool numerous times in these past few days. It is the same organism assembled from the same reads with higher k-mer value.

This is the one that failed two times in a row, and is currently running some 20 hours or so, so I think it will fail again. had k-mer=169
abyss-fac ACH01_abyss_k169-scaffolds.fa |tee ACH01_abyss_k169-stats.tab
n n:500 L50 min N75 N50 N25 E-size max sum name
157707 54412 6082 500 30634 58381 101250 74192 479769 1.209e9 ACH01_abyss_k169-scaffolds.fa

And this is the one that worked without a problem a few days ago that had k-mer=159
abyss-fac ACH01_abyss_k159-scaffolds.fa |tee ACH01_abyss_k159-stats.tab
n n:500 L50 min N75 N50 N25 E-size max sum name
170004 46589 4772 500 38474 73636 130224 93908 555427 1.206e9 ACH01_abyss_k159-scaffolds.fa

This is the only part that has some information on the error:
2023-08-23 10:17:48 ERROR: Impossible to read the lengths in /data/jwd02f/main/062/017/62017166/working/busco_downloads/lineages/aves_odb10/lengths_cutoff

Are you sure the problem is with my data or could it be with aves lineage dataset?

Best regards,
Pavle

Maybe I misunderstood. Did the other successful runs use that same lineage?

The k-mer and the busco min length are probably part of the problem.

yes all previous runs used aves lineage

Ok, then the index is Ok and the problem is with the other inputs or probably settings.

You’ll need to do some detective work. Review the BUSCO publications and author resources, generate some stats on your assembly (lengths of the contigs seems relevant), check to see of the total number of contigs can be handled.

Filtering down the assembly to remove anything “short” might help, or just as a diagnostic. Example: remove contigs that are actually just one of the original reads that didn’t assemble. 50 bases for an assembled contig is really really short, and is unlikely to be the result of two or more assembled reads… That could even be contamination or poor quality data. You could investigate what those are.

This doesn’t look like a failure due to lack of resources server side but BUSCO itself will have data expectations. You could also just take the first N sequences and try running the tool on that as a test - might give some more clues or produce a more meaningful error message.

Hi @musaqa ,

yeah, I had the same issue some time ago and I can’t remember what it was but the first thing I always like to do is re-run some old jobs (if I have them) with the same tool just to exclude that it is a galaxy issue - if it works, then I try to figure out whether the issue is in my input data.
As for the long time for the job to be finished - the same happened to me a couple of times, but in the end, I got the results.
Maybe try to do a little filtering of your assembly prior to busco analysis just to exclude the length issue? at the moment I have only experience in RNAseq and I like to filter my assemblies for <300 bp since those would probably not result in any proteins at the end. Just an idea but I am nit a very experienced bioinformatician…

Cheers and sretno,

Lada

Different tools have different algorithms, so this is expected for the “yellow” executing stage.

How long jobs queue – the “gray” stage – is distinct and depends on a few factors. Tool choice, how busy the server is… there are many topics at this forum that cover those details.

You could try more filtering. I gave some ideas in a prior reply above. Maybe just input the longest sequences. Or, just the top N lines of the file. The tool might actually run and report more informative error logs.

In the end, the problem could be the data itself of course but that’s what I would try, along with generating some more statistics. Get the lengths, cluster those to see if you can find any patterns, compare to the author resources about what the tool itself can handle, etc. There isn’t a single solution.

What you have now doesn’t look like a technical bioinformatics problem exactly – instead, this is a scientific problem to solve. Meaning, you need to give the tool something that it knows how to process :slight_smile:

This particular tool is trying to cluster and annotate your data by comparing it against a reference, then generating some stats about what it found. It might simply be overwhelmed by the number of input sequences, or the data doesn’t have any matches, or too many non-specific matches, or maybe something else like too many short reads it cannot process. You’ll need to investigate.

From what you’ve explained and tested so far, all of this would happen even outside of Galaxy, probably, unless the computer had infinite resources and the tool itself could actually keep track of the intermediate data.

Assembly and annotation processes are super sensitive to data content. So, you could try mapping the assembly against a reference (same species, or different) and reviewing in a browser like IGV, along with any known annotation (GTFs) and explore that way too. You could even just BLAT or BLAST+ a few sequences or any suspect sequences at UCSC or NCBI to see what happens. None of that is exact, it is exploratory.

Hope this works out for you, and feel free to post back what you find here! Might help others who are reading later on or following along now.