BUSCO error: refseq db not found

I ran busco to evaluate the draft of a bacterial genome and got this error:
“FileNotFoundError: [Errno 2] No such file or directory: ‘/cvmfs/data.galaxyproject.org/byhand/busco/v5/lineages/bacilli_odb10/refseq_db.faa.gz’”

Thanks for your help in adavance.

Best regards

Hi @H1889

Yes, this can happen with certain combinations of settings. We have some troubleshooting in this similar topics → #busco

This is a fresh testing history. I’m guessing that your options were set up like the first example run that errored.

So, instead, try one of the other combinations. Some can be unexpected. But if a database “isn’t available” in the error message, that means the underlying tool cannot process data that way.

:graduation_cap: GTN Tutorials for Genome Annotation / Tutorial List

Thank again for your kind help.

This behavior seems extrange to me because 8 months ago I used the same combination of options and everything went well.

Anyway I will follow your recomendations.

Greetings

Hum, if you can show an example of where that worked before, I’d be willing to take a look.

Now, the prior versions with simplified form all required a linage selection (most like the first example in dataset 8-9 above) so off hand that may have been it. But now, with the compound form, I think it is all working as expected.

Hello again, in the post you mentioned above, there is something that doesn’t make sense.
Prokaryotic data is used, but the ortholog database used is from plants (liliopsida).
BUSCO works well if you use version 5.8.0+galaxy0.

Here is my command line:

busco --in '/jetstream2/scratch/main/jobs/71098230/inputs/dataset_fea5ce6a-c055-42f0-8e6d-0ddcaa7e5762.dat' --mode 'geno' --out busco_galaxy --cpu ${GALAXY_SLOTS:-4} --evalue 0.001 --limit 3 --contig_break 10   --lineage_dataset 'bacilli_odb10'  --miniprot  && mkdir BUSCO_summaries && ls -l busco_galaxy/run_*/ && cp busco_galaxy/short_summary.*.txt BUSCO_summaries/ && generate_plot.py -wd BUSCO_summaries -rt specific

Therefore, the error does not stem from the absence of orthologous protein databases for each lineage, but rather from the version of the program used. It is clear that the latest version does not work.

Best wishes

Hi @H1889

I was just setting the default combinations at the top level to show what would be produced – my example isn’t a scientific example, more about the settings. I should have worded this better! I’ll try again.

You have bacterial data, correct? Using Metaeuk is the only requirement, the remainder you can set as needed. That can be “auto” for the linage as I used – but you can also choose a linage. In short, using Miniprot won’t work with the “auto” choice because it only functions with a linage selection.

Some more details are here. → https://help.galaxyproject.org/t/busco-for-bacterial-genome-assemblies/14475

Does this now help? :slight_smile:

Hi @jennaj, thank you very much for the clarification.

Just to add some context from my side: I am also working with a prokaryotic (bacterial) genome in genome mode, and I have tested both:auto-lineage, and manual lineage selection (bacteria_odb10). In both cases, BUSCO fails with the same error:

FileNotFoundError:
.../busco/v5/lineages/bacteria_odb10/refseq_db.faa.gz

I previously observed the same type of error for archaea_odb10 during auto-lineage as well. This makes me suspect that, on this Galaxy instance (Galaxy AU), the BUSCO lineage reference files themselves may be missing or incomplete, rather than the issue being related only to the MetaEuk/Miniprot or auto/manual lineage combination. I understand your point about the UI combinations and predictors, but in this case the error consistently points to the absence of the refseq_db.faa.gz file for prokaryotic lineages.

Just sharing this in case it helps diagnose a possible server-side database issue affecting prokaryotic BUSCO runs.

Welcome @sophiaescobar !!

I just checked the data at UseGalaxy.org and everything was working as expected. The UseGalaxy.org.au server should be using this same data through a shared CVMFS repository.

Shared history: https://usegalaxy.org/u/jen-galaxyproject/h/test-busco-2025-01

Screenshot of some of the test jobs with parameters tagged (you can explore the shared history for exact details).

Important keys: please notice how the Select a gene predictor must be set to metaeuk to allow the selection of the bacterial lineage. All prokaryotic lineages will require this same gene predictor.

For eukaryotic, you can use either of the predictors.


[1]

This is a good question so I’m glad you asked again, and I understand your point about suspecting that the gene predictor setting is unexpectedly specific!

But this is known and intentional for now – at the UseGalaxy public servers, the computed indexes for prokaryotic genomes are only available for use with the metaeuk option. This leads to Prodigal being used (technically!). Bacterial lineage index for miniprot are not available at this time.


[2]

The tool is a bit complicated with all of the options and the comprehensive indexes! Hopefully this explains what is going on.. but does this actually help?


  1. Screenshot from the shared history, with the history panel’s datasets displayed, and the rerun form for dataset 14 shown. ↩︎

  2. Screenshot from the Busco 5.8.0+galaxy1 tool form showing option Select a gene predictor. Tool tip: In the case of a prokaryotic genome, Prodigal is the default gene predictor. ↩︎

Thanks for your reply @jennaj. It worked when I selected Augustus, but yesterday i try selecting metaeuk and miniprot I got the error again. Thanks for your explanation of the subject. Hope is helpful for other users in the future. Cheers

1 Like