Forming orthogroups using BUSCO and access to busco_sequences output

BShelb · June 13, 2024, 11:39am

Hi everyone,

I would like to access the sequences from BUSCO to form single-copy orthogroups after running it on several inputs, but I cannot see any options to obtain them. It seems that gff files can be collected but not the fasta files.

I considered using funannotate predict and then functional to get the protein sequences, but the only BUSCO references are deep in the genbank file. I have seen that Proteinortho can be used to find orthogroups but this would also consider genes outside the BUSCO single-copy ones.

Thanks in advance!

igor · June 14, 2024, 4:21am

Hi @BShelb
Do you mean sequences for assemblies used as input for BUSCO jobs? Maybe try ExtractFastaBed or getfasta or any other similar tool.
Hope that helps.
Kind regards,
Igor

BShelb · June 14, 2024, 11:05am

Hello @igor

Thanks for your reply.

It is not the input, which in this case is genome sequences. It is part of the output produced by BUSCO, these CDSs can be found in the busco_sequences folder in the output directory. This directory is not accessible through galaxy though and there are no options to obtain this output.

I tried to get the gff files but it only returns a single empty gff file.

Thanks again.

igor · June 14, 2024, 11:05pm

Hi @BShelb
I am sorry for unclear reply. BUSCO deals with a database and user data, such as assembly contigs. I referred to the latter as “input”. If you are interested in CDSs from your input file, not BUSCO database, consider converting coordinates from BUSCO output into intervals/bed format and extract sequences using getfastabed or similar software.
Kind regards,
Igor

BShelb · June 15, 2024, 3:51am

Thanks again @igor .

Sorry for the confusion but I am definitely talking about an output from BUSCO. It seems like a lot of work to instead use the table that is output, extract the DNA sequences, and then remove the introns (and convert them to proteins) when this is already done in BUSCO.

In the documentation for BUSCO it says:

“The BUSCO output folder name is BUSCO_<input_filename> by default, but this can be changed by using the -o or --out option.”

“The busco_sequences/ subdirectory contains protein sequence files in FASTA format (*.faa ) and GFF files (*.gff ) for each BUSCO gene identified… the busco_sequences/ subdirectory also contains nucleotide coding sequence files in FASTA format (*.fna )”

bernt-matthias · March 4, 2025, 8:26am

Hi @BShelb I will try to include the output in the next version busco DM: consider odb version by bernt-matthias · Pull Request #6808 · galaxyproject/tools-iuc · GitHub. Better late then never

Topic		Replies	Views
linking BUSCO genes to GO/functional terms for genome genome , annotation , genome-annotation	0	376	May 13, 2022
Extracting portion of fasta sequences from a multifasta file having contigs names and start-stop positions usegalaxy.org support	0	433	February 24, 2022
Difficulty in using OrthoFinder troubleshooting , evolution	14	191	February 1, 2025
Bam file to fasta file - Genome assembly usegalaxy.org support genome , assembly	3	4748	February 6, 2019
From FastQ to fasta - WGS Variant Analysis workflow , wgs	2	1936	February 14, 2019

Forming orthogroups using BUSCO and access to busco_sequences output

Related topics