I would like to access the sequences from BUSCO to form single-copy orthogroups after running it on several inputs, but I cannot see any options to obtain them. It seems that gff files can be collected but not the fasta files.
I considered using funannotate predict and then functional to get the protein sequences, but the only BUSCO references are deep in the genbank file. I have seen that Proteinortho can be used to find orthogroups but this would also consider genes outside the BUSCO single-copy ones.
Hi @BShelb
Do you mean sequences for assemblies used as input for BUSCO jobs? Maybe try ExtractFastaBed or getfasta or any other similar tool.
Hope that helps.
Kind regards,
Igor
It is not the input, which in this case is genome sequences. It is part of the output produced by BUSCO, these CDSs can be found in the busco_sequences folder in the output directory. This directory is not accessible through galaxy though and there are no options to obtain this output.
I tried to get the gff files but it only returns a single empty gff file.
Hi @BShelb
I am sorry for unclear reply. BUSCO deals with a database and user data, such as assembly contigs. I referred to the latter as “input”. If you are interested in CDSs from your input file, not BUSCO database, consider converting coordinates from BUSCO output into intervals/bed format and extract sequences using getfastabed or similar software.
Kind regards,
Igor
Sorry for the confusion but I am definitely talking about an output from BUSCO. It seems like a lot of work to instead use the table that is output, extract the DNA sequences, and then remove the introns (and convert them to proteins) when this is already done in BUSCO.
In the documentation for BUSCO it says:
“The BUSCO output folder name is BUSCO_<input_filename> by default, but this can be changed by using the -o or --out option.”
“The busco_sequences/ subdirectory contains protein sequence files in FASTA format (*.faa ) and GFF files (*.gff ) for each BUSCO gene identified… the busco_sequences/ subdirectory also contains nucleotide coding sequence files in FASTA format (*.fna )”