Allowed makeblastdb file size (1GB), cut genome to smaller pieces?

Dear GalaxyHelp,

I have uploaded a species genome to Galaxy, this is a conifer genome which is quite large (for example Picea abies genome size is: https://www.ncbi.nlm.nih.gov/Traces/wgs/CBVK01?display=contigs). When I try to create a BLAST database (with the NCBI BLAST+ makeblastdb tool) I receive an error message. The message is about duplicated sequence IDs in the file, however, the problem in not this, there is no duplication for sure, the problem is the file size. I tried with a smaller genome and everything works fine, there Galaxy says: Maximum file size: 1000000000B.

According to this I was thinking to cut the genome somehow and than create two or three BLAST databases? Is it a good approach? Is it possible to achieve this in Galaxy (is there a tool to cut)?

How am I able to create such a large BLAST database?
I’m extremely appreciate any help!:slight_smile:

Best regards,
Thend

No one responds :frowning:

Hi @Thend

Apologies for the delay.

Correct, the problem is almost certainly the size. It is a scaffold-level assembly and highly fragmented. The error probably was a fall back error in the code that pops out when the data processing get cut off wherever your job got cut off (exceeded resources) – it is not easy to make a tool that can trap every possible interruption, then report a meaningful error, especially when some different parts of the job are split-up across cluster node cores. But all that is technical. I also doubt there are duplicated IDs given the source data provider (NCBI). You could always double-check but probably not worth it yet – this includes checking for a truncated fasta input created during Upload – meaning you didn’t get the entire genome loaded into Galaxy for some reason.

Options I can think of if the data is actually intact and fully uploaded:

  1. Consider filtering the genome instead of splitting it up. Eg: remove the shorter fasta records. Tool: Filter sequences by length. Review the assembly statistics at NCBI (includes length stats) for the original file or run a tool like Compute sequence length then Group again.

    • This may take some experimentation/testing to see if you can retain enough of the genome to make it still worth using while reducing the size enough to use it in Galaxy. Your issue isn’t that some of the “chromosomes” themselves are too long but rather that there are too many “chromosomes” to start with.
    • You could lose meaningful data (or not) doing this with data that is in such an early assembly state, so use your best judgment – assembly stats at NCBI can be useful to review. If you’re not sure how to interpret, search around as chat about these is all over general bioinformatics forums. NCBI also has much help plus (depends on the genome) links to publications that are related to the assembly – authors or others that have reviewed/used it.
  2. Split up the genome, index the parts, run Blast against those, then merge results. Tools are in the group “Text Manipulation” (count lines, select/remove first/last/etc to create the indexes). After mapping, then concatenate the mapped results, for example: tabular output (probably best). However you do it, be careful about mismatched content or extra headers when merging the results together.

    • You’ll need to convert fasta-to-tabular as the first step, then manipulate, then convert `tabular-to-fasta after done with the manipulation. IF you do that, the metrics between the merged runs won’t be compatible, except at maybe the highest level (has a hit or not).
    • Then, you could go back to filter the original genome to just include those sequences or others like them (remove because of being too short or quality issues or whatever that makes obtaining a hit, with your given read query content + tool parameters). Those chromosomes could be considered “noise” – it depends on your analysis goals – and doing this will also alter the statistics of course, just in a different way than the other two methods (outright filter by length OR breaking up the genome/merging results).
    • Again, a judgment call you’ll need to make.
  3. Move to your own Galaxy server and allocate enough resources. The underlying tool can handle very large data line command if wherever you are running it has sufficient computational resources. When running any 3rd party tool that is wrapped in Galaxy, the same amount of resource is needed as for line-command execution. I don’t know what this genome requires but you could test all that out/do some detective work (publications, forums).

    • This tool already has the maximum job resources allocation on the public server – so that cannot be changed, but you could set up resources any way you want in your own Galaxy.
    • Galaxy Choices (summary): https://galaxyproject.org/choices/
    • Using Galaxy (all choices with more details): https://galaxyproject.org/use/
    • Teaching resources – has some info about choices (especially cloud choices == far less administrative work) and is written in a way that can be easier to interpret for scientific readers: https://galaxyproject.org/teach/

Hope that helps!