Errors with makeblastdb & blastn on very large file sets

I am trying to follow a procedure in the following link - detailed instructions are in supplementary. I believe I have everything working until the step where I have to run BLAST + makeblastdb on a very large file (I believe ~ 100GB, although I haven’t been able to successfully download what the procedure uses - it’s supposed to be all bacterial nucleotide sequences). Instead, I have downloaded the BLAST pre-formatted nucleotide database (https://www.ncbi.nlm.nih.gov/public/?blast/db/FASTA/nt.gz) using FTP to download it. I have a local galaxy set up to handle the large file sizes and am running it from an external hard drive. I thought these databases were supposed to be pre-formatted, but:

  1. When I try to run NCBI BLAST+ blastn on my assembled sequence against the database (after I uncompressed it via Galaxy) I get an error:
    Fatal error: Exit code 137 ()
    /Volumes/MBLab/galaxy/database/jobs_directory/000/26/tool_script.sh: line 25: 87409 Killed: 9 blastn -query ‘/Volumes/MBLab/galaxy/database/objects/7/6/c/dataset_76cf539a-0e17-416d-b488-604ddd55b8ea.dat’ -subject ‘/Volumes/MBLab/galaxy/database/objects/1/8/9/dataset_189ca2b3-5d9d-4629-bb24-2811979970ef.dat’ -task ‘megablast’ -evalue ‘0.001’ -out ‘/Volumes/MBLab/galaxy/database/objects/b/5/a/dataset_b5a49b6e-b95a-40b7-9f72-cf421e820fe5.dat’ -outfmt ‘6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles’ -num_threads “${GALAXY_SLOTS:-8}”

  2. When I try to run NCBI BLAST+ makeblastdb on the uncompressed nt file (uncompressed via Galaxy -> Edit dataset attributes -> Convert -> Convert compressed file to uncompressed) I get an error:
    Fatal error: Exit code 1 ()
    BLAST Database creation error: Error: Duplicate seq_ids are found:
    GNL|BL_ORD_ID:553373

I am currently downloading a Refseq bacterial database (fromhere) because Refseq is nonredundant. I am going to try NCBI BLAST+ makeblastdb on those once I have them up to see if that helps, although the file is massive (300+gb). Any advice is very welcome - I struggle with this and have no experience here. I will update how the RefSeq approach goes.

1 Like

Hello @aroebuck

NCBI BLAST+ tools had some corrections today at UseGalaxy.org. Full testing is still in progress but so far have turned out well. Even if your combination of tools/inputs is not marked as “pass” yet in the tracking ticket’s test matrix, consider a rerun anyway.

For any reruns, be sure to use the most current version of all tools in the BLAST+ tool suite (2.10.1+galaxy0), including NCBI BLAST+ makeblastdb Make BLAST database (Galaxy Version 2.10.1+galaxy0).

For very large data, any public Galaxy server can be a problematic choice. 300GB of raw data would exceed your quota space (storage). And even if that was increased, memory for data storage is unrelated to the memory used to execute tools – and the example you state would certainly fail for resources.

I added some tags to your post that link to prior Q&A about the GVL and other cloud options.

You also may want to consider joining our upcoming webinar:

Thanks!

Thank you,

I ended up re-trying after reading something elsewhere. I tried with the parsing parameter set to “Yes” (the instructions say to leave it set to “No” . That seemed to do the trick and the database was generated! Hopefully this was not a problematic thing to do.