Errors with makeblastdb & blastn on very large file sets

aroebuck · November 20, 2020, 2:21am

I am trying to follow a procedure in the following link - detailed instructions are in supplementary. I believe I have everything working until the step where I have to run BLAST + makeblastdb on a very large file (I believe ~ 100GB, although I haven’t been able to successfully download what the procedure uses - it’s supposed to be all bacterial nucleotide sequences). Instead, I have downloaded the BLAST pre-formatted nucleotide database (NCBI/NLM/NIH :: Public FTP) using FTP to download it. I have a local galaxy set up to handle the large file sizes and am running it from an external hard drive. I thought these databases were supposed to be pre-formatted, but:

When I try to run NCBI BLAST+ blastn on my assembled sequence against the database (after I uncompressed it via Galaxy) I get an error:
Fatal error: Exit code 137 ()
/Volumes/MBLab/galaxy/database/jobs_directory/000/26/tool_script.sh: line 25: 87409 Killed: 9 blastn -query ‘/Volumes/MBLab/galaxy/database/objects/7/6/c/dataset_76cf539a-0e17-416d-b488-604ddd55b8ea.dat’ -subject ‘/Volumes/MBLab/galaxy/database/objects/1/8/9/dataset_189ca2b3-5d9d-4629-bb24-2811979970ef.dat’ -task ‘megablast’ -evalue ‘0.001’ -out ‘/Volumes/MBLab/galaxy/database/objects/b/5/a/dataset_b5a49b6e-b95a-40b7-9f72-cf421e820fe5.dat’ -outfmt ‘6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles’ -num_threads “${GALAXY_SLOTS:-8}”
When I try to run NCBI BLAST+ makeblastdb on the uncompressed nt file (uncompressed via Galaxy → Edit dataset attributes → Convert → Convert compressed file to uncompressed) I get an error:
Fatal error: Exit code 1 ()
BLAST Database creation error: Error: Duplicate seq_ids are found:
GNL|BL_ORD_ID:553373

I am currently downloading a Refseq bacterial database (fromhere) because Refseq is nonredundant. I am going to try NCBI BLAST+ makeblastdb on those once I have them up to see if that helps, although the file is massive (300+gb). Any advice is very welcome - I struggle with this and have no experience here. I will update how the RefSeq approach goes.

jennaj · December 8, 2020, 12:00am

Hello @aroebuck

NCBI BLAST+ tools had some corrections today at UseGalaxy.org. Full testing is still in progress but so far have turned out well. Even if your combination of tools/inputs is not marked as “pass” yet in the tracking ticket’s test matrix, consider a rerun anyway.

Issue tracking ticket: BLAST failures at usegalaxy.org · Issue #318 · galaxyproject/usegalaxy-playbook · GitHub

For any reruns, be sure to use the most current version of all tools in the BLAST+ tool suite (2.10.1+galaxy0), including NCBI BLAST+ makeblastdb Make BLAST database (Galaxy Version 2.10.1+galaxy0).

For very large data, any public Galaxy server can be a problematic choice. 300GB of raw data would exceed your quota space (storage). And even if that was increased, memory for data storage is unrelated to the memory used to execute tools – and the example you state would certainly fail for resources.

Ways to use Galaxy: Galaxy Platform Directory: Servers, Clouds, and Deployable Resources - Galaxy Community Hub
Choices matrix: Galaxy Choices - Galaxy Community Hub
The GVL version of Cloudman is one choice and AWS offers grants. AWS Programs for Research and Education

I added some tags to your post that link to prior Q&A about the GVL and other cloud options.

You also may want to consider joining our upcoming webinar:

Thanks!

aroebuck · December 14, 2020, 7:10pm

Thank you,

I ended up re-trying after reading something elsewhere. I tried with the parsing parameter set to “Yes” (the instructions say to leave it set to “No” . That seemed to do the trick and the database was generated! Hopefully this was not a problematic thing to do.

Topic		Replies	Views
Please run Megablast as an option with BLASTN usegalaxy.eu support mapping , blast , tool-help , ncbi_blastn_wrapper , megablast_wrapper	8	104	August 18, 2024
Problem to make a bacterial genome assembly database for blast troubleshooting , exceeds-memory-error	1	948	August 28, 2019
Errors with NCBI Blast+: cannot index custom database and dc-megablast throws window size errors. usegalaxy.org support mapping , blast , server-side-error	1	1341	December 8, 2020
Allowed makeblastdb file size (1GB), cut genome to smaller pieces? usegalaxy.org support database , genome , mapping , blast	2	2138	September 3, 2019
Request for megablast tool usegalaxy.eu support reference-index , mapping , blast	3	257	November 3, 2023

Errors with makeblastdb & blastn on very large file sets

Related topics