I am trying to follow a procedure in the following link - detailed instructions are in supplementary. I believe I have everything working until the step where I have to run BLAST + makeblastdb on a very large file (I believe ~ 100GB, although I haven’t been able to successfully download what the procedure uses - it’s supposed to be all bacterial nucleotide sequences). Instead, I have downloaded the BLAST pre-formatted nucleotide database (NCBI/NLM/NIH :: Public FTP) using FTP to download it. I have a local galaxy set up to handle the large file sizes and am running it from an external hard drive. I thought these databases were supposed to be pre-formatted, but:
-
When I try to run NCBI BLAST+ blastn on my assembled sequence against the database (after I uncompressed it via Galaxy) I get an error:
Fatal error: Exit code 137 ()
/Volumes/MBLab/galaxy/database/jobs_directory/000/26/tool_script.sh: line 25: 87409 Killed: 9 blastn -query ‘/Volumes/MBLab/galaxy/database/objects/7/6/c/dataset_76cf539a-0e17-416d-b488-604ddd55b8ea.dat’ -subject ‘/Volumes/MBLab/galaxy/database/objects/1/8/9/dataset_189ca2b3-5d9d-4629-bb24-2811979970ef.dat’ -task ‘megablast’ -evalue ‘0.001’ -out ‘/Volumes/MBLab/galaxy/database/objects/b/5/a/dataset_b5a49b6e-b95a-40b7-9f72-cf421e820fe5.dat’ -outfmt ‘6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles’ -num_threads “${GALAXY_SLOTS:-8}” -
When I try to run NCBI BLAST+ makeblastdb on the uncompressed nt file (uncompressed via Galaxy → Edit dataset attributes → Convert → Convert compressed file to uncompressed) I get an error:
Fatal error: Exit code 1 ()
BLAST Database creation error: Error: Duplicate seq_ids are found:
GNL|BL_ORD_ID:553373
I am currently downloading a Refseq bacterial database (fromhere) because Refseq is nonredundant. I am going to try NCBI BLAST+ makeblastdb on those once I have them up to see if that helps, although the file is massive (300+gb). Any advice is very welcome - I struggle with this and have no experience here. I will update how the RefSeq approach goes.