Troubleshooting BWA-MEM2 resources under Docker Galaxy

Hi Jenna,
I ran into this same error.
With fastp cleaned data, I can use bwa-mem2 to run a dataset to 1 or 2 genes of the reference hg38. I then tried running the 48 pairs in the collection against the single gene. That works. but when I scaled to the complete genome, I get the same error (below) except I’m running in docker desktop, the latest bgruening/galaxy-stable Docker stable image on Ubuntu 20.04LTS, with 32GB RAM, 4TB SSD and 2 1TB hard drives. My system says I’m using half the RAM and 10 of the cores.

Details

Execution resulted in the following messages:

Fatal error: Exit code 1 ()

Tool generated the following standard error:

Looking to launch executable “/export/tool_deps/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx2”, simd = .avx2 Launching executable “/export/tool_deps/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx2” [bwa_index] Pack FASTA… 14.26 sec * Entering FMI_search init ticks = 262953660145 ref seq len = 6418597448 binary seq ticks = 147693711887 Allocation of 47.82 GB for suffix_array failed. Current Allocation = 53.80 GB

Regards,
Ann

1 Like

Hi @Ann_Holtz-Morris

What is your reference genome? Or is it an exome? If there is a public link to the fasta data, please share that for context, and maybe some potential workaround.

And, you are currently using the Custom genome function, correct? A fasta file from the history?

Indexing the fasta is probably the solution. The indexing step can be computationally expensive, more so than the actual alignment step, and we had some trouble with BWA-MEM2 indexing too (close to being resolved). Plus that would avoid needing to spend compute time recreating the index each time you align against that reference.

This looks very similar to the errors we had when attempting to index the human genome originally. The root problem was lack of memory on the cluster node where the job ran. I don’t recall the details but we can find those if needed e.g. how the memory scales for resource estimates.



Data Managers: how to index local data and how to incorporate pre-computed indexes hosted at public servers: https://training.galaxyproject.org/training-material/search2?query=cvmfs

Working group’s repository for indexing data (new!). GitHub - galaxyproject/idc: Simon's Data Club - Reference data for Galaxy servers