BWA MEM2 too slow

Hi all,

We have been using BWA MEM 2 workflow over a Sun Grid Engine cluster with Galaxy, using a 8 core/16GB RAM machine and the execution was about 55 minutes with a really small fastq. We need to speed up the execution so we have changed the execution machine to a 64 core/128GB RAM. The problem is that the execution duration is the same. We have also created a script to export GALAXY_SLOTS variable value to 64 before the execution starts, because it looks like it is the parameter used to execute the workflow in multithread mode, but it does nothing to the execution time.

We have tested that this GALAXY_SLOTS value is the correct by executing this debug tool: Debug Galaxy tool for PBS/TORQUE GALAXY_SLOTS. · GitHub and it returns the 64 value correctly.

Also in Galaxy, the GALAXY_SLOTS variable is used in the BWA MEM2 workflow when we check the command line:

set -o | grep -q pipefail && set -o pipefail;  ln -s '/shared/Galaxy/database/objects/d/6/6/dataset_d66ecca3-445c-4fff-9315-4d45b8e51ac3.dat' 'localref.fa' && bwa-mem2 index 'localref.fa' &&    bwa-mem2 mem -t "${GALAXY_SLOTS:-1}" -v 1                            -R '@RG\tID:sample1-r.fq.gz.gz\tSM:sampleNameTest\tPL:ILLUMINA\tLB:sample1-r.fq.gz.gz\tPU:run'  'localref.fa' '/shared/Galaxy/database/objects/7/e/7/dataset_7e78b97c-cf0f-48fb-bee8-b4a91ba3fa1a.dat'  | samtools sort -@${GALAXY_SLOTS:-2} -T "${TMPDIR:-.}" -O bam -o '/shared/Galaxy/database/objects/7/0/5/dataset_705ab064-6728-4023-b8f5-6c0c6082edeb.dat

What could I be missing to optimize BWA MEM2 executions?

Thanks in advance

I attach also the galaxy console output:

Looking to launch executable "/shared/Galaxy/database/dependencies/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx512bw", simd = .avx512bw
Launching executable "/shared/Galaxy/database/dependencies/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx512bw"
[bwa_index] Pack FASTA... 15.00 sec
* Entering FMI_search
init ticks = 290784968664
ref seq len = 6418572210
binary seq ticks = 154991346064
build suffix-array ticks = 6763910118833
pos: 802321527, ref_seq_len__: 802321526
build fm-index ticks = 1228787853170
Total time taken: 3274.8116
Looking to launch executable "/shared/Galaxy/database/dependencies/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx512bw", simd = .avx512bw
Launching executable "/shared/Galaxy/database/dependencies/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx512bw"
-----------------------------
Executing in AVX512 mode!!
-----------------------------
* SA compression enabled with xfactor: 8
* Ref file: localref.fa
* Entering FMI_search
* Index file found. Loading index from localref.fa.bwt.2bit.64
* Reference seq len for bi-index = 6418572211
* sentinel-index: 2729492284
* Count:
0,	1
1,	1879238230
2,	3209286106
3,	4539333982
4,	6418572211

* Reading other elements of the index from files localref.fa
* Index prefix: localref.fa
* Read 0 ALT contigs
* Done reading Index!!
* Reading reference genome..
* Binary seq file = localref.fa.0123
* Reference genome size: 6418572210 bp
* Done reading reference genome !!

------------------------------------------
1. Memory pre-allocation for Chaining: 8917.6357 MB
2. Memory pre-allocation for BSW: 15335.4895 MB
3. Memory pre-allocation for BWT: 4948.1073 MB
------------------------------------------
* Threads used (compute): 64
* No. of pipeline threads: 2

[0000] read_chunk: 640000000, work_chunk_size: 6425706, nseq: 27606
	[0000][ M::kt_pipeline] read 27606 sequences (6425706 bp)...
[0000] Calling mem_process_seqs.., task: 0
[0000] 1. Calling kt_for - worker_bwt
[0000] read_chunk: 640000000, work_chunk_size: 0, nseq: 0
[0000] 2. Calling kt_for - worker_aln
[0000] 3. Calling kt_for - worker_sam
	[0000][ M::mem_process_seqs] Processed 27606 reads in 18.046 CPU sec, 1.192 real sec
[0000] read_chunk: 640000000, work_chunk_size: 0, nseq: 0
[0000] Computation ends..
No. of OMP threads: 64
Processor is running @2594.380597 MHz
Runtime profile:

	Time taken for main_mem function: 77.82 sec

	IO times (sec) :
	Reading IO time (reads) avg: 0.11, (0.11, 0.11)
	Writing IO time (SAM) avg: 0.03, (0.03, 0.03)
	Reading IO time (Reference Genome) avg: 30.15, (30.15, 30.15)
	Index read time avg: 45.81, (45.81, 45.81)

	Overall time (sec) (Excluding Index reading time):
	PROCESS() (Total compute time + (read + SAM) IO time) : 1.36
	MEM_PROCESS_SEQ() (Total compute time (Kernel + SAM)), avg: 1.19, (1.19, 1.19)

	 SAM Processing time (sec):
	--WORKER_SAM avg: 0.19, (0.19, 0.19)

	Kernels' compute time (sec):
	Total kernel (smem+sal+bsw) time avg: 1.00, (1.00, 1.00)
		SMEM compute avg: 0.33, (0.69, 0.00)
		SAL compute avg: 0.08, (0.24, 0.00)
				MEM_SA avg: 0.05, (0.17, 0.00)

		BSW time, avg: 0.13, (0.27, 0.00)

Important parameter settings: 
	BATCH_SIZE: 512
	MAX_SEQ_LEN_REF: 256
	MAX_SEQ_LEN_QER: 128
	MAX_SEQ_LEN8: 128
	SEEDS_PER_READ: 500
	SIMD_WIDTH8 X: 64
	SIMD_WIDTH16 X: 32
	AVG_SEEDS_PER_READ: 64
[bam_sort_core] merging from 0 files and 64 in-memory blocks...

It looks like the algorithm is using the 64 cores but no improvement is shown

Hi @fcasnun, you use a custom reference genome. For every run BWA-MEM2 creates a genome index. It takes time. After that it does mapping. You can add a build(t?)-in index. You need a database manager for this. With build-in indexed genome BWA-MEM2 starts with mapping.

Hope that helps.

Kind regards,
Igor

Hi @igor, thank you so much for the response. I didn´t know that first part of BWA-MEM2 could be “skipped” by the use of the built-in index genomes. I am trying now to create them in our local galaxy. We have installed the BWA-MEM2 index builder tool but no fasta file is shown in its selector despite being already imported 2 fasta files (hg19 and hg38). We suspect that it could be because we are using SQLite database and we will migrate to PostgreSQL to test again.

Thanks again

Hi @fcasnun,
It is a multi-step procedure. First, create a new dbkey using Create DBKey and Reference Genome fetching. Next do fasta indexing. The database manager should see the new dbkey. After that add twoBit indexing. After that create aligner index.
There should be an instruction on there web, but I cannot find it.
Kind regards,
Igor

Hi @igor

It seems that there is an issue running BWA-MEM2 using the Galaxy Euro interface as well. It was working perfectly fine last week, but now I’m encountering an error: “Fatal error: Exit code 127 ()”. Additionally, when running the Galaxy AI check, I receive an error related to samtools:

/data/jwd02f/main/081/194/81194794/tool_script.sh: line 27: samtools: command not found

Could you please look into this?

Hi @zschong,
yes, it seems there are issues with BWA-MEM2 in Europe. I reported the error to the server support.
Just in case if you are not familiar with error reports: click at any output from a failed job, click at Error icon, the one looking like ladybird beetle, in the middle window write any info about the failed job, for example, if you are doing a tutorial, provide an URL. Hit Report button.

In meantime, try other aligners, for example BWA-MEM. It is functional in Europe.

Kind regards,
Igor

1 Like

Hi @zschong.
BWA-MEM2 has been updated to Galaxy Version 2.2.1+galaxy3 in Europe. It works now.

Please, in the future start a new thread/post for any question unrelated to the original topic: it is easier for forum admins and more useful for other users.

Kind regards,
Igor

Hi @igor,

Thank you so much for the help. With the built-in index, execution times have improved drastically from 55 minutes to 1 minute

Kind regards,

Paco

Hi @fcasnun
Thank you for the update. I am glad it works for you.
Have a great day,
Igor