I attach also the galaxy console output:
Looking to launch executable "/shared/Galaxy/database/dependencies/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx512bw", simd = .avx512bw
Launching executable "/shared/Galaxy/database/dependencies/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx512bw"
[bwa_index] Pack FASTA... 15.00 sec
* Entering FMI_search
init ticks = 290784968664
ref seq len = 6418572210
binary seq ticks = 154991346064
build suffix-array ticks = 6763910118833
pos: 802321527, ref_seq_len__: 802321526
build fm-index ticks = 1228787853170
Total time taken: 3274.8116
Looking to launch executable "/shared/Galaxy/database/dependencies/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx512bw", simd = .avx512bw
Launching executable "/shared/Galaxy/database/dependencies/_conda/envs/mulled-v1-88bfe9d3fb5d8ab3673a5b08b613f2c0d466656f329fd172728c59fa3917261d/bin/bwa-mem2.avx512bw"
-----------------------------
Executing in AVX512 mode!!
-----------------------------
* SA compression enabled with xfactor: 8
* Ref file: localref.fa
* Entering FMI_search
* Index file found. Loading index from localref.fa.bwt.2bit.64
* Reference seq len for bi-index = 6418572211
* sentinel-index: 2729492284
* Count:
0, 1
1, 1879238230
2, 3209286106
3, 4539333982
4, 6418572211
* Reading other elements of the index from files localref.fa
* Index prefix: localref.fa
* Read 0 ALT contigs
* Done reading Index!!
* Reading reference genome..
* Binary seq file = localref.fa.0123
* Reference genome size: 6418572210 bp
* Done reading reference genome !!
------------------------------------------
1. Memory pre-allocation for Chaining: 8917.6357 MB
2. Memory pre-allocation for BSW: 15335.4895 MB
3. Memory pre-allocation for BWT: 4948.1073 MB
------------------------------------------
* Threads used (compute): 64
* No. of pipeline threads: 2
[0000] read_chunk: 640000000, work_chunk_size: 6425706, nseq: 27606
[0000][ M::kt_pipeline] read 27606 sequences (6425706 bp)...
[0000] Calling mem_process_seqs.., task: 0
[0000] 1. Calling kt_for - worker_bwt
[0000] read_chunk: 640000000, work_chunk_size: 0, nseq: 0
[0000] 2. Calling kt_for - worker_aln
[0000] 3. Calling kt_for - worker_sam
[0000][ M::mem_process_seqs] Processed 27606 reads in 18.046 CPU sec, 1.192 real sec
[0000] read_chunk: 640000000, work_chunk_size: 0, nseq: 0
[0000] Computation ends..
No. of OMP threads: 64
Processor is running @2594.380597 MHz
Runtime profile:
Time taken for main_mem function: 77.82 sec
IO times (sec) :
Reading IO time (reads) avg: 0.11, (0.11, 0.11)
Writing IO time (SAM) avg: 0.03, (0.03, 0.03)
Reading IO time (Reference Genome) avg: 30.15, (30.15, 30.15)
Index read time avg: 45.81, (45.81, 45.81)
Overall time (sec) (Excluding Index reading time):
PROCESS() (Total compute time + (read + SAM) IO time) : 1.36
MEM_PROCESS_SEQ() (Total compute time (Kernel + SAM)), avg: 1.19, (1.19, 1.19)
SAM Processing time (sec):
--WORKER_SAM avg: 0.19, (0.19, 0.19)
Kernels' compute time (sec):
Total kernel (smem+sal+bsw) time avg: 1.00, (1.00, 1.00)
SMEM compute avg: 0.33, (0.69, 0.00)
SAL compute avg: 0.08, (0.24, 0.00)
MEM_SA avg: 0.05, (0.17, 0.00)
BSW time, avg: 0.13, (0.27, 0.00)
Important parameter settings:
BATCH_SIZE: 512
MAX_SEQ_LEN_REF: 256
MAX_SEQ_LEN_QER: 128
MAX_SEQ_LEN8: 128
SEEDS_PER_READ: 500
SIMD_WIDTH8 X: 64
SIMD_WIDTH16 X: 32
AVG_SEEDS_PER_READ: 64
[bam_sort_core] merging from 0 files and 64 in-memory blocks...
It looks like the algorithm is using the 64 cores but no improvement is shown