I managed to launch GVL with a SLURM cluster, and it seemed to be working, since I was able to run sinfo to list my master node.
However, when I add a worker node through CloudMan, I am receiving the following error messages in the CloudMan console.
- 05:27:45 - Adding 1 on-demand instance(s)
- 05:30:05 - Instance ‘i-09db541d174b1bed6; 126.96.36.199; w2’ reported alive
- 05:30:30 - —> PROBLEM, running command ‘/usr/bin/scontrol reconfigure’ returned code ‘1’, the following stderr: 'scontrol: error: slurm_receive_msg: Zero Bytes were transmitted or received slurm_reconfigure error: Zero Bytes were transmitted or received ’ and stdout: ‘’
- 05:30:30 - Could not get a handle on job manager service to add node ‘i-09db541d174b1bed6; 188.8.131.52; w2’
- 05:30:30 - Waiting on worker instance ‘i-09db541d174b1bed6; 184.108.40.206; w2’ to configure itself.
- 05:30:35 - Slurm error: slurmctld not running; setting service state to Error
- 05:30:41 - Instance ‘i-09db541d174b1bed6; 220.127.116.11; w2’ ready
Back on the master node, sinfo now correctly shows two nodes.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main* up infinite 1 drain master
main* up infinite 1 idle w2
However, my srun and sbatch jobs are getting executed on the master node instead of the worker nodes. They get run as regular bash tasks and neither squeue nor smap are showing any running tasks.
Does anybody know what is going on and how to fix this?