Hi everyone,
im trying to use Slurm for our local galaxy, but somehow all jobs immediately fail when i run them. When i go back to singularity, everything works again. I can also run jobs on slurm with sbatch or so, but when starting a job in the Galaxy UI, they become yellow and then red with this error: “This job failed for reasons that could not be determined.” I’ll attach all logs that might be useful. I would be very grateful if someone could have a look. I feel like it is a problem because im running slurm on the same machine as the galaxy.
Is slurm needed or recommended if i want to run jobs to run with more than one core? I thought slurm to be most efficient because i could use it with tpv so i dont have to find out how many resources each tool would need.
------------slurm.log--------------------------
#log while running job
[2023-07-04T16:46:05.716] Launching batch job 24 for UID 999
[2023-07-04T16:46:10.257] [24.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[2023-07-04T16:46:10.259] [24.batch] done with job
------------slurmctld.log--------------------------
#Log from job run
[2023-07-04T16:46:05.504] _slurm_rpc_submit_batch_job: JobId=24 InitPrio=4294901754 usec=545
[2023-07-04T16:46:05.713] sched: Allocate JobId=24 NodeList=galaxyworkstation #CPUs=1 Partition=debug
[2023-07-04T16:46:10.258] _job_complete: JobId=24 WEXITSTATUS 1
[2023-07-04T16:46:10.258] _job_complete: JobId=24 done
#Log when starting slurm
[2023-07-04T17:04:57.311] error: Node galaxyworkstation appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2023-07-04T17:04:57.698] Terminate signal (SIGINT or SIGTERM) received
[2023-07-04T17:04:57.795] Saving all slurm state
[2023-07-04T17:04:58.153] error: Configured MailProg is invalid
[2023-07-04T17:04:58.154] slurmctld version 21.08.5 started on cluster cluster
[2023-07-04T17:04:58.157] No memory enforcing mechanism configured.
[2023-07-04T17:04:58.185] Recovered state of 1 nodes
[2023-07-04T17:04:58.185] Recovered information about 0 jobs
[2023-07-04T17:04:58.185] select/cons_res: part_data_create_array: select/cons_res: preparing for 1 partitions
[2023-07-04T17:04:58.186] Recovered state of 0 reservations
[2023-07-04T17:04:58.186] read_slurm_conf: backup_controller not specified
[2023-07-04T17:04:58.186] select/cons_res: select_p_reconfigure: select/cons_res: reconfigure
[2023-07-04T17:04:58.186] select/cons_res: part_data_create_array: select/cons_res: preparing for 1 partitions
[2023-07-04T17:04:58.186] Running as primary controller
[2023-07-04T17:04:58.186] No parameter for mcs plugin, default values set
[2023-07-04T17:04:58.186] mcs: MCSParameters = (null). ondemand set.
[2023-07-04T17:05:03.191] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
------------slurm.conf--------------------------
This file is maintained by Ansible - ALL MODIFICATIONS WILL BE REVERTED
Default, define SlurmctldHost or ControlMachine to override
ControlMachine=localhost
Configuration options
AuthType=auth/munge
ClusterName=cluster
CryptoType=crypto/munge
ProctrackType=proctrack/pgid
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdParameters=config_overrides
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmctld
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
Nodes
NodeName=galaxyworkstation CPUs=32
Partitions
PartitionName=debug Default=YES Nodes=galaxyworkstation State=UP
------------scontrol show job--------------------------
#scontrol show job
#scontrol show job 18
JobId=18 JobName=g67_secure_hash_message_digest_niklas_petzold_tum_de
UserId=galaxy_ans(999) GroupId=galaxy_ans(999) MCS_label=N/A
Priority=4294901759 Nice=0 Account=(null) QOS=(null)
JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2023-06-30T16:07:07 EligibleTime=2023-06-30T16:07:07
AccrueTime=2023-06-30T16:07:07
StartTime=2023-06-30T16:07:08 EndTime=2023-06-30T16:07:12 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-30T16:07:08 Scheduler=Main
Partition=debug AllocNode:Sid=galaxyworkstation:22799
ReqNodeList=(null) ExcNodeList=(null)
NodeList=galaxyworkstation
BatchHost=galaxyworkstation
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0::
TRES=cpu=1,mem=1M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
MinCPUsNode=1 MinMemoryNode=1M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/media/Galaxy2022/data/jobs/000/67
StdErr=/media/Galaxy2022/data/jobs/000/67/galaxy_67.e
StdIn=/dev/null
StdOut=/media/Galaxy2022/data/jobs/000/67/galaxy_67.o
Power=
-------------------groupvars-------------------
#Job_config
Galaxy Job Configuration
galaxy_job_config:
runners:
local_runner:
load: galaxy.jobs.runners.local:LocalJobRunner
workers: 4
slurm:
load: galaxy.jobs.runners.slurm:SlurmJobRunner
drmaa_library_path: /usr/lib/slurm-drmaa/lib/libdrmaa.so.1
handling:
assign: [‘db-skip-locked’]
execution:
#default: local_env
#default: singularity
default: slurm
environments:
local_env:
runner: local_runner
tmp_dir: true
slurm:
runner: slurm
singularity_enabled: true
env:
- name: LC_ALL
value: C
- name: SINGULARITY_CACHEDIR
value: /tmp/singularity
- name: APPTAINER_TMPDIR
value: /tmp
singularity:
runner: local_runner
singularity_enabled: true
env:
# Ensuring a consistent collation environment is good for reproducibility.
- name: LC_ALL
value: C
# The cache directory holds the docker containers that get converted
- name: APPTAINER_CACHEDIR
value: /tmp/singularity
# Apptainer uses a temporary directory to build the squashfs filesystem
- name: APPTAINER_TMPDIR
value: /tmp
tools:
- class: local # these special tools that aren’t parameterized for remote execution - expression tools, upload, etc
environment: local_env
#Slurm configuration
Slurm
slurm_roles: [‘controller’, ‘exec’] # Which roles should the machine play? exec are execution hosts.
slurm_nodes:
- name: galaxyworkstation # Name of our host
CPUs: 32 # Here you would need to figure out how many cores your machine has. For this training we will use 2 but in real life, look athtop
or similar.
Sockets: 2
CoresPerSocket: 8
ThreadsPerCore: 2
slurm_config:
SlurmdParameters: config_overrides # Ignore errors if the host actually has cores != 2
SelectType: select/cons_res
SelectTypeParameters: CR_CPU_Memory # Allocate individual cores/memory instead of entire node
------------journalctl while running job--------------------------
#Journalctl when running job
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools INFO 2023-07-04 16:46:04,383 [pN:main.2,p:13025,tN:WSGI_0] Validated and populated state for tool request (15.737 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.actions INFO 2023-07-04 16:46:04,398 [pN:main.2,p:13025,tN:WSGI_0] Handled output named out_file1 for tool secure_hash_message_digest (2.066 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.actions INFO 2023-07-04 16:46:04,414 [pN:main.2,p:13025,tN:WSGI_0] Added output datasets to history (15.295 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.actions INFO 2023-07-04 16:46:04,417 [pN:main.2,p:13025,tN:WSGI_0] Setup for job Job[unflushed,tool_id=secure_hash_message_digest] complete, ready to be enqueued (2.917 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.execute DEBUG 2023-07-04 16:46:04,417 [pN:main.2,p:13025,tN:WSGI_0] Tool secure_hash_message_digest created job None (29.101 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.web_stack.handlers INFO 2023-07-04 16:46:04,464 [pN:main.2,p:13025,tN:WSGI_0] (Job[id=71,tool_id=secure_hash_message_digest]) Handler ‘default’ assigned using ‘db-skip-locked’ assignment method
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.execute DEBUG 2023-07-04 16:46:04,481 [pN:main.2,p:13025,tN:WSGI_0] Created 1 job(s) for tool secure_hash_message_digest request (97.784 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:04,499 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “POST /api/tools HTTP/1.0” 200
Jul 04 16:46:04 galaxyworkstation galaxyctl[13017]: uvicorn.access INFO 2023-07-04 16:46:04,539 [pN:main.1,p:13017,tN:MainThread] 141.39.152.83:0 - “GET /history/current_history_json?since=2023-07-04T14:45:27.511849 HTTP/1.0” 200
Jul 04 16:46:04 galaxyworkstation galaxyctl[13017]: uvicorn.access INFO 2023-07-04 16:46:04,607 [pN:main.1,p:13017,tN:MainThread] 141.39.152.83:0 - “GET /api/histories/7ccd42b23e9772e4/contents?v=dev&limit=1000&q=update_time-ge&qv=2023-07-04T14:45:30.441Z&details=ac07c226041e4298,efb7f17bc8d9ab0f,d166293c8b6f10d0 HTTP/1.0” 200
Jul 04 16:46:04 galaxyworkstation galaxyctl[13017]: uvicorn.access INFO 2023-07-04 16:46:04,674 [pN:main.1,p:13017,tN:MainThread] 141.39.152.83:0 - “GET /api/users/dba0b16b7fc2d9dd HTTP/1.0” 200
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:04,741 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /api/histories/7ccd42b23e9772e4/contents?v=dev&order=hid&offset=0&limit=100&q=deleted&qv=false&q=visible&qv=true HTTP/1.0” 200
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.handler DEBUG 2023-07-04 16:46:05,052 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] Grabbed Job(s): 71
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.mapper DEBUG 2023-07-04 16:46:05,130 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Mapped job to destination id: slurm
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.handler DEBUG 2023-07-04 16:46:05,154 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Dispatching to slurm runner
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs DEBUG 2023-07-04 16:46:05,218 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Persisting job destination (destination id: slurm)
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs DEBUG 2023-07-04 16:46:05,227 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Working directory for job is: /media/Galaxy2022/data/jobs/000/71
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners DEBUG 2023-07-04 16:46:05,260 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] Job [71] queued (104.998 ms)
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.handler INFO 2023-07-04 16:46:05,266 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Job dispatched
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs DEBUG 2023-07-04 16:46:05,392 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Job wrapper for Job [71] prepared (107.937 ms)
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.tool_util.deps.containers INFO 2023-07-04 16:46:05,393 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Checking with container resolver [ExplicitContainerResolver] found description [None]
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.tool_util.deps.containers INFO 2023-07-04 16:46:05,393 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Checking with container resolver [ExplicitSingularityContainerResolver] found description [None]
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.tool_util.deps.container_resolvers.mulled DEBUG 2023-07-04 16:46:05,393 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Image name for tool secure_hash_message_digest: python:3.7
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.tool_util.deps.containers INFO 2023-07-04 16:46:05,395 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Checking with container resolver [CachedMulledSingularityContainerResolver[cache_directory=/srv/galaxy_ans/var/container_cache/singularity/mulled]] found description [ContainerDescription[identifier=/srv/galaxy_ans/var/container_cache/singularity/mulled/python:3.7–1,type=singularity]]
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.command_factory INFO 2023-07-04 16:46:05,427 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Built script [/media/Galaxy2022/data/jobs/000/71/tool_script.sh] for tool command [python ‘/srv/galaxy_ans/server/tools/filters/secure_hash_message_digest.py’ --input ‘/media/Galaxy2022/data/datasets/f/7/f/dataset_f7fbfb00-7fd5-49dd-985a-c246018f4e3f.dat’ --output ‘/media/Galaxy2022/data/jobs/000/71/outputs/galaxy_dataset_0f3d2303-4e23-4ec1-a255-9b77d77ecee1.dat’ --algorithm “md5”]
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners DEBUG 2023-07-04 16:46:05,493 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] (71) command is: mkdir -p working outputs configs
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: if [ -d _working ]; then
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: rm -rf working/ outputs/ configs/; cp -R _working working; cp -R _outputs outputs; cp -R _configs configs
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: else
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: cp -R working _working; cp -R outputs _outputs; cp -R configs _configs
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: fi
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: cd working; SINGULARITYENV_GALAXY_SLOTS=$GALAXY_SLOTS SINGULARITYENV_GALAXY_MEMORY_MB=$GALAXY_MEMORY_MB SINGULARITYENV_GALAXY_MEMORY_MB_PER_SLOT=$GALAXY_MEMORY_MB_PER_SLOT SINGULARITYENV_HOME=$HOME SINGULARITYENV__GALAXY_JOB_HOME_DIR=$_GALAXY_JOB_HOME_DIR SINGULARITYENV__GALAXY_JOB_TMP_DIR=$_GALAXY_JOB_TMP_DIR SINGULARITYENV_TMPDIR=$TMPDIR SINGULARITYENV_TMP=$TMP SINGULARITYENV_TEMP=$TEMP singularity -s exec --cleanenv -B /srv/galaxy_ans/server:/srv/galaxy_ans/server -B /srv/galaxy_ans/server/tools/filters:/srv/galaxy_ans/server/tools/filters -B /media/Galaxy2022/data/jobs/000/71:/media/Galaxy2022/data/jobs/000/71 -B /media/Galaxy2022/data/jobs/000/71/outputs:/media/Galaxy2022/data/jobs/000/71/outputs -B /media/Galaxy2022/data/jobs/000/71/configs:/media/Galaxy2022/data/jobs/000/71/configs -B “$_GALAXY_JOB_TMP_DIR:$_GALAXY_JOB_TMP_DIR” -B “$TMPDIR:$TMPDIR” -B “$TMP:$TMP” -B “$TEMP:$TEMP” -B “$_GALAXY_JOB_HOME_DIR:$_GALAXY_JOB_HOME_DIR” -B /media/Galaxy2022/data/jobs/000/71/working:/media/Galaxy2022/data/jobs/000/71/working -B /media/Galaxy2022/data/datasets:/media/Galaxy2022/data/datasets -B /srv/galaxy_ans/var/tool_data:/srv/galaxy_ans/var/tool_data -B /srv/galaxy_ans/var/tool_data:/srv/galaxy_ans/var/tool_data --home $HOME:$HOME /srv/galaxy_ans/var/container_cache/singularity/mulled/python:3.7–1 /bin/bash /media/Galaxy2022/data/jobs/000/71/tool_script.sh > ‘…/outputs/tool_stdout’ 2> ‘…/outputs/tool_stderr’; return_code=$?; echo $return_code > /media/Galaxy2022/data/jobs/000/71/galaxy_71.ec; cd ‘/media/Galaxy2022/data/jobs/000/71’;
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: [ “$GALAXY_VIRTUAL_ENV” = “None” ] && GALAXY_VIRTUAL_ENV=“$_GALAXY_VIRTUAL_ENV”; _galaxy_setup_environment True; python metadata/set.py; sh -c “exit $return_code”
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.drmaa DEBUG 2023-07-04 16:46:05,501 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] (71) submitting file /media/Galaxy2022/data/jobs/000/71/galaxy_71.sh
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.drmaa INFO 2023-07-04 16:46:05,505 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] (71) queued as 24
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.drmaa DEBUG 2023-07-04 16:46:05,801 [pN:handler_0,p:11617,tN:SlurmRunner.monitor_thread] (71/24) state change: job is running
Jul 04 16:46:07 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:07,796 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /history/current_history_json?since=2023-07-04T14:46:04.427855 HTTP/1.0” 200
Jul 04 16:46:07 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:07,846 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /api/histories/7ccd42b23e9772e4/contents?v=dev&limit=1000&q=update_time-ge&qv=2023-07-04T14:46:04.555Z&details=ac07c226041e4298,efb7f17bc8d9ab0f,d166293c8b6f10d0 HTTP/1.0” 200
Jul 04 16:46:07 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:07,908 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /api/users/dba0b16b7fc2d9dd HTTP/1.0” 200
Jul 04 16:46:07 galaxyworkstation galaxyctl[13017]: uvicorn.access INFO 2023-07-04 16:46:07,979 [pN:main.1,p:13017,tN:MainThread] 141.39.152.83:0 - “GET /api/histories/7ccd42b23e9772e4/contents?v=dev&order=hid&offset=0&limit=100&q=deleted&qv=false&q=visible&qv=true HTTP/1.0” 200
Jul 04 16:46:08 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:08,353 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /api/entry_points?running=true HTTP/1.0” 200
Jul 04 16:46:10 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:10,938 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /history/current_history_json?since=2023-07-04T14:46:05.984307 HTTP/1.0” 200
Jul 04 16:46:11 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.drmaa DEBUG 2023-07-04 16:46:11,110 [pN:handler_0,p:11617,tN:SlurmRunner.monitor_thread] (71/24) state change: job finished, but failed
Jul 04 16:46:11 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.slurm WARNING 2023-07-04 16:46:11,125 [pN:handler_0,p:11617,tN:SlurmRunner.monitor_thread] (71/24) Job failed due to unknown reasons, job state in SLURM was: FAILED