Jobs served via SLURM fail

niklas.petzold · July 4, 2023, 3:20pm

Hi everyone,
im trying to use Slurm for our local galaxy, but somehow all jobs immediately fail when i run them. When i go back to singularity, everything works again. I can also run jobs on slurm with sbatch or so, but when starting a job in the Galaxy UI, they become yellow and then red with this error: “This job failed for reasons that could not be determined.” I’ll attach all logs that might be useful. I would be very grateful if someone could have a look. I feel like it is a problem because im running slurm on the same machine as the galaxy.
Is slurm needed or recommended if i want to run jobs to run with more than one core? I thought slurm to be most efficient because i could use it with tpv so i dont have to find out how many resources each tool would need.

------------slurm.log--------------------------
#log while running job
[2023-07-04T16:46:05.716] Launching batch job 24 for UID 999
[2023-07-04T16:46:10.257] [24.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[2023-07-04T16:46:10.259] [24.batch] done with job

------------slurmctld.log--------------------------
#Log from job run
[2023-07-04T16:46:05.504] _slurm_rpc_submit_batch_job: JobId=24 InitPrio=4294901754 usec=545
[2023-07-04T16:46:05.713] sched: Allocate JobId=24 NodeList=galaxyworkstation #CPUs=1 Partition=debug
[2023-07-04T16:46:10.258] _job_complete: JobId=24 WEXITSTATUS 1
[2023-07-04T16:46:10.258] _job_complete: JobId=24 done

#Log when starting slurm
[2023-07-04T17:04:57.311] error: Node galaxyworkstation appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2023-07-04T17:04:57.698] Terminate signal (SIGINT or SIGTERM) received
[2023-07-04T17:04:57.795] Saving all slurm state
[2023-07-04T17:04:58.153] error: Configured MailProg is invalid
[2023-07-04T17:04:58.154] slurmctld version 21.08.5 started on cluster cluster
[2023-07-04T17:04:58.157] No memory enforcing mechanism configured.
[2023-07-04T17:04:58.185] Recovered state of 1 nodes
[2023-07-04T17:04:58.185] Recovered information about 0 jobs
[2023-07-04T17:04:58.185] select/cons_res: part_data_create_array: select/cons_res: preparing for 1 partitions
[2023-07-04T17:04:58.186] Recovered state of 0 reservations
[2023-07-04T17:04:58.186] read_slurm_conf: backup_controller not specified
[2023-07-04T17:04:58.186] select/cons_res: select_p_reconfigure: select/cons_res: reconfigure
[2023-07-04T17:04:58.186] select/cons_res: part_data_create_array: select/cons_res: preparing for 1 partitions
[2023-07-04T17:04:58.186] Running as primary controller
[2023-07-04T17:04:58.186] No parameter for mcs plugin, default values set
[2023-07-04T17:04:58.186] mcs: MCSParameters = (null). ondemand set.
[2023-07-04T17:05:03.191] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

------------slurm.conf--------------------------

This file is maintained by Ansible - ALL MODIFICATIONS WILL BE REVERTED

Default, define SlurmctldHost or ControlMachine to override

ControlMachine=localhost

Configuration options

AuthType=auth/munge
ClusterName=cluster
CryptoType=crypto/munge
ProctrackType=proctrack/pgid
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdParameters=config_overrides
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmctld
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld

Nodes

NodeName=galaxyworkstation CPUs=32

Partitions

PartitionName=debug Default=YES Nodes=galaxyworkstation State=UP

------------scontrol show job--------------------------
#scontrol show job
#scontrol show job 18
JobId=18 JobName=g67_secure_hash_message_digest_niklas_petzold_tum_de
UserId=galaxy_ans(999) GroupId=galaxy_ans(999) MCS_label=N/A
Priority=4294901759 Nice=0 Account=(null) QOS=(null)
JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2023-06-30T16:07:07 EligibleTime=2023-06-30T16:07:07
AccrueTime=2023-06-30T16:07:07
StartTime=2023-06-30T16:07:08 EndTime=2023-06-30T16:07:12 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-30T16:07:08 Scheduler=Main
Partition=debug AllocNode:Sid=galaxyworkstation:22799
ReqNodeList=(null) ExcNodeList=(null)
NodeList=galaxyworkstation
BatchHost=galaxyworkstation
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0::
TRES=cpu=1,mem=1M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
MinCPUsNode=1 MinMemoryNode=1M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/media/Galaxy2022/data/jobs/000/67
StdErr=/media/Galaxy2022/data/jobs/000/67/galaxy_67.e
StdIn=/dev/null
StdOut=/media/Galaxy2022/data/jobs/000/67/galaxy_67.o
Power=

-------------------groupvars-------------------
#Job_config

Galaxy Job Configuration

galaxy_job_config:
runners:
local_runner:
load: galaxy.jobs.runners.local:LocalJobRunner
workers: 4
slurm:
load: galaxy.jobs.runners.slurm:SlurmJobRunner
drmaa_library_path: /usr/lib/slurm-drmaa/lib/libdrmaa.so.1
handling:
assign: [‘db-skip-locked’]
execution:
#default: local_env
#default: singularity
default: slurm
environments:
local_env:
runner: local_runner
tmp_dir: true
slurm:
runner: slurm
singularity_enabled: true
env:
- name: LC_ALL
value: C
- name: SINGULARITY_CACHEDIR
value: /tmp/singularity
- name: APPTAINER_TMPDIR
value: /tmp
singularity:
runner: local_runner
singularity_enabled: true
env:
# Ensuring a consistent collation environment is good for reproducibility.
- name: LC_ALL
value: C
# The cache directory holds the docker containers that get converted
- name: APPTAINER_CACHEDIR
value: /tmp/singularity
# Apptainer uses a temporary directory to build the squashfs filesystem
- name: APPTAINER_TMPDIR
value: /tmp
tools:
- class: local # these special tools that aren’t parameterized for remote execution - expression tools, upload, etc
environment: local_env

#Slurm configuration

Slurm

slurm_roles: [‘controller’, ‘exec’] # Which roles should the machine play? exec are execution hosts.
slurm_nodes:

name: galaxyworkstation # Name of our host
CPUs: 32 # Here you would need to figure out how many cores your machine has. For this training we will use 2 but in real life, look at htop or similar.
Sockets: 2
CoresPerSocket: 8
ThreadsPerCore: 2
slurm_config:
SlurmdParameters: config_overrides # Ignore errors if the host actually has cores != 2
SelectType: select/cons_res
SelectTypeParameters: CR_CPU_Memory # Allocate individual cores/memory instead of entire node

------------journalctl while running job--------------------------
#Journalctl when running job
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools INFO 2023-07-04 16:46:04,383 [pN:main.2,p:13025,tN:WSGI_0] Validated and populated state for tool request (15.737 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.actions INFO 2023-07-04 16:46:04,398 [pN:main.2,p:13025,tN:WSGI_0] Handled output named out_file1 for tool secure_hash_message_digest (2.066 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.actions INFO 2023-07-04 16:46:04,414 [pN:main.2,p:13025,tN:WSGI_0] Added output datasets to history (15.295 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.actions INFO 2023-07-04 16:46:04,417 [pN:main.2,p:13025,tN:WSGI_0] Setup for job Job[unflushed,tool_id=secure_hash_message_digest] complete, ready to be enqueued (2.917 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.execute DEBUG 2023-07-04 16:46:04,417 [pN:main.2,p:13025,tN:WSGI_0] Tool secure_hash_message_digest created job None (29.101 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.web_stack.handlers INFO 2023-07-04 16:46:04,464 [pN:main.2,p:13025,tN:WSGI_0] (Job[id=71,tool_id=secure_hash_message_digest]) Handler ‘default’ assigned using ‘db-skip-locked’ assignment method
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: galaxy.tools.execute DEBUG 2023-07-04 16:46:04,481 [pN:main.2,p:13025,tN:WSGI_0] Created 1 job(s) for tool secure_hash_message_digest request (97.784 ms)
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:04,499 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “POST /api/tools HTTP/1.0” 200
Jul 04 16:46:04 galaxyworkstation galaxyctl[13017]: uvicorn.access INFO 2023-07-04 16:46:04,539 [pN:main.1,p:13017,tN:MainThread] 141.39.152.83:0 - “GET /history/current_history_json?since=2023-07-04T14:45:27.511849 HTTP/1.0” 200
Jul 04 16:46:04 galaxyworkstation galaxyctl[13017]: uvicorn.access INFO 2023-07-04 16:46:04,607 [pN:main.1,p:13017,tN:MainThread] 141.39.152.83:0 - “GET /api/histories/7ccd42b23e9772e4/contents?v=dev&limit=1000&q=update_time-ge&qv=2023-07-04T14:45:30.441Z&details=ac07c226041e4298,efb7f17bc8d9ab0f,d166293c8b6f10d0 HTTP/1.0” 200
Jul 04 16:46:04 galaxyworkstation galaxyctl[13017]: uvicorn.access INFO 2023-07-04 16:46:04,674 [pN:main.1,p:13017,tN:MainThread] 141.39.152.83:0 - “GET /api/users/dba0b16b7fc2d9dd HTTP/1.0” 200
Jul 04 16:46:04 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:04,741 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /api/histories/7ccd42b23e9772e4/contents?v=dev&order=hid&offset=0&limit=100&q=deleted&qv=false&q=visible&qv=true HTTP/1.0” 200
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.handler DEBUG 2023-07-04 16:46:05,052 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] Grabbed Job(s): 71
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.mapper DEBUG 2023-07-04 16:46:05,130 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Mapped job to destination id: slurm
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.handler DEBUG 2023-07-04 16:46:05,154 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Dispatching to slurm runner
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs DEBUG 2023-07-04 16:46:05,218 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Persisting job destination (destination id: slurm)
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs DEBUG 2023-07-04 16:46:05,227 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Working directory for job is: /media/Galaxy2022/data/jobs/000/71
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners DEBUG 2023-07-04 16:46:05,260 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] Job [71] queued (104.998 ms)
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.handler INFO 2023-07-04 16:46:05,266 [pN:handler_0,p:11617,tN:JobHandlerQueue.monitor_thread] (71) Job dispatched
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs DEBUG 2023-07-04 16:46:05,392 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Job wrapper for Job [71] prepared (107.937 ms)
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.tool_util.deps.containers INFO 2023-07-04 16:46:05,393 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Checking with container resolver [ExplicitContainerResolver] found description [None]
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.tool_util.deps.containers INFO 2023-07-04 16:46:05,393 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Checking with container resolver [ExplicitSingularityContainerResolver] found description [None]
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.tool_util.deps.container_resolvers.mulled DEBUG 2023-07-04 16:46:05,393 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Image name for tool secure_hash_message_digest: python:3.7
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.tool_util.deps.containers INFO 2023-07-04 16:46:05,395 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Checking with container resolver [CachedMulledSingularityContainerResolver[cache_directory=/srv/galaxy_ans/var/container_cache/singularity/mulled]] found description [ContainerDescription[identifier=/srv/galaxy_ans/var/container_cache/singularity/mulled/python:3.7–1,type=singularity]]
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.command_factory INFO 2023-07-04 16:46:05,427 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] Built script [/media/Galaxy2022/data/jobs/000/71/tool_script.sh] for tool command [python ‘/srv/galaxy_ans/server/tools/filters/secure_hash_message_digest.py’ --input ‘/media/Galaxy2022/data/datasets/f/7/f/dataset_f7fbfb00-7fd5-49dd-985a-c246018f4e3f.dat’ --output ‘/media/Galaxy2022/data/jobs/000/71/outputs/galaxy_dataset_0f3d2303-4e23-4ec1-a255-9b77d77ecee1.dat’ --algorithm “md5”]
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners DEBUG 2023-07-04 16:46:05,493 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] (71) command is: mkdir -p working outputs configs
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: if [ -d _working ]; then
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: rm -rf working/ outputs/ configs/; cp -R _working working; cp -R _outputs outputs; cp -R _configs configs
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: else
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: cp -R working _working; cp -R outputs _outputs; cp -R configs _configs
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: fi
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: cd working; SINGULARITYENV_GALAXY_SLOTS=$GALAXY_SLOTS SINGULARITYENV_GALAXY_MEMORY_MB=$GALAXY_MEMORY_MB SINGULARITYENV_GALAXY_MEMORY_MB_PER_SLOT=$GALAXY_MEMORY_MB_PER_SLOT SINGULARITYENV_HOME=$HOME SINGULARITYENV__GALAXY_JOB_HOME_DIR=$_GALAXY_JOB_HOME_DIR SINGULARITYENV__GALAXY_JOB_TMP_DIR=$_GALAXY_JOB_TMP_DIR SINGULARITYENV_TMPDIR=$TMPDIR SINGULARITYENV_TMP=$TMP SINGULARITYENV_TEMP=$TEMP singularity -s exec --cleanenv -B /srv/galaxy_ans/server:/srv/galaxy_ans/server -B /srv/galaxy_ans/server/tools/filters:/srv/galaxy_ans/server/tools/filters -B /media/Galaxy2022/data/jobs/000/71:/media/Galaxy2022/data/jobs/000/71 -B /media/Galaxy2022/data/jobs/000/71/outputs:/media/Galaxy2022/data/jobs/000/71/outputs -B /media/Galaxy2022/data/jobs/000/71/configs:/media/Galaxy2022/data/jobs/000/71/configs -B “$_GALAXY_JOB_TMP_DIR:$_GALAXY_JOB_TMP_DIR” -B “$TMPDIR:$TMPDIR” -B “$TMP:$TMP” -B “$TEMP:$TEMP” -B “$_GALAXY_JOB_HOME_DIR:$_GALAXY_JOB_HOME_DIR” -B /media/Galaxy2022/data/jobs/000/71/working:/media/Galaxy2022/data/jobs/000/71/working -B /media/Galaxy2022/data/datasets:/media/Galaxy2022/data/datasets -B /srv/galaxy_ans/var/tool_data:/srv/galaxy_ans/var/tool_data -B /srv/galaxy_ans/var/tool_data:/srv/galaxy_ans/var/tool_data --home $HOME:$HOME /srv/galaxy_ans/var/container_cache/singularity/mulled/python:3.7–1 /bin/bash /media/Galaxy2022/data/jobs/000/71/tool_script.sh > ‘…/outputs/tool_stdout’ 2> ‘…/outputs/tool_stderr’; return_code=$?; echo $return_code > /media/Galaxy2022/data/jobs/000/71/galaxy_71.ec; cd ‘/media/Galaxy2022/data/jobs/000/71’;
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: [ “$GALAXY_VIRTUAL_ENV” = “None” ] && GALAXY_VIRTUAL_ENV=“$_GALAXY_VIRTUAL_ENV”; _galaxy_setup_environment True; python metadata/set.py; sh -c “exit $return_code”
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.drmaa DEBUG 2023-07-04 16:46:05,501 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] (71) submitting file /media/Galaxy2022/data/jobs/000/71/galaxy_71.sh
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.drmaa INFO 2023-07-04 16:46:05,505 [pN:handler_0,p:11617,tN:SlurmRunner.work_thread-2] (71) queued as 24
Jul 04 16:46:05 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.drmaa DEBUG 2023-07-04 16:46:05,801 [pN:handler_0,p:11617,tN:SlurmRunner.monitor_thread] (71/24) state change: job is running
Jul 04 16:46:07 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:07,796 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /history/current_history_json?since=2023-07-04T14:46:04.427855 HTTP/1.0” 200
Jul 04 16:46:07 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:07,846 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /api/histories/7ccd42b23e9772e4/contents?v=dev&limit=1000&q=update_time-ge&qv=2023-07-04T14:46:04.555Z&details=ac07c226041e4298,efb7f17bc8d9ab0f,d166293c8b6f10d0 HTTP/1.0” 200
Jul 04 16:46:07 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:07,908 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /api/users/dba0b16b7fc2d9dd HTTP/1.0” 200
Jul 04 16:46:07 galaxyworkstation galaxyctl[13017]: uvicorn.access INFO 2023-07-04 16:46:07,979 [pN:main.1,p:13017,tN:MainThread] 141.39.152.83:0 - “GET /api/histories/7ccd42b23e9772e4/contents?v=dev&order=hid&offset=0&limit=100&q=deleted&qv=false&q=visible&qv=true HTTP/1.0” 200
Jul 04 16:46:08 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:08,353 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /api/entry_points?running=true HTTP/1.0” 200
Jul 04 16:46:10 galaxyworkstation galaxyctl[13025]: uvicorn.access INFO 2023-07-04 16:46:10,938 [pN:main.2,p:13025,tN:MainThread] 141.39.152.83:0 - “GET /history/current_history_json?since=2023-07-04T14:46:05.984307 HTTP/1.0” 200
Jul 04 16:46:11 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.drmaa DEBUG 2023-07-04 16:46:11,110 [pN:handler_0,p:11617,tN:SlurmRunner.monitor_thread] (71/24) state change: job finished, but failed
Jul 04 16:46:11 galaxyworkstation galaxyctl[11617]: galaxy.jobs.runners.slurm WARNING 2023-07-04 16:46:11,125 [pN:handler_0,p:11617,tN:SlurmRunner.monitor_thread] (71/24) Job failed due to unknown reasons, job state in SLURM was: FAILED

jennaj · July 5, 2023, 6:28pm

Hi @niklas.petzold

Let’s cross-post this question to the Admin chat. They may reply here or there, and feel free to join the chat. You're invited to talk on Matrix

Thanks for posting all the details and logs – super helpful. If you could post back the version of Galaxy you are running, that would probably also help. Using the most recent stable release is usually best: Galaxy Documentation — Galaxy Project 23.0.2.dev0 documentation

niklas.petzold · July 6, 2023, 9:56am

Hi Jennaj,
thanks for you reply and linking the question in the chat i also posted the question in the admin chat a few days ago.
Catbro gave me a hint that it is already known that there is an issue with slurm when using Ubuntu 22.04 since it uses cgroup2 instead of cgroup1. I added a task to the playbook to use cgroup as described by Nate Coraor:
- name: Disable cgroupv2
copy:
content: |
GRUB_CMDLINE_LINUX_DEFAULT=“$GRUB_CMDLINE_LINUX_DEFAULT systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1”
dest: /etc/default/grub.d/99-cgroupv1.cfg
mode: 0644
notify:
- update-grub
- reboot

But the problem still remains the same.

I am following the newest version of the hands-on of the Galaxy Admin Training Path, so i am running Galaxy release 23.0. My playbook etc. is pretty much the same as the reference files here:

Does it make sense to change the Ubuntu version? Which version is used for development and is most stable/tested? The training suggests 20.04-22.04

niklas.petzold · July 17, 2023, 2:06pm

After adding the cgroup patch and updating all packages, as described by Nate Coraor, it suddendly worked. I put the update tasks into a new playbook, which looked like this:

hosts: galaxyworkstation
become: true
become_user: root
pre_tasks:
- name: Update repos
  apt:
  update_cache: yes
  cache_valid_time: 900
  tags:
  - upgrades
- name: Upgrade packages
  apt:
  upgrade: yes
  autoremove: yes
  notify:
  - reboot
    tags:
  - upgrades
- name: Install other packages
  package:
  name:
  - acl
  - bc
  - sudo
  - make
  - build-essential
  - git
  - nano
  - vim-nox
  - vim-pathogen
  - emacs-nox
  - virtualenv
  - python3-pip
  - jq
  - htop
  - zlib1g-dev
  - libbz2-dev # planemo → samtools
  - liblzma-dev # planemo → samtools
  - tree
  - byobu
  - screen
  - cockpit # web console
  - moreutils # for gat-cli
  - silversearcher-ag # For @hexylena
  state: latest
  tags:
  - packages
- name: Ensure unnecessary stuff is purged
  package:
  name:
  - python
  - fail2ban
  - snapd
  - emacs-lucid
  - emacs-gtk
  state: absent
handlers:
- name: update-grub
  command: /usr/sbin/update-grub
- name: reboot
  reboot:

I hope this might help anyone at some point Thanks for the help!

Topic		Replies	Views
local galaxy seems fail to run ----spown error and job is waiting to run for ever server-admin , galaxy-local	3	436	November 21, 2023
Killing job in Galaxy does not kill Slurm job on cluster server-admin	1	560	February 24, 2020
Slurm-drmaa with Slurm 23.02 server-admin	2	407	May 25, 2023
can not run any jobs on galaxy main usegalaxy.org support queued-gray-datasets	2	518	June 5, 2023
Infinitely spawning ghost jobs (JobHandlerQueue.monitor_thread) server-admin , galaxy-local , troubleshooting , gxadmin	2	581	March 3, 2021