Working on a new job-runner (ARC middleware runner) - using dev-branch - job fails upon finishing

(Note: seems my previous post got deleted, so reposting)

Developing an new job runner on a test-galaxy server on almalinux 9 with the dev branch.

The job-runner works as expected in release_23 of galaxy. I can submit a job via a tool (currently a custom tool) from galaxy, the job runs as expected on the remote site, and the files after the job are don are uploaded into the correct galaxy folder once the job is done.

The same is true for the dev-version, however, upon finishing there is an error and the job is marked as failed in Galaxy, and there is no access to the uploaded output files (even though they are present in the job folder in galaxy):

[root@galaxy-arc-test arc_galaxy_runner]# ls -lhrt /storage/galaxy/data/jobs/000/2/
total 8.0K
drwxr-xr-x 3 galaxy galaxy  26 Aug 11 11:09 working
drwxr-xr-x 2 galaxy galaxy  44 Aug 11 11:09 outputs
-rw-r--r-- 1 galaxy galaxy 118 Aug 11 11:09 galaxy_2.o
-rw-r--r-- 1 galaxy galaxy  59 Aug 11 11:09 galaxy_2.e
[root@galaxy-arc-test arc_galaxy_runner]# ls -lhrt /storage/galaxy/data/jobs/000/2/outputs/
total 0
-rw-r--r-- 1 galaxy galaxy 0 Aug 11 11:09 tool_stdout
-rw-r--r-- 1 galaxy galaxy 0 Aug 11 11:09 tool_stderr
[root@galaxy-arc-test arc_galaxy_runner]# ls -lhrt /storage/galaxy/data/jobs/000/2/working/fa619b53ba31/
total 12K
-rw-r--r-- 1 galaxy galaxy   0 Aug 11 11:09 arc.err
-rw-r--r-- 1 galaxy galaxy   0 Aug 11 11:09 arc.out
-rw-r--r-- 1 galaxy galaxy  36 Aug 11 11:09 arcout1.txt
-rw-r--r-- 1 galaxy galaxy 158 Aug 11 11:09 runhello.sh
-rw-r--r-- 1 galaxy galaxy  36 Aug 11 11:09 arcout2.txt

Exerpt from galaxy log:

Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: null DEBUG 2023-08-11 11:09:12,832 [pN:handler_1,p:28846,tN:ThreadPoolExecutor-2_2] ====== MAIKEN ====== method: GET url: https://arctestcluster-slurm-el8-arc7-ce1.cern-test.uiocloud.no/arex/rest/1.1/jobs/fa619b53ba31/session/arcout2.txt headers: {}
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: galaxy.jobs ERROR 2023-08-11 11:09:12,832 [pN:handler_1,p:28846,tN:ArcJobRunner.work_thread-0] Parent instance <Job at 0x7f74d453f100> is not bound to a Session; lazy load operation of attribute 'output_datasets' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: Traceback (most recent call last):
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/server/lib/galaxy/jobs/runners/__init__.py", line 631, in _finish_or_resubmit_job
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     job_wrapper.finish(
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/server/lib/galaxy/jobs/__init__.py", line 1888, in finish
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     for dataset_path in self.job_io.get_output_fnames():
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/server/lib/galaxy/job_execution/setup.py", line 219, in get_output_fnames
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     return self.output_paths
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/server/lib/galaxy/job_execution/setup.py", line 169, in output_paths
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     self.compute_outputs()
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/server/lib/galaxy/job_execution/setup.py", line 245, in compute_outputs
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     for da in job.output_datasets + job.output_library_datasets:
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/venv/lib64/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 487, in __get__
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     return self.impl.get(state, dict_)
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/venv/lib64/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 959, in get
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     value = self._fire_loader_callables(state, key, passive)
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/venv/lib64/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 995, in _fire_loader_callables
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     return self.callable_(state, passive)
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/venv/lib64/python3.9/site-packages/sqlalchemy/orm/strategies.py", line 863, in _load_for_state
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     raise orm_exc.DetachedInstanceError(
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <Job at 0x7f74d453f100> is not bound to a Session; lazy load operation of attribute 'output_datasets' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: During handling of the above exception, another exception occurred:
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: Traceback (most recent call last):
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/server/lib/galaxy/jobs/__init__.py", line 1417, in fail
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     for dataset_path in self.job_io.get_output_fnames():
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/server/lib/galaxy/job_execution/setup.py", line 219, in get_output_fnames
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     return self.output_paths
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/server/lib/galaxy/job_execution/setup.py", line 169, in output_paths
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     self.compute_outputs()
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/server/lib/galaxy/job_execution/setup.py", line 245, in compute_outputs
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     for da in job.output_datasets + job.output_library_datasets:
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/venv/lib64/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 487, in __get__
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     return self.impl.get(state, dict_)
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/venv/lib64/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 959, in get
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     value = self._fire_loader_callables(state, key, passive)
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/venv/lib64/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 995, in _fire_loader_callables
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     return self.callable_(state, passive)
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:   File "/storage/srv/galaxy/venv/lib64/python3.9/site-packages/sqlalchemy/orm/strategies.py", line 863, in _load_for_state
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]:     raise orm_exc.DetachedInstanceError(
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <Job at 0x7f74d453f100> is not bound to a Session; lazy load operation of attribute 'output_datasets' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: null DEBUG 2023-08-11 11:09:12,832 [pN:handler_1,p:28846,tN:ThreadPoolExecutor-2_2] ====== MAIKEN ====== GET https://arctestcluster-slurm-el8-arc7-ce1.cern-test.uiocloud.no/arex/rest/1.1/jobs/fa619b53ba31/session/arcout2.txt headers={}
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: null DEBUG 2023-08-11 11:09:12,840 [pN:handler_1,p:28846,tN:ThreadPoolExecutor-2_1] Download https://arctestcluster-slurm-el8-arc7-ce1.cern-test.uiocloud.no/arex/rest/1.1/jobs/fa619b53ba31/session/runhello.sh to /storage/galaxy/data/jobs/000/2/working/fa619b53ba31/runhello.sh for job fa619b53ba31 successful
Aug 11 11:09:12 galaxy-arc-test.itf.uiocloud.no galaxyctl[28846]: null DEBUG 2023-08-11 11:09:12,840 [pN:handler_1,p:28846,tN:ThreadPoolExecutor-2_2] Download https://arctestcluster-slurm-el8-arc7-ce1.cern-test.uiocloud.no/arex/rest/1.1/jobs/fa619b53ba31/session/arcout2.txt to /storage/galaxy/data/jobs/000/2/working/fa619b53ba31/arcout2.txt for job fa619b53ba31 successful

Fuller galaxy log: galaxy_log - Google Drive

The very crude demo-version of the ARC runner can be found here: arc_galaxy_runner.tgz - Google Drive

Any tips as to what is going on and how to fix it?

Hi @maikenp1

This will need input from the developers … so maybe use a chat? That could be the Systems or Backend working group… so try at one and they will redirect you if needed. You can include a link to this topic for context. See → Galaxy Working Groups and Projects - Galaxy Community Hub

Ps: The original post was probably auto-removed due to the code or “typing” too fast :slight_smile:

1 Like