Connect local Galaxy instance to Grid Engine

Hi,

I am trying to install a local Galaxy instance and connect it to a Grid engine cluster to run the jobs. However, following the guidelines at Connecting to a Cluster — Galaxy Project 24.1.dev0 documentation for creating the configuration files, I cannot get the cluster to perform any execution or receive any trace in the logs. Meanwhile, if I run a job from the grid engine cluster directly the job does run. Is there any extra documentation about Galaxy integration with Grid Engine?

These are the configuration files:

job_conf.xml

<?xml version="1.0"?>
<!-- A sample job config that describes all available options -->
<job_conf>
    <plugins>
        <!-- "workers" is the number of threads for the runner's work queue.
             The default from <plugins> is used if not defined for a <plugin>.
             For all asynchronous runners (i.e. everything other than
             LocalJobRunner), this is the number of threads available for
             starting and finishing jobs. For the LocalJobRunner, this is the
             number of concurrent jobs that Galaxy will run.
          -->
        <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner">
            <!-- Override the $DRMAA_LIBRARY_PATH environment variable -->
            <param id="drmaa_library_path">/sched/sge/sge-2011.11/lib/linux-x64/libdrmaa.so</param>
        </plugin>
    </plugins>
    <destinations default="htc">
        <!-- Destinations define details about remote resources and how jobs
             should be executed on those remote resources.
         -->
        <destination id="htc" runner="drmaa">
            <!-- SGE DRMAA to send to a specific request 7GB of RAM (2 per CPU) and 4 CPUs -->
            <param id="nativeSpecification">-w n -l slot_type=htc</param>
            <env file="/shared/Galaxy/.venv/bin/activate" />
        </destination>
        <destination id="mpi" runner="drmaa">
            <!-- SGE DRMAA to send to a specific request 420 GB of RAM and 116 CPUs for MPI job -->
            <param id="nativeSpecification">-w n -pe mpi 116</param>
            <env file="/shared/Galaxy/.venv/bin/activate" />
        </destination>
    </destinations>
</job_conf>

galaxy.yml


uwsgi:

  
  shared-socket: :80
  http: =0
  uid: azureuser
  gid: azureuser

  
  buffer-size: 16384

  
  processes: 2

  
  threads: 4

  
  offload-threads: 2

  # Mapping to serve static content.
  static-map: /static=static

  # Mapping to serve the favicon.
  static-map: /favicon.ico=static/favicon.ico

  # Allow serving certain assets out of `client`.  Most modern Galaxy
  # interfaces bundle all of this, but some older pages still serve
  # these via symlink, requiring this rule.
  static-safe: client/src/assets

  # Enable the master process manager. Disabled by default for maximum
  # compatibility with CTRL+C, but should be enabled for use with
  # --daemon and/or production deployments.
  master: true

  # Path to the application's Python virtual environment. If using Conda
  # for Galaxy's framework dependencies (not tools!), do not set this.
  virtualenv: /shared/Galaxy/.venv

  # Path to the application's Python library.
  pythonpath: lib

  # The entry point which returns the web application (e.g. Galaxy,
  # Reports, etc.) that you are loading.
  module: galaxy.webapps.galaxy.buildapp:uwsgi_app()

  # Mount the web application (e.g. Galaxy, Reports, etc.) at the given
  # URL prefix. Cannot be used together with 'module:' above.
  #mount: /galaxy=galaxy.webapps.galaxy.buildapp:uwsgi_app()

  # Make uWSGI rewrite PATH_INFO and SCRIPT_NAME according to
  # mount-points. Set this to true if a URL prefix is used.
  manage-script-name: false

  # It is usually a good idea to set this to ``true`` if processes is
  # greater than 1.
  thunder-lock: true

  # Cause uWSGI to respect the traditional behavior of dying on SIGTERM
  # (its default is to brutally reload workers)
  die-on-term: true

  # Cause uWSGI to gracefully reload workers and mules upon receipt of
  # SIGINT (its default is to brutally kill workers)
  hook-master-start: unix_signal:2 gracefully_kill_them_all

  # Cause uWSGI to gracefully reload workers and mules upon receipt of
  # SIGTERM (its default is to brutally kill workers)
  hook-master-start: unix_signal:15 gracefully_kill_them_all

  
  py-call-osafterfork: false

  # Ensure application threads will run if `threads` is unset.
  enable-threads: true

  
  umask: 022

galaxy:

  
  new_file_path: tmp

  
  cluster_files_directory: sge

  admin_users: azureuser

  
  drmaa_external_runjob_script: sudo -E /shared/Galaxy/galaxy_env/bin/python3 /shared/Galaxy/galaxy-app/scripts/drmaa_external_runner.py

  
  external_chown_script: sudo -E /shared/Galaxy/galaxy_env/bin/python3 /shared/Galaxy/galaxy-app/scripts/external_chown_script.py

  
  real_system_username: username

  

Thanks and greetings

Hi @fcasnun

If you have solved this already, a post explaining what you did would be great!

And if not, I’ve cross-posted over the Admin chat to see if they have any advice. Please feel free to join the chat :slight_smile: You're invited to talk on Matrix

I was trying different connection configurations based on Connecting to a Cluster — Galaxy Project 24.1.dev0 documentation, specially about real_system_username variable based on this paragraph extracted from previous link :

Limitations: The DRMAA runner does not work if Galaxy is configured to run jobs as real user, because in this setting jobs are submitted with an external script, i.e. in an extra DRMAA session, and the session based (python) DRMAA library can only query jobs within the session in which started them. Furthermore, the DRMAA job runner only distinguishes successful and failed jobs and ignores information about possible failure sources, e.g. runtime / memory violation, which could be used for job resubmission. Specialized job runners are abvailable that are not affected by these limitations, e.g. univa and slurm runners.

real_system_username variable has the following options according to documentation :

# When running DRMAA jobs as the Galaxy user
  # (https://docs.galaxyproject.org/en/master/admin/cluster.html#submitting-jobs-as-the-real-user)
  # Galaxy can extract the user name from the email address (actually
  # the local-part before the @) or the username which are both stored
  # in the Galaxy data base. The latter option is particularly useful
  # for installations that get the authentication from LDAP. Also,
  # Galaxy can accept the name of a common system user (eg.
  # galaxy_worker) who can run every job being submitted. This user
  # should not be the same user running the galaxy system. Possible
  # values are user_email (default), username or <common_system_user>
  #real_system_username: user_email

Reviewing Galaxy logs, I found the following error with username value in real_system_username:

galaxy.jobs.runners.drmaa DEBUG 2024-04-17 11:47:18,374 [pN:main.web.2,p:11516,w:2,m:0,tN:DRMAARunner.work_thread-2] (15) native specification is: -w n -l slot_type=htc
galaxy.model ERROR 2024-04-17 11:47:18,375 [pN:main.web.2,p:11516,w:2,m:0,tN:DRMAARunner.work_thread-2] Error getting the password database entry for user fcasnun
Traceback (most recent call last):
  File "lib/galaxy/model/__init__.py", line 585, in system_user_pwent
    return pwd.getpwnam(username)
KeyError: "getpwnam(): name not found: 'fcasnun'"
galaxy.jobs.runners ERROR 2024-04-17 11:47:18,375 [pN:main.web.2,p:11516,w:2,m:0,tN:DRMAARunner.work_thread-2] (15) Unhandled exception calling queue_job
Traceback (most recent call last):
  File "lib/galaxy/jobs/runners/__init__.py", line 142, in run_next
    method(arg)
  File "lib/galaxy/jobs/runners/drmaa.py", line 202, in queue_job
    job_wrapper.change_ownership_for_run()
  File "lib/galaxy/jobs/__init__.py", line 2273, in change_ownership_for_run
    ret = external_chown(self.working_directory, self.user_system_pwent,
  File "lib/galaxy/jobs/__init__.py", line 2289, in user_system_pwent
    self.__user_system_pwent = job.user.system_user_pwent(self.app.config.real_system_username)
  File "lib/galaxy/model/__init__.py", line 585, in system_user_pwent
    return pwd.getpwnam(username)
KeyError: "getpwnam(): name not found: 'fcasnun'"
galaxy.jobs.runners.drmaa ERROR 2024-04-17 11:47:18,376 [pN:main.web.2,p:11516,w:2,m:0,tN:DRMAARunner.work_thread-2] (15/None) User killed running job, but error encountered removing from DRM queue
Traceback (most recent call last):
  File "lib/galaxy/jobs/runners/drmaa.py", line 359, in stop_job
    assert ext_id not in (None, 'None'), 'External job id is None'
AssertionError: External job id is None

After that, I tried with an username to satisfy the <common_system_user> option but it doesn´t work. However error log changed to :

galaxy.jobs.runners.drmaa DEBUG 2024-04-18 10:29:40,660 [pN:main.web.2,p:28175,w:2,m:0,tN:DRMAARunner.work_thread-0] (26) submitting file /shared/Galaxy/database/jobs_directory/000/26/galaxy_26.sh
galaxy.jobs.runners.drmaa DEBUG 2024-04-18 10:29:40,661 [pN:main.web.2,p:28175,w:2,m:0,tN:DRMAARunner.work_thread-0] (26) native specification is: -w n -l slot_type=htc
galaxy.jobs.runners.drmaa DEBUG 2024-04-18 10:29:40,661 [pN:main.web.2,p:28175,w:2,m:0,tN:DRMAARunner.work_thread-0] (26) submitting with credentials: azureuser [uid: 20001]
galaxy.jobs.runners ERROR 2024-04-18 10:29:40,662 [pN:main.web.2,p:28175,w:2,m:0,tN:DRMAARunner.work_thread-0] (26) Unhandled exception calling queue_job
Traceback (most recent call last):
  File "lib/galaxy/jobs/runners/__init__.py", line 142, in run_next
    method(arg)
  File "lib/galaxy/jobs/runners/drmaa.py", line 213, in queue_job
    filename = self.store_jobtemplate(job_wrapper, jt)
  File "lib/galaxy/jobs/runners/drmaa.py", line 402, in store_jobtemplate
    with open(filename, 'w+') as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/shared/Galaxy/database/sge/26.jt_json'
galaxy.jobs.runners.drmaa ERROR 2024-04-18 10:29:40,664 [pN:main.web.2,p:28175,w:2,m:0,tN:DRMAARunner.work_thread-0] (26/None) User killed running job, but error encountered removing from DRM queue
Traceback (most recent call last):
  File "lib/galaxy/jobs/runners/drmaa.py", line 359, in stop_job
    assert ext_id not in (None, 'None'), 'External job id is None'
AssertionError: External job id is None
1 Like