drmaa library threads and munge: Invalid Credential Format

submitting a job to a SGE 8.1.9 Cluster from an application called galaxy logged the following error:

galaxy.jobs.runners.drmaa WARNING 2019-02-13 13:20:43,111 (427) drmaa.Session.runJob() failed, will retry: code 17: MUNGE authentication failed: Invalid credential format

I have verified UIDs and GIDs across host and cluster are alike as well as verified perms for munge dirs and files match install docs. Also verfied munge.key matched across cluster and host.
Used this python script outside of galaxy to submit job to cluster with success:

import drmaa
from multiprocessing.pool import ThreadPool
import tempfile
import os
import stat


session = drmaa.Session()
session.initialize()
def main():
    smt = "ls . > test.out"
    script_file = tempfile.NamedTemporaryFile(mode="w", dir=os.getcwd(), delete=False)
    script_file.write(smt)
    script_file.close()
    print "Job is in file %s" % script_file.name
    os.chmod(script_file.name, stat.S_IRWXG | stat.S_IRWXU)
    jt = session.createJobTemplate()
    print "jt created"
    jt.jobEnvironment = {'BASH_ENV': '~/.bashrc'}
    print "environment set"
    jt.remoteCommand = os.path.join(os.getcwd(),script_file.name)
    print "remote command set"
    jobid = session.runJob(jt)
    print "Job submitted with id: %s, waiting ..." % jobid
    retval = session.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)

if __name__=='__main__':
    main()

WHEN I try this same script with Python Multithreading, I get error Script and error are:

import drmaa
from multiprocessing.pool import ThreadPool
import tempfile
import os
import stat

pool = ThreadPool(1)

session = drmaa.Session()
session.initialize()

def pTask(n):
    smt = "ls . > test.out"
    script_file = tempfile.NamedTemporaryFile(mode="w", dir=os.getcwd(), delete=False)
    script_file.write(smt)
    script_file.close()
    print "Job is in file %s" % script_file.name
    os.chmod(script_file.name, stat.S_IRWXG | stat.S_IRWXU)
    jt = session.createJobTemplate()
    print "jt created"
    jt.jobEnvironment = {'BASH_ENV': '~/.bashrc'}
    print "environment set"
    jt.remoteCommand = os.path.join(os.getcwd(),script_file.name)
    print "remote command set"
    jobid = session.runJob(jt)
    print "Job submitted with id: %s, waiting ..." % jobid
    retval = session.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)

pool.map(pTask, (1,))
Result is:
Job is in file /home/svc-clingalprod/tmpu3A6Rk
jt created
environment set
error: getting configuration: MUNGE authentication failed: Invalid credential format
remote command set
Traceback (most recent call last):
  File "remote_mthread.py", line 29, in <module>
    pool.map(pTask, (1,))
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
drmaa.errors.DeniedByDrmException: code 17: MUNGE authentication failed: Invalid credential format

Where do I go from here in isolating the cause of the Invalid Credential format error?

1 Like

So this is interesting and going a bit further than https://github.com/pygridtools/drmaa-python/issues/44, as this is failing also with just a single thread. Maybe we can come up with a patch for drmaa-python, which would probably be the best solution. Can you check that this works with a ProcessPool instead of a ThreadPool ?

Also, does it work if you initialize the session within the thread:

import drmaa
from multiprocessing.pool import ThreadPool
import tempfile
import os
import stat

pool = ThreadPool(1)

def pTask(n):
    session = drmaa.Session()
    session.initialize()
    smt = "ls . > test.out"
    script_file = tempfile.NamedTemporaryFile(mode="w", dir=os.getcwd(), delete=False)
    script_file.write(smt)
    script_file.close()
    print "Job is in file %s" % script_file.name
    os.chmod(script_file.name, stat.S_IRWXG | stat.S_IRWXU)
    jt = session.createJobTemplate()
    print "jt created"
    jt.jobEnvironment = {'BASH_ENV': '~/.bashrc'}
    print "environment set"
    jt.remoteCommand = os.path.join(os.getcwd(),script_file.name)
    print "remote command set"
    jobid = session.runJob(jt)
    print "Job submitted with id: %s, waiting ..." % jobid
    retval = session.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)

pool.map(pTask, (1,))
1 Like

That will take me a little while to work through. Parallel processes and threads are new to me. I will let you know. In the meantime is Pulsar a viable option for my situation? I need to provide some options for solutions.

1 Like

Yes, but that’s more complicated than necessary. If you can use qsub on the host where Galaxy runs your best bet is probably to use the command line interface runner with the PBS backend. I’m using this in production (though ssh), this should be a viable option.

1 Like

linked post/discussion

FYI – drmaa python no longer has an active package maintainer to go to for issues. Where can I get some documentation that covers the CLI job runner? I need to figure out how to translate the Torque information to SGE. Using the drmaa queue with pulsar suffers from the same munge auth issue as Galaxy.

@mvdbeek Initializing the session from within the threadpool does work. Does this have any bearing on how Galaxy is submitting jobs using python-drmaa?

Yes, we probably need to move the session initialization around to accommodate this. Alternatively it might be necessary to make the drmaa session global. I’ll try to find some time to look into this and to setup a test case. What did you try for the CLI runner ? It might just work with the torque or PBS defaults

I had hoped it would work with the LocalShell in conjunction with the Torque Job. I have submitted a request to get ssh or rsh access for my default galaxy user configured on the head-node for submissions so i can test that.

I appreciate you looking into it.

Yes, they use the same interface. If on the Galaxy node you can submit jobs you don’t need to use the remote shell, you can use the local shell interface.

1 Like

@mvdbeek Running a basic job using local shell threw an error. Error is same as LocalShell runner fails on "No such file or directory" · Issue #7269 · galaxyproject/galaxy · GitHub .

job conf:

<destination id="LocalShell" runner="cli">
    <param id="shell_plugin">LocalShell</param>
    <param id="job_plugin">Torque</param>
    <param id="job_destination">IIHG</param>
</destination>

<tool id="testing" destination="LocalShell"/>
1 Like

We just merged a fix for this, can you update and try again ?

1 Like

Nope…
I ran git pull to grab merge:

[svc-clingalprod@clinical-galaxy clingalaxyprod]$ git pull
Updating 9c9c8a5..a12d9e3
Fast-forward
 config/galaxy.yml.sample                         | 30 ++++++++++++++++++------------
 config/reports.yml.sample                        | 31 ++++++++++++++++---------------
 config/tool_shed.yml.sample                      | 27 +++++++++++++++++----------
 doc/source/admin/galaxy_options.rst              | 30 +++++++++++-------------------
 doc/source/admin/reports_options.rst             | 31 +++++++++----------------------
 lib/galaxy/jobs/runners/slurm.py                 |  4 ++--
 lib/galaxy/jobs/runners/util/cli/shell/local.py  |  7 ++++++-
 lib/galaxy/managers/workflows.py                 |  2 +-
 lib/galaxy/webapps/config_manage.py              | 10 ++++++++++
 lib/galaxy/webapps/galaxy/config_schema.yml      | 33 +++++++++------------------------
 lib/galaxy/webapps/galaxy/controllers/dataset.py | 56 +++++++++++++++++++++++++++++++-------------------------
 lib/galaxy/webapps/reports/config_schema.yml     | 27 +++++++++------------------
 lib/galaxy/webapps/tool_shed/config_schema.yml   | 18 +++++++++---------
 13 files changed, 148 insertions(+), 158 deletions(-)
[svc-clingalprod@clinical-galaxy clingalaxyprod]$ git log
commit a12d9e367cae5eb19eec905385d6cb9e5ad0826c
Merge: 68e3568 e8f5a13
Author: John Chilton <jmchilton@gmail.com>
Date:   Fri Mar 1 11:40:35 2019 -0500

    Merge pull request #7438 from mvdbeek/fix_shell_runner

    [18.09] Fix LocalShellRunner

commit e8f5a13f3b4d0f57819db18bb6cde086b92eb6c8
Author: mvdbeek <m.vandenbeek@gmail.com>
Date:   Fri Mar 1 11:22:50 2019 +0100

    Fix LocalShellRunner

    All the `cmd` coming from the plugins are strings, so if cmd is
    a string we use `shell=True`, brings this on par
    with the Remote Shell plugins (which consume strings).
    Should fix https://github.com/galaxyproject/galaxy/issues/7269.

commit 68e3568bb25928a86283e3d50411d64acd523c52
Merge: 13c175a 9a26de1
Author: Nicola Soranzo <nicola.soranzo@gmail.com>
Date:   Fri Mar 1 00:03:03 2019 +0000

    Merge pull request #7429 from natefoo/slurm-cgroup-message-fix

Logged into galaxy to run the job, Handler error output to follow:

galaxy.jobs.mapper DEBUG 2019-03-01 13:38:38,666 (493) Mapped job to destination id: LocalShell
galaxy.jobs.handler DEBUG 2019-03-01 13:38:38,678 (493) Dispatching to cli runner
galaxy.jobs DEBUG 2019-03-01 13:38:38,686 (493) Persisting job destination (destination id: LocalShell)
galaxy.jobs DEBUG 2019-03-01 13:38:38,687 (493) Working directory for job is: /Dedicated/clingalproddata/database/jobs_directory/000/493
galaxy.jobs.runners DEBUG 2019-03-01 13:38:38,699 Job [493] queued (20.767 ms)
galaxy.jobs.handler INFO 2019-03-01 13:38:38,709 (493) Job dispatched
galaxy.jobs.command_factory INFO 2019-03-01 13:38:38,799 Built script [/Dedicated/clingalproddata/database/jobs_directory/000/493/tool_script.sh] for tool command [echo "Running with '${GALAXY_SLOTS:-1}' threads" > "/Dedicated/clingalproddata/database/files/000/dataset_570.dat"]
galaxy.jobs.runners DEBUG 2019-03-01 13:38:38,843 (493) command is: rm -rf working; mkdir -p working; cd working; /Dedicated/clingalproddata/database/jobs_directory/000/493/tool_script.sh; return_code=$?; cd '/Dedicated/clingalproddata/database/jobs_directory/000/493';
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True
python "/Dedicated/clingalproddata/database/jobs_directory/000/493/set_metadata_tfxren.py" "/Dedicated/clingalproddata/database/jobs_directory/000/493/registry.xml" "/Dedicated/clingalproddata/database/jobs_directory/000/493/working/galaxy.json" "/Dedicated/clingalproddata/database/jobs_directory/000/493/metadata_in_HistoryDatasetAssociation_705_9OtoI8,/Dedicated/clingalproddata/database/jobs_directory/000/493/metadata_kwds_HistoryDatasetAssociation_705_JhY9R_,/Dedicated/clingalproddata/database/jobs_directory/000/493/metadata_out_HistoryDatasetAssociation_705_SUNI5o,/Dedicated/clingalproddata/database/jobs_directory/000/493/metadata_results_HistoryDatasetAssociation_705_Sa9Qvq,/Dedicated/clingalproddata/database/files/000/dataset_570.dat,/Dedicated/clingalproddata/database/jobs_directory/000/493/metadata_override_HistoryDatasetAssociation_705_K4Lslz" 5242880; sh -c "exit $return_code"
galaxy.jobs.runners.cli DEBUG 2019-03-01 13:38:38,874 (493) submitting file: /Dedicated/clingalproddata/database/jobs_directory/000/493/galaxy_493.sh
galaxy.jobs.runners ERROR 2019-03-01 13:38:38,899 (493) Unhandled exception calling queue_job
Traceback (most recent call last):
  File "/Dedicated/clingalaxyprod/lib/galaxy/jobs/runners/__init__.py", line 113, in run_next
    method(arg)
  File "/Dedicated/clingalaxyprod/lib/galaxy/jobs/runners/cli.py", line 98, in queue_job
    returncode, stdout = self.submit(shell, job_interface, ajs.job_file, galaxy_id_tag, retry=MAX_SUBMIT_RETRY)
  File "/Dedicated/clingalaxyprod/lib/galaxy/jobs/runners/cli.py", line 130, in submit
    cmd_out = shell.execute(job_interface.submit(job_file))
  File "/Dedicated/clingalaxyprod/lib/galaxy/jobs/runners/util/cli/shell/local.py", line 47, in execute
    pass
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

Something is weird in your traceback,

  File "/Dedicated/clingalaxyprod/lib/galaxy/jobs/runners/util/cli/shell/local.py", line 47, in execute
    pass

Doesn’t make any sense. You may need to cleanup any *.pyc files and restart Galaxy. If anything it should fail at line 52.

That makes sense as it’s the correct path according to the galaxy build. I restarted the galaxy and it cleared the error… Sort of:

galaxy.jobs.runners.cli DEBUG 2019-03-01 14:44:06,175 (498) submitting file: /Dedicated/clingalproddata/database/jobs_directory/000/498/galaxy_498.sh
galaxy.jobs.runners.cli DEBUG 2019-03-01 14:44:06,284 (498) submission failed (stdout): , retrying in 10 seconds
galaxy.jobs.runners.cli DEBUG 2019-03-01 14:44:06,284 (498) submission failed (stderr): /bin/sh: qsub: command not found

I can literally run qstat and qsub from a terminal as the galaxy user in /bin/sh… am I missing something in job_conf.xml or possibly not passing my env to the cli runner? I will do some digging, however it looks like your merge has worked to fix the documented issue.

I think the Popen inherits the starting processes’ environment, if you’re using supervisor to start Galaxy you may need to add the PATH to the environment line. If however qsub is in a standard location that shouldn’t be necessary (/usr/bin, likely /usr/local/bin as well) and work out of the box.

meh… The PATH to qsub is in the environment for the handlers but not on everything in the conf file. could that be the issue? I am also wondering if running standard uWSGI rather than what Galaxy provides in the .venv is contributing to my complications. Zerglings… I inherited the system that way…
See below:

[program:zergpool]
command         = uwsgi --plugin zergpool --master --zerg-pool /var/tmp/zergpool.sock:127.0.0.1:4001 --logto /var/log/galaxy/zergpool.log
directory       = /Dedicated/clingalaxyprod
priority        = 899
umask           = 022
autostart       = true
autorestart     = true
startsecs       = 5
user            = svc-clingalprod
environment     = HOME="/Dedicated/clingalaxyprod",VIRTUAL_ENV="/Dedicated/clingalaxyprod/.venv",PATH="/Dedicated/clingalaxyprod/.venv/bin:%(ENV_PATH)s"
numprocs        = 1
stopsignal      = INT

[program:zergling0]
command         = uwsgi --plugin python --virtualenv /Dedicated/clingalaxyprod/.venv --ini-paste /Dedicated/clingalaxyprod/config/galaxy.ini --stats 127.0.0.1:9190 --logto /var/log/galaxy/zergling0.log
directory       = /Dedicated/clingalaxyprod
priority        = 999
umask           = 022
autostart       = true
autorestart     = unexpected
startsecs       = 15
user            = svc-clingalprod
environment     = HOME="/Dedicated/clingalaxyprod",VIRTUAL_ENV="/Dedicated/clingalaxyprod/.venv",PATH="/Dedicated/clingalaxyprod/.venv/bin:%(ENV_PATH)s"
stopsignal      = INT

[program:zergling1]
command         = uwsgi --plugin python --virtualenv /Dedicated/clingalaxyprod/.venv --ini-paste /Dedicated/clingalaxyprod/config/galaxy.ini --stats 127.0.0.1:9191 --logto /var/log/galaxy/zergling1.log
directory       = /Dedicated/clingalaxyprod
priority        = 999
umask           = 022
autostart       = false
autorestart     = unexpected
startsecs       = 15
user            = svc-clingalprod
environment     = HOME="/Dedicated/clingalaxyprod",VIRTUAL_ENV="/Dedicated/clingalaxyprod/.venv",PATH="/Dedicated/clingalaxyprod/.venv/bin:%(ENV_PATH)s"
stopsignal      = INT

[program:handler]
command         = python ./scripts/galaxy-main -c /Dedicated/clingalaxyprod/config/galaxy.ini --server-name=handler%(process_num)s --log-file /var/log/galaxy/handler%(process_num)s.log
directory       = /Dedicated/clingalaxyprod
process_name    = handler%(process_num)s
numprocs        = 2
umask           = 022
autostart       = true
autorestart     = true
startsecs       = 10
user            = svc-clingalprod
environment     = VIRTUAL_ENV="/Dedicated/clingalaxyprod/.venv",PATH="/Dedicated/clingalaxyprod/.venv/bin:%(ENV_PATH)s:/opt/sge/bin",DRMAA_LIBRARY_PATH="/opt/sge/lib/lx-amd64/libdrmaa.so",SGE_ROOT="/opt/sge",SGE_ARCH="lx-amd64",DEFAULTMANPATH="/opt/sge/man",MANTYPE="man",SGE_CELL="default",SGE_CLUSTER_NAME="argon",SGE_QMASTER_PORT="6444",SGE_EXECD_PORT="6445",shlib_path_name="LD_LIBRARY_PATH"

[group:gx]
programs = zergpool,zergling0,zergling1,handler

Not sure about what the zerglings are doing, just serving web requests ?
In that case I’d assume they wouldn’t need qsub in PATH, but why not give it a try and see if that changes anything ? If you do change stuff there don’t forget to supervisorctl update, that has bitten me many times in the past.

It was a missing entry in path to qsub… That’s a good thing. qsub works but… maybe slight nuance in job creation between the torque CLI object and SGE??

galaxy.jobs.runners ERROR 2019-03-07 14:38:00,402 [p:21198,w:0,m:2] [ShellRunner.monitor_thread] Unhandled exception checking active jobs
Traceback (most recent call last):
  File "/Dedicated/clingalaxyprod/lib/galaxy/jobs/runners/__init__.py", line 594, in monitor
    self.check_watched_items()
  File "/Dedicated/clingalaxyprod/lib/galaxy/jobs/runners/cli.py", line 152, in check_watched_items
    job_states = self.__get_job_states()
  File "/Dedicated/clingalaxyprod/lib/galaxy/jobs/runners/cli.py", line 206, in __get_job_states
    assert cmd_out.returncode == 0, cmd_out.stderr
AssertionError: SGE 8.1.9
usage: qstat [options]