Jobs in workflow sometimes run, sometimes don't, on local galaxy 22.01 using slurm cluster

Hello, Galaxy! :wave: :sparkles:

I am running a local instance of Galaxy 22.01 on my institution’s slurm cluster via DRMAA. I am configuring it to send jobs to our partition, where each job is given one node and all the cpus on that node (in job_conf.xml: param id="nativeSpecification">--nodes=1 --ntasks=1 --cpus-per-task=32 --partition=vgl</param>). For more background on this cluster, when I am submitting jobs via sbatch and I need a tool from a conda environment, I need to either a) activate the conda env from the interactive head node, then submit the job, or b) have the job script explicitly init my conda, then source ~/.bashrc, then activate the required env, within the job script. Otherwise, the job can fail because it did not find the tool.

Onto the local Galaxy install!
Sometimes when I run a workflow, a tool works once but not another time, like in this screenshot:

Other times, a tool will not work when I try to invoke a workflow using it. These two BWA MEM jobs failed for different reasons in the same workflow. But then I tried invoking the workflow again, and then they both ran fine.


Another instance of BWA MEM working once then not again within the same invokation:

I have also been seeing this with BUSCO and QUAST. Here is an example of the workflow-invoked BUSCO failing, but then it ran fine on the same dataset when I clicked “rerun job”:

Sometimes for BWA MEM I get a database locked error (might be unrelated to the previous ones…):

Traceback (most recent call last):
  File "/lustre/fs5/vgl/scratch/labueg/galaxy_22.01/galaxy/.venv/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1803, in _execute_context
    cursor, statement, parameters, context
  File "/lustre/fs5/vgl/scratch/labueg/galaxy_22.01/galaxy/.venv/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: database is locked

For the above database locked error, this was when I was running the bwa-mem workflow a couple times to try to debug it, so maybe it is coming from trying to run the workflow twice at once or something? I have only started seeing this error recently, like last night when trying to run it a couple times.

For tools that I have not run into this issue with: meryl, merqury, and hifiasm have all been running without a hitch, I have not seen any of the above errors with them.

The workflows I am using are found here: HiC workflow (with BWA MEM) and [iwc/Galaxy-Workflow-Long_read_assembly_with_Hifiasm_and_HiC_data.ga at VGP · Delphine-L/iwc · GitHub](https://Hifiasm-HiC workflow (with BUSCO/QUAST)). Here is the rest of my job_conf.xml in case it is helpful:

<?xml version="1.0"?>
<!-- A sample job config that explicitly configures job running the way it is
     configured by default (if there is no explicit config). -->
<job_conf>
    <plugins>
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="4"/>
        <plugin id="slurm" type="runner" load="galaxy.jobs.runners.slurm:SlurmJobRunner">
            <param id="drmaa_library_path">/vggpfs/fs3/vgl/store/labueg/programs/slurm-drmaa/slurm-drmaa-1.1.3/lib/libdrmaa.so</param>
        </plugin>
    </plugins>
    <destinations default="slurm-vgl">
        <destination id="local" runner="local"/>
        <destination id="slurm-vgl" runner="slurm">
            <param id="nativeSpecification">--nodes=1 --ntasks=1 --cpus-per-task=32 --partition=vgl</param>
        </destination>
        <destination id="slurm-bigmem" runner="slurm">
            <param id="nativeSpecification">--nodes=1 --ntasks=1 --cpus-per-task=64 --partition=vgl_bigmem</param>
        </destination>
    </destinations>
</job_conf>

Any help is appreciated, and please let me know if any more details would help! Thank you for your time! :smile:

1 Like

Hi @abueg

I’ve cross-posted this to the admin chat. They may reply here or there, and feel free to join the chat.

Matrix: You're invited to talk on Matrix

Gitter: galaxyproject/admins - Gitter

One extra bit of information you might add to help others offer advice… have you implemented any of these? Which and how? Production Environments — Galaxy Project 22.01.1.dev0 documentation

1 Like

@abueg if the file-not-found error only appears randomly, can it be that a few nodes in your cluster do not have access to your conda environment? Every cluster node that receives jobs need to have access to the conda environments. Alternatively, you can try to use Docker or Singularity containers.

The database is locked is due to the sqlite database that you are using. This is not meant for production or heavy load. Normally you will use a proper database like postgresql instead.

Hope that helps,
Bjoern

4 Likes

It seems to be random, so I think that might be it. Any suggestions on how I can test or work around this? I was thinking of trying to see which nodes were used when a job failed and see if going in that node and activating the environments works, but I am open to other ideas.

re: seqlite - ohh, okay, thank you! Yeah, once the web server is up I will hopefully be using a proper database; our web server isn’t available yet, though, so I’ve just been testing the workflows on a local instance for the time being. Thank you!

1 Like

Click on the “view details” (circle with an i) icon for one of the red failed jobs, and review the bottom of the Dataset Details form in the section Job Metrics. You could compare that to other failed jobs to find technical patterns.

It is possible that the default SQLite database locking could lead to this problem, too (as far as I know). Upgrading your database is likely the best first step.

1 Like