Long waiting times until metadata is set

casio · May 26, 2026, 12:13pm

Hi everyone,

I am reaching out because I have hit a major performance wall with my Galaxy instance and could use some advice from the community.

I am currently managing a 160 TB instance where the entire database and data storage reside on an IT-provided NFS share. My setup involves a 10 Gbit/s head node and 4 compute nodes on 1 Gbit/s, all connected via Slurm. Recently, after running a heavy load of roughly 900 workflows with 3,000 to 5,000 steps each, the system performance has slowed down.

The main issue occurs during the setting metadata phase. Even a simple operation like the cut tool on a tiny file takes about 3 minutes to complete, it used to be maybe 20 seconds. When I check the compute nodes while Slurm is processing, I can see python metadata/set.py sitting in D-state, which clearly indicates it is waiting on NFS I/O.

I ran diskus to get a breakdown of the storage usage and found the following numbers:

Total NFS: 159 TB
database/objects: 142 TB
database/jobs_directory: 15 TB
database/tmp: 3.5 TB

The 15 TB size for the jobs directory seems extremely high. I already have cleanup_job set to onsuccess in my configuration, but it does not seem to be keeping the directory clean. I have tried running the maintenance.sh script and even tried moving some jobs to the head node to bypass Slurm, but the metadata delay persists. Additionally, my IT department has refused to increase the NFS rsize and wsize settings, so I am stuck with the defaults.

I have a few specific questions for other admins:

Is a 15 TB jobs directory normal for an instance of this scale, or is my cleanup process failing? I suspect the massive number of subdirectories and files in there is causing a metadata storm that is killing NFS performance.
Is it safe for me to manually purge old folders in the jobs_directory? If I use find or a similar tool to delete anything older than 30 days, will I cause any critical failures in the UI beyond losing access to old job logs and stderr/stdout?
Are there specific NFS mount options like noatime or actimeo that you recommend to help mitigate these D-state hangs when dealing with such a large directory tree? I tried a few, but did not really see any improvement.

I would really appreciate any insights or experiences you can share. The system is technically functional, but the latency has made it almost impossible to use for our researchers.

Thanks!

jennaj · May 26, 2026, 6:53pm

Hi @casio

I’ve reached out to some of our administrators to see if they have feedback and suggestions. This is where at Matrix, but let’s keep the conversation here please!

Meanwhile, I have a few suggestions that will likely be of interest to you.

For the cleanup, yes, looking into this is probably where to start, since you’ll want to make sure this is tuned correctly anyway. → Hands-on: Server Maintenance: Cleanup, Backup, and Restoration / Server Maintenance: Cleanup, Backup, and Restoration / Galaxy Server administration
How to investigate where the underlying bottlenecks may be. → Hands-on: Galaxy Monitoring with Telegraf and Grafana / Galaxy Monitoring with Telegraf and Grafana / Galaxy Server administration
Alternative ways to store data.: Hands-on: Distributed Object Storage / Distributed Object Storage / Galaxy Server administration
General and is more for front end then backend but seems worth including! → Scaling and Load Balancing — Galaxy Project 26.0.1.dev0 documentation

Finally, if you would like to share back which version of Galaxy you are running, and clarify some of the other configurations you’ve made, both will likely be helpful.

Let’s start there, thanks!

marten · May 27, 2026, 8:34am

@casio I am not sure I understand correctly - are you saying that everything slows down under heavy load or that Galaxy has entered some broken state and now everything is slow constantly?

If you run 5 million jobs with job directories and tmp on NFS I would expect a significant slow down – did you have similar workloads before without an issue?

One thing to consider could be a test with a Pulsar that stages data in and out with job working dir and tmp being local to the compute node. Or maybe some NFS performance measurements during the Galaxy load.

Working job directories are generally safe to delete after their job is done. It would be interesting to see what comprises those 15TB – maybe failed jobs with disabled cleanup?

bernt-matthias · May 27, 2026, 10:24am

Jobs directories can safely be removed after the job finished (I think this is the default for production). This can be done automatically be Galaxy.

Which setting do you use:

github.com/galaxyproject/galaxy

lib/galaxy/config/sample/galaxy.yml.sample

59b15b531


      
          # for many tools). Set this to legacy_and_local to preserve the
          # environment for legacy tools and locally managed tools (this might
          # be useful for instance if you are installing software into Galaxy's
          # virtualenv for tool development).
          #preserve_python_environment: legacy_only
          
          # Clean up various bits of jobs left on the filesystem after
          # completion.  These bits include the job working directory, external
          # metadata temporary files, and DRM stdout and stderr files (if using
          # a DRM).  Possible values are: always, onsuccess, never
          #cleanup_job: always
          
          # When running DRMAA jobs as the Galaxy user
          # (https://docs.galaxyproject.org/en/master/admin/cluster.html#submitting-jobs-as-the-real-user)
          # this script is used to run the job script Galaxy generates for a
          # tool execution.
          # Example value 'sudo -E scripts/drmaa_external_runner.py
          # --assign_all_groups'
          #drmaa_external_runjob_script: null
          
          # When running DRMAA jobs as the Galaxy user

On my instance I use on_success (can be useful for debugging) and just remove datasets older than 30/60days (would need to lookup).

Are there specific datatypes involved? There might be performance bottlenecks ..

casio · May 27, 2026, 6:03pm

Thanks for the replies.

I run the current 26.0, but also in the previous 25.1 I had these issues.

@marten

I am not sure I understand correctly - are you saying that everything slows down under heavy load or that Galaxy has entered some broken state and now everything is slow constantly?

The last, it entered some ‘broken’ state and and everything is slow. It was fine under heavy load until I started my many workflows, and yes many of them also failed in their running. That could explain the large job_directory folder. Anyhow, now even a running a simple cut on a toy file takes 2-3 minutes.

If you run 5 million jobs with job directories and tmp on NFS I would expect a significant slow down – did you have similar workloads before without an issue?

When starting the first few hundert workflows, it was fine. I think over more and more large workflows, this accumulated slowly to the point I started to really notice it.

I did asked the IT department if the performance from their side is alright, and they evaluated it with their support company and did not notice any performance issues.

@bernt-matthias

I have:

cleanup_job: onsuccess

After you both confirmed it is save to delete the job_directory, I deleted it and created a new empty one; same for tmp. It does not improve something, time for a cut is now 1:30 mins. Let’s see how it looks tomorrow, our IT department always has some backups as hidden files, and those disappear usually over night.

I will set cleanup_job: always and do onsuccess only for debugging.

// edit: This morning the same cut job is starting faster in slurm, that’s some improvement, but compute time is at almost three minutes with 2:43 mins. I think these variations in run time are also caused by how busy the network is.