Long waiting times until metadata is set

Hi everyone,

I am reaching out because I have hit a major performance wall with my Galaxy instance and could use some advice from the community.

I am currently managing a 160 TB instance where the entire database and data storage reside on an IT-provided NFS share. My setup involves a 10 Gbit/s head node and 4 compute nodes on 1 Gbit/s, all connected via Slurm. Recently, after running a heavy load of roughly 900 workflows with 3,000 to 5,000 steps each, the system performance has slowed down.

The main issue occurs during the setting metadata phase. Even a simple operation like the cut tool on a tiny file takes about 3 minutes to complete, it used to be maybe 20 seconds. When I check the compute nodes while Slurm is processing, I can see python metadata/set.py sitting in D-state, which clearly indicates it is waiting on NFS I/O.

I ran diskus to get a breakdown of the storage usage and found the following numbers:

Total NFS: 159 TB
database/objects: 142 TB
database/jobs_directory: 15 TB
database/tmp: 3.5 TB

The 15 TB size for the jobs directory seems extremely high. I already have cleanup_job set to onsuccess in my configuration, but it does not seem to be keeping the directory clean. I have tried running the maintenance.sh script and even tried moving some jobs to the head node to bypass Slurm, but the metadata delay persists. Additionally, my IT department has refused to increase the NFS rsize and wsize settings, so I am stuck with the defaults.

I have a few specific questions for other admins:

  1. Is a 15 TB jobs directory normal for an instance of this scale, or is my cleanup process failing? I suspect the massive number of subdirectories and files in there is causing a metadata storm that is killing NFS performance.

  2. Is it safe for me to manually purge old folders in the jobs_directory? If I use find or a similar tool to delete anything older than 30 days, will I cause any critical failures in the UI beyond losing access to old job logs and stderr/stdout?

  3. Are there specific NFS mount options like noatime or actimeo that you recommend to help mitigate these D-state hangs when dealing with such a large directory tree? I tried a few, but did not really see any improvement.

I would really appreciate any insights or experiences you can share. The system is technically functional, but the latency has made it almost impossible to use for our researchers.

Thanks!

Hi @casio

I’ve reached out to some of our administrators to see if they have feedback and suggestions. This is where at Matrix, but let’s keep the conversation here please!

Meanwhile, I have a few suggestions that will likely be of interest to you.

Finally, if you would like to share back which version of Galaxy you are running, and clarify some of the other configurations you’ve made, both will likely be helpful.

Let’s start there, thanks! :slight_smile:

@casio I am not sure I understand correctly - are you saying that everything slows down under heavy load or that Galaxy has entered some broken state and now everything is slow constantly?

If you run 5 million jobs with job directories and tmp on NFS I would expect a significant slow down – did you have similar workloads before without an issue?

One thing to consider could be a test with a Pulsar that stages data in and out with job working dir and tmp being local to the compute node. Or maybe some NFS performance measurements during the Galaxy load.

Working job directories are generally safe to delete after their job is done. It would be interesting to see what comprises those 15TB – maybe failed jobs with disabled cleanup?

Jobs directories can safely be removed after the job finished (I think this is the default for production). This can be done automatically be Galaxy.

Which setting do you use:

On my instance I use on_success (can be useful for debugging) and just remove datasets older than 30/60days (would need to lookup).

Are there specific datatypes involved? There might be performance bottlenecks ..

Thanks for the replies.

I run the current 26.0, but also in the previous 25.1 I had these issues.

@marten

I am not sure I understand correctly - are you saying that everything slows down under heavy load or that Galaxy has entered some broken state and now everything is slow constantly?

The last, it entered some ‘broken’ state and and everything is slow. It was fine under heavy load until I started my many workflows, and yes many of them also failed in their running. That could explain the large job_directory folder. Anyhow, now even a running a simple cut on a toy file takes 2-3 minutes.

If you run 5 million jobs with job directories and tmp on NFS I would expect a significant slow down – did you have similar workloads before without an issue?

When starting the first few hundert workflows, it was fine. I think over more and more large workflows, this accumulated slowly to the point I started to really notice it.

I did asked the IT department if the performance from their side is alright, and they evaluated it with their support company and did not notice any performance issues.

@bernt-matthias

I have:

cleanup_job: onsuccess

After you both confirmed it is save to delete the job_directory, I deleted it and created a new empty one; same for tmp. It does not improve something, time for a cut is now 1:30 mins. Let’s see how it looks tomorrow, our IT department always has some backups as hidden files, and those disappear usually over night.

I will set cleanup_job: always and do onsuccess only for debugging.

// edit: This morning the same cut job is starting faster in slurm, that’s some improvement, but compute time is at almost three minutes with 2:43 mins. I think these variations in run time are also caused by how busy the network is.