Custom local instance with cluster - various questions

Dear Galaxy admins/users,

Overall context:

  • In a university context, across multiple clusters, teams, and fields
  • We are experimenting with a local Galaxy instance that we want to adapt to a cluster setting.
  • Galaxy will be hosted on a virtual machine.
  • Galaxy will submit jobs to several (non-local) clusters using REST requests with JWT authorisation.
  • We don’t have root on the clusters.
  • We need all data to remain exclusively on the clusters.
  • New tools would be created regularly
  • Reasons we are considering choosing Galaxy instead of another workflow management platform:
    • Excellent GUI for non-technical users
    • Able to precisely see how a specific datum was generated, including the inheritance chain
    • Sharing of workflows + data
    • RBAC
    • Cluster integration, including dynamic destination methods
    • Collections
    • Conditional workflow steps

Here are our specific questions, any answer to any sub-part is appreciated:

  1. Data
    1. Context:
      • We need the files in galaxy_install_dir/database/objects to be references containing the cluster name + filepath (within that cluster), instead of being the actual file.
      • To be clear, we don’t want to duplicate the data, but just refer to the data on the cluster.
      • Real-time access to the results from the Galaxy GUI
        • still needs to be possible
        • needs to respect the cluster’s filesystem’s permissions/file visibility
    2. Question
      • Are there any recommended ways of achieving the above goals?
        A galaxy config option we missed? Or something more complex but still better than forking Galaxy?
    3. Solutions we thought of
      • We can fork Galaxy and modify the behaviour in the engine’s code if need be
      • We successfully used Data Libraries for input data referenced by symlinks, but
        • that is assuming the data is on the same machine as Galaxy (which won’t be the case)
        • and it only works for input data, not output data
      • We could configure a File Source that points towards the given cluster, and an “Export data” tool to write the data to it. This custom “Export data” tool would be appended to every tool of every workflow.
        • This would cause an extra tool for every existing tool, which is very verbose.
          So we would still need to modify the engine to make this implicit
        • I think the data would be duplicated, in galaxy_install_dir/database/objects AND the external server, instead of just on the latter. Which is counter-productive.
          Although we could flag the data to be cleared after the workflow finishes.
          But, even then, it’s still inefficient to store it in the first place, even more so given the potential file-sizes (multi-TB).
      • Maybe use some model_operations tools as seen in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/actions/__init__.py , https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/actions/model_operations.py and https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/duplicate_file_to_collection.xml which allow the creation of objects without actually increasing storage
      • For real-time access to the results, we could use a REST request to the batch scheduler (/media endpoint with JWT authorisation for OAR)
        • Also, our Galaxy usernames would correspond to the cluster usernames
        • Also, we would require cluster-equivalent SSO authentication for sign-in.
        • Also, the /media endpoint enforces file permissions
        • Therefore the correct file permissions would be enforced.
  2. Cluster
    1. Context
      1. We would like to preserve file and jobs permissions/owners as the user submitting the job, not as the “galaxy” user.
      2. We see the recommended way to submit jobs as the real user, in https://docs.galaxyproject.org/en/latest/admin/cluster.html#submitting-jobs-as-the-real-user , is “by executing a site-customizable script via sudo” but we don’t have root on the clusters we will be using
    2. Questions
      • Does anybody have any general feedback on their integration with a cluster ?
      • Is there any existing OAR (a batch scheduler) integration ? I don’t see any in https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/jobs/runners
      • Does anybody have feedback on restarting the Galaxy instance while a workflow invocation with cluster jobs is running? Is it easy to recover the state of the invocation/jobs?
      • Does anybody have any feedback on getting logs and performance metrics per job, in a cluster context?
    3. Solution we thought of:
      • using REST job submission commands with JWT authorisation

Thank you,

-Vlad

HI @vlad.visan

I cross-posted your question over to the Admin matrix chat to bring people in. https://matrix.to/#/!rfLDbcWEWZapZrujix:gitter.im/$S09q5CT-L8YU-gXtyBgq1tX3RTngQbbUNdXJi3Jk-Zo?via=gitter.im&via=matrix.org

They may reply here or there, and for your specific usage, I’d strongly recommend that you both join the chat and review the recent posts. There is a new working group dedicated to running Galaxy instances just like yours! The first meeting is this next Thursday. I’ll repost the WG link here too → https://galaxyproject.org/events/2024-01-small-scale/

Thank you for the cross-post!

1 Like

Few random thoughts from my side.

We need all data to remain exclusively on the clusters.

At least for viewing data needs to be transferred to Galaxy.

Having the data on a different location (with user permissions) kind of screams for S3 storage (or maybe another object store lib/galaxy/objectstore/). As far as I know per user S3 storage might be possible (but maybe a nightmare to configure and maintain). Its also not widely used yet.

One important point is that a Galaxy configured that way probably won’t be able to use the data sharing capabilities of Galaxy (at least this might be super hard to implement). Also debugging tool errors will not be easily possible (since Galaxy admins can’t access the data).

File sources as implemented in https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/files/sources probably also work for input…

sudo: On my system the admins only granted sudo permissions for the 2-3 scripts needed to run jobs as real user and disabled that anyone can change them. Without this it won’t be possible to run as different user.

OAR: not that I know of. As an intermediate workaround you could maybe could run pulsar on you cluster and and schedule jobs then locally.

Jobs and workflow invocations should not be harmed by a restart. But to be on the safe side you could stop job processing a while before restart.

Thank you Matthias.

Data

  1. Use S3 storage: Thank you for the idea. I see it is supported as a file source, I need to experiment.
  2. Data-sharing limitations:
    • True, we are intentionally restricting it, which does seem to go against the general Galaxy state of mind.
    • Our idea to keep it working is:
      • that anybody can see the path of an output of a job
      • but when a user tries to preview it, if we cannot authenticate/authorise as them to the S3 (or similar) service then the underlying REST file request will fail.
    • As for tool errors, we are assuming the tool writers/users should debug their own errors
      • But if the error is due to an admin. config error, then it’s true that it will be harder to debug for the admin, I had not thought of that.

Cluster

  1. Sudo: Good point, maybe having just a few scripts would be doable, I will have to think about this. Although in our case we will likely use JWT authorisation +REST submission.
  2. OAR: Thanks for the Pulsar suggestion, we will have it as a back-up. We will likely be adding a “oar.py” runner, inspired from the “slurm.py” runner in https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/jobs/runners
  3. Stopping job before restart: If restarts are rare, then this is indeed a solution whose drawback (lost computing time for huge jobs) becomes amortised over time.