Dear Galaxy admins/users,
Overall context:
- In a university context, across multiple clusters, teams, and fields
- We are experimenting with a local Galaxy instance that we want to adapt to a cluster setting.
- Galaxy will be hosted on a virtual machine.
- Galaxy will submit jobs to several (non-local) clusters using REST requests with JWT authorisation.
- We don’t have root on the clusters.
- We need all data to remain exclusively on the clusters.
- New tools would be created regularly
- Reasons we are considering choosing Galaxy instead of another workflow management platform:
- Excellent GUI for non-technical users
- Able to precisely see how a specific datum was generated, including the inheritance chain
- Sharing of workflows + data
- RBAC
- Cluster integration, including dynamic destination methods
- Collections
- Conditional workflow steps
Here are our specific questions, any answer to any sub-part is appreciated:
- Data
- Context:
- We need the files in galaxy_install_dir/database/objects to be references containing the cluster name + filepath (within that cluster), instead of being the actual file.
- To be clear, we don’t want to duplicate the data, but just refer to the data on the cluster.
- Real-time access to the results from the Galaxy GUI
- still needs to be possible
- needs to respect the cluster’s filesystem’s permissions/file visibility
- Question
- Are there any recommended ways of achieving the above goals?
A galaxy config option we missed? Or something more complex but still better than forking Galaxy?
- Are there any recommended ways of achieving the above goals?
- Solutions we thought of
- We can fork Galaxy and modify the behaviour in the engine’s code if need be
- We successfully used Data Libraries for input data referenced by symlinks, but
- that is assuming the data is on the same machine as Galaxy (which won’t be the case)
- and it only works for input data, not output data
- We could configure a File Source that points towards the given cluster, and an “Export data” tool to write the data to it. This custom “Export data” tool would be appended to every tool of every workflow.
- This would cause an extra tool for every existing tool, which is very verbose.
So we would still need to modify the engine to make this implicit - I think the data would be duplicated, in galaxy_install_dir/database/objects AND the external server, instead of just on the latter. Which is counter-productive.
Although we could flag the data to be cleared after the workflow finishes.
But, even then, it’s still inefficient to store it in the first place, even more so given the potential file-sizes (multi-TB).
- This would cause an extra tool for every existing tool, which is very verbose.
- Maybe use some model_operations tools as seen in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/actions/__init__.py , https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/actions/model_operations.py and https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/duplicate_file_to_collection.xml which allow the creation of objects without actually increasing storage
- For real-time access to the results, we could use a REST request to the batch scheduler (/media endpoint with JWT authorisation for OAR)
- Also, our Galaxy usernames would correspond to the cluster usernames
- Also, we would require cluster-equivalent SSO authentication for sign-in.
- Also, the /media endpoint enforces file permissions
- Therefore the correct file permissions would be enforced.
- Context:
- Cluster
- Context
- We would like to preserve file and jobs permissions/owners as the user submitting the job, not as the “galaxy” user.
- We see the recommended way to submit jobs as the real user, in https://docs.galaxyproject.org/en/latest/admin/cluster.html#submitting-jobs-as-the-real-user , is “by executing a site-customizable script via sudo” but we don’t have root on the clusters we will be using
- Questions
- Does anybody have any general feedback on their integration with a cluster ?
- Is there any existing OAR (a batch scheduler) integration ? I don’t see any in https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/jobs/runners
- Does anybody have feedback on restarting the Galaxy instance while a workflow invocation with cluster jobs is running? Is it easy to recover the state of the invocation/jobs?
- Does anybody have any feedback on getting logs and performance metrics per job, in a cluster context?
- Solution we thought of:
- using REST job submission commands with JWT authorisation
- Context
Thank you,
-Vlad