Custom set of reference indexes for transcriptomics processing

We’re developing a transcriptomics processing pipeline which involves multiple steps (tools) which we want to keep together in a single Galaxy tool, rather than parting them out as individual tools in Galaxy.

As part of this we want to tightly control the reference indexes used at each step (e.g. HG38 indexed for STAR, HISAT2, and the associated transcripts for Salmon).

Ideally we’d like a way of bundling all needed reference indexes in one set or collection that would be used by our pipeline when it’s a Galaxy tool. To be clear, unlike other tools, we’d be hard coding the reference(s) ahead of time so the user wouldn’t have a choice.

The reason for that is to ensure that any run of the Galaxy version of our pipeline is in lock step with the offline version we’re running elsewhere.

I’ve read a little bit about Data Managers being used to get custom references into a Galaxy server, however, it wasn’t clear this would be appropriate for a set of indexes, rather than just one, if I wanted to bundle them together.

Any suggestions on how to do this (or a pointer to documentation that I missed) would be helpful.

Thanks,
Chris

1 Like

Hi @ChristopherWilks

The indexes created by Data Managers are grouped by the “dbkey” (primary key in data tables). That “dbkey” is the primary key set when first installing the base-line genome with the “Fetch genome” tool. All downstream indexes created can use that same original “dbkey” (and the underlying installed fasta for the genome).

The simplest way to restrict to a single genome a user can access as a built-in index is to just index one genome on your server for tools. That way, end-users wouldn’t be able to modify the genome at runtime when using a Workflow. Now, they could still upload a Custom genome and attempt to use that – but you could disable the functionality by modifying the tool wrappers.

Hard-coding the reference genome is also probably possible, but that means modifying the wrapper for every tool in your pipeline to point to your “reference genome” explicitly. Be aware that are no Data Managers that index “reference annotation” specifically (except for a few tools: RNA-Star is one example, Featurecounts in another). So, you’ll need to either provide the reference annotation (GTF) someplace where everyone has access to it (Data Library, Shared History) or also hard-code that into tool wrappers and the data tables they access.

Changing the wrappers for tools will mean that your pipeline won’t be available for others to use who working on a different server, unless you bundle it all up into a Docker image or similar, but maybe that is not a concern.

All that said, creating a “meta” data table structure that bundles all the indexes based on the same “dbeky” together is an interesting idea. You could suggest that as an enhancement to the development team here: https://github.com/galaxyproject/galaxy/issues

Thanks!

Thanks for the quick and detailed response, Jennifer!

I’ve started down the Data Manager path and will continue with that.
It seems like that’s probably the best way to do what we want for now.

We may also file a suggestion for an enhancement with the dev team as you said, as ultimately we’d like to have this pipeline and the references as a bundle in the main, public Galaxy servers. But we still need to work out the Galaxy integration on our own server before we start that process.

Thanks again,
Chris