usegalaxy.eu cufflinks no cached reference data

lldelisle · January 22, 2025, 5:04pm

Dear usegalaxy.eu admins,
It seems that the table with the cached reference data is not well defined:
If I use a BAM file with no database as input:
In usegalaxy.eu, I have only few proposition:

In usegalaxy.org, I have plenty:

It is missing mm39 among others. Would it be possible to add it, please?

Best,

Lucille

lldelisle · January 22, 2025, 5:05pm

I don’t know if this is related to:

jennaj · January 22, 2025, 8:34pm

Hi @lldelisle

Yes, this is probably partially related but I also see many of the new genomes related to the BRC project (these have recently increased what is included in CVMFS), then I also spot some of the genomes we indexed in the pre Data Manager times (“byhand”). That data flows out to all the public servers, and anyone else pulling in the datacache archive but how that content loops into some tool forms (older styles) might not be automatic.

One more wrinkle since you pointed this one out and it is a good example of how complex this is: the GRCm39/mm39 index was made available ORG as a sort of proof-of-method test using a new process, so it is a bit special and new under the hood. The EU server has that genome available in certain other tools that ORG does not. These have identical scientific content but are organized differently (paths, metadata, etc).

The point is the UseGalaxy servers might have “more or less” data that others, along with overlap, and that can vary by where one is looking. Synching all this up with a long term data management strategy is the primary goal for the IDC.

Lucille, if there is something you want to use right now, making that available is usually possible for simple indexes (for others, the recommendation might be to move to the server that has your target data, or to load it into your history since that works everywhere). Is it just mm39 that you want to use at EU with Cufflinks, or are there other genomes? If this is to support a tutorial of course that would get more attention, too!

Let’s also ask the EU admins. Ping @wm75 what do you think?

If you want to post back what you need, too, that might also help. Or, you can create an issue at the IDC repository and describe what you are looking for, link to a file, etc.

Let’s start there, thanks!

lldelisle · January 22, 2025, 10:13pm

I just told my users that they can use the IWC workflows on usegalaxy.eu and when they ran with mm39 they got an error so I was a bit disappointed… Finally I changed the workflow to use a genome ‘from history’ but I think it may concern other users… As I know where the problem comes from I wanted to write a topic so if someone else get an error he/she can refer to.

jennaj · January 24, 2025, 12:27am

Hi @lldelisle

Ok, I understand now. Thanks for explaining!

Since that pipeline with Cufflinks will eventually involve reference annotation (unless a purely predictive run?), getting the right annotation could be a wrinkle. Maybe better to have the scientist pull in the genome and annotation, from the same source, at the same time, then run it. Even nicer if the workflow did some standardizing data prep on the reference data since these tools are so picky if I am remembering correctly. Example: GTF headers were a problem before. The workflow could always pre-strip those out since so many data providers are including them now, and if a file doesn’t have header, no harm is done beyond duplicating a smaller file.

Big picture, I think that all IWC workflows are better published with data from the history since that makes the workflow more “Galaxy server and species/assembly agnostic”. Even if the same genome was available, server administrators may not have labeled it the same way (exact same dbkey) and even then there is massive confusion between UCSC identifiers and Ensembl identifiers on genomes, how to get the matching annotation for a genome you can’t “see” yet, how to get that data cleaned up enough format-wise that all tools across the different development packages used can interpret them, etcetera. Maybe a third of all questions at this forum are addressing that confusion at least in part. But it used to be 80% so we are making progress!

We’ve also had people who were using an IWC workflow and didn’t know how to use it with their own custom genome for an organism unlikely to be indexed widely across public servers. Example → Input Custom Reference Genome into Workflow. They were so close with a custom build and everything! Maybe a future enhancement can make choosing genomes all ways possible as options at the top of the launch form as a sort of meta function but that seems a while off.

I can let you know that the BRC project has decided to get around all of those problems – using novel genomes and properly pairing up reference data – by creating a sort of website portal that hosts workflows along side other resources. These are presented as a list of genomes as the starting place (this makes more sense to bench scientists, right?). Those genomes are specific: organism and assembly version – that means the workflow form can be auto-populated with URLs to public resources that are a correct fit, the workflow then does all the format normalization internally (since those steps rarely (ever?) cause problems, can only fix or not fix if a data provider adds in extra comment lines or similar), then the user of that workflow isn’t needing to think about the data details at all beyond what species and what analysis to investigate. I think we’ll probably see more of this strategy since it works and creating indexes on the fly is a bit easier than attempting to pre-index for every tool, then keep it all updated across every possible genome assembly across every public Galaxy site, then those internally in workflows (hard coded dbkeys). I think there is a plan to save back and capture prior on-the-fly indexes as a sort of mini “reuse prior job run data” function but I don’t know if that is still the current thinking.

I wrote too much but I think this is all worth discussing, and you can point people here so they can understand how complicated this is to actually do in practical ways but these are things our project knows about, and you and I know about it, but someone newer to computational biology may not. Tools need very specific inputs or the outputs are not good – even if the tool doesn’t fail – and that is easier now but not yet easy.

Thanks!

mvdbeek · January 24, 2025, 9:45am

Absolutely. As for the remaining discussion, BRC targets .org for the first iteration, and eventually we’ll want to rework reference data access, but I don’t think this is the right avenue to discuss that.

jennaj · January 24, 2025, 7:19pm

Thanks Marius, I agree! And Lucille, maybe this helps? However this works out is fine with me but I think some clarity would be helpful for everyone. Suggestion: Source for input reference data defaults to "from the history"? · Issue #648 · galaxyproject/iwc · GitHub

Topic		Replies	Views
Setting a custom database input for tools -- in a workflow or on a tool form workflow , data-manager , igv , cvmfs , custom-build	8	973	June 23, 2021
Genome index or dbkey not accessed by tools on a local Galaxy - Solution: Run tool-specific Data Managers usegalaxy.org support server-admin , tool-install , galaxy-local , data-manager	5	961	May 18, 2019
reference genome Caenorhabditis elegans ce11 usegalaxy.org support reference-genome	3	351	May 3, 2022
There are no locally cached or built-in genomes on use.galaxy.org usegalaxy.org support mapping , reference-genome , bwa_mem2	1	248	November 29, 2023
Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting galaxy-local , data-manager , picard_markduplicates	28	7634	July 7, 2021

usegalaxy.eu cufflinks no cached reference data

Related topics