Dear usegalaxy.eu admins,
It seems that the table with the cached reference data is not well defined:
If I use a BAM file with no database as input:
In usegalaxy.eu, I have only few proposition:
Yes, this is probably partially related but I also see many of the new genomes related to the BRC project (these have recently increased what is included in CVMFS), then I also spot some of the genomes we indexed in the pre Data Manager times (âbyhandâ). That data flows out to all the public servers, and anyone else pulling in the datacache archive but how that content loops into some tool forms (older styles) might not be automatic.
One more wrinkle since you pointed this one out and it is a good example of how complex this is: the GRCm39/mm39 index was made available ORG as a sort of proof-of-method test using a new process, so it is a bit special and new under the hood. The EU server has that genome available in certain other tools that ORG does not. These have identical scientific content but are organized differently (paths, metadata, etc).
The point is the UseGalaxy servers might have âmore or lessâ data that others, along with overlap, and that can vary by where one is looking. Synching all this up with a long term data management strategy is the primary goal for the IDC.
Lucille, if there is something you want to use right now, making that available is usually possible for simple indexes (for others, the recommendation might be to move to the server that has your target data, or to load it into your history since that works everywhere). Is it just mm39 that you want to use at EU with Cufflinks, or are there other genomes? If this is to support a tutorial of course that would get more attention, too!
Letâs also ask the EU admins. Ping @wm75 what do you think?
If you want to post back what you need, too, that might also help. Or, you can create an issue at the IDC repository and describe what you are looking for, link to a file, etc.
I just told my users that they can use the IWC workflows on usegalaxy.eu and when they ran with mm39 they got an error so I was a bit disappointed⌠Finally I changed the workflow to use a genome âfrom historyâ but I think it may concern other users⌠As I know where the problem comes from I wanted to write a topic so if someone else get an error he/she can refer to.
Since that pipeline with Cufflinks will eventually involve reference annotation (unless a purely predictive run?), getting the right annotation could be a wrinkle. Maybe better to have the scientist pull in the genome and annotation, from the same source, at the same time, then run it. Even nicer if the workflow did some standardizing data prep on the reference data since these tools are so picky if I am remembering correctly. Example: GTF headers were a problem before. The workflow could always pre-strip those out since so many data providers are including them now, and if a file doesnât have header, no harm is done beyond duplicating a smaller file.
Big picture, I think that all IWC workflows are better published with data from the history since that makes the workflow more âGalaxy server and species/assembly agnosticâ. Even if the same genome was available, server administrators may not have labeled it the same way (exact same dbkey) and even then there is massive confusion between UCSC identifiers and Ensembl identifiers on genomes, how to get the matching annotation for a genome you canât âseeâ yet, how to get that data cleaned up enough format-wise that all tools across the different development packages used can interpret them, etcetera. Maybe a third of all questions at this forum are addressing that confusion at least in part. But it used to be 80% so we are making progress!
Weâve also had people who were using an IWC workflow and didnât know how to use it with their own custom genome for an organism unlikely to be indexed widely across public servers. Example â Input Custom Reference Genome into Workflow. They were so close with a custom build and everything! Maybe a future enhancement can make choosing genomes all ways possible as options at the top of the launch form as a sort of meta function but that seems a while off.
I can let you know that the BRC project has decided to get around all of those problems â using novel genomes and properly pairing up reference data â by creating a sort of website portal that hosts workflows along side other resources. These are presented as a list of genomes as the starting place (this makes more sense to bench scientists, right?). Those genomes are specific: organism and assembly version â that means the workflow form can be auto-populated with URLs to public resources that are a correct fit, the workflow then does all the format normalization internally (since those steps rarely (ever?) cause problems, can only fix or not fix if a data provider adds in extra comment lines or similar), then the user of that workflow isnât needing to think about the data details at all beyond what species and what analysis to investigate. I think weâll probably see more of this strategy since it works and creating indexes on the fly is a bit easier than attempting to pre-index for every tool, then keep it all updated across every possible genome assembly across every public Galaxy site, then those internally in workflows (hard coded dbkeys). I think there is a plan to save back and capture prior on-the-fly indexes as a sort of mini âreuse prior job run dataâ function but I donât know if that is still the current thinking.
I wrote too much but I think this is all worth discussing, and you can point people here so they can understand how complicated this is to actually do in practical ways but these are things our project knows about, and you and I know about it, but someone newer to computational biology may not. Tools need very specific inputs or the outputs are not good â even if the tool doesnât fail â and that is easier now but not yet easy.
Absolutely. As for the remaining discussion, BRC targets .org for the first iteration, and eventually weâll want to rework reference data access, but I donât think this is the right avenue to discuss that.