Hi @lldelisle
Ok, I understand now. Thanks for explaining!
Since that pipeline with Cufflinks will eventually involve reference annotation (unless a purely predictive run?), getting the right annotation could be a wrinkle. Maybe better to have the scientist pull in the genome and annotation, from the same source, at the same time, then run it. Even nicer if the workflow did some standardizing data prep on the reference data since these tools are so picky if I am remembering correctly. Example: GTF headers were a problem before. The workflow could always pre-strip those out since so many data providers are including them now, and if a file doesn’t have header, no harm is done beyond duplicating a smaller file.
Big picture, I think that all IWC workflows are better published with data from the history since that makes the workflow more “Galaxy server and species/assembly agnostic”. Even if the same genome was available, server administrators may not have labeled it the same way (exact same dbkey) and even then there is massive confusion between UCSC identifiers and Ensembl identifiers on genomes, how to get the matching annotation for a genome you can’t “see” yet, how to get that data cleaned up enough format-wise that all tools across the different development packages used can interpret them, etcetera. Maybe a third of all questions at this forum are addressing that confusion at least in part. But it used to be 80% so we are making progress!
We’ve also had people who were using an IWC workflow and didn’t know how to use it with their own custom genome for an organism unlikely to be indexed widely across public servers. Example → Input Custom Reference Genome into Workflow. They were so close with a custom build and everything! Maybe a future enhancement can make choosing genomes all ways possible as options at the top of the launch form as a sort of meta function but that seems a while off.
I can let you know that the BRC project has decided to get around all of those problems – using novel genomes and properly pairing up reference data – by creating a sort of website portal that hosts workflows along side other resources. These are presented as a list of genomes as the starting place (this makes more sense to bench scientists, right?). Those genomes are specific: organism and assembly version – that means the workflow form can be auto-populated with URLs to public resources that are a correct fit, the workflow then does all the format normalization internally (since those steps rarely (ever?) cause problems, can only fix or not fix if a data provider adds in extra comment lines or similar), then the user of that workflow isn’t needing to think about the data details at all beyond what species and what analysis to investigate. I think we’ll probably see more of this strategy since it works and creating indexes on the fly is a bit easier than attempting to pre-index for every tool, then keep it all updated across every possible genome assembly across every public Galaxy site, then those internally in workflows (hard coded dbkeys). I think there is a plan to save back and capture prior on-the-fly indexes as a sort of mini “reuse prior job run data” function but I don’t know if that is still the current thinking.
I wrote too much but I think this is all worth discussing, and you can point people here so they can understand how complicated this is to actually do in practical ways but these are things our project knows about, and you and I know about it, but someone newer to computational biology may not. Tools need very specific inputs or the outputs are not good – even if the tool doesn’t fail – and that is easier now but not yet easy.
Thanks!