Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting

Hi,

The mm10 reference genome/build is sourced from UCSC.

You’ll need to index the genome on your server with Data Manager tools.

Install the genome with Data Managers (sourced from the ToolShed, all are in a separate category there). Install DMs like any other tool using the Admin functions.

You’ll need these DMs at a minimum. Execute them in this order first:

  • Fasta fetcher – has an option to pick UCSC as the data source.
  • SAM indexer
  • Picard indexer
  • 2bit (twoBit) indexer

Then get the DMs that create indexes for the tools you want to use. Run these after the others have above have completed for the best results.

The above was copied over and slightly updated from this prior Galaxy Biostars Q&A. It has more details (worth reviewing):

https://biostar.usegalaxy.org/p/19371/

A search here with the keyword string “biostars data manager fetch sam picard” will find more prior Q&A that covers many different use cases: https://galaxyproject.org/search/

More details
  • Be very careful about not running an indexing tool twice – duplicated lines will cause problems and are a hassle to correct.
  • If you use a “dbkey” that is already available, you must make sure that your genome is an exact match for that content. Genomes that are already indexed on one of the public Galaxy servers (and associated with an existing dbkey) can be found here: http://datacache.galaxyproject.org/
  • Data Libraries are great for storing datasets commonly used (copies do not consume any disk space), and you could put in a fasta for a genome in one to use it with tools as a Custom Genome (no precomputed tool indexes can be used from there or a history, so don’t bother to load them).
  • Custom genome help is in this FAQ and a related FAQ. If you are indexing a genome from an uploaded fasta file for some reason (not available directly from a known source), the same fasta formatting rules apply:
  • But IF you are admin of a server it would be better to install the genome directly and index it for tools on the server. Jobs will use fewer resources and run faster this way, plus be much less likely to fail for memory reasons.
  • Data Manager tools are installed/used under the “Admin” masthead, top section. The “loc” files/table content is there as well.
  • The jobs that are run by Data Manager tools will be sent to your active history. It is strongly recommended to set up a distinct history for each genome, or batch of genomes, and date it in the name, to better keep track of what was done over time.
1 Like