Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting

jennaj · May 1, 2019, 5:04pm

Hi,

The mm10 reference genome/build is sourced from UCSC.

You’ll need to index the genome on your server with Data Manager tools.

Install the genome with Data Managers (sourced from the ToolShed, all are in a separate category there). Install DMs like any other tool using the Admin functions.

You’ll need these DMs at a minimum. Execute them in this order first:

Fasta fetcher – has an option to pick UCSC as the data source.
SAM indexer
Picard indexer
2bit (twoBit) indexer

Then get the DMs that create indexes for the tools you want to use. Run these after the others have above have completed for the best results.

The above was copied over and slightly updated from this prior Galaxy Biostars Q&A. It has more details (worth reviewing):

https://biostar.usegalaxy.org/p/19371/

A search here with the keyword string “biostars data manager fetch sam picard” will find more prior Q&A that covers many different use cases: https://galaxyproject.org/search/

More details

Be very careful about not running an indexing tool twice – duplicated lines will cause problems and are a hassle to correct.
If you use a “dbkey” that is already available, you must make sure that your genome is an exact match for that content. Genomes that are already indexed on one of the public Galaxy servers (and associated with an existing dbkey) can be found here: http://datacache.galaxyproject.org/
Data Libraries are great for storing datasets commonly used (copies do not consume any disk space), and you could put in a fasta for a genome in one to use it with tools as a Custom Genome (no precomputed tool indexes can be used from there or a history, so don’t bother to load them).
Custom genome help is in this FAQ and a related FAQ. If you are indexing a genome from an uploaded fasta file for some reason (not available directly from a known source), the same fasta formatting rules apply:
- Preparing and using a Custom Reference Genome or Build
- Mismatched Chromosome identifiers (and how to avoid them)
But IF you are admin of a server it would be better to install the genome directly and index it for tools on the server. Jobs will use fewer resources and run faster this way, plus be much less likely to fail for memory reasons.
Data Manager tools are installed/used under the “Admin” masthead, top section. The “loc” files/table content is there as well.
The jobs that are run by Data Manager tools will be sent to your active history. It is strongly recommended to set up a distinct history for each genome, or batch of genomes, and date it in the name, to better keep track of what was done over time.

Topic		Replies	Views
Genome index or dbkey not accessed by tools on a local Galaxy - Solution: Run tool-specific Data Managers usegalaxy.org support server-admin , tool-install , galaxy-local , data-manager	5	960	May 18, 2019
No options available (Select Reference Genome) server-admin , reference-index , galaxy-local , data-manager , transcriptomics , cvmfs , rna_star	3	1492	June 14, 2022
adding reference genome to a local install galaxy-local	3	638	March 9, 2021
Reference Genome in some tools - Fully indexing genomes with Data Managers galaxy-local , data-manager , reference-genome , variant-analysis	3	1312	January 27, 2020
Building an indexed genome file for GATK tools usegalaxy.org support data-manager , gatk4 , server-open-issue , vcf	13	2366	October 27, 2021

Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting

Related topics