Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting

Hi Jennaj,
I think I figured out where the issue is, these indexers require some dependency which are missing in my galaxy server. Most common one is samtools. I searched online and found no clear guide on how to install samtools. Could you help me here? Thanks :slight_smile:

Hi Jennaj,
Crazy, I have figured out how to install the missing samtools by going through the galaxy document: conda for tool dependencies, where it states if you set “conda_auto_install” to “true”, galaxy will look for and install Conda packages for missing tool dependecies before running a job.

I did exactly what it says and run the indexer again, then it all works like a magic :slight_smile: Thanks a lot for your nice guide.

1 Like

@YONG_JIA Super, glad you found our docs ( https://docs.galaxyproject.org ) and the help solved your problem :sunny:

Hi Jannaj,
Sorry to bother you again, regarding the reference genome preparation, you have been suggesting doing the four basic indexing first and then tool-specific indexing, what is the reason for doing this? Can I just go straight to the tool-specific indexing such as the RNA-star, HISAT indexers?

I wonder whether you would be able to help with another problem. I got an error " java.lang.OutOfMemoryError" during the Picard indexing. I found the exact same issue was posted earlier here by someone else in github:


There was an answer posted but I couldn’t really understand how to do it. I have chased the question two days ago but haven’t got any response yet. Thanks a lot :sunny:

1 Like

Hi @YONG_JIA

If you haven’t created the SAMtools, Picard, and 2bit indexes, problems with tools can come up.

That error you reference is for when using Picard tools line command. For your indexing with a Data Manager in Galaxy, this same error likely means that your local Galaxy does not have enough memory to index the genome. If you are trying to index wheat from most public sources, there will never be enough memory (the genome is simply too large). Using the PLAZA version can help reduce the amount of memory needed due to the way it was reorganized, but it will still be substantial.

Related Q&A