Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting

You’ll need to contact you admins again. This error means there is a Samtools dependency problem. They might need to uninstall/reinstall HISAT2 using the “manage dependencies” option. Also ask them to make sure they are running the latest version of Galaxy, including checking for point-updates since the original release (19.01). Installing the most current version of HISAT2 would also be important (2.1.0+galaxy4).

If those admins need help, they can reach the developers at this Gitter chat https://gitter.im/galaxyproject/dev. In some cases, getting dependencies sorted out correctly takes a bit more effort. I see this kind of error reported across many different tools/platforms (from google search) and it seems to be linked to a laundry list of factors: OpenSSL, bioconda, conda, conda-forge, et cetera. The “fix” details are not specific for all but people are addressing it successfully. So, please start by getting Galaxy and the tool updated, see if that resolves the problems, and if not have them report it at Gitter for help. HISAT2 is working correctly on the public servers.

1 Like

Update: I decided to start up the Gitter question just to see if someone already recognizes the problem/knows the solution. They may write back here, or in the chat: https://gitter.im/galaxyproject/dev?at=5cd5b632bdc3b64fcf2389aa

1 Like

Having a recent version of conda is also important, see https://docs.galaxyproject.org/en/latest/admin/conda_faq.html#how-can-i-upgrade-conda

2 Likes

Hi Jennaj,
I am excited to learn galaxy and just installed Galaxy on my personal PC. Im keen to install a reference genome (I am working on plant wheat ) for my RNAseq analysis.

Forgive me if my question sounds stupid to you, but I am still not quite clear the exact steps needed to download and build the reference genome using Data Manager. Before I thought DM is just a single tool that I can find in toolshed, now it seems it is a set of tools required for data preparation. Do you have any plan to do a step-by-step tutorial on reference genome preparation?

BTW, I am not quite sure what kind of wheat genome data is required for reference genome. Thanks in advance for your response :slight_smile:

Another issue I encountered, after installing the data_manager_fetch_genome_dbkeys_all_fasta tool from toolshed. But I can not search it in my tool list, anyone know why? All the other tools I installed would appear in my tool list, this one just not, so weird

Review here, has many details, more than in my original post, including a video: Galaxy Community Hub - Galaxy Community Hub

But I still strongly recommended running the DMs in the order I suggest. Let that first set finish completely, one by one. Then can run other indexes in batch but only if you have enough resources allocated on your Galaxy to do that (run multiple high-memory jobs, and space to store the results). Some indexes can be imported from CVMFS pre-computed – the DM form will note that if available.

Epherimus has a “data manager” mode for batch work and is a bit more work to set up. If interested, see: Welcome to the Ephemeris documentation! — Ephemeris 0.10.3 documentation

The tools are under the “Admin” top masthead link. See the first top section. Tools along with the associated logistical (“loc”) files created by them and other related data. I like to create one new history for data manager runs and make that active, before running any of them. That way I keep all the runs together someplace I can refer back to them. I tend to do this in batches and name/date the histories so are easy to find.

You will have trouble with the wheat genome as-is, no matter where you are working. The PLAZA resource is where you should get the data (genome + matched up reference annotation). We have not added this to CVMFS yet (the core data repository) but this ticket explains what we want to do and includes info and links from the data organizers: Add PLAZA (plant) genomes to test, main, and cvmfs · Issue #187 · galaxyproject/usegalaxy-playbook · GitHub

Hi Jennaj,
Thanks for the answers. I have figured out where the data managers are.
For the plant genome plaza, do you have any plan to update them? The data for both wheat and barley are outdated already.

I will try to download the updated genome fasta and gff files from specific database and uploaded to galaxy to see whether it will work or not. From your perspective, is there anything that I should pay special attention to while I am doing this? Thanks.

Hi,
I know this may be not be the right place to ask this question, but is there a way to convert .zip file to .gz format in galaxy?

Unfortunately the wheat genome fasta file is in zip format 8Gb large. It’s not realistic for me download the zip file and convert the file locally. Thanks.

Just found a step-by-step tutorials on how to create your own built-in reference genome:
https://bioinformatics.ucdavis.edu/research-computing/documentation/using-datamanagers-to-create-your-own-built-in-genomes/

1 Like

Thanks a nice tutorial. Most of that will be the same in the current Galaxy version. You’ll be able to tell what is now modeled a bit differently now.

I still recommended running the indexes in the order above. And using a fasta from the history is fine.

For the .zip file, you’ll need to uncompress that data somewhere before loading the data to Galaxy. Probably. Single file zip archives will sometimes uncompress correctly, sometimes they won’t. Multi-file zip archives are the same (some work, some do not) but only the first file in the archive will load.

For PLAZA genomes, open an issue ticket against the repository that hosts them and request an updated genome version (wheat or others). These may be a work-in progress, but in any case you’ll get more answers.

Hi Jennaj,
You are right, the zip file doesn’t seem to be an issue.

I followed your suggestion on the order: download genome fasta file, and use the “Create DBKey and Reference Genome” to build the genome with no problem. However, errors arise when it comes to the indexing step. I have tried SAM FASTA index, rnastar index2 and BWA-MEM index, none of them work. The index step should be very straightforward from my understanding. I don’t understand why it can not go through.

I have also tested a single chromosome fasta file to make sure it is not due to dataset size. It seems that I may have missed something for the indexing. Do you have any idea? Thanks.

Hi Jennaj,
I think I figured out where the issue is, these indexers require some dependency which are missing in my galaxy server. Most common one is samtools. I searched online and found no clear guide on how to install samtools. Could you help me here? Thanks :slight_smile:

Hi Jennaj,
Crazy, I have figured out how to install the missing samtools by going through the galaxy document: conda for tool dependencies, where it states if you set “conda_auto_install” to “true”, galaxy will look for and install Conda packages for missing tool dependecies before running a job.

I did exactly what it says and run the indexer again, then it all works like a magic :slight_smile: Thanks a lot for your nice guide.

1 Like

@YONG_JIA Super, glad you found our docs ( https://docs.galaxyproject.org ) and the help solved your problem :sunny:

Hi Jannaj,
Sorry to bother you again, regarding the reference genome preparation, you have been suggesting doing the four basic indexing first and then tool-specific indexing, what is the reason for doing this? Can I just go straight to the tool-specific indexing such as the RNA-star, HISAT indexers?

I wonder whether you would be able to help with another problem. I got an error " java.lang.OutOfMemoryError" during the Picard indexing. I found the exact same issue was posted earlier here by someone else in github:


There was an answer posted but I couldn’t really understand how to do it. I have chased the question two days ago but haven’t got any response yet. Thanks a lot :sunny:

1 Like

Hi @YONG_JIA

If you haven’t created the SAMtools, Picard, and 2bit indexes, problems with tools can come up.

That error you reference is for when using Picard tools line command. For your indexing with a Data Manager in Galaxy, this same error likely means that your local Galaxy does not have enough memory to index the genome. If you are trying to index wheat from most public sources, there will never be enough memory (the genome is simply too large). Using the PLAZA version can help reduce the amount of memory needed due to the way it was reorganized, but it will still be substantial.

Related Q&A

I had this all working nicely on my local install, then when something failed I ran them again and the data managers are now broken. Down in the details I see “never run them twice or there will be chaos and it is tricky to fix” essentially.
OK so having brought this apocalypse upon my server, is there a way to fix it; presumably deleting a bunch of files and starting again? Can you help at all please!
Richard

Seemed to get around this, buy rummaging around until finding the duplicate lines and deleting them. Index still says failed, but seems to have worked so long as you manually type the location into the file.

1 Like

Yes, duplicate entries will lead to problems. As will missing entries.

Sometimes starting over is the easiest way, but correcting the data directly is also possible (just complicated!).

More details are in these FAQs if you want to attempt a manual fix: Galaxy Administration

Stopping the server, making changes, then restarting seems to always work best. Pay special attention to spaces, tabs, and the like. Using a command-line text editor that reveals whitespace characters is essential, imho.

1 Like

Thanks, all sorted now. Again (my third installation of the server)!!
Kind Regards

1 Like