Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting

Hi,

I’m attempting to run HISAT2 on paired RNAseq data. I have run it successfully previously on the main server using the mm10 built-in reference genome, however, I am now using a local server and the built-in reference genomes have apparently not been included in the set-up. I’m hoping to get some assistance on how to obtain the right reference genome file for mm10 installed, or even better how to update the local server so the same built-in reference genome will be available as in the main server?

Thank you,
Asta

1 Like

Hi,

The mm10 reference genome/build is sourced from UCSC.

You’ll need to index the genome on your server with Data Manager tools.

Install the genome with Data Managers (sourced from the ToolShed, all are in a separate category there). Install DMs like any other tool using the Admin functions.

You’ll need these DMs at a minimum. Execute them in this order first:

  • Fasta fetcher – has an option to pick UCSC as the data source.
  • SAM indexer
  • Picard indexer
  • 2bit (twoBit) indexer

Then get the DMs that create indexes for the tools you want to use. Run these after the others have above have completed for the best results.

The above was copied over and slightly updated from this prior Galaxy Biostars Q&A. It has more details (worth reviewing):

https://biostar.usegalaxy.org/p/19371/

A search here with the keyword string “biostars data manager fetch sam picard” will find more prior Q&A that covers many different use cases: https://galaxyproject.org/search/

More details
  • Be very careful about not running an indexing tool twice – duplicated lines will cause problems and are a hassle to correct.
  • If you use a “dbkey” that is already available, you must make sure that your genome is an exact match for that content. Genomes that are already indexed on one of the public Galaxy servers (and associated with an existing dbkey) can be found here: http://datacache.galaxyproject.org/
  • Data Libraries are great for storing datasets commonly used (copies do not consume any disk space), and you could put in a fasta for a genome in one to use it with tools as a Custom Genome (no precomputed tool indexes can be used from there or a history, so don’t bother to load them).
  • Custom genome help is in this FAQ and a related FAQ. If you are indexing a genome from an uploaded fasta file for some reason (not available directly from a known source), the same fasta formatting rules apply:
  • But IF you are admin of a server it would be better to install the genome directly and index it for tools on the server. Jobs will use fewer resources and run faster this way, plus be much less likely to fail for memory reasons.
  • Data Manager tools are installed/used under the “Admin” masthead, top section. The “loc” files/table content is there as well.
  • The jobs that are run by Data Manager tools will be sent to your active history. It is strongly recommended to set up a distinct history for each genome, or batch of genomes, and date it in the name, to better keep track of what was done over time.

Hi Jenna,

This was really helpful thank you! I have a couple of questions:
I don’t seem to be able to pick any options with the Fasta fetcher tool? Is it correct that the tool is named data_manager_fetch_genome_dbkeys_all_fasta in the ToolShed? It is the only fasta fetcher tool I can find.
Additionally, when you write ‘Execute’ do you simply mean installing the tools to my local server?

Thank you again,
Asta

Hi Jenna,

I have now succeeded in executing the four listed DMs according to https://github.com/galaxyproject/dagobah-training/blob/2017-montpellier/sessions/05-reference-genomes/ex1-reference-genomes.md
However, I subsequently ran the DM for HISAT2 index and it ran for a few hours, then failed and have me the following message:
“Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:18:13
Allocating rank array
Ranking v-sort output
Fatal error: Exit code 247 ()
Settings:
Output files: “mm10.*.ht2”
Line rat”

I’m unsure of what this error code means, I hope you can clarify?

Asta

1 Like

It looks like the tool is running out of memory. Are you running this in a local Galaxy on a personal computer? It might not have the resources you need.

The mouse genome is pretty large. If indexing (or later mapping) wouldn’t work when using HISAT2 functions line-command (jobs exceed resources – disk space or memory) then they wouldn’t work in Galaxy. See here for common index sizes: https://ccb.jhu.edu/software/hisat2/index.shtml

There are Cloud-based Galaxy options. Check to see if any of the academic clouds are available to you/your institution. AWS also offers grants to cover research projects for students, researchers, etc (a simple online form, usually turns around quickly). Galaxy itself is free – and the cloud version is designed to be easy to administer and has many indexes pre-computed – but you’ll need to connect a resource for the database-data storage and computational work. Many scientists/teachers use a cloud option every day.

For Galaxy platform choices, please see:

1 Like

I have been running galaxy on an university-opened local server - however, I have previously been opening the server from my personal computer, I thought this shouldn’t make a difference since the server itself is an academic server and it doesn’t look like there is a storage limit listed.
I have also succeeded in uploading RNAseq fastqsanger files and run FastQC on these files without any issues?

Does this mean accessing some internal Galaxy server that is hosted by your University? Other people use it, there is an administrator, and you are also an administrator?

If this means just opening a browser window for the same server above, but from a different computer, then you are still using the same Galaxy account/server for work.

If the above is true, contact the admins that are running the technical side of the server. They can check the server logs. It is very likely that more memory needs to be allocated for this job on whatever cluster they attached.

If instead, you are running your own Galaxy (whether on a university server and/or your own computer), please explain the source. Is it from a https://getgalaxy.org GitHub install and is this current with version 19.01? Or, some docker version (which URL did you source it from? there are a few, including training versions). The job is almost certainly running out of resources (most likely memory – and that is different from the amount of disk space you may have available). We can point you to server administration docs/tutorials.

Thanks!

1 Like

Thank you again, it is an internal galaxy server hosted by the university, I have access with personal login details but others have access to the internal server too. I’m an administrator for the server with my login details as well.
I have contacted the admins now and hopefully they can allocate more memroy for the job, thank you again so much!

1 Like

So I managed to import the HISAT2 reference genome and now I’m getting the following error:
Fatal error: Exit code 127 ()
samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
(ERR): hisat2-align died with signal 13 (PIPE)

1 Like

You’ll need to contact you admins again. This error means there is a Samtools dependency problem. They might need to uninstall/reinstall HISAT2 using the “manage dependencies” option. Also ask them to make sure they are running the latest version of Galaxy, including checking for point-updates since the original release (19.01). Installing the most current version of HISAT2 would also be important (2.1.0+galaxy4).

If those admins need help, they can reach the developers at this Gitter chat https://gitter.im/galaxyproject/dev. In some cases, getting dependencies sorted out correctly takes a bit more effort. I see this kind of error reported across many different tools/platforms (from google search) and it seems to be linked to a laundry list of factors: OpenSSL, bioconda, conda, conda-forge, et cetera. The “fix” details are not specific for all but people are addressing it successfully. So, please start by getting Galaxy and the tool updated, see if that resolves the problems, and if not have them report it at Gitter for help. HISAT2 is working correctly on the public servers.

1 Like

Update: I decided to start up the Gitter question just to see if someone already recognizes the problem/knows the solution. They may write back here, or in the chat: https://gitter.im/galaxyproject/dev?at=5cd5b632bdc3b64fcf2389aa

1 Like

Having a recent version of conda is also important, see https://docs.galaxyproject.org/en/latest/admin/conda_faq.html#how-can-i-upgrade-conda

2 Likes

Hi Jennaj,
I am excited to learn galaxy and just installed Galaxy on my personal PC. Im keen to install a reference genome (I am working on plant wheat ) for my RNAseq analysis.

Forgive me if my question sounds stupid to you, but I am still not quite clear the exact steps needed to download and build the reference genome using Data Manager. Before I thought DM is just a single tool that I can find in toolshed, now it seems it is a set of tools required for data preparation. Do you have any plan to do a step-by-step tutorial on reference genome preparation?

BTW, I am not quite sure what kind of wheat genome data is required for reference genome. Thanks in advance for your response :slight_smile:

Another issue I encountered, after installing the data_manager_fetch_genome_dbkeys_all_fasta tool from toolshed. But I can not search it in my tool list, anyone know why? All the other tools I installed would appear in my tool list, this one just not, so weird

Review here, has many details, more than in my original post, including a video: https://galaxyproject.org/admin/tools/data-managers/

But I still strongly recommended running the DMs in the order I suggest. Let that first set finish completely, one by one. Then can run other indexes in batch but only if you have enough resources allocated on your Galaxy to do that (run multiple high-memory jobs, and space to store the results). Some indexes can be imported from CVMFS pre-computed – the DM form will note that if available.

Epherimus has a “data manager” mode for batch work and is a bit more work to set up. If interested, see: https://ephemeris.readthedocs.io/en/latest/

The tools are under the “Admin” top masthead link. See the first top section. Tools along with the associated logistical (“loc”) files created by them and other related data. I like to create one new history for data manager runs and make that active, before running any of them. That way I keep all the runs together someplace I can refer back to them. I tend to do this in batches and name/date the histories so are easy to find.

You will have trouble with the wheat genome as-is, no matter where you are working. The PLAZA resource is where you should get the data (genome + matched up reference annotation). We have not added this to CVMFS yet (the core data repository) but this ticket explains what we want to do and includes info and links from the data organizers: https://github.com/galaxyproject/usegalaxy-playbook/issues/187

Hi Jennaj,
Thanks for the answers. I have figured out where the data managers are.
For the plant genome plaza, do you have any plan to update them? The data for both wheat and barley are outdated already.

I will try to download the updated genome fasta and gff files from specific database and uploaded to galaxy to see whether it will work or not. From your perspective, is there anything that I should pay special attention to while I am doing this? Thanks.

Hi,
I know this may be not be the right place to ask this question, but is there a way to convert .zip file to .gz format in galaxy?

Unfortunately the wheat genome fasta file is in zip format 8Gb large. It’s not realistic for me download the zip file and convert the file locally. Thanks.

Just found a step-by-step tutorials on how to create your own built-in reference genome:
https://bioinformatics.ucdavis.edu/research-computing/documentation/using-datamanagers-to-create-your-own-built-in-genomes/

1 Like

Thanks a nice tutorial. Most of that will be the same in the current Galaxy version. You’ll be able to tell what is now modeled a bit differently now.

I still recommended running the indexes in the order above. And using a fasta from the history is fine.

For the .zip file, you’ll need to uncompress that data somewhere before loading the data to Galaxy. Probably. Single file zip archives will sometimes uncompress correctly, sometimes they won’t. Multi-file zip archives are the same (some work, some do not) but only the first file in the archive will load.

For PLAZA genomes, open an issue ticket against the repository that hosts them and request an updated genome version (wheat or others). These may be a work-in progress, but in any case you’ll get more answers.

Hi Jennaj,
You are right, the zip file doesn’t seem to be an issue.

I followed your suggestion on the order: download genome fasta file, and use the “Create DBKey and Reference Genome” to build the genome with no problem. However, errors arise when it comes to the indexing step. I have tried SAM FASTA index, rnastar index2 and BWA-MEM index, none of them work. The index step should be very straightforward from my understanding. I don’t understand why it can not go through.

I have also tested a single chromosome fasta file to make sure it is not due to dataset size. It seems that I may have missed something for the indexing. Do you have any idea? Thanks.