Mapping to a uploaded (large) reference genome dataset -- Platform Choices

jennaj · May 17, 2021, 8:29pm

Larger genomes tend to exceed processing resources when used as a custom genome. Why? The genome has to be loaded then indexed first. After that is done the tool is run against it and the index is discarded after each run.

When the query is another entire genome, potentially fragmented, that also adds to the memory and time the job requires.

The rn7 genome is new (officially released May 4, 2021 by UCSC UCSC Genome Browser: News Archives) and isn’t indexed for tools at usegalaxy.* servers yet.

UseGalaxy.org is undergoing testing for the upcoming release, so adding a new genome would be problematic right now.

That said, UseGalaxy.eu might be able to index the genome sooner, it has started to become requested more often. ping @bjoern.gruening @gallardoalba

Our plans are to unify the reference data across usegalaxy.* servers but that project is still a work-in-progress. So, if the EU server can add it, you can work there.

What you might be able to do meanwhile:

Make sure that your query doesn’t include very short sequences that represent a single read. Meaning, you might need to clean up/filter the assembly first.
Load up the genome (use the UCSC version Index of /goldenPath/rn7/bigZips – chose the rn7.fa.gz file) and try running at UseGalaxy.eu instead. They have a few clusters that scale for even large jobs. Just be aware that the job may queue, run, then auto-rerun if the first attempt fails for resources.
Download the same rn7 fasta, uncompress, reduce the file so that only the primary chromosomes are remaining (remove all the haplotype/alt/unmapped sequences). Then load that up to either/both servers and try to map against that. This reduces the size of the genome – and primary chromosome hits are usually what are most important anyway.
Or, pick the version of the genome from NCBI that only includes primary chromosomes: https://www.ncbi.nlm.nih.gov/genome/73
You could also wait for the genome to be indexed – the EU team will respond with their plans, likely including projected timing. It may take a day or so (time zone differences/summer vacation schedules/focusing on the Galaxy release testing, etc).
Consider setting up and using your own Galaxy server. You’ll be able to add/index any genome your want natively when you are the administrator. This is good for large work in general – the free public servers have significant resources, but not as much as you can set up yourself (for practical reasons). Galaxy scales for very large work when you run your own and allocate appropriate resources.
- Webinar: Use Galaxy on the web, the cloud, and your laptop too >> Webinar: Use Galaxy on the web, the cloud, and your laptop too
- Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting
- Ways to use Galaxy: https://galaxyproject.org >> Use. Scroll down a bit to review the platform matrix to help with decisions.
- The GVL version of Cloudman is one choice and AWS offers grants. Use a high memory base server, or you might have to start over. Cluster nodes are based on the same configuration choices as the server that hosts Galaxy. AWS Programs for Research and Education

I also added a few tags to your post that may help with finding prior Q&A that is useful.

Thanks!

Topic		Replies	Views
Genome build for dog in RNA STAR usegalaxy.eu support reference-index , custom-genome , reference-genome , custom-build	5	387	March 20, 2023
Creating a customized genome index (large) on a private Galaxy server -> Use Data Managers server-admin , data-manager	3	531	August 23, 2022
usegalaxy.eu cufflinks no cached reference data usegalaxy.eu support reference-index	6	23	January 24, 2025
RNA STAR Reference Genome at UseGalaxy.org -- Resolved usegalaxy.org support	7	3028	January 8, 2021
Addition Of Reference Genome custom-genome	2	338	December 23, 2024

Mapping to a uploaded (large) reference genome dataset -- Platform Choices

Related topics