Mapping to a uploaded (large) reference genome dataset -- Platform Choices

Hi @daikez

Larger genomes tend to exceed processing resources when used as a custom genome. Why? The genome has to be loaded then indexed first. After that is done the tool is run against it and the index is discarded after each run.

When the query is another entire genome, potentially fragmented, that also adds to the memory and time the job requires.

The rn7 genome is new (officially released May 4, 2021 by UCSC UCSC Genome Browser: News Archives) and isn’t indexed for tools at usegalaxy.* servers yet.

UseGalaxy.org is undergoing testing for the upcoming release, so adding a new genome would be problematic right now.

That said, UseGalaxy.eu might be able to index the genome sooner, it has started to become requested more often. ping @bjoern.gruening @gallardoalba

Our plans are to unify the reference data across usegalaxy.* servers but that project is still a work-in-progress. So, if the EU server can add it, you can work there.

What you might be able to do meanwhile:

  1. Make sure that your query doesn’t include very short sequences that represent a single read. Meaning, you might need to clean up/filter the assembly first.
  2. Load up the genome (use the UCSC version Index of /goldenPath/rn7/bigZips – chose the rn7.fa.gz file) and try running at UseGalaxy.eu instead. They have a few clusters that scale for even large jobs. Just be aware that the job may queue, run, then auto-rerun if the first attempt fails for resources.
  3. Download the same rn7 fasta, uncompress, reduce the file so that only the primary chromosomes are remaining (remove all the haplotype/alt/unmapped sequences). Then load that up to either/both servers and try to map against that. This reduces the size of the genome – and primary chromosome hits are usually what are most important anyway.
  4. Or, pick the version of the genome from NCBI that only includes primary chromosomes: https://www.ncbi.nlm.nih.gov/genome/73
  5. You could also wait for the genome to be indexed – the EU team will respond with their plans, likely including projected timing. It may take a day or so (time zone differences/summer vacation schedules/focusing on the Galaxy release testing, etc).
  6. Consider setting up and using your own Galaxy server. You’ll be able to add/index any genome your want natively when you are the administrator. This is good for large work in general – the free public servers have significant resources, but not as much as you can set up yourself (for practical reasons). Galaxy scales for very large work when you run your own and allocate appropriate resources.

I also added a few tags to your post that may help with finding prior Q&A that is useful.

Thanks!

2 Likes