Mapping to a uploaded (large) reference genome dataset -- Platform Choices

daikez · May 12, 2021, 9:49am

I am trying to do the minimap2 with my own rat genome assembly against the reference rat genome. It went through well when I mapped against rn6 which is on the reference list. But when I tried to map agaisnt rn7 which is not on the list and had to upload as a seperate dataset, it failed constantly whatever parameters I have changed. The error message showed out of memory. Is there any solution for this circumstance? Or may I ask if the rat rn7 could be added on the reference list on the server?

jennaj · May 17, 2021, 8:29pm

Hi @daikez

Larger genomes tend to exceed processing resources when used as a custom genome. Why? The genome has to be loaded then indexed first. After that is done the tool is run against it and the index is discarded after each run.

When the query is another entire genome, potentially fragmented, that also adds to the memory and time the job requires.

The rn7 genome is new (officially released May 4, 2021 by UCSC UCSC Genome Browser: News Archives) and isn’t indexed for tools at usegalaxy.* servers yet.

UseGalaxy.org is undergoing testing for the upcoming release, so adding a new genome would be problematic right now.

That said, UseGalaxy.eu might be able to index the genome sooner, it has started to become requested more often. ping @bjoern.gruening @gallardoalba

Our plans are to unify the reference data across usegalaxy.* servers but that project is still a work-in-progress. So, if the EU server can add it, you can work there.

What you might be able to do meanwhile:

Make sure that your query doesn’t include very short sequences that represent a single read. Meaning, you might need to clean up/filter the assembly first.
Load up the genome (use the UCSC version Index of /goldenPath/rn7/bigZips – chose the rn7.fa.gz file) and try running at UseGalaxy.eu instead. They have a few clusters that scale for even large jobs. Just be aware that the job may queue, run, then auto-rerun if the first attempt fails for resources.
Download the same rn7 fasta, uncompress, reduce the file so that only the primary chromosomes are remaining (remove all the haplotype/alt/unmapped sequences). Then load that up to either/both servers and try to map against that. This reduces the size of the genome – and primary chromosome hits are usually what are most important anyway.
Or, pick the version of the genome from NCBI that only includes primary chromosomes: https://www.ncbi.nlm.nih.gov/genome/73
You could also wait for the genome to be indexed – the EU team will respond with their plans, likely including projected timing. It may take a day or so (time zone differences/summer vacation schedules/focusing on the Galaxy release testing, etc).
Consider setting up and using your own Galaxy server. You’ll be able to add/index any genome your want natively when you are the administrator. This is good for large work in general – the free public servers have significant resources, but not as much as you can set up yourself (for practical reasons). Galaxy scales for very large work when you run your own and allocate appropriate resources.
- Webinar: Use Galaxy on the web, the cloud, and your laptop too >> Webinar: Use Galaxy on the web, the cloud, and your laptop too
- Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting
- Ways to use Galaxy: https://galaxyproject.org >> Use. Scroll down a bit to review the platform matrix to help with decisions.
- The GVL version of Cloudman is one choice and AWS offers grants. Use a high memory base server, or you might have to start over. Cluster nodes are based on the same configuration choices as the server that hosts Galaxy. AWS Programs for Research and Education

I also added a few tags to your post that may help with finding prior Q&A that is useful.

Thanks!

daikez · May 18, 2021, 6:59am

Hi, @jennaj

Thank you very much for your detailed suggestions.

I will check them out closely.

daikez · May 18, 2021, 8:58am

May I ask about the possible date of indexed rat rn7.2 to be available on Galaxy? Thanks.

gallardoalba · May 18, 2021, 7:09pm

Hi @daikez,
it has been requested; I’ll inform you as soon as I have some information about it.

Regards

gallardoalba · May 19, 2021, 3:26pm

Hi @daikez,
the rn72 indexed genome will be available on useGalaxy.eu in a few hours.

Regards.

daikez · May 20, 2021, 8:39am

Thank you very much for the fast solution! I already run a mapping against rn72 and it went well. But I just notice one thing, ulike the mapping result when against rn6, there is no option to display the result at UCSC genome browser. Is there any way to do that? Thanks again.

bjoern.gruening · May 20, 2021, 9:47pm

@daikez make sure your dataset is annotated as rn7. Meaning the database is not set to ? but rn7.

daikez · May 21, 2021, 6:09am

Yes, I did. Please look at the attached screenshots and see the differences between two mapping results.

daikez · May 21, 2021, 7:27am

Another screenshot using the same dataset mapped against rn6 and now against rn7.2. Notice the size of the files generated with rn7.2 is only half of the one generated with rn6.0.

jennaj · May 24, 2021, 7:42pm

The resulting bam size difference might be due to the differences in the assembly produced by the two different groups: UCSC Genome Browser: Acknowledgments

The UseGalaxy.eu server has the genome in the list of known databases:

And UCSC has a live browser for rn7. That said, it is not the default browser for Rat when going to UCSC (rn6 is), but it can be navigated to. The rn7 genome is still very new and still having more tracks added as far as I know:

Related news:

I’m wondering if a specific conflict is coming up @daikez

Did you add the genome yourself as a Custom Build within Galaxy, before it was added built-it? If you used the same “dbkey” aka the short name for the genome rn7, that could cause a conflict. Both would show up in the list of databases (custom + built-in). If so, delete the custom build, refresh Galaxy (maybe log out/log in again to clear the browser cache), and see if that resolves the problem.

If that doesn’t work, @bjoern.gruening or @gallardoalba can help more

daikez · May 25, 2021, 5:51am

Thanks for the suggestion! I didn’t define my genome as any of the dbkey on the list, as it won’t fit. Anything not right of the attibutes?

daikez · May 27, 2021, 10:27am

Any chance to display this mapping result on UCSC genome browser (against rn7.2)?

jennaj · June 7, 2021, 10:51pm

Hi @daikez

I can confirm the UCSC linkout is not present (in an independent test). I’ll leave this test history intact for a while: Galaxy | Europe | Accessible History | test rn7 UCSC link out (bam). You can just view it at that URL, no need to import/copy – datasets in a shared history link can be expanded and reviewed directly. Warning: no QA/QC was done, which is NOT what you’d want to do for an real analysis but it was good enough for this test.

So, why is the link missing … may be because of some setting at UCSC (the genome is not the “default” rat genome, as I noted before). Or, it may be something Galaxy can address via some configuration update/change. That said, I’m guessing the first the root issue. UCSC probably isn’t hosting the rn7 genome browser yet to other applications since the annotation tracks are not complete (the genome is very new).

There are other ways to view data. The options already in the links, plus UCSC accepts other file types (that you can convert to within Galaxy) as custom tracks. Convert to a supported format then copy the dataset URL and input that at the UCSC web site. They don’t accept bam data directly in a custom track since it can be so large. Genome Browser Custom Tracks

ping @bjoern.gruening @gallardoalba

daikez · June 9, 2021, 9:32pm

Hi, @jennaj

Thanks again for your tips!

I have tried other links in the links in the data. The data was loaded as empty tracks in the IGV and IGB, And both programs have the same problem as the rn7.2 is not on the reference list. So although I can see how well the assembly is mapping to rn7.2, but no correct annotations (very different between rn6 and rn7.2) were shown.

I have also tried to input my mapping data to UCSC custom tracks, but still don’t get it. I got always error message such as “first 5 lines of chromosome id not correct, it’s case sensitive, etc.”

jennaj · December 19, 2023, 8:16pm

Resolved, please try again now.