Hosting to UCSC with Custom Builds

I love Galaxy!

I’m working with the white-tail deer genome build, which is curated by UCSC in it’s most recent form as Ovbor_1.2 Oct. 2024 white-tailed deer (Illinois 20LAN1187 2024) (GCF_023699985.2). As near as I can tell, this genome build is not included in the native options on Galaxy.

I was able to send the genome fasta file from the UCSC table browser to galaxy, normalize the fasta to the 80nt per line, and assign it as a custom build using the notation from UCSC (GCF_023699985.2). I’m pretty sure I’ve done this correctly, as I’m able to map RNA-seq reads to the custom build within Galaxy. Awesome.

I’d like to set up a track hub on UCSC by hosting the bigwigs on Galaxy, and I’m plenty under the data limit. I’m able to do this easily with T2Tv2 data by following the directions UCSC has for hosting data on on Galaxy but viewing on UCSC ( Track Hubs ). The little “graph icon” in the history pulls up various options, one of which is viewing the data on IGV (local) and the other is viewing the data on UCSC. So I have this working with hs1.

However, when I do this with any of my custom genome data, I’m not given the option to view the data on UCSC. Is there some kind of attribute I need to assign to either my custom genome build or my individual data files such that Galaxy knows which UCSC genome build to render the data? Can anyone advise on how to use the storage provided by galaxy to host bigwig data files that can be visualized on UCSC for genomes that UCSC has…but at the moment galaxy does not have as a native build.

Thanks!

Welcome @GalaxyFrog

Thanks for sharing all of the details! Glad you have this working for the human build!

For this question:

Yes, this can be possible as long as UCSC is hosting the same reference genome (same build/assembly version). You already have the Custom Build key created, so my next guess is that the data you want to visualize does not have the database/dbkey assigned yet. Or have you done that already? This is the last optional step in the FAQ here:


With all that said, there might be an extra wrinkle given that the genome is not one of the primary assemblies, and is instead part of the assembly track hub project. This was an issue before and I’ll check on the status!

As needed, with real example we can troubleshoot this and get it adjusted if needed. I’ll also give this a quick test today to see what happens. If you want to share a history with the 1) reference genome fasta you are using, and 2) some simple/smaller files (BED, BAM, Bigwig) with content that matches the genome, that will help my review to go quicker. This could possibly flushing out any genome-specific issues if there are any. Please leave the current custom database key assigned to the datasets (even though I’ll be recreating it in my testing account!).

Please let us know about the database assignment and if that works or not! To create the shared history, you can copy datasets (not move) into a new history. Copies of data in your own account do not consume extra quota – these are just clones. You can try to subset or I can do that for the testing, plus create the other file variants as needed, but including the genome fasta and current database key will be important please.

Thanks! :slight_smile:

Thanks so much!!!

I think I had done most of that. I created a custom genome build database within the “user → custom build” options. I gave it the name I got from UCSC and assigned it a key based on the version build of that genome like this:

Name Key Number of chroms/contigs
GCF_023699985.2 WhiteTailDeer_v1.2 718

I was able to assign this build database to my data files, which now all report their database is “WhiteTailDeer_v1.2”. But again, when I click on the little “graph icon” for each data file, I’m still only given the IGV option not the UCSC option.

Hi @GalaxyFrog

Thanks for explaining more and I think I spot the problem!

For the “database” key (in Galaxy), you will want to use the value UCSC terms the dbkey.

For this genome assembly, that term will be: GCF_023699985.2

Please try going into the custom genome creation page and recreate the Key this way, then assign it to your datasets. In short, try swapping the values you have now when you recreate it. You can delete the one you have now (to avoid mixups later).

Let us know what happens! :scientist:

More details

The UCSC dbkey terms are similar to a unique index identifier, and all of the other artifacts associated with that specific assembly are connected to it, including external connections like a Galaxy link. The other labels can be descriptive or used as supplementary keys for other purposes.

In this screenshot, the dbkey is the last (value) in the genome title at the top. The URL for the view also includes the dbkey as the baseline identifier. All assemblies at UCSC will have the same structure/organization.

Hi @GalaxyFrog

We discussed this directly but let’s share with the community and the ChatGXY help indexes here too.

Hub genomes - data hosting from a public Galaxy server

These are special and not currently supported for direct data hosting from a public Galaxy server. We have been discussing this for some time! I’ve opened a new issue ticket with all of the details for the use case (hosting bigWig files) but also for visualization purposes. For now, IGV can be used for the visualization with any fasta file (see igv ).

Hub genomes - data hosting workaround in a private Galaxy server

The ticket above also includes the current workaround for custom data hosting into a Hub assembly. In short, you can host a small personal Galaxy instance and connect it to UCSC display. The DOCKER version of Galaxy has most pre-configured, so the start up and data saving parameters are different than a full server from scratch, so please be sure to see the dedicated README.

Workaround for end users

  1. Learn how to host your own Galaxy server here (the Docker version would be recommended for “less” technical user cases!) → Private Galaxy Servers

  2. Learn how to host data in your own Galaxy server into a Hub here → GitHub - goeckslab/hub-archive-creator: This Galaxy tool permits to prepare your files to be ready for Assembly Hub visualization. · GitHub

Fasta preparation

The chromosome lengths will be computed on demand from a fasta file, however, that fasta should be very clean. For your case, this means removing the description content from the fasta title lines.

The NormalizeFasta tool can do this. Please see the extra option to “split title line at first whitespace”. You will want to use this, along with wrapping to a very standard width. Wrapping at 80 bases is the original specification and tends to work best with any tool! If you were working on the command line, both would be needed, and in Galaxy this can be optional, but if you loath odd random errors as much as me, standardizing reference data at the start is a worthwhile time investment.

Not sure how? Please see → FAQ: How to use Custom Reference Genomes?



Thanks for all the follow-up! The developers will review the ticket next Tuesday and we’ll learn about their thoughts back on it directly.

Ok! Thanks so much for your help. I made the updates to my fasta file and re-assigned that build (with the correct key name) to my datasets. I guess for now I’m limited to viewing the data on Galaxy with IGV/IGB, with the understanding that the developers are aware the genomes on the Hub project are not currently supported, and that it sounds like this issue has been known for a decade. Hopefully there is a reconsideration of the Hub project.