send files from Galaxy to UCSC's EU mirror

Hi,
genome.ucsc.edu is down due to a disk array failure last night. It’ll come back soon, but in the meantime, can Galaxy users open their files on genome-euro.ucsc.edu ? Is there any workaround for them to get the URL to the file on Galaxy and paste it on the genome-euro.ucsc.edu ?

Thanks!
Max

A UCSC user wrote to our support desk:

I have tried using the eu site, but I need to be able to add custom tracks from Galaxy. When opening the tracks Bia galaxy, the default is to open them in the main version of UCSC, therefore I cannot get them to open.

I have tried adding the tracks manually, but one is a BAM file and the other is a VCF, and when I try to upload these they both either come out blank or display an error. Do you have any advice on how I can get galaxy data to be displayed in the eu version rather than the main browser, as I think that would fix my issue?

Does this mean that Galaxy blocks requests based on IP address? Could you unblock genome-euro.ucsc.edu and genome-asia.ucsc.edu ?

Hi @Maximilian_Haeussler

What to try:

  1. Have the user set the Galaxy history to a shared state
  2. Attempt to copy/paste the dataset URL
  3. IP address wouldn’t be a factor as far as I know
  4. The type of data would matter. I don’t think any UCSC sites support a BAM upload, and not sure about VCF. Or maybe that changed on your end? These data are usually hosted at Galaxy, then the direct link to UCSC transfers slices of data and not the entire file at once.

Please let me know what you think. :slight_smile:

Yes, I understand that UCSC does not store data, it’s loaded via URL from Galaxy.

OK, the outage is resolved now, but this will come up again: what is the easiest way for a user to open a datafile on genome-euro.ucsc.edu and not genome.ucsc.edu ?

I suspect that the “open in UCSC” button in Galaxy has some special sharing to allow the file to be opened by genome.ucsc.edu only? Opening a file in UCSC doesn’t require the user to put it into a shared state, right?

Hi @Maximilian_Haeussler

This is how I also understood it. The direct link-out hosts the data in slices. Uploading directly by URL fails.

From what I could guess by testing loading data to UCSC (custom track view) by URL:

  1. VCF worked fine for a very small file (header + one data line). I didn’t test a larger file. Maybe there is a data size limit and that is what the UCSC user was attempting?
  2. BAM failed each attempt, even with a small file. It seems the function is looking for a myfile.bam.bai index file to upload along with the myfile.bam. If I added in both URLs into the same data loading submission, that also failed. It seems the that function might be looking in the same place as the bam URL location for the associated index but that is just a guess.

There is a global server setting to point to a “UCSC server” for all functions. Right now, that is a single hardcoded value. Some ideas that came up yesterday are below. I’m not stuck on a specific solution :slight_smile: and maybe we can discuss with stakeholders + developers.

  1. Make a change on the Galaxy side to store multiple UCSC targets.
  • Have those all accessible to the admins for quick updates. This seems like it would be difficult to communicate to all server admins within a timeframe that would be worth it and would take some work to do (backend + UI/UX). It expects admins to be available for those “quick updates” and users would have to wait.
  • Add in a user preference (per account) that the user can set, to control the target UCSC server for both link-outs to UCSC and getting data from the Table Browser. I like this idea the best but it would take some work to implement.
  1. Make a change on the UCSC side to auto-redirect to the preferred UCSC server.
  • The redirect would controlled by UCSC admins in real time and would involve some work.
  • Might be complicated to tune the handshake, on both sides.
  • User would not have control over this but is also has the least friction versus how it works now.

The open in UCSC link-outs involve a handshake between the servers. The links will only show up in Galaxy if the datatype and database/dbkey for that data are appropriate for UCSC. Sharing state is not factor for this type of transfer, instead: is that data in a Galaxy account that is currently logged into and is the target UCSC server available. If yes, the data is sent.

Transferring data by URL is a bit different. Those URLs are valid for anywhere. If you know the link, the data can be transferred and read at the destination. The reason the URL data transfer might need to have the history sharing state set is because of the administrative “GDPR-mode” some servers apply. Default for GDPR is any data retrieval by URL is restricted/private unless specifically granted by setting the history (or account) permissions.

Users can specify the data sharing state (by URL) as an account level user preference – or per history. I wasn’t sure where that user was working so suggested setting the history to shared in case permissions were the problem on the Galaxy side. Sharing state might have been an issue for the user’s VCF (or it was too large?) but it wasn’t for my test file. Their BAM would have failed since what UCSC is trying to read and the single URL for the data don’t quite match up for what UCSC needs. Solving the data/URL send/read for the BAM file type is another thing that could be tuned up, but maybe not as the primary solution since the file sizes will be a limit. Addressing the link-out function seems more important but all this can be discussed :slight_smile:

  1. VCF worked fine for a very small file (header + one data line). I didn’t test a larger file. Maybe there is a data size limit and that is what the UCSC user was attempting?

I don’t think that Galaxy uses that. Yes, you can upload small VCF files, but not big ones. We are not Galaxy, we cannot store big files for users for too long. We’ll add something like this, but don’t have the feature yet.

  1. BAM failed each attempt, even with a small file. It seems the function is looking for a myfile.bam.bai index file to upload along with the myfile.bam. If I added in both URLs into the same data loading submission, that also failed. It seems the that function might be looking in the same place as the bam URL location for the associated index but that is just a guess.

You cannot upload BAM files at all into UCSC. You can only paste a URL to a BAM file into the custom track box, and yes, it expects the .bai next to it, but you can use the setting bigIndexUrl= to point to the bai file, if it’s not stored next to the bam.

A line like this should work:

track=test type=bam bigDataUrl=URL-to-BAM bigIndexUrl=Url-to-BAI
this is documented here, but should probably more widely highlighted, I added notes to various doc pages now. Thanks for bringing this up.
https://genome.ucsc.edu/goldenPath/help/trackDb/trackDbHub.html#bigDataIndex

Add in a user preference (per account) that the user can set, to control the target UCSC server for both link-outs to UCSC and getting data from the Table Browser. I like this idea the best but it would take some > work to implement.

This seems technically the easiest route to me, as a galaxy-naive programmer. Has the added advantage that local server admins can point to the EU server, e.g. if data protection is an issue. Has the added advantage that user could point to their “own” UCSC mirror on-site when they work with Galaxy.

(It’s a lot easier now than in the past to setup a UCSC mirror site. We have a download+click VM image now and a docker container. You can have your own mirror soon with a single “docker run” command)

  1. Make a change on the UCSC side to auto-redirect to the preferred UCSC server.

We cannot redirect when the UCSC server is down.

The open in UCSC link-outs involve a handshake between the servers. The links will only show up in Galaxy if the datatype and database/dbkey for that data are appropriate for UCSC. Sharing state is not factor for this type of transfer, instead: is that data in a Galaxy account that is currently logged into and is the target UCSC server available. If yes, the data is sent.

Please note that now we have thousands more genomes available than before, a lot of GCA_ and GCF_ genomes we can handle now. We have many many more genomes than shown on the tree on hgGateway: https://hgdownload.soe.ucsc.edu/hubs/. This should probably a different ticket: support NCBI Assemblies for UCSC linkouts. We have a text file with these accessions (see the URL before) and Galaxy could pull the list once per night. But yes, different ticket.

Transferring data by URL is a bit different. Those URLs are valid for anywhere. If you know the link, the data can be transferred and read at the destination. The reason the URL data transfer might need to have the history sharing state set is because of the administrative “GDPR-mode” some servers apply. Default for GDPR is any data retrieval by URL is restricted/private unless specifically granted by setting the history (or account) permissions.

This makes sense. I didn’t know about GDPR mode. Thanks!

Users can specify the data sharing state (by URL) as an account level user preference – or per history. I wasn’t sure where that user was working so suggested setting the history to shared in case permissions were the problem on the Galaxy side. Sharing state might have been an issue for the user’s VCF (or it was too large?) but it wasn’t for my test file. Their BAM would have failed since what UCSC is trying to read and the single URL for the data don’t quite match up for what UCSC needs. Solving the data/URL send/read for the BAM file type is another thing that could be tuned up, but maybe not as the primary solution since the file sizes will be a limit. Addressing the link-out function seems more important but all this can be discussed :slight_smile:

I don’t fully understand, but it sounds as if BAM loading onto UCSC is broken fundamentally, it shouldn’t be. If this is really true (I have trouble believing it, I’m relatively sure that I displayed a BAM file from Galaxy on UCSC years ago…), then this sounds like another ticket, to add the bigDataIndex=xxx to the custom track line.

Great! I didn’t know about that option.

The submission was successful for smaller pair of BAM + BAI files. Both URLs are available in the Galaxy UI. But, there was an error message. Do you recognize what is going wrong? Or, are you able to load a BAM from a Galaxy history by URLs? A working example would be helpful.

Test history: Galaxy

What I entered in the second box of the custom track form:

track=test type=bam bigDataUrl=https://usegalaxy.org/api/datasets/f9cad7b01a472135b6ce5d10a7ecd18b/display?to_ext=bam bigIndexUrl=https://usegalaxy.org/api/datasets/f9cad7b01a472135b6ce5d10a7ecd18b/metadata_file?metadata_file=bam_index


To check if that particular BAM was a problem, I tried with another. The form will not submit. Maybe I trigged some kind of block with too many failures.

This is the string for that data (dataset 5 in the test history). The URLs are captured from the disk icon.

track=testbam type=bam bigDataUrl=https://usegalaxy.org/api/datasets/f9cad7b01a47213532ebe265768e88e2/display?to_ext=bam bigIndexUrl=https://usegalaxy.org/api/datasets/f9cad7b01a47213532ebe265768e88e2/metadata_file?metadata_file=bam_index

And, I tried going from another server. The submission form went through, but the data didn’t show up on the “Manage Custom Tracks” view. So, three different results :crazy_face:. I’m probably doing something wrong.

History: Galaxy

Sting used at UCSC

track=test type=bam bigDataUrl=https://usegalaxy.eu/api/datasets/4838ba20a6d86765d1ffc92a1647911d/display?to_ext=bam bigIndexUrl=https://usegalaxy.eu/api/datasets/4838ba20a6d86765d1ffc92a1647911d/metadata_file?metadata_file=bam_index

Hi Jennifer,

I typed too quickly, and should have tested the custom, my colleagues looked at it, and this one is correct:

track type=bam bigDataUrl=https://usegalaxy.org/api/datasets/f9cad7b01a472135b6ce5d10a7ecd18b/display?to_ext=bam bigDataIndex=https://usegalaxy.org/api/datasets/f9cad7b01a472135b6ce5d10a7ecd18b/metadata_file?metadata_file=bam_index

It’s called “bigDataIndex” (as I linked in the docs, but I mistyped the tag in the example).

But it seems that what prevents this from working is the part after the “?”. QA tried downloading the files into a local webserver and that works, so it’s not the files. While we’re looking into it, does Galaxy have a way to access files without the “?” query parameters?

1 Like

Not that I know of. Let’s bring in others to confirm. Chat link: You're invited to talk on Matrix

Hello,

Thank you for using the UCSC Genome Browser and reporting your issues.

An engineer ran the following tests and no issues were reported:

  • Fetch test on the Galaxy BAM file
  • Fetch test via a Genome Browser HTTP library on the Galaxy BAM file
  • Fetch test on the Galaxy BAM index file

However, a fetch test via a Genome Browser HTTP library on the Galaxy BAM index file failed. It appears that the metadata response does not support the HEAD command, which can be seen with “curl -I”:

curl -I “https://usegalaxy.eu/api/datasets/4838ba20a6d86765d1ffc92a1647911d/metadata_file?metadata_file=bam_index” HTTP/1.1 404 Not Found

The HEAD method doe work on the main Galaxy BAM file:

curl -I “https://usegalaxy.eu/api/datasets/4838ba20a6d86765d1ffc92a1647911d/display?to_ext=bam” HTTP/1.1 200 OK

We recommend adding support for the HEAD method for metadata.

I hope this is helpful. If you have any further questions, please reply to genome@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser

1 Like

Great, thank you @jnavarr5 for sorting out the exact issue!

Issue ticket: Enhancement: Add functionality for direct BAM index retrieval via URL from Galaxy to UCSC · Issue #16074 · galaxyproject/galaxy · GitHub

the HEAD request for metadata files has been implemented in Allow HEAD request for requesting metadata files by martenson · Pull Request #16113 · galaxyproject/galaxy · GitHub and will be available in the next release

1 Like