Extracting MAF blocks from regions of the human GRCh38/hg38 assembly

Hi, when I tried to extract MAF blocks from a set of regions of the human 2013 GRCh38/hg38 assembly, the message “No options available” popped out under ‘Choose species’. This operation works on the 2009 assembly GRCh37/hg19. However, if I send MAF alignments from the UCSC Table Browser using the 2013 GRCh38/hg38 assembly, I can actually join them and choose species! But, these are whole alignments, not blocks, so it doesn’t help getting MAF blocks of specific regions, ie exons.
It would be great if the MAF blocks from the GRCh38/hg38 assembly could be extracted directly from genome coordinates of that assembly. I hope this can be fixed soon.
Thanks!

1 Like

Hi @ccasola

True, the hg38 MAF data is not pre-indexed at Galaxy Main https://usegalaxy.org. Only hg19. This might be updated in the future – but it won’t be soon and may not happen at all. Part of a larger project and there are some challenging technical/development items to address first.

The Galaxy EU https://usegalaxy.eu server does not have either. They may be able to add in hg38 (they use a different data organization structure – for now – soon-ish all usegalaxy.* servers will have data synced up). But some of the same issues may present at EU (specifically, the new Data Manager might be required first for them to add this index in). Ping @hxr @bjoern.gruening

What you can do: Load the MAF data from UCSC into your account at either server or wherever you are working. It will be very large, will come in chunks (per chromosome) that need to be combined, should be sourced NOT from the UCSC Table Browser but their Downloads area (Conservation). Warning: you might run into quota space problems due to the size of the data. Give it a try first, and if you run into space problems, explain where you are working in a reply and we can follow up. Both of those servers above can temporarily grant extra quota for special purposes like this – IF and when actually needed. The process if different for both. You must be an academic for extra quota at usegalaxy.org. I’m not certain if that is required for usegalaxy.eu but we can sort that out.

Other public servers, in particular, Galaxy AU https://usegalaxy.org.au, already has a larger default quota allocation, so you could decide to do your work there. If an Australian national, the default quota is larger and there are other resources available. I also didn’t check to see if they have hg38 MAFs pre-indexed for these tools already, but you could (create an account if you don’t have one, load any bed dataset (test size/few lines is enough), assign to the genome hg38 “database”, and see what options for the tools show up).

Alternatives include setting up your own Galaxy with sufficient processing and space resources. Indexing for these particular tools is non-trivial line-command work and a Data Manager for these tools does not exist yet. Unless you are willing to try older scripts, troubleshoot, etc – just load the MAF data into your server and use it from a history. It could be put into a Data Library so that copies in histories do not consume extra space each time you want to work with it. Data cannot be used in analysis directly from a Data Library – it needs to be in a history or pre-indexed correctly on the server first.

We realize this is much more complicated than it should be, but those are the current options. Ask if you have questions about any of this - I didn’t add in a bunch of links for options that you do not intend to try (yet).

Thanks!

I forgot to address this. In order to get “blocks” (putatively exons) a BED12 dataset is needed. Or, BED6 data that already represents exons. These can be extracted from the UCSC table browser from existing tracks. If you just have a BED dataset that you created yourself, and it represents an entire transcript, try the tool **Stitch MAF blocks** given a set of genomic intervals (Galaxy Version 1.0.1). Some experimenting with tools in this group will be needed since we don’t know what data you have now.

How the UCSC Table Browser handles data/blocks is different than using the MAF tools in Galaxy. The options are not the same, which is why the Galaxy tools exist. You could contact UCSC at their Google forum and see what options they have. Warning, this will probably include line-command work and you will not have the analysis tracking/histories/reproducibility that Galaxy provides. But how to do this is your choice.