download gnomad vcf.bgz.tbi dataset into galaxy

I am failing to download the gnomad dataset into galaxy. It cannot recognize the vcf.bgz.tbi extension, I think.
Would you suggest how can I download this dataset?

1 Like

Hello @matty5

The data is sourced from here, correct? https://gnomad.broadinstitute.org/downloads

Choices:

  1. Download the data using the methods described by the data provider. Upload just the .vcf dataset to Galaxy. The resulting vcf will work with more of the tools wrapped for Galaxy. This is your best option.
  2. Copy/paste the URL into the Upload tool. You may need to assign the datatype vcf_bgzip during upload. These data can be very large and may not load via URL.

I’m running tests with the Upload tool that use “autodetect” for the datatype (test1) and directly assigning the datatype (test2) to see which work AND if the connection persists long enough to load the complete dataset by URL. It has been running for a few hours now… and both may fail for a timeout. It depends on the limits data the provider set for URL downloads. They want you to use their tools for downloads, to manage resources.

vcf_bgzip is a datatype that very few tools will accept in Galaxy at this time. Also, the genome (database) those data are based on may not be indexed for tools. Do not assign vcf or vcf_bgzip to data that is not really in that format or expect problems. GATK’s GRCh37 data will match up with the genome hg_g1k_v37 in Galaxy (GATK’s version of GRCh37/hg19, sourced directly from GATK). This genome is NOT indexed for all tools at public servers. The GRCh38 data will not match up with any native indexes unless you modify the chromosome identifiers to match hg38 with a tool like “Replace column by values which are defined in a convert file”. The GATK identifiers are similar in format to those from Ensembl (a subset of the Ensembl chromosomes).

Unless you are running your own Galaxy and the genome indexes have been created from the genome fasta sourced from Broad/gnomAD, that version of the human genome may not be indexed at the public Galaxy server you are working at. Identifiers will need to be modified to match UCSC’s hg19 or hg38 genome build version. Do not mix up data from GRCh37/hg19 and GRCh38/hg38, or tools will fail or you’ll get unexpected results – these are different genome assemblies. There are also some data sorting differences…

Few more complications: GATK genome builds contain a subset of the chromosomes included in the Ensembl/UCSC genome builds (and Ensembl contains more than UCSC). GATK includes the autosomes, the female chromosome, and the mitochondrial chromosome. Ensembl and UCSC include the male chromosome and haplotypes, unmapped, patches, etc. UCSC includes the original release, Ensembl includes the original + updates. But if you start with the GATK data, only the GATK chromosomes will be in the original file, those were in the original release both Ensembl + UCSC contain, and those will convert over using an Ensembl-to-UCSC mapping file (see the tool form I above – this “convert file” is available).

Summary:

  1. Broad released data may be in a format (datatype) that is not accepted by other tools in Galaxy written by other developers, including other 3rd party developers.
  2. The gnomAD data can be very large. You may need to download it using the tools the data source provides.
  3. Upload data in a format that the tools in Galaxy can interpret. There may be some information loss. In most cases, only the uncompressed vcf should be loaded to Galaxy since that is what other tools written by other developers work with (mostly, and for now).
  4. Broad (GATK/gnomAD) released data have specific content that is similar to Ensembl. Specifically, the chromosome identifier names.
  5. Broad/Ensembl both use a different chromosome identifier naming format than UCSC.
  6. If you plan to download results from Galaxy, you’ll have the reverse problem when using the Broad tools. Know what you plan to do and prepare your data to meet the formatting requirements tools written by different developers may have. Or, you can set up your own Galaxy and index whatever external genome you want to.
  7. Data sorting problems may present. If they do, try sorting your vcf, after the chromosome naming/genome (database) changes have been done.
  8. There is usually some way to manipulate data format/chromosome naming to make data usable across tools. This may including using tools supplied by the original data provider, alone or in combination with tools in Galaxy, or using your own methods, before using data with tools you may find in Galaxy from other developers.

Thanks!

Update:

Data from https://gnomad.broadinstitute.org/ will have a few issues once loaded by URL into Galaxy. The data does load with the Upload tool. I tested at Galaxy EU https://usegalaxy.eu and ran the Upload tool twice – both were successful.

  1. The data will be in vcf_bgzip format (autodetected or assigned). There wasn’t a good way to extract just the vcf from the compressed vcf_bgzip format or to have tools recognize and use the compressed format “as is” and uncompress to vcf during job runtime. Tools expect uncompressed vcf as an input, including Gemini. We are addressing vcf_bgzip and vcf issues right now.
  2. The full dataset for v3 is very large, even compressed (235.7 GB). Data this large will not only consume a great deal of account quota but will be too large to manipulate with tools at any public Galaxy server. Meaning, increasing your account quota will not help. You could, however, set up your own Galaxy and allocate sufficient resources.
  3. The v2 version of the data has two version groups of the data. One is mapped against GRCh37/hg19 and the other is “lifted” (liftOver) to what appears to be the Ensembl version of the GRCh37 genome. The UCSC sourced hg19 reference genome is indexed at most public Galaxy servers for most tools, including Gemini at Galaxy EU https://usegalaxy.org. Gemini is not currently available at Galaxy Main https://usegalaxy.eu.
  4. The v3 version of the data has one version group of the data (as of now) and it just states that it is mapped against GRCh38. They do not specify if that is UCSC sourced hg38 or not. From reviewing the content, it appears to be “mostly” hg38 content but with extra content. Where to download the version of the genome they used is not clear.
  5. The “ExAC” data – which release or genome it is based on is not clear.

Details:

https://gnomad.broadinstitute.org/faq

Summary:

  • The complete v2 or v3 dataset cannot be used at public Galaxy servers due to the size. It is simply too large, even after the changes to have tools to accept the data as input(s) are applied.
  • Per-chromosome datasets for v2 might be possible to use after the updates for compressed/uncompressed vcf are completed and applied to servers (not done yet). Try assigning the database to “hg19”. If jobs fail because of a genome mismatch, then the data cannot be used at a public Galaxy server. Even if you obtained the genome in fasta format it will be too large to use as a Custom Genome/Build.
  • Per-chromosome datasets for v3 will probably not work due to the custom GRCh38 genome used. It is close to UCSC sourced hg38 but not exact. Expect genome mismatch problems if hg38 is assigned.

Clarification about any of the above could be sent to the data authors. Q&A could include: What is the exact reference genome the variants are called against? Do they provide that genome fasta or is it from a different public source (and where, exactly)? If that genome is available, you could index it at your own Galaxy server for tools.

Sorry, we couldn’t help more. I’m closing out your other question as a duplicate. Next time, please keep follow-up questions in the same post, unless the subject changes. These appear to be the same, just with some added context.

Also asked about in this post:

Hi @jennaj,
Thanks a lot for your answers.

In the meantime I tried to upload “gnomad.genomes.r2.1.1.exome_calling_intervals.sites.vcf.bgz” with the GalaxyEU tool. it worked apparently, but it assigned automatically the vcf_bgzip (~10 GB). It is not recognized by GEMINI annotate tool (as you also said), even if I tried to change the format with the pencil.

So, I tried to upload with the gemini tool assigning the type vcf in the upload window. It uploaded a vcf file of ~75 GB. I ran gemini annotate (my GEMINI load output and gnomad database). It worked. I did GEMINI inheritance and I added additional columns with the Join tools using additional databases, to customize the analysis. Apparently everything looked fine.

Beyond this particular request to upload gnomad, I have few general questions:

  1. to decompress vcf files: is it right to upload it setting vcf type in the upload window? Or am I introducing errors?
  2. I did this analysis on 1 patient apparently successfully and I would like to perform the same analysis on multiple patients (~40) maybe using multi datasets option in gemini. In this case the “Sample and family information in PED format” field in gemini load requires more files for each patient vcf?
  3. I didn’t find the Snpeff_eff tools to annotate variants in galaxy EU. So, I did it on galaxy.org. Is it, maybe, on another name?
  4. Is there any database of gene expression I can potentially add very easily or already present in Galaxy?
  5. please tell more about space, (or suggest a link I can read about that) since I have already used 44%…and resources about SQL syntax for advanced query, for filtering analysis purposes.

Thanks a lot for your answers. I am a molecular geneticist with no programming experience and new to Galaxy :slight_smile: :slight_smile:

Matty

2 Likes

@jennaj, @matty5
oops, I was thinking that both gemini load and gemini annotate were already accepting vcf_bgzip as input, but that’s not the case.
I just added the necessary logic to these tools, which means, @matty5, that you only need to be patient until these changes make it into the Galaxy toolshed and from there onto usegalaxy.eu. Expect a couple of days for that. In the meantime, uploading the data as vcf (decompressing it during the process) can serve as a workaround as you already found out (but it consumes quota unnecessarily).

Now for your new qeustions:

  1. This is okay. As I said, it’s just less ideal because the data uses more space than necessary in uncompressed form.
  2. If you really want to run separate analyses for each sample, then your best option is to turn the one analysis you completed into a workflow, and run this workflow 40 times (providing a different VCF and a different PED dataset at each run).
  3. Yeah, we are aware of this, and it’s rather annoying. snpeff_eff is there, but isn’t found when you search for it (just like a couple of other snpeff tools). Just browse through all tools in the section “Variant Calling” (without any search term entered in the search box) and you will find it way down in the list. Sorry for that inconvenience.
  4. I’m not sure what kind of database you are asking for.
  5. a) It’s all on the home page of usegalaxy.eu. There’s also a link if you want to ask for more space (you need to request it and this request will be reviewed).
    b) for GEMINI advanced queries start out with the gemini docs here: https://gemini.readthedocs.io/en/latest/content/querying.html and https://gemini.readthedocs.io/en/latest/content/database_schema.html (the latter for a list of default columns to query).

Best, Wolfgang

1 Like

Everything went fine with the tool update on usegalaxy.eu. The latest version of gemini annotate can now use vcf_bgzip input data directly. We’ve also updated gemini load to let you build the initial database from vcf_bgzip if you should ever want to do this.

Thanks for bringing up this issue, @matty5!

1 Like

Thanks @wm75 for replying and update gemini annotate.

Regarding the answer below, actually I would prefer, if I can to run one analysis for all my samples, batching them in one initial database in gemini load. Anyway to batch the VCFs keeping track of patient info/ID?
Thanks

Gemini can currently only build one database per VCF, so to analyze several samples together you need to produce a multisample VCF first.

As I tried to explain in merge multiple VCF files - variant analysis and sample organization, whether that makes sense, strongly depends on what you’re analyzing:

As a rule of thumb:

  • if you’re trying to answer one question using several samples, you should probably use a variant caller like freebayes to produce one single VCF from multiple BAM files representing your samples.
    An example would be if you’re trying to find causative variants in families with a common genetic disease.
  • if samples don’t relate to a common question, then keep them separate throughout your analysis and build separate gemini databases for them. If, e.g., you’re analyzing a family trio to study one genetic disease, and two other family trios for a different disease, do multisample variant calling for the first trio, and separate variant calling for the other two and keep the analysis of the two VCFs separate.

In other words, if you have many samples representing, e.g., different patients with lots of different diseases, tumors, etc. analyze these cases individually using a workflow.
If you have one big group of patients to study the same disease, use joint variant calling to produce one big VCF of all of their mutations, then feed it to gemini.

Ok! thank you!