merge multiple VCF files - variant analysis and sample organization

I have multiple VCF files corresponding to 40 different patients. I want to run a batch annotation and GEMINI analysis on them. Can I merge/concatenate these files and run 1 single VCF file keeping patient ID information?

1 Like

While you may be able to use the tool VCFcombine for that, please note that this is normally not what you would do because:

  1. starting with separate VCFs for each patient you typically have no information about any variant site’s status in “unaffected” patients other than that the variant wasn’t called, i.e., if a site was judged homozygous ref in a patient it is not in the output of that patient and, thus, there are no stats about it.

  2. because you’re lacking information you, generally, cannot rely on the INFO column after combining the files. GEMINI however relies on INFO column fields for many of the queries you can perform with it.
    So unless you know exactly what you’re doing you may get very wrong answers from such queries.

  1. Assuming that you expect your 40 patients (or subsets of them) to have something in common, joint variant calling assessing the data of all samples simultaneously can increase sensitivity in particular at low coverage sites.

For all of these reasons, I would recommend calling variants for all samples (or, at least, the ones that logically should be grouped) together with a tool like freebayes. You can then directly use the resulting multi-sample VCF dataset with GEMINI.

If all this sounds confusing, you may want to have a look at this tutorial:


which illustrates joint variant analysis for a family trio.

1 Like

Thank you for your answer!
I followed the tutorial you suggested and also the “Identification of somatic and germline variants from tumor and normal sample pairs” tutorial (https://galaxyproject.github.io/training-material/topics/variant-analysis/tutorials/somatic-variants/tutorial.html#variant-annotation-and-reporting), successfully.

1)I am interested in learning more about the GEMINI annotate and query and Join files to customize my analysis. I am trying to download the gnomad database but I am failing in downloading the dataset. It cannot recognize the vcf.bgz.tbi extension, I think.
Would you suggest how can I download this dataset?

2)during the tutorials I realized that the some commands are on usegalaxy.org and others on galaxy Europe. Is there any way to share the history between the two websites?

  1. Lastly for annotate query, learning SQL language is always needed? Or I can use other command like filter and sort? (to remove frequent variant for example)

Thanks in advance for your help.

  1. You don’t want to upload any .tbi files. These are index files that the GEMINI annotate tool knows how to generate transparently and on-the-fly for you. What you need to upload is just the (compressed) vcf.
  2. No, there is no simple sharing between the websites. For following these tutorials there shouldn’t be any missing tools on EU though since that’s were these particular tutorials have been developed and tested.
  3. As long as you want to stick to GEMINI, you have to use SQL for phrasing more complex queries (after running GEMINI load your data is stored in an sqlite database and this is not a plain text file. Note, however, that many GEMINI tools exist only for the purpose of avoiding explicit SQL syntax for common usage scenarios - e.g., GEMINI inheritance, GEMINI query in basic mode, etc.).
    What you can always do is formulate a partial query (one that doesn’t do all filtering, etc. you’d want), then use regular text processing tools to work on the tabular output.
1 Like

@wm75 Please see the other post and correct as needed.

In summary, @mvdbeek is making some changes so that compressed vcf can be recognized and used by tools. The “VCF” gnomAD datasets are loading with a vcf_bgzip datatype and are not currently recognized by tools (including Gemini).

The complete gnomAD datasets are also very large (too large, will probably fail for resources… or is that not correct at usegalaxy.eu?). Per-chromosome data are smaller.

Both use custom versions of hg19/hg38, so assigning a database will be troublesome (yes?). If the actual human genome fasta used to call the variants is identified and loaded, it will be too large to use as a custom genome/build (or at least at usegalaxy.org).

(For data that is vcf.gz (no index, from other sources), those uncompress during Upload to vcf and tools recognize the data as valid VCF input.)