merge multiple VCF files - variant analysis and sample organization

While you may be able to use the tool VCFcombine for that, please note that this is normally not what you would do because:

  1. starting with separate VCFs for each patient you typically have no information about any variant site’s status in “unaffected” patients other than that the variant wasn’t called, i.e., if a site was judged homozygous ref in a patient it is not in the output of that patient and, thus, there are no stats about it.

  2. because you’re lacking information you, generally, cannot rely on the INFO column after combining the files. GEMINI however relies on INFO column fields for many of the queries you can perform with it.
    So unless you know exactly what you’re doing you may get very wrong answers from such queries.

  1. Assuming that you expect your 40 patients (or subsets of them) to have something in common, joint variant calling assessing the data of all samples simultaneously can increase sensitivity in particular at low coverage sites.

For all of these reasons, I would recommend calling variants for all samples (or, at least, the ones that logically should be grouped) together with a tool like freebayes. You can then directly use the resulting multi-sample VCF dataset with GEMINI.

If all this sounds confusing, you may want to have a look at this tutorial:


which illustrates joint variant analysis for a family trio.

1 Like