Calculating variant allele frequency from FreeBayes VCF

Hi everyone,

This is my first time using FreeBayes on Galaxy. Does anyone know how to calculate the variant allele frequency (some call it alternate allele frequency) from the VCF? Some headers that I think would apply are:

##FORMAT=<ID=DP,Number=1,Type=Integer,Description=“Read Depth”>
##FORMAT=<ID=AD,Number=R,Type=Integer,Description=“Number of observation for each allele”>
##FORMAT=<ID=RO,Number=1,Type=Integer,Description=“Reference allele observation count”>
##FORMAT=<ID=AO,Number=A,Type=Integer,Description=“Alternate allele observation count”>

I was thinking maybe AO/DP?

Another question I had was, there is an option for pre-calculated variant allele frequency (see below for “AF”), but the VAF is always either 1.0 or 0.5. This seems suspicious because VAF are never that clean in the real world. Anyone have ideas why it does this?

##INFO=<ID=AF,Number=A,Type=Float,Description=“Estimated allele frequency in the range (0,1]”>

Thanks!

1 Like

Hi there,
AO/DP is just an approximation since AO+RO <= DP in general. This is because DP counts all reads covering the variant site, but AO and RO only include reads deemed good enough for an allele call.
A better calculation would be AO/(AO+RO), but pay close attention to this line:
##FORMAT=<ID=AO,Number=A,Type=Integer,Description=“Alternate allele observation count”>
and the Number=A in it, which means that AO will be a comma-separated list of as many integers as there are alternate alleles. If you don’t have a haploid genome or have multiple samples that’s an issue to consider.

So assuming you define VAF as the frequency of the most observed variant allele you would have to:

  • split AO on comma
  • find the maximum in the resulting list of values and use it as the numerator for your ratio
  • sum up all values in the list and add the RO value => use the result as the denominator

Equivalently, and maybe easier: use the AD field to get all counts including reference at once.

That makes me wonder if maybe you have only one diploid sample in your VCF. AF is an INFO field so it refers to all samples at once. It’s called estimated because FreeBayes looks at all your samples and tries to estimate the frequency at which the allele(s) (again, observe the Number=A in the definition) occur in the population. Now if you only have one sample and you’re doing diploid calling, then if FreeBayes calls a heterozygous genotype for that sample, its best possible estimate for AF based on the limited data is 0.5. If the sample is homozygous variant, then that would be 1. If it was homozygous reference, you’d get 0, but then with a single sample, of course, you wouldn’t have a variant record in the first place.

Hope that helps,
Wolfgang

1 Like

To illustrate the logic for a more complex case with three samples, here’s a truncated output line:

T C 8474.07 . AF=0.333333 GT:DP:AD:RO:QR:AO:QA:GL 0/1:389:187,201:187:6069:201:7786:-583.228,0,-429.318 0/1:249:130,118:130:4255:118:4330:-315.01,0,-308.479 0/0:253:252,1:252:8248:1:39:0,-72.264,-738.511

Here you have three diploid samples, so six chromosomes. Two out of the six carry the variant allele, so AF = 1/3

1 Like

Thank you very much!

1 Like