bcftools norm realignment

Hi everyone,

I have a question regarding bcftools norm ( bcftools norm Left-align and normalize indels; check if REF alleles match the reference; split multiallelic sites into multiple rows; recover multiallelics from multiple rows (Galaxy Version 1.9+galaxy1)).
Using the tool 6 variants in my vcf got realigned. However, for one of them it seems like the realignment is not the most left position according to the following definition:
„A VCF entry is left aligned if and only if its base position is smallest among all potential VCF entries having the same allele length and representing the same variant“ (Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. (2015) Unified Representation of Genetic Variants. Bioinformatics. https://academic.oup.com/bioinformatics/article/31/13/2202/196142 )

original:
chr1: 115254052 C-CTC

realigned:
chr1: 115254051 T -TCT

Here is a screenshot of the sequence (GRCh37) as presented in the Genome Data Viewer by NCBI.
36

Shouldn’t be the correct realigned position:
chr1: 115254050 G GTC

I am glad for any suggestions or explanations.

Thanks a lot!
Rose

1 Like

The way you’re presenting it I would say you’re right.
One thing you could do is to run the variant through vt normalize and see if it gives a different answer. bcftools norm is supposed to use vt normalize’s algorithm under the hood, but who knows.
If you find a difference it may be worth a bug report against bcftools.

2 Likes

Hi @wm75,

I ran the tool vt normalize ( VT normalize normalizes variants in a VCF file (Galaxy Version 0.1.0)) with the following tool parameters:

Choose the source for the reference list cached
Using reference genome hg19
File containing list of intervals
Window size for local sorting of variants 10000
Choose the output format VCF

I compared the too vcf files and for all 6 variants for which the bcftools norm did a realignment, the variants got realigned also. All realignments were the same, except for the one variant I thought to be realigned incorrectly.

original:
chr1: 115254052 C-CTC

realigned with bcftools norm:
chr1: 115254051 T -TCT

realigned with vt normalize:
chr1: 115254050 G - GTC

Would you recommend using vt normalize instead of bcftools norm for post processing after variant calling? If yes, does vt normalize also splits multiallelic sites into beallelic records (I didn’t find a tool, such as vt decompose_blocksub or vt decompose) ? And could it be problematic, that there is no option to ignore the problem, when any REF allele does not match the reference genome base? Is the default window size for local sorting of variants is what I want to use when having about 52.000 lines in the vcf (about 4.500 variants) (ordered within chromosome, but not ordered by the chromosomes)?

Or should I rather keep on using bcftools norm regardless of the bug?

Thanks for your help!

All the best,
Rose

PS: I also realised that for another type of my data, bcftools norm rounds the variant allele frequency. Is this supposed to be like this?

after bcftools norm | before bcftools norm
0/1:181,187:0.508152 | 0/1:181,187:0.5081522
0/1:156,105:0.402299 | 0/1:156,105:0.4022989
0/1:388,256:0.397516 | 0/1:388,256:0.3975155

However it does not round the variant frequency correctly for all variants:
0/1:133,146:0.523297 | 0/1:133,146:0.5232975

Thanks a lot!

Last question first: my guess would be that it doesn’t round the value it finds, but rather recalculates it, then rounds. Generally, bcftools norm tries to make your vcf conform to the vcf specs in various (unfortunately, undocumented) ways. That’s why you can run the tool without any selected options. I guess validation of the variant allele frequency is one of them. If you want to you could check by editing your vcf to contain a wrong variant allele frequency and see if the tool fixes it.

Now for the first question: yes, I don’t think vcf decompose is currently available on eu or main and, no, vt normalize will not split multiallelic sites. If you want to, you can, however, run vt normalize followed by bcftools norm with the option Left-align and normalize indels turned off and that should do what you want. Your settings for vt normalize seem just fine,

I will look at bcftools norm, the command line tool, next week to find out why it doesn’t do full realignment for some variants and why its results are not fully equivalent to vt normalize when they should.

Hi @wm75,

Thanks for your fast response. I reported the bugg as you adviced and the reponse was as followed:

„That’s a normal behavior, the round trip between text representation of floats to internal binary and text again introduces these inaccuracies. For all practical purposes, a change in 6th decimal digit of variant allele frequency should not influence the analysis.“
Regarding the differing left-alligend:
„ This looks like a bug, thank you for reporting it.“

Regarding the Varaint frequency bcftools I guess, as this is normal behaviour and is only affecting the 6th decimal position, this is something one can accept ? :wink:

However, in regard to the bugg in left-alignment, I would be glad, if you could let me know, if you find something out this week? Otherwise I will do the workaround (using bcftools Norm without the left allignment option and then VT normalize afterwards) as you suggested.

Thanks for your help! I am looking forward to what you find out.

All the best,
Rose

PS: Someone from bcftools checked the variant and told me the following:

„I just checked with 1.9 and the latest development version and both produce the correct result:

input

1 115254052 . C CTC

output

1 115254050 . G GTC“

When someone from bcftools is trying to reproduce the output with the same vcf, thr error doesnt occur. Is there a chance to upgrade the tool version available on Galaxy?

„Thanks for the testing data. However, I am unable to reproduce the behavior, I get the correctly realigned variant:

chr1   115254050   .   G   .   100 PASS    DP=1176
chr1   115254050   .   G   GTC 98  LowAQ   DP=1151;ANT=ENST00000369535|intron_variant|YES|ENST00000369535.4:c.291-1703_291-1702dupGA||NRAS||

Please try to upgrade bcftools, maybe you are using an old version by accident. (Try bcftools --version )“