Variant calling from VCF files


I have the 3 gzipped VCF files (.vcf.gz, proband, mother and father) obtained from a sequencing lab, which also provided me the .vcf.gz.tbi files.

Following the variant analysis tutorial of Galaxy, from the step of calling FreeBayes, an option of merging the 3 .bam files, provides a single multisample .vcf file in which the annotation is made. I am trying to merge these 3 files through bcftools merge but when I call bcftools norm with the same parameters as in the tutorial, an error is shown:

[E::faidx_adjust_position] The sequence “1” was not found
faidx_fetch_seq failed at 1:69270

Do you know how can I solve this?

Hi @Adrian

There is probably a chromosome mismatch problem. Meaning, the genome assigned as the database is not an exact match for the reference genome that was used for the variant calling.

This FAQ explains how to confirm: Mismatched Chromosome identifiers and how to avoid them

Thank you! Finally I confirmed the error was on the reference genome. Apparently the reference genome employed was b37 and I would like to make the sequence annotation on hg19. Do you know if there is any way of converting the coordinates?

Hello @Adrian

Freebayes might already contain your genome.

I ran a quick alignment to generate some headers against that genome index. Those specify the chromosome identifiers, and can be compared to your VCF headers. History share link Galaxy

I think that will work for you given what you have shared already.

For others reading, if your original genome is not indexed at some server, this is what to try:

Mapping data between genomes is probably not a good idea for variant calling protocols. For this reason, remapping the data against a supported genome would be recommended.

If you just need a database key, you can load up the reference genome fasta that matches your data and use it like a custom genome. Promoting to a custom build will create a database metadata key that you can assign to datasets, and that will avoid conflicts. Keep in mind that any other data you incorporate needs to also be based on that exact reference genome (annotation, etc).