Filter sites with missing genotypes in multi-sample VCF

Hello,
I am new to bioinformatics and Galaxy so may have a naive question.

I have generated a multi-sample VCF and would like to remove the sites where I am missing data from any of the samples, so I can only compare variants where I have data for all individuals.

This should be possible with vcftools and –max-missing-count
(see https://vcftools.github.io/man_latest.html#GENOTYPE%20FILTERING%20OPTIONS)
However, I can’t figure out which exact tool I should be using on usegalaxy.org. This doesn’t seem to work for me with VCF filter or VCFtools annotate.

Could anyone please point me in the right direction to the tool I should use?

Thank you!

Hi - Try a query like this:

  • tool: VCFfilter
  • Select the filter type: Genotype filter (-g)
  • Specify filterting value: GT = N
    • where N might be any one of these ". ./. .|." in your data
  • Filter entire records, not just alleles: Yes

If that finds the records with at least one sample that doesn’t have calls, then use either of these to reverse the filter:

  • Specify filterting value: !(GT = ./.)
  • Inverts the filter, e.g. grep -v: Yes

If that doesn’t work, share a few lines from your VCF file that contain data you want to remove and note what part you want to filter on.

Thanks!

thanks for the help Jen. That actually didn’t work for me as it still outputted all the lines of my data but somehow removed the sample information.

However, I did realize that number of samples with data (NS) was already one of my info fields so I was able to filter by NS=total number of samples. And this worked!

Thank you!

1 Like

Super, glad that you worked out a query that fits your data :smiley: