Filter sites with missing genotypes in multi-sample VCF

vcf
vcf-filter
#1

Hello,
I am new to bioinformatics and Galaxy so may have a naive question.

I have generated a multi-sample VCF and would like to remove the sites where I am missing data from any of the samples, so I can only compare variants where I have data for all individuals.

This should be possible with vcftools and –max-missing-count
(see https://vcftools.github.io/man_latest.html#GENOTYPE%20FILTERING%20OPTIONS)
However, I can’t figure out which exact tool I should be using on usegalaxy.org. This doesn’t seem to work for me with VCF filter or VCFtools annotate.

Could anyone please point me in the right direction to the tool I should use?

Thank you!

#2

Hi - Try a query like this:

  • tool: VCFfilter
  • Select the filter type: Genotype filter (-g)
  • Specify filterting value: GT = N
    • where N might be any one of these ". ./. .|." in your data
  • Filter entire records, not just alleles: Yes

If that finds the records with at least one sample that doesn’t have calls, then use either of these to reverse the filter:

  • Specify filterting value: !(GT = ./.)
  • Inverts the filter, e.g. grep -v: Yes

If that doesn’t work, share a few lines from your VCF file that contain data you want to remove and note what part you want to filter on.

Thanks!

#3

thanks for the help Jen. That actually didn’t work for me as it still outputted all the lines of my data but somehow removed the sample information.

However, I did realize that number of samples with data (NS) was already one of my info fields so I was able to filter by NS=total number of samples. And this worked!

Thank you!

1 Like
#4

Super, glad that you worked out a query that fits your data :smiley: