Subsetting VCF by regions in BED

I am having trouble subsetting a VCF by desired regions. I imported the following VCF file from NCBI directly into Galaxy (vcf.gz):
ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/NISTv4.2.1/GRCh38/SupplementaryFiles/HG001_GRCh38_1_22_v4.2.1_all.vcf.gz

I added format = vcf_bgzip and db=hg38

A sample section of this VCF, skipping all # comment lines, looks like this:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	INTEGRATION
chr1	10108	.	C	CT	5	allfilteredbutagree	platforms=0;platformnames=;datasets=0;datasetnames=;callsets=0;callsetnames=;filt=CS_CCS15kb_20kbDV_filt;difficultregion=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set_REF_N_slop_15kb,GRCh38_AllTandemRepeats_201to10000bp_slop5,HG001.hg38.300x.bam.bilkentuniv.010920.dups,hg38.contigs.1-22.lt_500kb,hg38.segdups_sorted_merged,lowmappabilityall	GT:PS:DP:ADALL:AD:GQ	:.:0:0,0:0,0:0
chr1	10146	.	AC	A	5	allfilteredbutagree	platforms=0;platformnames=;datasets=0;datasetnames=;callsets=0;callsetnames=;filt=CS_CCS15kb_20kbDV_filt,CS_10XLRGATK_filt;difficultregion=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set_REF_N_slop_15kb,GRCh38_AllTandemRepeats_201to10000bp_slop5,HG001.hg38.300x.bam.bilkentuniv.010920.dups,hg38.contigs.1-22.lt_500kb,hg38.segdups_sorted_merged,lowmappabilityall	GT:PS:DP:ADALL:AD:GQ	:.:0:0,0:0,0:0
chr1	10157	.	TA	T	5	allfilteredbutagree	platforms=0;platformnames=;datasets=0;datasetnames=;callsets=0;callsetnames=;filt=CS_CCS15kb_20kbDV_filt,CS_10XLRGATK_filt;difficultregion=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set_REF_N_slop_15kb,GRCh38_AllTandemRepeats_201to10000bp_slop5,HG001.hg38.300x.bam.bilkentuniv.010920.dups,hg38.contigs.1-22.lt_500kb,hg38.segdups_sorted_merged	GT:PS:DP:ADALL:AD:GQ	:.:0:0,0:0,0:0

Then, I uploaded a list of regions of interest that I want to subset from that VCF as a bed file, for example:

chr1	11790551	11790551
chr1	11790553	11790553
chr1	11790559	11790559
chr1	11790560	11790560
chr1	11790562	11790562
chr1	11790563	11790563
chr1	11790565	11790565

I used the tool VCF-BEDintersect: to intersect the original HG001_GRCh38_1_22_v4.2.1_all.vcf.gz with the bed file containing regions of interest (around 200k). However, I could only retrieve 4 regions interesecting my bed file.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	INTEGRATION
chr2	108259008	.	AAAACTTGGTCAGGTGATGTTAT	A	50	PASS	platforms=4;platformnames=Illumina,PacBio,CG,10X;datasets=4;datasetnames=HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR;callsets=6;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,10XLRGATK;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable	GT:PS:DP:ADALL:AD:GQ	1/1:.:646:0,263:82,417:362
chr11	640056	.	TCCCCGGGGTCCCTGCGGCCCCGACTGTGCGCCCGCCGCGCCCAGCCTC	T	5	allfilteredanddisagree	platforms=2;platformnames=Illumina,PacBio;datasets=2;datasetnames=HiSeqPE300x,CCS15kb_20kb;callsets=3;callsetnames=,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xGATK;callable=CS_CCS15kb_20kbDV_callable;filt=CS_HiSeqPE300xGATK_filt,CS_CCS15kb_20kbDV_filt,CS_10XLRGATK_filt,CS_HiSeqPE300xfreebayes_filt;difficultregion=GRCh38_AllTandemRepeats_201to10000bp_slop5	GT:PS:DP:ADALL:AD:GQ	0|1:640056_TCCCCGGGGTCCCTGCGGCCCCGACTGTGCGCCCGCCGCGCCCAGCCTC_T:303:107,91:0,0:106
chr15	40407805	.	GTT	G	50	PASS	platforms=3;platformnames=PacBio,Illumina,10X;datasets=3;datasetnames=CCS15kb_20kb,HiSeqPE300x,10XChromiumLR;callsets=5;callsetnames=CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,HiSeqPE300xGATK,10XLRGATK;datasetsmissingcall=CGnormal,IonExome,SolidSE75bp;callable=CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable	GT:PS:DP:ADALL:AD:GQ	1/1:.:677:0,283:1,63:215
chr22	42126617	.	AG

I highly doubt that there are only 4 regions of interest out of the 200k as bed I originally uploaded in the input VCF, so I suspect that there is probably something else going on but I cannot figure it. I appreciate any suggestions that you might have.

Hi,

BED files are 0 based for the start and non-inclusive for the end position. The wiki page about it is quite nice BED (file format) - Wikipedia

So I could imagine that you need to adjust your BED file.
Ciao,
Bjoern

1 Like