I am having trouble subsetting a VCF by desired regions. I imported the following VCF file from NCBI directly into Galaxy (vcf.gz):
ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/NISTv4.2.1/GRCh38/SupplementaryFiles/HG001_GRCh38_1_22_v4.2.1_all.vcf.gz
I added format = vcf_bgzip
and db=hg38
A sample section of this VCF, skipping all # comment lines, looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT INTEGRATION
chr1 10108 . C CT 5 allfilteredbutagree platforms=0;platformnames=;datasets=0;datasetnames=;callsets=0;callsetnames=;filt=CS_CCS15kb_20kbDV_filt;difficultregion=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set_REF_N_slop_15kb,GRCh38_AllTandemRepeats_201to10000bp_slop5,HG001.hg38.300x.bam.bilkentuniv.010920.dups,hg38.contigs.1-22.lt_500kb,hg38.segdups_sorted_merged,lowmappabilityall GT:PS:DP:ADALL:AD:GQ :.:0:0,0:0,0:0
chr1 10146 . AC A 5 allfilteredbutagree platforms=0;platformnames=;datasets=0;datasetnames=;callsets=0;callsetnames=;filt=CS_CCS15kb_20kbDV_filt,CS_10XLRGATK_filt;difficultregion=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set_REF_N_slop_15kb,GRCh38_AllTandemRepeats_201to10000bp_slop5,HG001.hg38.300x.bam.bilkentuniv.010920.dups,hg38.contigs.1-22.lt_500kb,hg38.segdups_sorted_merged,lowmappabilityall GT:PS:DP:ADALL:AD:GQ :.:0:0,0:0,0:0
chr1 10157 . TA T 5 allfilteredbutagree platforms=0;platformnames=;datasets=0;datasetnames=;callsets=0;callsetnames=;filt=CS_CCS15kb_20kbDV_filt,CS_10XLRGATK_filt;difficultregion=GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set_REF_N_slop_15kb,GRCh38_AllTandemRepeats_201to10000bp_slop5,HG001.hg38.300x.bam.bilkentuniv.010920.dups,hg38.contigs.1-22.lt_500kb,hg38.segdups_sorted_merged GT:PS:DP:ADALL:AD:GQ :.:0:0,0:0,0:0
Then, I uploaded a list of regions of interest that I want to subset from that VCF as a bed file, for example:
chr1 11790551 11790551
chr1 11790553 11790553
chr1 11790559 11790559
chr1 11790560 11790560
chr1 11790562 11790562
chr1 11790563 11790563
chr1 11790565 11790565
I used the tool VCF-BEDintersect: to intersect the original HG001_GRCh38_1_22_v4.2.1_all.vcf.gz with the bed file containing regions of interest (around 200k). However, I could only retrieve 4 regions interesecting my bed file.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT INTEGRATION
chr2 108259008 . AAAACTTGGTCAGGTGATGTTAT A 50 PASS platforms=4;platformnames=Illumina,PacBio,CG,10X;datasets=4;datasetnames=HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR;callsets=6;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,10XLRGATK;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ 1/1:.:646:0,263:82,417:362
chr11 640056 . TCCCCGGGGTCCCTGCGGCCCCGACTGTGCGCCCGCCGCGCCCAGCCTC T 5 allfilteredanddisagree platforms=2;platformnames=Illumina,PacBio;datasets=2;datasetnames=HiSeqPE300x,CCS15kb_20kb;callsets=3;callsetnames=,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xGATK;callable=CS_CCS15kb_20kbDV_callable;filt=CS_HiSeqPE300xGATK_filt,CS_CCS15kb_20kbDV_filt,CS_10XLRGATK_filt,CS_HiSeqPE300xfreebayes_filt;difficultregion=GRCh38_AllTandemRepeats_201to10000bp_slop5 GT:PS:DP:ADALL:AD:GQ 0|1:640056_TCCCCGGGGTCCCTGCGGCCCCGACTGTGCGCCCGCCGCGCCCAGCCTC_T:303:107,91:0,0:106
chr15 40407805 . GTT G 50 PASS platforms=3;platformnames=PacBio,Illumina,10X;datasets=3;datasetnames=CCS15kb_20kb,HiSeqPE300x,10XChromiumLR;callsets=5;callsetnames=CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,HiSeqPE300xGATK,10XLRGATK;datasetsmissingcall=CGnormal,IonExome,SolidSE75bp;callable=CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ 1/1:.:677:0,283:1,63:215
chr22 42126617 . AG
I highly doubt that there are only 4 regions of interest out of the 200k as bed I originally uploaded in the input VCF, so I suspect that there is probably something else going on but I cannot figure it. I appreciate any suggestions that you might have.