I am working to identify variants in WGS data from an intron rich, haploid eukaryotic GC rich algae, for which human or bacteria derived pipeline parameters may not be appropriate. To maximise robustness of variant calls, I have been trying to develop a benchmark dataset of accurate variant calls from my population of UV mutants, initially processed with Freebayes, to validate and optimise my variant call pipeline.
I see that in your tutorials, you recommend Info filters for strand bias and placement bias for distinguishing correctly mapped variant reads - in particular using SAP (Strand balance probability for the alternate allele) and EPP (End Placement Probability) which are encoded as Phred-scaled estimates of the probability of deviation from the expected ratio of 0.5, with a suggested cutoff of >20. However, my control, R16, has a well supported and sequenced mutation which is eliminated by these filters, which had scores of 3.3935 for both measures.
Can you help me understand why this might be the case? I have copied the VCF entry below and provided a link to the relevant history which includes many trial analyses. Thanks for your advice
QUAL: 2587.33 . INFO: AB=0;ABP=0;AC=1;AF=1;AN=1;AO=51;CIGAR=1M1D3M;DP=51;DPB=40.8;DPRA=0;EPP=3.3935;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=595.754;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=3052;QR=0;RO=0;RPL=25;RPP=3.05288;RPPR=0;RPR=26;RUN=1;SAF=24;SAP=3.3935;SAR=27;SRF=0;SRP=0;SRR=0;TYPE=del;technology.ILLUMINA=1 GT:DP:AD:RO:QR:AO:QA:GL 1:51:0,51:0:0:51:3052:-261.674,0