Remove some vcf header lines (not all)

roselucia · November 14, 2019, 12:19am

Hi everyone,

I am using snpsift rminfo on EU and annotate the variants with snpeff eff annotate afterwards. Before I use vcftotab-delimited in order to parse my data (the great EU Team made this tool working for my data on Galaxy EU- MANY THANKS!), I would like to get rid of some vcf headerlines with the info tags from the original annotation, where there is no data for the info tag existent anymore. Otherwise I would end up with various empty columns using vcftotab-delimited. I tried a couple thinks and it is a possibility to use use the tool Cut columns from a table (cut) (Galaxy Version 1.1.0) AFTER using vcftotab-delimited. However, as in my workflow I annotate my variants with lots of further information I would prefer to keep the vcf clean and amount of headerlines small. So I would rather find a tool to remove specific header lines BEFORE I used vcftotab in the vcf itself.

Below is a list of the headers I would like to remove!

Many thanks!
All the best,
Rose

##SnpSiftVersion=“SnpSift 4.2 (build 2015-12-05), by Pablo Cingolani”
##SnpSiftCmd=“SnpSift annotate -id /srv/qgen/data/annotation/common_all_20160601.vcf.gz 5BC_S15.temp0.vcf”
##INFO=<ID=CDA,Number=0,Type=Flag,Description=“Variation is interrogated in a clinical diagnostic assay”>
##INFO=<ID=OTH,Number=0,Type=Flag,Description=“Has other variant with exactly the same set of mapped positions on NCBI refernce assembly.”>
##INFO=<ID=S3D,Number=0,Type=Flag,Description=“Has 3D structure - SNP3D table”>
##INFO=<ID=WTD,Number=0,Type=Flag,Description=“Is Withdrawn by submitter If one member ss is withdrawn by submitter, then this bit is set. If all member ss’ are withdrawn, then the rs is deleted to SNPHistory”>
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description=“First dbSNP Build for RS”>
##INFO=<ID=SLO,Number=0,Type=Flag,Description=“Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out”>
##INFO=<ID=NSF,Number=0,Type=Flag,Description=“Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44”>
##INFO=<ID=R3,Number=0,Type=Flag,Description=“In 3’ gene region FxnCode = 13”>
##INFO=<ID=R5,Number=0,Type=Flag,Description=“In 5’ gene region FxnCode = 15”>
##INFO=<ID=NSN,Number=0,Type=Flag,Description=“Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41”>
##INFO=<ID=NSM,Number=0,Type=Flag,Description=“Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42”>
##INFO=<ID=G5A,Number=0,Type=Flag,Description=">5% minor allele frequency in each and all populations">
##INFO=<ID=COMMON,Number=1,Type=Integer,Description=“RS is a common SNP. A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency.”>
##INFO=<ID=RS,Number=1,Type=Integer,Description=“dbSNP ID (i.e. rs number)”>
##INFO=<ID=RV,Number=0,Type=Flag,Description=“RS orientation is reversed”>
##INFO=<ID=TPA,Number=0,Type=Flag,Description=“Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data)”>
##INFO=<ID=CFL,Number=0,Type=Flag,Description=“Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies.”>
##INFO=<ID=GNO,Number=0,Type=Flag,Description=“Genotypes available. The variant has individual genotype (in SubInd table).”>
##INFO=<ID=VLD,Number=0,Type=Flag,Description=“Is Validated. This bit is set if the variant has 2+ minor allele count based on frequency or genotype data.”>
##INFO=<ID=ASP,Number=0,Type=Flag,Description=“Is Assembly specific. This is set if the variant only maps to one assembly”>
##INFO=<ID=ASS,Number=0,Type=Flag,Description=“In acceptor splice site FxnCode = 73”>
##INFO=<ID=REF,Number=0,Type=Flag,Description=“Has reference A coding region variation where one allele in the set is identical to the reference sequence. FxnCode = 8”>
##INFO=<ID=U3,Number=0,Type=Flag,Description=“In 3’ UTR Location is in an untranslated region (UTR). FxnCode = 53”>
##INFO=<ID=U5,Number=0,Type=Flag,Description=“In 5’ UTR Location is in an untranslated region (UTR). FxnCode = 55”>
##INFO=<ID=WGT,Number=1,Type=Integer,Description=“Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more”>
##INFO=<ID=MTP,Number=0,Type=Flag,Description=“Microattribution/third-party annotation(TPA:GWAS,PAGE)”>
##INFO=<ID=LSD,Number=0,Type=Flag,Description=“Submitted from a locus-specific database”>
##INFO=<ID=NOC,Number=0,Type=Flag,Description=“Contig allele not present in variant allele list. The reference sequence allele at the mapped position is not present in the variant allele list, adjusted for orientation.”>
##INFO=<ID=DSS,Number=0,Type=Flag,Description=“In donor splice-site FxnCode = 75”>
##INFO=<ID=SYN,Number=0,Type=Flag,Description=“Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3”>
##INFO=<ID=KGPhase3,Number=0,Type=Flag,Description=“1000 Genome phase 3”>
##INFO=<ID=CAF,Number=.,Type=String,Description=“An ordered, comma delimited list of allele frequencies based on 1000Genomes, starting with the reference allele followed by alternate alleles as ordered in the ALT column. Where a 1000Genomes alternate allele is not in the dbSNPs alternate allele set, the allele is added to the ALT column. The minor allele is the second largest value in the list, and was previuosly reported in VCF as the GMAF. This is the GMAF reported on the RefSNP and EntrezSNP pages and VariationReporter”>
##INFO=<ID=VC,Number=1,Type=String,Description=“Variation Class”>
##INFO=<ID=MUT,Number=0,Type=Flag,Description=“Is mutation (journal citation, explicit fact): a low frequency variation that is cited in journal and other reputable sources”>
##INFO=<ID=KGPhase1,Number=0,Type=Flag,Description=“1000 Genome phase 1 (incl. June Interim phase 1)”>
##INFO=<ID=NOV,Number=0,Type=Flag,Description=“Rs cluster has non-overlapping allele sets. True when rs set has more than 2 alleles from different submissions and these sets share no alleles in common.”>
##INFO=<ID=VP,Number=1,Type=String,Description=“Variation Property. Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf”>
##INFO=<ID=SAO,Number=1,Type=Integer,Description=“Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both”>
##INFO=<ID=GENEINFO,Number=1,Type=String,Description=“Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a colon ( and each pair is delimited by a vertical bar (|)”>
##INFO=<ID=INT,Number=0,Type=Flag,Description=“In Intron FxnCode = 6”>
##INFO=<ID=G5,Number=0,Type=Flag,Description=">5% minor allele frequency in 1+ populations">
##INFO=<ID=OM,Number=0,Type=Flag,Description=“Has OMIM/OMIA”>
##INFO=<ID=PMC,Number=0,Type=Flag,Description=“Links exist to PubMed Central article”>
##INFO=<ID=SSR,Number=1,Type=Integer,Description=“Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other”>
##INFO=<ID=RSPOS,Number=1,Type=Integer,Description=“Chr position reported in dbSNP”>
##INFO=<ID=HD,Number=0,Type=Flag,Description=“Marker is on high density genotyping kit (50K density or greater). The variant may have phenotype associations present in dbGaP.”>
##INFO=<ID=PM,Number=0,Type=Flag,Description=“Variant is Precious(Clinical,Pubmed Cited)”>
##SnpSiftCmd=“SnpSift annotate -id /srv/qgen/data/annotation/CosmicAllMuts_v69_20140602.vcf.gz 5BC_S15.temp1.vcf”
##INFO=<ID=CDS,Number=1,Type=String,Description=“CDS annotation”>
##INFO=<ID=AA,Number=1,Type=String,Description=“Peptide annotation”>
##INFO=<ID=GENE,Number=1,Type=String,Description=“Gene name”>
##INFO=<ID=CNT,Number=1,Type=Integer,Description=“How many samples have this mutation”>
##INFO=<ID=STRAND,Number=1,Type=String,Description=“Gene strand”>
##SnpSiftCmd=“SnpSift annotate -id /srv/qgen/data/annotation/clinvar_20160531.vcf.gz 5BC_S15.temp0.vcf”
##INFO=<ID=CLNSIG,Number=.,Type=String,Description=“Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - Pathogenic, 6 - drug response, 7 - histocompatibility, 255 - other”>
##INFO=<ID=CLNALLE,Number=.,Type=Integer,Description=“Variant alleles from REF or ALT columns. 0 is REF, 1 is the first ALT allele, etc. This is used to match alleles with other corresponding clinical (CLN) INFO tags. A value of -1 indicates that no allele was found to match a corresponding HGVS allele name.”>
##INFO=<ID=CLNORIGIN,Number=.,Type=String,Description=“Allele Origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other”>
##INFO=<ID=CLNSRC,Number=.,Type=String,Description=“Variant Clinical Chanels”>
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description=“no_assertion - No assertion provided, no_criteria - No assertion criteria provided, single - Criteria provided single submitter, mult - Criteria provided multiple submitters no conflicts, conf - Criteria provided conflicting interpretations, exp - Reviewed by expert panel, guideline - Practice guideline”>
##INFO=<ID=CLNDSDB,Number=.,Type=String,Description=“Variant disease database name”>
##INFO=<ID=CLNACC,Number=.,Type=String,Description=“Variant Accession and Versions”>
##INFO=<ID=CLNDBN,Number=.,Type=String,Description=“Variant disease name”>
##INFO=<ID=CLNDSDBID,Number=.,Type=String,Description=“Variant disease database ID”>
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description=“Variant names from HGVS. The order of these variants corresponds to the order of the info in the other clinical INFO tags.”>
##INFO=<ID=CLNSRCID,Number=.,Type=String,Description=“Variant Clinical Channel IDs”>

wm75 · November 14, 2019, 8:00am

Hi Rose,
it shouldn’t surprise you at this point that the answer lies in yet another regexp and the Select lines tool https://usegalaxy.eu/root?tool_id=Grep1

You want to keep all lines NOT Matching a pattern like:
^##INFO=<ID=((ANN)|(DP)|(WHATEVER)|(AND_SO_ON))

roselucia · November 15, 2019, 7:43am

Hi @wm75,
tied out the following pattern with the “NOT Matching” option and it worked very well:

^(##INFO=<ID=((CDA)|(OTH)|(S3D)|(WTD)|(dbSNPBuildID)|(SLO)|(NSF)|(R3)|(R5)|(NSN)|(NSM)|(G5A)|(COMMON)|(RS)|(RV)|(TPA)|(CFL)|(GNO)|(VLD)|(ASP)|(ASS)|(REF)|(U3)|(U5)|(WGT)|(MTP)|(LSD)|(NOC)|(DSS)|(SYN)|(KGPhase3)|(CAF)|(VC)|(MUT)|(KGPhase1)|(NOV)|(VP)|(SAO)|(GENEINFO)|(INT)|(G5)|(OM)|(PMC)|(SSR)|(RSPOS)|(HD)|(PM)|(CDS)|(AA)|(GENE)|(CNT)|(STRAND)|(CLNSIG)|(CLNALLE)|(CLNORIGIN)|(CLNSRC)|(CLNREVSTAT)|(CLNDSDB)|(CLNACC)|(CLNDBN)|(CLNDSDBID)|(CLNHGVS)|(CLNSRCID)|(ANN)|(LOF)|(NMD)))|(##SnpEffVersion)|(##SnpSiftVersion)|(##SnpEffCmd)|(##SnpSiftCmd="SnpSift annotate -id /srv/qgen/data/annotation/((common_all_20160601.vcf.gz)|(CosmicAllMuts_v69_20140602.vcf.gz)|(clinvar_20160531.vcf.gz)))

Thanks a lot!