Help with vcf annotation

HoWI · January 4, 2024, 6:08pm

Hi everyone,
I intend to add rsids to my dantelabs vcf and later merge it with 1240k dataset via plink. I have done it before on my laptop with an older version of dbsnp file (138) , but snp overlap with 1240k dataset was not good. I wanted to try again with the latest dbsnp file (156) but as the uncompressed file is whopping 165gb in size so it is not possible to use my laptop. I am unfamiliar with usegalaxy but still tried to annotate my vcf with bcftools on usegalaxy the resultant file had no rsids.

Can someone please instruct me regarding this?

Also I am getting this error message-

“INFO/RS value encountered and set to missing at NC_000001.10:6319593”.

Snpsift appears to be tailor made for this but I get this error message with it-

“Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3308)
at org.snpsift.annotate.VcfIndexDataChromo.grow(VcfIndexDataChromo.java:103)
at org.snpsift.annotate.VcfIndexDataChromo.add(VcfIndexDataChromo.java:46)
at org.snpsift.annotate.VcfIndex.add(VcfIndex.java:67)
at org.snpsift.annotate.VcfIndex.loadIntervals(VcfIndex.java:245)
at org.snpsift.annotate.VcfIndex.index(VcfIndex.java:183)
at org.snpsift.annotate.DbVcfSorted.open(DbVcfSorted.java:55)
at org.snpsift.annotate.AnnotateVcfDb.open(AnnotateVcfDb.java:395)
at org.snpsift.SnpSiftCmdAnnotate.annotateInit(SnpSiftCmdAnnotate.java:190)
at org.snpsift.SnpSiftCmdAnnotate.annotate(SnpSiftCmdAnnotate.java:70)
at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:410)
at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:397)
at org.snpsift.SnpSift.run(SnpSift.java:588)
at org.snpsift.SnpSift.main(SnpSift.java:76)”

jennaj · January 9, 2024, 9:03pm

Hi @HoWI

Your error message above indicates that there is some mismatch between your files. Double check that all data is based on the same genome assembly. This means that your VCF and any reference files all use the same chromosome identifier naming scheme, not just that the same underlying assembly coordinates are involved (although this is really important too, especially for variants!). Tools want to “match up” both to link in the annotation: common chromosome + overlapping coordinates.

Mismatch issues can look exactly like memory issues – so eliminate that problem first before deciding the work is too large for the public sites. The EU clusters scale quite large, so I would be surprised if this job is actually running out of resources.

If you want to share your work for a closer review, search this forum with “sharing your history” for the how-to. But check for and address any data mismatches first, since that seems to be what is going on. https://training.galaxyproject.org/training-material/faqs/galaxy/datasets_chromosome_identifiers.html

Let’s start there

HoWI · January 10, 2024, 4:54am

Hi jennaj,
Thanks for your reply. Yes, you are right it turns out that my Dante labs vcf and the dbsnp vcf are using different chromosomal notations. I was successfully able to annotate my vcf after changing its chromosomal notations to match dbsnp file.

Topic		Replies	Views
How to annotate VCF files using SnpSift Annotate and dbSNPs? variant-analysis , vcf	2	1178	February 3, 2022
vcf2tsv docker vs. usegalaxy.eu usegalaxy.eu support tool-deprecated	1	726	November 14, 2019
snpsift annotate for annotating gnomad and dbsnp ids	2	1175	October 28, 2019
Question about the Galaxy Tutorial ''Exome sequencing data analysis'', the dbSNP vcf file gtn-tutorial , tutorial-feedback , variant-analysis	3	1341	July 2, 2019
Unable to select VCF with Gemini load? Tool form includes full input requirements database , metadata , troubleshooting , variant-analysis , vcf , snpeff	1	741	June 21, 2022

Help with vcf annotation

Related topics