Help with vcf annotation

Hi everyone,
I intend to add rsids to my dantelabs vcf and later merge it with 1240k dataset via plink. I have done it before on my laptop with an older version of dbsnp file (138) , but snp overlap with 1240k dataset was not good. I wanted to try again with the latest dbsnp file (156) but as the uncompressed file is whopping 165gb in size so it is not possible to use my laptop. I am unfamiliar with usegalaxy but still tried to annotate my vcf with bcftools on usegalaxy the resultant file had no rsids.

Can someone please instruct me regarding this?

Also I am getting this error message-

“INFO/RS value encountered and set to missing at NC_000001.10:6319593”.

Snpsift appears to be tailor made for this but I get this error message with it-

“Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3308)
at org.snpsift.annotate.VcfIndexDataChromo.grow(VcfIndexDataChromo.java:103)
at org.snpsift.annotate.VcfIndexDataChromo.add(VcfIndexDataChromo.java:46)
at org.snpsift.annotate.VcfIndex.add(VcfIndex.java:67)
at org.snpsift.annotate.VcfIndex.loadIntervals(VcfIndex.java:245)
at org.snpsift.annotate.VcfIndex.index(VcfIndex.java:183)
at org.snpsift.annotate.DbVcfSorted.open(DbVcfSorted.java:55)
at org.snpsift.annotate.AnnotateVcfDb.open(AnnotateVcfDb.java:395)
at org.snpsift.SnpSiftCmdAnnotate.annotateInit(SnpSiftCmdAnnotate.java:190)
at org.snpsift.SnpSiftCmdAnnotate.annotate(SnpSiftCmdAnnotate.java:70)
at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:410)
at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:397)
at org.snpsift.SnpSift.run(SnpSift.java:588)
at org.snpsift.SnpSift.main(SnpSift.java:76)”

Hi @HoWI

Your error message above indicates that there is some mismatch between your files. Double check that all data is based on the same genome assembly. This means that your VCF and any reference files all use the same chromosome identifier naming scheme, not just that the same underlying assembly coordinates are involved (although this is really important too, especially for variants!). Tools want to “match up” both to link in the annotation: common chromosome + overlapping coordinates.

Mismatch issues can look exactly like memory issues – so eliminate that problem first before deciding the work is too large for the public sites. The EU clusters scale quite large, so I would be surprised if this job is actually running out of resources.

If you want to share your work for a closer review, search this forum with “sharing your history” for the how-to. But check for and address any data mismatches first, since that seems to be what is going on. https://training.galaxyproject.org/training-material/faqs/galaxy/datasets_chromosome_identifiers.html

Let’s start there :slight_smile:

1 Like

Hi jennaj,
Thanks for your reply. Yes, you are right it turns out that my Dante labs vcf and the dbsnp vcf are using different chromosomal notations. I was successfully able to annotate my vcf after changing its chromosomal notations to match dbsnp file.

1 Like