Snpeff database run errors

Hello everyone,

I’m trying to use Snpeff tool to annotate a vcf file containing variants called with FreeBayes on Bos taurus (cow). However, it appears that I cannot generate any proper output file.

I have tried to create my own snpeff database with a gtf file from the latest annotation for this species (annotation features (GTF) from https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_002263795.3/download?include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GFF&include_annotation_type=RNA_FASTA&include_annotation_type=CDS_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=FULLY_HYDRATED), but I always get errors as “the name (ARS-UCD2.0) that I’m choosing isn’t found”. I have tried older versions of snpeff build (4.3+T.galaxy2/3/5/6), of snpeff eff (5.2+galaxy0, 4.3+T.galaxy2), and I have the same errors.

I tried using available datasets in Galaxy, but it doesn’t work either.
The only time I had an output was when I chose the option in snpeff eff genome source “download on demand” with ARS-UCD1.2.105 (see picture below) but not a single variant was correctly annotated (+ the ARS-UCD1.2 reference genome is contaminated).

Can you please help me? I’m sure there’s a way to troubleshoot this.

Thanks a lot in advance :slight_smile:

1 Like

Welcome @gauthap

Yes, building a database with SnpEff can be a bit tricky since the tools are very particular about the formatting for identifiers (chromosomes, genes, other features). In short, try to locate files that are all based on the same exact assembly build/version, then use very simplest file formats possible (applying “cleanup” steps after getting the data from a provider might be necessary).

We have prior troubleshooting for this tool in these topics: snpeff and snpeff_build_gb plus any with reference-genome or reference-annotation tags.

For a short review of the human genome as an example, which has many different assemblies that are not directly compatible (but can be manipulated to be), please see →

And for a recent post where this type of data was reformatted, please see. (they are doing something a bit different but perhaps helpful anyway) →

Then for this part

Those indexes come directly from the tool authors at Home - SnpEff & SnpSift. You could report the issue to them but there might not be a lot they can do since it is all automatic, and relies on public data. If that is flawed, anything created from it will carry the problems forward, as you noticed!



SO, all of that is a lot to read through! If you get stuck and would like to share back a history with just your reference data and the failed runs, we can probably help to diagnose what might be going wrong and fix it up. Right now, it seems like you have mismatched chromosome identifiers. Meaning, the reference genome fasta and reference annotation seem to not be “matching up” for some reason. That could be a file format issue (simplifying the format is where to start, maybe with gffread and NormalizeFasta), or actually a difference in the data itself and you’ll need to standardize the identifiers across files (if a mapping exists for the Replace column tool) or need to locate different reference data.

Hope this helps and we can follow up! :slight_smile:

Thank you for your reply!

I somehow managed to create an output file with the latest version of snpeff (5.2) and the latest reference genome. The only thing is that I have 0% known variants even though I’m using the reference genome and its annotation file from the same GCF_002263795.3 (see picture below).

So for some reason snpeff now accepts to create a custom database with the latest reference genome, and then snpeff eff works as well, even though I literally changed nothing, I just figured I would try one more time. But still I have the problem that 0% variants are known.

I manually checked for chromosome ID between my snpeff eff output file, my gtf and my input freebayes file (variant call). Chromosome ID is the same (with associated variant values) between those files. However, when I look into my vcf from snpeff eff, there is information about the annotation (e.g. AB=0;ABP=0;AC=2;AF=1;AN=2;AO=2;CIGAR=1X;DP=2;DPB=2;DPRA=0;EPP=3.0103;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=7.37776;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=0;RO=0;RPL=1;RPP=3.0103;RPPR=0;RPR=1;RUN=1;SAF=0;SAP=7.35324;SAR=2;SRF=0;SRP=0;SRR=0;TYPE=snp;ANN=C|intron_variant|MODIFIER|SMIM11|SMIM11|transcript|NM_001114516.2|protein_coding|2/3|c.13-196A>G|||||| or AB=0.8;ABP=6.91895;AC=1;AF=0.5;AN=2;AO=4;CIGAR=1X;DP=5;DPB=5;DPRA=0;EPP=3.0103;EPPR=5.18177;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=3.49208;PAIRED=0.75;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=148;QR=37;RO=1;RPL=2;RPP=3.0103;RPPR=5.18177;RPR=2;RUN=1;SAF=4;SAP=11.6962;SAR=0;SRF=1;SRP=5.18177;SRR=0;TYPE=snp;ANN=G|intron_variant|MODIFIER|JAM2|JAM2|transcript|NM_001083736.1|protein_coding|6/9|c.697+141A>C||||||)
These examples were selected randomly from my output vcf file from snpeff eff. What should I do now?

Also, what does “MODIFIER” means in the Number of effects by impact in snpeff eff report?

Thanks :slight_smile:

For this one, you could examine your SNPs in a genome browser. UCSC hosts your genome here. https://genome.ucsc.edu/ → Genomes. Load your data, and turn on the pre-loaded tracks? This is exploratory so I don’t have exact instructions.

The table is ranking the impact of the SNP. This is the best resource. → Output summary files - SnpEff & SnpSift

In short, the SNPs in that category either did not change the translation in a meaningful way, or didn’t fall within a coding region at all.

From there, you could look at your logs more. Find these on the i-info icon view. You could also review your parameters? Load up runs with different parameter sets into the genome browser and compare. Can you notice anything relevant?

Hope this helps! :slight_smile: