SnpEff build: Create Vaccinium meridionale database

Hi,

I am attempting to identify whether certain SNPs obtained through GBS are synonymous or non-synonymous mutations and to determine their effects on the sequences. To achieve this, I am trying to create a database for Vaccinium meridionale, using a .fa genome file and a .gff annotation file as input.

However, the process fails after approximately two hours of execution. The error message generated is as follows:

plaintext

*** Total: 1079584 markers added.  
00:00:10 Create exons from CDS (if needed):  
00:00:10 Exons created for 0 transcripts.  
00:00:10 Deleting redundant exons (if needed):  
00:00:11     Total transcripts with deleted exons: 0  
00:00:11 Collapsing zero length introns (if needed):  
00:00:12     Total collapsed transcripts: 0  
00:00:12     Reading sequences   :  
WARNING_CHROMOSOME_NOT_FOUND: Ignoring sequences for 'null'. Cannot find chromosome. File '/data/jwd05e/main/076/580/76580216/working/snpeff_output/Vaccinium_meridionale/genes.gff' line 1733010   '##FASTA'  
00:00:12     Total: 0 sequences added, 0 sequences ignored.  

It appears that the process is unable to locate chromosomes for the sequences specified in the GFF file.

Additional Context

  • The genome file consists of scaffolds rather than assembled chromosomes.
  • The GFF file is formatted according to the GFF3 standard, with features such as gene , mRNA , and CDS properly defined.

Could you help me understand the root cause of this issue? Specifically:

  1. Is the problem related to the use of scaffolds instead of chromosomes?
  2. Are there additional steps or adjustments needed to ensure compatibility with SnpEff?

I appreciate any guidance or recommendations you can provide to resolve this issue.

Thank you very much for your help!

1 Like

Welcome, @Ginna_Patricia_Velas

Thank you for sharing so many details! Based on your description the problem is likely with the way the data is formatted instead of the content itself. But we can confirm that. In short, you will want the sequence identifiers in as simple a format as possible and consistent between all files. The tool is trying to “match” those together to create the index. Then when using the index, the query VCFs will be matched against the index. When everything works together, the tool will work how you want it to.

If you would post back a few more details, we can help to solve the mismatch. Screenshots might be enough for this.

  1. reference assembly

    Go to your history, and click on the fasta dataset to expand it. This reveals some metadata that will matter. Next, click on the eye icon to bring up the first lines in the file into the center panel. Capture that entire view in a screenshot so that the expanded dataset and the top of the contents are shown. The > title lines of the fasta file are the most important here.

  2. reference annotation

    Then, do the same for your reference annotation. If you have a lot of headers and no data lines are shown yet, scroll down and capture a second screenshot that shows some of the data lines.

  3. query variant data

    You also have at least one VCF file, correct? Results from mapping then calling variants with upstream tools? That data will need to be based on that same “reference assembly”, too. If you want to troubleshoot that file along with these, you can include the same kind of screenshot for this as well.

All of these files will work together once you have your SnpEff database created and are using it. If you have multiple VCFs, how we confirm and resolve any conflicts in the example you can later check for and apply to the others.

Those screenshots can be posted back here. You could also generate a history share link since that will include everything. Or both, your choice!

I’ve included some guides below if you want to try on your own first. Let us know if you solve it that way instead! :slight_smile:



Related FAQs and help

1 Like

Thank you very much for responding to my request so quickly.
I have attached the requested screenshots:

  1. reference assembly

  2. reference annotation

  3. query variant data

Ok, great, thanks @Ginna_Patricia_Velas for sharing.

Your identifiers seem fine to me and those are consistent across files.

The problem is likely with the output from Maker. Running the annotation through the gffread tool can sometimes help to sort of “prepare” the data for use with other tools. The issues can be with the order of attributes (in the 9th column) but also order of the lines in the GFF3 (exons should be nested under transcripts, then transcripts under genes).

You can find examples of the files from Genbank or Ensembl to see the usual format, and compare against those, since that is what this tool is expecting to work with, too. The gffread tool should be able to do those conversions.

Keep in mind that any data that isn’t about a transcript footprint and the protein translation isn’t going to answer your questions (correct?). So dropping any excess data should be okay for this purpose (if you need to).

As a test and example, I had an old history here. I ran the Ensembl gff3 file through gffread, outputting both a GFF and a GTF file. The GTF here is what I would use with SnpEff since it the most likely to “work”. But I also tested with a cleaned up GFF3. Some is still running but maybe helps?

Maybe you can compare to your entire files, try some manipulations, see if you can notice what is different, etcetera. Your content seems fine, so all of this is just about getting the format standardized enough for SnpEff to understand it correctly.

Your datasets were not explanded so I can’t see the assigned datatypes, or how large they are, but maybe you can review/compare that part too. If the genome is “too fragmented” this too can have trouble. It wants each gene localized in one placement – meaning, on one assembly fragment – but perhaps you have that already given the annotation tools you have used.

Hope this helps and let us know how this worked out! :scientist: