Snpeff errors=numbers of variants process

Hi,
I wonder if anyone can help me understand this:00%20PM

Thank you
Sanjay

This log doesn’t show much unfortunately. You need to show the specific errors. One common error I have encountered with snpEff in the past is the chromosome name in the VCF file not matching what snpEff expects. Is this a job you are running on one of the usegalaxy servers?

Ok, I have a user (pvanheus) on usegalaxy.org - if you want you can share your history with me so I can see the error in more detail.

What specific error are you getting? Does it show in the VCF?

1 Like

Can you share your “Galaxy user email”? Its not taking up the username.

I have done so using a private message.

Thank you. Your error is, indeed, the “Chromosome name not found” error, but it is masked by the size of your variant file. You have 1429 actual variants in the VCF. The other 4300000-odd positions are non-variant sites. If I filter out the non-variant sites by using snpSift with the (!ANN='.') clause included in its filters, I can run it through snpEff and obviously see the error messages. The problem here is that you called variants against the AL123456.3 whereas the snpEff database you downloadd expects the reference to be called Chromosome (it is modelled on the H37Rv sequence in Ensembl Bacteria which is the same sequence as AL123456.3 (and NC_000962.3) but with a different name.

To remedy this I ran your VCF through Text transformation with sed with the SED Program /^[^#]/s/^AL123456.3/Chromosome/ and then snpEff and it worked.

I took the workflow you used with modification described above and turned it into a Galaxy workflow that is accessible at https://usegalaxy.org/u/pvanheus/w/mtb-map-and-variant-call .

Honestly though, I’d just use snippy for variant calling. :slight_smile: You can find it on usegalaxy.eu.

Peter

1 Like

BTW we (at the South African National Bioinformatics Institute - SANBI) have been working extensively on M. tuberculosis bioinformatics using Galaxy - perhaps this is something we can discuss via email.

1 Like

Hi Peter,
Thank you for the extensive help. I am new in galaxy.
Could you please tell me which file do i need to pick for select at runtime in Step 8: bcftools call (sample file under “restrict to”), and ploidy file and sample file under “Select Predefined Ploidy” ? Also for Step 11: SnpEff eff: Use custom interval file for annotation and Only use the transcripts in this file.
I would love to connect with through the email.

Thank you
Sanjay

The default for bcftools is to treat everything as haploid, which is correct for a bacterium. And then you don’t need to restrict to particular regions in the bcftools or snpEff steps. If you want to filter out thing like PE/PPE genes, consider this script: https://github.com/combat-tb/tb_variant_filter - which is not on any of the main usegalaxy servers but is available via bioconda.

1 Like

Hi Peter,
I would like to annotate my called variants (from Snippy) using SNPeff to look at the potential effect on the protein, I tried to follow the workflow you listed, it runs but I do not get the information on the gene, in the info I get instead: QR=0;RO=0;DP=1775;AB=0;AO=402;QA=14790;TYPE=snp;EFF=(MODIFIER||||||||||T|ERROR_CHROMOSOME_NOT_FOUND)
Do you have any clue, what I can do?

Sorry I didn’t look at this message board for a long time. If you get the “ERROR_CHROMOME_NOT_FOUND” error it is because your reference chromosome name does not match the one SnpEff uses. Depending on the database you’re using SnpEff uses different names… e.g. Chromosome or NC_000962. I often use a SED or AWK script (in Galaxy) to make my VCF match the expected reference name. One option besides using SnpEff outside of Snippy is to use the Genbank format of your reference genome (if you have one) - that is the approach described in this tutorial: M. tuberculosis Variant Analysis