I annotated my variants with SnpEff using SnpEff eff: annotate variants (Galaxy Version 4.3+T.galaxy1) on usegalaxy.eu.
When checking the annotated information for two example variants and comparing it to the information available on the Ensemble webpage (Variant Effect Predictor, web version), I found some different transcript information.
For the transcript ENST00000595514 the SnpEff annotation did not include " WARNING_TRANSCRIPT_NO_START_CODON", even though this information is supplied by Ensemble. For all other incomplete transcripts, the warning was correctly annotated.
Furthermore I did not find a way so far to double check the HGVSc annotation by SnpEff for downstream/upstream_gene_varaints, as this is not supplied by e.g. the Ensemble VEP. Is there another easy way to double check the annotated information for downstream/upstream_gene_variants. Or is this not needed, as I can be sure that the annotation process worked correctly, when the rest of the SnpEff annotation (everything except HGVSc for upstream/downstream_gene_variants) for this variant has been proofed correct?
I would be glad for any suggestions!
All the best,
With the tool you are using, the available built-in indexes can vary across Galaxy servers. When built-in indexes are used, it is important to know these are pre-computed and based on a specific version of the annotation sourced from the SnpEff website: https://sourceforge.net/projects/snpeff/files/databases/v4_3/.
Pre-computed SnpEff indexes can become outdated over time due to additions, changes, or fixes. The annotation reviewed directly at the Ensembl website is always the most current available.
The most current SnpEff annotation can be downloaded or created directly with these tools:
As far as double-checking annotation results, the most important item is to avoid genome/build chromosome naming conflicts between inputs. The overall VCF format should also be confirmed to be correct, especially if you needed to modify that VCF in any way (adjust the chromosome names).
Tools that can help with QC or renaming chromosomes (if needed) are here. Usage instructions are on the tool forms.
SnpEff chromosome-inf: get chromosome names in a built-in SnpEff database
SnpSift Extract Fields: get chromosome names in a VCF dataset
Replace chromosome names: change chromosome names in a tabular/VCF file. Use as needed for this and other reasons/tools/protocols. Many chromosome mapping files are available between genome builds/versions or you can create your own mapping file.
SnpSift vcfCheck: basic VCF format validation. It can be run first. And definitely should be run after making any changes.
So I am still confused, why I have a slight difference in the annotated information for one of my example variants compared to the output of the variant effect predictor. (Missing information ( WARNING_TRANSCRIPT_NO_START_CODON) for ENST00000595514). Do you have another idea?
I think my second question was misleading. When checking the annotated information with VEP (Variant Effect Predictor by Ensemble) I had to find out that VEP does not supply the HGVSc for upstream and downstream variants. So I was wondering if there is another option to check the HGVSc for upstream and downstream variants? Or if I am going to far and don’t have to do this as I can be sure that using the tool snpeff eff annotate with the downloaded snpeff database on usegalaxy.eu without an error, that the annotation worked fine?
Thanks for the information regarding the QC and renaming chromosome. As I ran the bcftools norm after variant calling and before annotation, I thought this would be sufficient to make sure the vcf is compatible for annotation afterwards (as recommended in the “somatic variant tutorial”). Am I wrong?
The version of the SnpEff hosted pre-built indexes and the Galaxy tool that retrieves those indexes have some cross-dependencies. This is due to how database URLs change over time at SnpEff. It is not always obvious which SnpEff indexes can be retrieved with any particular SnpEff download tool version. If the most current tool version fails for a particular pre-built index, then try earlier tool versions.
That said, the current version of the tool SnpEff Download Download a new database (Galaxy Version 4.3r.1) will extract the GRCh37.75 genome annotation at https://usegalaxy.eu. The GRCh37.87 version of the data is more recent (2018) but will not download, even with this version of the Galaxy tool. And GRCh37.98 is the most current, and not pre-indexed at the SnpEff site. That’s Ok, you can make your own based on Ensembl 98, or whatever the most current is at any time.
Ensembl annotation is at version 98 now. Version 75 is older.
You don’t have to use pre-compiled indexes. Try making your own, based on the most current annotation (98), with the tool SnpEff build, if you want it to match what is at the Ensembl website. The tool form explains the different “build” options.
If you included upstream/downstream variants in a VCF file, the HGVSc would be reported (when those variants are in a coding region). Whether to do this or not is up to you. The annotation is a match for the specific variant in your VCF, so there is no reason to think that something went wrong. Unless the chromosome names are not quite matching up…
The same base reference genome build/version needs to be used/input for all steps in an analysis pathway. GRCh37 (has Ensembl chromosome names) is the same underlying genome (nucleotides) but a different build version (identifiers) versus hg19 (has UCSC chromosome names). Trying to use both together can lead to “green” dataset content problems, not always “red” dataset errors. That is why I suggested the other tools to check the chromosome naming identifiers in your particular VCF data, the indexes used (built-in, downloaded, or created), and to make adjustments as needed.
Reference data/indexes used in tutorials becomes outdated over time. The training purpose is to show how to use the tools/workflow covered. Tutorials also use simplified inputs to keep the processing small in size and faster to process. When running an analysis with real full-size data, use the most current annotation indexes for the most current annotation results.
The majority of the data I checked using the VEP, which only works with the current version of Ensemble. Thanks to your good advice, I know looked at the Transcript in the old Ensemble Version (http://feb2014.archive.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000074181;r=19:15272436-15280961;t=ENST00000595514). However, I did not see what you have told me, which is that on the Ensemble Archive Website (version 37.75) there is no annotation regarding the warning “WARNING_TRANSCRIPT_NO_START_CODON”, but I actually do find this information in the transcript table online. Or does only the text below the text oft the “Summary”-Saction for this Transcript is, what gets annotated using this Ensemble version?
In regard to your recommendation to be careful when using GRCh37 and hg19 in the same analysis workflow ( “GRCh37 (has Ensembl chromosome names) is the same underlying genome (nucleotides) but a different build version (identifiers) versus hg19 (has UCSC chromosome names)”.): Both variant callers (for my two different data types) use hg19 (UCSC) as the reference genome. Using bcftools norm the Reference genome I selected is “Human (Homo sapiens): hg19”. However, when I annotated my variants with snpeff eff I use GRCh 37.75. Is that a problem, I did not know of?
The reason I used GRCh is because SnpEff does not recommend to use the hg19 from UCSC anymore. ( WARNING: Usage of hg19 genome is deprecated and discouraged, you should use GRChXX.YY instead (e.g. the latest version at the time of writing is GRCh37.70). Reference sequence and annotations are made for an organism version and sub-version. For examples human genome, version 37, sub-version 70 would be called (GRCh37.70). UCSC doesn’t specify sub-version. They just say hg19. This annoying sub-version problem appeared often and, having reproducibility of results in mind, I dropped UCSC annotations in favor of ENSEMBL ones (they have clear versioning)."http://snpeff.sourceforge.net/SnpEff_manual.html ).
As the vcf had been annotated with GCRh37 before by Illumina and Quiagen Software for the two sample times respectively (which for consistency of data analysis I removed and now want to annotate both sample types with the same annotation engine), I thought it would be fine to use GCRh 37.75.
What do you think?
I see the difference. Thanks a lot.
Thanks again for your help! I am very glad for your support!
Ok, that information helps. Let’s back up a bit – it is not clear now which exact genome version/build was used for the original mapping and variant calling steps.
Would you please post back the entire header and first 10 or so data lines from the two original vcf files? The versions before you started to manipulate them in Galaxy. Or, you can send that in a direct message if you want to keep it private for some reason. Copy/paste is fine – we are checking content, not format.
What the header will clarify:
There are other content differences between hg19 (UCSC) and GRCh37 (NCBI, Ensembl) besides updated/additional assembly and annotation content and the chrom identifier nomenclature differences. The version of the mito assembly incorporated is different in the v37 releases across data providers, which complicates it even more. UCSC used the human v36/hg18 mito for v37/hg19 while NCBI/Ensembl used an updated mito assembly for v37/CHRh37. The short-ish “why” is that which mito assembly to incorporate changed during the data staging steps for the v37 release. UCSC had already started annotating the genome using the older mito and decided to not re-do those steps (computationally expensive steps) since the v38 release was expected to be released quickly after … then it wasn’t. Full details about the hg19 assembly content, including this part, is at the UCSC website: http://genome.ucsc.edu/ > Genomes > hg19 > Assembly details.
I’m also curious if the genome mapped against was stripped down to include just the 1) primary autosomes (1-23/chr1-chr23), 2) both X/chrX and Y/chrY or just X/chrX, and 3) mito (MT/chrM) – and which version of mito. A common choice when calling variants is to use just autosomes, X, and mito (excluding Y to avoid PAR mapping issues). Or, it could have included Y plus possibly all the haplotypes, unmapped, ect in the original release.
hg19 vs GRCh 37.75
I double checked a couple thinks to ensure that snpeff is working fine:
i)There are no errors in summary stats reported.
ii)For all chromosomes (where there is a variant detected), there is an annotation in the vcf visible. My QIAGEN example-sample does not contain a variant chr. 18,21,23,Y, so that I cannot check those chromosomes. However, for all other variants there is an annotation available.
iii)I checked if there is an error annotated. For both example variants there is no error annotated. (ERROR_CHROMOSOME_NOT_FOUND, http://snpeff.sourceforge.net/SnpEff_faq.html)
Maybe that might be enough to indicate that the annotation works probably?
Unfortunately the Ensemble help desk disagrees, and says that the transcript warning is definelty in the database version 37.75. They underlined though, that what other annotation programs (snpeff via galaxy) and what they “make of the database” is not of their knowledge. Well I honestly do not exactly know what to do with this information. What do you think? How can we be so sure about that the “missing” transcript warning in my snpeff annotation is not really missing, but just not available in the GRCh 37.75 snpeff database, even though it is visible on the ensemble archive website for GRCh37.75. Do you have another idea how I could double check this (maybe view the snpeff database somehow?)?
So I finally have a guess why this happens.
This specific transcript, ENST00000595514, has its biotype assigned as Nonsense mediated decay both in Ensemble 37.75 and in the snpeff genome database of that same version.
This turns out to be key here since, by default, SnpEff does NOT report its four transcript consistency warnings:
for anything but transcripts with biotype Protein coding.
I didn’t find this documented anywhere, but it’s in SnpEff’s source code, and it makes kind of sense though you might disagree.
Just for completeness, because the behavior surprised me: WARNING_TRANSCRIPT_INCOMPLETE does NOT mean that either the start or the stop codon is missing, but that the number of bases in the transcript’s cds is not a multiple of 3. This one IS documented though at http://snpeff.sourceforge.net/SnpEff_manual.html.
I hope this answers your question?