Snpeff eff annotate Info field ends in the next vcf line

roselucia · November 14, 2019, 12:38am

Hi everyone,

I am using snpeff eff annotate on Galaxy EU in order to annotate my variants from targeted sequencing. For a few variants in some samples I encountered, that the snpeff annotation (Info-Tag “ANN”) is extended in another line, so that the values of the format column are not in the same line anymore (please see picture below). The two affected variants seem to have a tremendously long snpeff annotation.
I am wondering if there is an explanation to this and if this can be easily fixed (e.g. restricting the length of the snpeff Annotation). As this is affecting only a few variants per sample, it is not too difficult to identify them and make sure the data does not get lost. However, it is a little tricky to handle in downstream analysis sometimes though. So I would be delighted, if there might me a solution and am open for any idea.

Thanks a lot!
All the best,
Rose

wm75 · November 14, 2019, 8:09am

Hmm, all I’m seeing in that screenshot is two variant records imported into excel(?), so first of all, have you verified that the line break is actually part of the original data?
Maybe the spreadsheet tool just decided to wrap the content somehow?
Assuming you see this in Galaxy, I would additionallly verify this by downloading the file and opening it in a plain text editor (something like Wordpad, Notepad++, gedit depending on your platform).
If it is really SnpEff annotate that is responsible for this, you should also see an increase in the number of lines (not comment lines) reported by Galaxy for the dataset (compared to the input dataset).

roselucia · November 14, 2019, 9:53am

Hi Wolfgang,

I have to apologise. Looking at the data in TextEdit or counting the number of lines in the terminal (wc ) suggests that the original data is correct. I saw this “parsing error” when I opened the tab generated by vcf2tsv in excel (which I eventually have to do for downstream analysis). I did not think about that even after using vcf2tsv the amount of digits in an excel cell might be still exceeding the maximum in the ANN column for the two variants (which of course it does). So I guess I should “manipulate” the original tab-file to limit the length of the information in the ANN column to make sure if I have to present some of those variants in e.g. excel, they won’t exceed the maximum digit count of a cell. What do you think?

Thanks a lot!
All the best,
Rose

roselucia · November 17, 2019, 11:31am

Hi again,

I earlier tried to somehow restrict the amount of annotation annotated by snpeff (e.G. the first 10 transcripts) using the tool Replace Text in entire line (Galaxy Version 1.1.2). Somehow my chosen pattern is not working. I tried to different settings in order to keep only the first 10 transcripts of the snpeff annotation (ANN)

Find pattern: ([^\t]+\t {7}) (.+;)* (ANN=[^,\t;]+) (,[^,;\t]+{9})* (,[^,;\t]+)* (.*)
Replace with: \1\2\3\4\6

Find pattern: (.+\t.+\t.+\t.+\t.+\t.+\t.+\t) (.+;)* (ANN=[^,\t;]+) (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^\t;]+)* (.*)
Replace with:1\2\3\4\5\6\7\8\9\10\11\12\14

However, I just tried to restrict the amount of characters right on the final .txt-file with the tool Tool: Replace Text in a specific column (Galaxy Version 1.1.3) and restricted the character amount in the ANN column to 30000 using the following command:
Find pattern: ^(.{30000})(.*)
Replace with: \1

This worked fine. So I guess this would be a possible way to prevent the ANN column to exceed maximum character count (when using e.g. excel).
If someone knows why my first two options did not work out, I am glad for suggestions. (As SED is so powerful is always good to learn more about it ;-)).

All the best,
Rose

Topic		Replies	Views
SnpEff annotation- transcript information discordant to the information available on the Ensemble website usegalaxy.eu support variant-analysis , snpeff	11	4055	January 29, 2020
SnpEff annotation errors snpeff	1	912	August 26, 2020
Unable to use 'ANNOVAR Annotate VCF' to annotate my vcf file usegalaxy.org support workflow , tool-dev , tool-deprecated , variant-analysis , vcf , snpeff	1	658	August 27, 2021
How to add Triticum aestivum snpEff4.3 genome database or appropriate wheat genome database in Galaxy for VCF annotation? usegalaxy.org support snpeff	10	2288	March 30, 2020
Snpeff errors=numbers of variants process variant-analysis , snpeff	10	1615	April 9, 2021

Snpeff eff annotate Info field ends in the next vcf line

Related topics