Snpeff eff annotate Info field ends in the next vcf line

Hi everyone,

I am using snpeff eff annotate on Galaxy EU in order to annotate my variants from targeted sequencing. For a few variants in some samples I encountered, that the snpeff annotation (Info-Tag “ANN”) is extended in another line, so that the values of the format column are not in the same line anymore (please see picture below). The two affected variants seem to have a tremendously long snpeff annotation.
I am wondering if there is an explanation to this and if this can be easily fixed (e.g. restricting the length of the snpeff Annotation). As this is affecting only a few variants per sample, it is not too difficult to identify them and make sure the data does not get lost. However, it is a little tricky to handle in downstream analysis sometimes though. So I would be delighted, if there might me a solution and am open for any idea.

Thanks a lot!
All the best,
Rose

Hmm, all I’m seeing in that screenshot is two variant records imported into excel(?), so first of all, have you verified that the line break is actually part of the original data?
Maybe the spreadsheet tool just decided to wrap the content somehow?
Assuming you see this in Galaxy, I would additionallly verify this by downloading the file and opening it in a plain text editor (something like Wordpad, Notepad++, gedit depending on your platform).
If it is really SnpEff annotate that is responsible for this, you should also see an increase in the number of lines (not comment lines) reported by Galaxy for the dataset (compared to the input dataset).

Hi Wolfgang,

I have to apologise. Looking at the data in TextEdit or counting the number of lines in the terminal (wc ) suggests that the original data is correct. I saw this “parsing error” when I opened the tab generated by vcf2tsv in excel (which I eventually have to do for downstream analysis). I did not think about that even after using vcf2tsv the amount of digits in an excel cell might be still exceeding the maximum in the ANN column for the two variants (which of course it does). So I guess I should “manipulate” the original tab-file to limit the length of the information in the ANN column to make sure if I have to present some of those variants in e.g. excel, they won’t exceed the maximum digit count of a cell. What do you think?

Thanks a lot!
All the best,
Rose

Hi again,

I earlier tried to somehow restrict the amount of annotation annotated by snpeff (e.G. the first 10 transcripts) using the tool Replace Text in entire line (Galaxy Version 1.1.2). Somehow my chosen pattern is not working. I tried to different settings in order to keep only the first 10 transcripts of the snpeff annotation (ANN)

Find pattern: ([^\t]+\t {7}) (.+;)* (ANN=[^,\t;]+) (,[^,;\t]+{9})* (,[^,;\t]+)* (.*)
Replace with: \1\2\3\4\6

Find pattern: (.+\t.+\t.+\t.+\t.+\t.+\t.+\t) (.+;)* (ANN=[^,\t;]+) (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^,\t;]+)* (,[^\t;]+)* (.*)
Replace with:1\2\3\4\5\6\7\8\9\10\11\12\14

However, I just tried to restrict the amount of characters right on the final .txt-file with the tool Tool: Replace Text in a specific column (Galaxy Version 1.1.3) and restricted the character amount in the ANN column to 30000 using the following command:
Find pattern: ^(.{30000})(.*)
Replace with: \1

This worked fine. So I guess this would be a possible way to prevent the ANN column to exceed maximum character count (when using e.g. excel).
If someone knows why my first two options did not work out, I am glad for suggestions. (As SED is so powerful is always good to learn more about it ;-)).

All the best,
Rose