Delete lines from vcf with ALT="."

roselucia · November 1, 2019, 5:04pm

Hi everyone,

I would like to delete all lines from my vcf, where ALT="." in order to only have the variants represented in my vcf. I thought I could youse the genotype filter in the tool vcffilter, but I actually want to keep those lines where an ALT allele is reported, even though GT="./.".
I would be glad for help.
Thanks a lot!

Rose

wm75 · November 3, 2019, 11:20pm

Hi Rose,
I cannot think of a dedicated tool for this job right now, but it’s always good to remember that VCF is just a specialized (albeit rather complicated) form of tabular data. As such you can use on them many of the general Text Manipulation and Filter tools that most Galaxy instances offer.
In particular, you could use https://usegalaxy.org/root?tool_id=Filter1 as a simple means to achieve your goal though it will require you to specify the number of header rows in your VCF under Number of header lines to skip (which somewhat confusingly means “number of header lines to except from filtering”).
If you can’t do that ( for example, because you need to perform this step as part of a workflow on input datasets with a variable number of header lines), you could use the more flexible, but also harder to configure https://usegalaxy.org/root?tool_id=Grep1 tool with a Matching regular expression pattern of ^((.+\t){4}[^.]\t)|#.+ (which should retain all lines where either the 5th tab-separated column is not a literal . or that start with a comment # symbol).

roselucia · November 5, 2019, 1:58pm

Hi Wolfgang,

thanks for the two options.

This works very well. I checked all vcfs (21) and they all have the same number of header lines. So I guess this too would even work when using it for a workflow.

This tool for some reason did not work for me with the expression ^((.+\t){4}[^.]\t)|#.+
Instead of 4800 lines with variants (where ALT isn’t “.”) and 60 header lines, I am left with only 4425 lines and 60 header lines.

Thanks for the fast help! I am glad one of the options worked on my data!
All the best Rose

wm75 · November 5, 2019, 4:33pm

Glad to hear the first solution is working for you. Just to have this documented properly, here’s the correct version of my second suggestion (the first version wasn’t tested carefully, sorry for that).
This regex pattern should work: ^(([^\t]+\t){4}[^.])|#.+

roselucia · November 6, 2019, 8:16am

Hi @wm75 ,

thanks for the correction. I tried the new pattern again and now it is working again. I thought the pattern through in order to understand it correctly and was wondering why I could not use the following: ^((.+\t.+\t.+\t.+\t)([.]+\t)(.*))|#.+
“My” pattern left me with 0 lines.
May I ask you for an explanation?

Thanks All the best Rose

PS: I was using this pattern below in the Tool “Replace text in entire line” in order to be left only the first annotation in the ANN Info-Tag. This worked fine. With this knowledge in mind I came up with the pattern for the “Select lines which the expression” tool.
Tool:https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/1.1.2 auf usegalaxy.eu

Find pattern: (.+\t.+\t.+\t.+\t.+\t.+\t.+\t)(.+;)(ANN=[^,\t;]+)(,[^;\t]+)(.*)

Replace with: \1\2\3\5

wm75 · November 6, 2019, 10:30am

The main problem with your pattern is that .+\t acts greedily, i.e. it tries to match as much of each line as possible while still enabling a match to the rest of the pattern. That is, the first .+\t may actually already span several columns.
In your previous pattern with ANN=, you prevent this by anchoring the match at that string (provided that ANN= occurs only once per line in the INFO column). In your current pattern, however, the [.]+\t is too ambiguous since you may have other columns after ALT ending in a . (my guess would be the FILTER column).
This is why my revised suggestion defines a column explicitly as at least one character, which is not a TAB and which is followed by a TAB. The complete pattern then requires four such columns followed by a character that is not a . (or, alternatively, a # as the first character on the line) followed by additional characters.
If you prefer, you could also use the following NOT Matching pattern, which is slightly shorter: ^([^\t]+\t){4}\. and simply says: skip all lines that are starting with four tab-separated columns and a . at the start of the 5th.
In general, you can test your patterns locally in a Linux terminal with:
grep -P 'your_pattern_in_single_quotes' input_file # for a Matching pattern
grep -vP 'your_pattern_in_single_quotes' input_file # for a NOT Matching pattern
Similarly use grep -cP or grep -cvP if you just want to count the number of lines you’d keep.
Cheers,
Wolfgang

Topic		Replies	Views
Data reformat (.vcf file) usegalaxy.org support	4	321	May 26, 2021
Remove some vcf header lines (not all) usegalaxy.eu support text-manipulation	2	2367	November 15, 2019
Filter sites with missing genotypes in multi-sample VCF usegalaxy.org support vcf-filter	3	3384	April 22, 2019
Missing information VCFtoTab-delimited usegalaxy.eu support	16	2329	November 15, 2019
vcf file direct editing data-manipulation	2	65	November 25, 2024

Delete lines from vcf with ALT="."

Related topics