Extra line of information generated by VCFtoTab-delimited

Hi Galaxy community,

I have a question regarding the tool ‘‘VCFtoTab-delimited.’’
The output tabular file somehow has double numbers of lines, compared to the input vcf file.
As I further look into the data, each line of the vcf file becomes two lines in the tabular.
That is to say, each ‘‘position’’ has two lines of information, named ‘‘Sample1’’ and ‘‘1.vcf.sorted_Sample1’’ in the SAMPLE COLUMN.The information of the variants calling data, such as ABQ, AD, FREQ, are also different.
Can anyone explain this to me? How this can happen and which line I should take as the ‘‘correct information’’, since I am looking for variants on specific genes as well as their VAF.

Thanks for your support.

Susan

1 Like

Hi Susan,
it is the job of the “VCFtoTab-delimited” tool to expand the nested information from a VCF dataset into a flat tabular structure. For the INFO column of a VCF this means that it parses the ;-separated subfields in that column and turns them into separate tab-separated columns. That’s the easy part.

If you have a multisample VCF dataset, however, each single variant record holds the information about the variant metrics of each individual sample (in your VCF input dataset scroll to the right past the FORMAT column to see the sample-specific information). Since there is no easy way to flatten this information into a single tab-separated line, the tool breaks multisample variant records into one line per sample. In the process it duplicates the general variant info (the VCF CHROM, POS, ID, FILTER, QUAL and INFO columns), then on each line, appends the flattened info about one specific sample.
From your sample names, "‘Sample1’’ and ‘‘1.vcf.sorted_Sample1’’, it appears that maybe you didn’t intend to have multiple samples in your input? Maybe that’s why you got confused?

This should really only happen for sample-specific info as explained above. If you’re seeing differences between fields that are taken from the INFO column of your VCF that would indeed be strange.

Hi wm75,

Thanks for your quick and nice reply as always.
I would like to give you more detailed information, so we have more clues to decipher.

you can find the attached 4 pictures which shows you the vcf (pic 1&2) and the tabular converted from the vcf (pic 3&4). The one line I highlighted is one of our regions of interest. As you can see that the FREQ is different in the two lines in the tabular. (In addition, I used igv viewer to examine the bam file at this position, the FREQ shown in igv is 4%.)

Hope the extra information helps.

grafik
grafik
grafik
grafik

Well, it’s very hard to read the text in your screenshots, but at least I gathered that the ABQ, AD and FREQ values you are mentioning are indeed sample-specific fields.
The output from the “VCFtoTab-delimited” tool indicates that you have two samples analyzed - one named ‘‘Sample1’’, the other one "1.vcf.sorted_Sample1’’. So for every variant you will be getting two lines of output. One describing the variant metrics found in the first sample, one describing them for the second sample, and, of course, the allele frequencies and counts observed in the first and the second sample will generally be different.
Your VCF screenshots show only one sample, however, at least as far as I can recognize anything. Did you maybe not scroll all the way to the right when you took them?

Hi wm75

Sorry for the low-resolution screenshot. I am glad that you can still spot the tiny blurry text.
I re-examined my VarScan file and confirmed that there is no extra column after the ‘‘Sample column’’.
What I found was that the extra column appears after I merge the VarScan files!!
For each sample, I performed two VarScans, one for SNP, one for Indel, because I could only choose one of them as I command to perform VarScan.

My idea was to perform two VarScans to detect different types of variants, and then I annotate them with SnpSIFT annotate, and then I merge the annotated VarScan files.
Think this is how two samples per line was generated.

Is it wrong to do so? Do you have any advices?

1 Like

Merging the individual files in that case is probably not bad, but obviously, you do not want to create a new (pseudo)sample when really the data comes from one and the same sample. You forgot to mention which tool you used to merge the files, but one that should (in theory, untested) do the right thing (i.e. combine variants from one sample into a single file) is https://usegalaxy.org/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/bcftools_concat/bcftools_concat/1.4.0.1.

Beyond that (hopefully) immediate solution: is there any specific reason why you are using VarScan? And why do you have only one sample?
If, in fact, you have several samples, then calling variants on them jointly with Freebayes is almost certainly going to give better results than running them individually through VarScan.

1 Like

Thanks wm75.
I do have more than one samples, and it was a good idea to perform Freebayes for variants calling. The problem is that the default Quality filter in Freebayes is Phred 33, and I somehow could not find the way to change that to 28, which is my desired Phred threshold. Hence, I loss some output lines with Freebayes.

I found that in the VarScan tool, I can actually choose to call for “consensus genotype”, which includes both SNP and Indel. So I do not need to merge any vcf in the downstream, so the double lines are no longer the problem.

For me, it would be interesting to know the difference and comparability between VarScan and Freebayes, since their parameters are so different that it is difficult to judge on the output from both tools.

Susan