I’m seeking help with my variant analysis. I have four sequencing datasets that correspond to biological replicas, indicating that the probes originate from the same individual. I conducted base calling using Snippy, and the output has provided me with multiple SNPs that vary across the datasets (see the table on the bottom). Now, I aim to construct a consensus sequence that considers the evidence from all the variants, and, of course, identify the ones that occur with the highest frequency.
Can anyone guide me on how to accomplish this? Your assistance would be greatly appreciated!
You’ll need to provide the SNPs in VCF format, then you can use this tool: bcftools consensus. Link at ORG
Scroll down on the tool form to review the Help section for quick help and links to related resources. This one includes GTN tutorials, too! Most tutorials have an associated workflow template that you can import and adapt.
Thank you for your reply!
When I apply your suggestion, it results in a full consensus sequence for the dataset which I provided. I was looking for a way to output only the alternative SNP per position with the highest frequency (as an additional column or list, if this is possible).
Maybe I am misunderstanding, but doesn’t Snippy output a Ref/Alt summary that considers all samples? I’m not sure exactly the output file name or how you used the tool – but check the associated tutorial since it lists those out.
I do know that some of the optional outputs depend on which optional reference annotation was provided, and the tutorial covers this.
If I am not understanding correctly, could you explain more? I thought you were looking to generate consensus sequences based on the SNPs. If you just want the consensus for the SNPs themselves, you probably already have that data or could rerun Snippy to create it.
Yes, this result you are showing is also what I get with Snippy. I am looking for a way to combine the different variants, in your case of a, b and c together and make one single consensus sequence of the three variants in a, b and c. In your example this look like something like this
abc consensus
TCCACAA… and so on
Reference
TCCACAA…
So what I am trying to do is to combine the results of a set of data (a, b and c) into one (abc consensus). In my case G1, G2, G3 and G4 represent four sequencing runs for one single patient (patient G) and I need to create one single representative variant consensus sequence for patient G out of the information of the four sequencing runs.