Assistance with Variant Analysis

Hello forum members,

I’m seeking help with my variant analysis. I have four sequencing datasets that correspond to biological replicas, indicating that the probes originate from the same individual. I conducted base calling using Snippy, and the output has provided me with multiple SNPs that vary across the datasets (see the table on the bottom). Now, I aim to construct a consensus sequence that considers the evidence from all the variants, and, of course, identify the ones that occur with the highest frequency.
Can anyone guide me on how to accomplish this? Your assistance would be greatly appreciated!

Hello @ge96dah

You’ll need to provide the SNPs in VCF format, then you can use this tool: bcftools consensus. Link at ORG

Scroll down on the tool form to review the Help section for quick help and links to related resources. This one includes GTN tutorials, too! Most tutorials have an associated workflow template that you can import and adapt.

Hope that helps!

Hello @jennaj !

Thank you for your reply!
When I apply your suggestion, it results in a full consensus sequence for the dataset which I provided. I was looking for a way to output only the alternative SNP per position with the highest frequency (as an additional column or list, if this is possible).

HI @ge96dah

Maybe I am misunderstanding, but doesn’t Snippy output a Ref/Alt summary that considers all samples? I’m not sure exactly the output file name or how you used the tool – but check the associated tutorial since it lists those out.

I do know that some of the optional outputs depend on which optional reference annotation was provided, and the tutorial covers this.

If I am not understanding correctly, could you explain more? I thought you were looking to generate consensus sequences based on the SNPs. If you just want the consensus for the SNPs themselves, you probably already have that data or could rerun Snippy to create it.

Hi @jennaj,

snippy outputs this:

G1_2306_dh_004_pass_fastq_gz
GTTTTGTTCTTAGCAATAGCTGCTTAATAAGCCCTGACATATTGCCACGTCTGTTATTATTATA
G2_2306_dh_005_pass_fastq_gz
GTTCCACCCTTAGCAATAGCTGCGTAACGAGTCCTGACACATTATTGCATCTGTTATTGTTATA
G3_2306_dh_006_pass_fastq_gz
GTTTCACCCTTAACAATGATTGCGTAATAAGTCACAGTGTGCCGCTGCATCTGCTATTATTATA
G4_2306_dh_007_pass_fastq_gz
GTTCTACCCTTGGCAATAGCCACTTAACGAGTCCTGGTGTATTGCTGCATCTGTTATTGTTATA
Reference
ACCTTGTTTCCGATGGCAGCCATTAGGTAGACTCTGACACATTGCCATACTCACCGCCACCGCC

And I would like to have something like this

Sample G consensus
GTTCCACCCTTA
Reference
ACCTTGTTTCCGATGGCAGCCATTAGGTAGACTCTGACACATTGCCATACTCACCGCCACCGCC

where the bold bases represent the ones with the highest frequency within the 4 datasets.

Hi @ge96dah

snippy-core will produce a similar output. I just ran a test to make sure it is working as expected. Shared history: Galaxy | Europe

This seems to be what you are looking for but please correct me if I’m still wrong. :slight_smile:

Sample output1:

>Reference
AA
>a
AA
>b
TA
>c
AC

Sample output2:

>Reference
TCCACAAGCCATTGTGTGTAATTAACCACTAATTGTGTATAAGTTTAAACTAATTGAAAA
GGTTATCCACAATAAAAAGGCGTTATTCAGGAGTTATCCACACTTTCTAGGAAAGGATTT
CATTGCGCCAATGTGTTAAACTATTTACCGAATACGAAAAAAAGACAAATAAATGAGGTT
GTGAAAAATGATATTTCAACGGCTTTTGAAAACTAGAGATACAGAGTTTTATCGAGTTAT
ACAAAACAGGAATATTGACGACGTATTTGGATACTTATTAATTCACGATAAACGGGAACC
AGCAGAAATTGACGATTTTAAGGTATTTGCAAAAAGTAATATAAATAAAGAAGCTTTTTC
AGTGAATATCAAAAAAAATCATATTTACACGATGTTTTTCCACTTTACTGATTTAGAGGA
AGAACAGGAAATTCCAAAATTTACTAAAGTTATTCGTTTTATAGAAGGACTTTTATCTTT
TCAGCCAGAAACAAGCCATTACGTTGATAACTATTTAATAAAGGAAAAACTAATTTTTGA
ATATCCTGCTGAATTTGAGAAAATCGGGGAGTTTGCTAAATATTTAGTAAAGCTTTCGGG
TCGTAAAATTACTATTCCAGACACAACGAGAGAAAAATATATCTATTTAACGCAATAATT
TTCGAAAAATGGTTTTTCTCTCTATAAAAATATGATATGA
>a
TCCACAAGCCATTGTGTGTAATTAACCACTAATTGTGTATAAGTTTAAACT---------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
----------------------------------------
>b
TCCtCAAGCCATTGTGTGTAATTAACCACTAATTGTGTATAAGTTTAAACT---------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
----------------------------------------
>c
TCCACAAGCCATTGTGTGTAATTAACCACTAATTGTGTATAAGTTTAcACT---------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
----------------------------------------

Hello @jennaj,

Yes, this result you are showing is also what I get with Snippy. I am looking for a way to combine the different variants, in your case of a, b and c together and make one single consensus sequence of the three variants in a, b and c. In your example this look like something like this

abc consensus
TCCACAA… and so on

Reference
TCCACAA…

So what I am trying to do is to combine the results of a set of data (a, b and c) into one (abc consensus). In my case G1, G2, G3 and G4 represent four sequencing runs for one single patient (patient G) and I need to create one single representative variant consensus sequence for patient G out of the information of the four sequencing runs.