Snippy to find SNP in bacterial DNA seq

Dear all,

I am trying to use Snippy function by Galaxy software. However, I can’t find workflow if I would like to use my own DNA seq file R1 and R2 of WT as reference file to compare with DNA seq of my target bacterial DNA seq. What is the work flow for this case?
Do I need to do Genome Assembly first for my reference DNA seq R1 and R2 using SPAdes? With SPAdes, I can get Scaffolds but not GTF or GFF3 file like the usual reference genome files. Can someone share the workflow for this? Thanks.

Welcome, @ymcr

Is this the tutorial you are trying to replicate with your own data? Microbial Variant Calling

The files we will be using are:

* `mutant_R1.fastq` & `mutant_R2.fastq` - the read files in fastq format.
* `wildtype.**fna**` - The sequence of the reference strain in fasta format.
* `wildtype.**gbk**` - The reference strain with gene and other annotations in genbank format.
* `wildtype.**gff**` - The reference strain with gene and other annotations in gff3 format.

Or, maybe this tutorial? M. tuberculosis Variant Analysis

This one doesn’t involve a reference assembly as a distinct file (fasta aka fna), but it does need a GFF3 and a GBK input. If you decide to create these instead of sourcing from a public repository, then you’ll also need the fasta genome assembly as part of those data prep steps.

If the reference strain is not available, then yes, you could create those three inputs. See the Genome Assembly and Genome Annotation sections of the tutorials. Each includes a workflow that you can import and customize.

Spades or Unicycler for the assembly. Annotating the genome is a distinct step, Prokka is one choice. The tutorials and tutorial workflows cover how to use these – and this forum has much Q&A about troubleshooting. If an error seems novel, you can also ask new questions :slight_smile:

Thank you. Yes, I can try to follow the same workflow for example 1. In this case, I will know SNPs of my test sample, my reference sample WT compared with the reference genome in genbank or gff3 format. Good. Let me try and get back to you if I have any issues. I have a question here, do I need to do FASTQC, Trim adaptor, before running the Nippy?

For method 2, if I need to use my reference sample genome sequence in FASTA format R1 and R2, noted that I need to make genome assembly first. Where can I find workflow for these steps? I am aware that I can use SPAdes to have scaffolds. After that I am not sure what to do further. Can you let me know next steps by using Galaxy software. Thanks.

Hi @ymcr

Those are big questions with many technical details. The GTN tutorials would be a good place to start.

Click into the link for the other tutorials and do one of these:

  1. Search with tool names or datatypes
  2. Navigate by tutorial topic domains

Domains/keywords for your use-case would probably include:

  • assembly
  • genome annotation
  • quality
  • single-cell (possibly? you can review – the quality tutorials explain how to learn the read type)

And, the training event we just hosted (over now) has suggested pathways for general and specific purposes. The materials are always available. This means you can work through these at your own pace anytime. Don’t skip the collections and workflows portions – these will make analysis life MUCH more fun. Smörgåsbord 2023

Hi Jennaj,

Thank you. I have done accordingly and completed assembly, genome annotation to my WT DNA seq FASTQ file (R1 and R2).
SNPS table shows data from column 1 to 6. However, I see only empty from column 7 to 14 (FTYPE STRAND NT_POS AA_POS EFFECT LOCUS_TAG GENE PRODUCT). Why is that so? However, when I use reference genome from genbank (gbff), all data are appeared from column 1 to 14. Could you pls let me what should I do to get snps data from column 7 to 14 although i am using my own reference WT genome to find snps in target sequence. Thanks.

Oh I tried with Prokka data gbk and now I can see data from the column 7 to 14. Noted that I can’t get data for column 7 to 14 if I use Prokka data gff or fna. Is this true?

Yes. Tool can only parse and do manipulations on data that you provide, or that they generate themselves. Maybe take a closer look at these different file formats to better understand what each contains?