FreeBayes variant calling and Snpeff variant annotation

Hi,

I am very new to Galaxy and bioinformatics overall (I have never had any education in it), however I am using Galaxy for my internship project, which involves analyzing NGS data from eight different yeast strains to identify differences between them and design yeast-specific primers. I already have completed several steps like quality checking, trimming, and mapping the data to their respective reference genomes. Additionally, I’ve isolated the specific region from my alignments in which I am interested in.

However, I’ve hit a roadblock. I am trying to pinpoint the differences in this region between two of the eight yeasts. Initially, I attempted to use FreeBayes for this by creating a RefSeq for one yeast strain and comparing it with the sequence of the other strain. However I keep getting errors like: 'unable to find FASTA index entry for ‘NC_001141.2’, that I don’t know to resolve. I addition my FASTA file looks like this:
-----------------------------------------------------------------------------------------------ATACGTGTTATACAGAAAATCTATTGGAGGATACAATTTGCTTGTTTTTGAGGTATTTTCAATAATTTTAAGTGCAGCTTTATCAATAAAATTCAAAGGTCCAATATCTCCGTTTGTTAAGTTAAGTTCTC-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So I understand that this is not a good FASTA file to use as RefSeq, I’m unsure how to proceed from here. I’m open to suggestions on how to create a suitable RefSeq or if there are alternative tools or methods to identify differences between the two yeast strains. Any help would be greatly appreciated. Thanks in advance!

Hi @Isa

Fasta files have a specific format. This is an older FAQ but still has useful details. There are many other references for the format online too.

Datatypes - Galaxy Community Hub

To use a fasta file as a reference target, you can use it directly from the history just like the other inputs. This is named a “custom genome” in Galaxy (but could also be a transcriptome or exome). If you need a database metadata key, you can create one for your custom genome, and that is called a “custom build”. You’ll be able to assign that to datasets, and some tools will interpret it.

FAQ: How to use Custom Reference Genomes?

For this part

The fasta you posted doesn’t seem to have a >identifier title line at the start. But maybe that was excluded from the copy/paste? Be sure to review those lines when you are confirming the file format against the guide above.

For this error

the tool is probably expecting a database assigned to the input dataset. If this was from Freebayes: that tool does require an assigned database. Your input either didn’t have a database assigned, or had the wrong database (that didn’t include a sequence with that identifier). When you create a custom build, that is what creates a fasta index for your custom genome. Then the assignment of that custom build is what links the fasta index to your dataset.

Maybe I am overexplaining or maybe I missed something… but if you have followup questions, please ask :slight_smile: We’d also be interested in knowing if you solve this!

Hi @jennaj,
Thanks for your response, it already gave me a push in the right direction. Regarding the identifier line you are right that I did not include that in the copy/paste. This is what the header looks like:

A00597:359:H53WCDSXC:3:1178:6958:29074


But I’m still encountering sequences with numerous unknown bases, represented as 'N’s. How is this possible? This is a FASTA file created from a specific region of my BWA-MEM output file. I used the tool convert BAM to make this. I also used the tool samtools fastx which gives me the following:

A00597:359:H53WCDSXC:3:1101:1018:23046/1
CTCTTTCAAAAGACCTTGGTAATTTATCTATCATGTACTTTTCGAGTTAACTTAAGTGGAAGACGCTGGTTTTTCTTAAAGTTATAAGCACTATGTGGGTAATTTTTTATTGTCAGTCTGTTTGTAATCTT
A00597:359:H53WCDSXC:3:1101:1018:23046/2
TCCAGAGTAGAACTTTTGATGAATTGTTGGATAAGATTAGAAACAGACTGACAATAAAAAATTACCCACATAGTGCTTATAACTTTAAGAAAATCCAGCGTCTTCCACTTAAGTTAACTCGAAAAGTACAT

What is the difference between these two, and why does one have so many unknown bases and the other one not?

Regarding the error with the custom build, I appreciate the thorough explanation as it helps me follow along. Despite my attempts to create a custom reference database and standardize headers, I’m still facing the same error. I tried reformatting the files to ensure consistent headers, but the issue persists. Upon examining the BAM file, I realized the error relates to the RNAME field, not the header. When I compare the RefSeq of S. cerevisiae online to my custom-built file, I notice identifiers present in the online version’s headers are missing in mine. This difference seems to be causing the error. How can I resolve this? It seems crucial to include these identifiers in the headers to avoid errors. Could you help me with this?
Thanks in advance!

Hi @Isa

That sequence you are posting back is a fastq read – specifically the first part of it.

Using fastq reads that have been transformed to fasta will not work as a target (reference sequence). And, you’ll want to use the original fastq reads as the query when mapping.

The second paste has two ends from a paired end fastq dataset but minus the original quality scores.

This FAQ explains what fastq read formats look like → Hands-on: NGS data logistics / Introduction to Galaxy Analyses

And these tutorials have help for doing QA on reads, then some simple mapping examples. You don’t need to do these right now, but maybe take a look so you better understand what I’m describing here. Start with the “Quality Control” then look at “Mapping”. Sequence analysis / Tutorial List

Your error with the “NC_001141.2” sequence is the name of a nucleotide reference sequence. To learn what it is, you can search NCBI like this (just search “all databases” instead of picking a category if you are looking something up and not sure where it will be sorted, or just want to see everything then drill down) → Search: NC_001141.2 - NLM

That is a yeast chromosome, part of a full genome assembly with several chromosomes. Those together would be a “reference genome” target, and is what would be listed out in the header of a BAM dataset. The data lines in a BAM dataset are the reads (query) mapped against that target, and will not be listed out in the header.

Review this so far, then we can follow up more. I think you are getting closer and just a few things are mixed up still. Those three tutorials should help, so start next with those and you’ll better understand the files you have now, and what you want to do next.

Hi @jennaj
I do understand the tutorials, and I’ve completed them. I can see now that the files I have now created are fastQ files. However, despite using tools in Galaxy that said to convert my BAM file to FASTA file, I ended up with FASTQ files. How can I create a correct FASTA format using Galaxy?

I understand that NC_001144.2 is a yeast chromosome, I myself am interested in NC_001144.5 of S. cerevisiae. Although I successfully isolated NC_001144.5 from my assemblies, my attempts at performing variance analyses of these chromosomes between 5 different yeast are thwarted by errors arising from missing identifiers in the resulting FASTA file. The approach I am getting at now is to make one of the yeasts into a RefSeq, which I am struggling with now due to the error I keep getting that there is no FASTA index entry for all the chromosomes present in S. cerevisiae. Thus I interpreted from this that none of these identifiers are present in the FASTA, and that is where I mainly get stuck, because I need these identifiers in my FASTA but I just don’t understand how I can get this done.

So in my opinion I have a very clear vision of what I want to do next: I want to make a FASTA file from my BAM file that contains the identifiers for the yeast chromosomes. So that in turn I can use the genome of one yeast as a RefSeq to compare the NC_001144.5 chromosome of the other four yeasts to this yeast to find variances in the sequence. But I get stuck on creating de right FASTA format because: I don’t know how to get the identifiers in the FASTA file and apparently the convert bam to FASTA tools in galaxy don’t give me the right FASTA format. so how do I do that?

If there are other suggestion on how to do this analysis: comparing NC_001144.5 of five different yeasts to find variances between the sequence I am also happy to take these suggestions.

I can see that this might sound a bit confusing, however I really am stuck and don’t know how to proceed. It also doesn’t work that I have zero experience with galaxy and bioinformatics what makes it very hard for me to understand this all, so I really hope you can help me in the final steps.

Hi @Isa

I think we are getting closer.

What is that BAM? Did you create it in Galaxy?

You would need that reference genome fasta to do the original mapping that created the BAM. Do you still have that fasta file?

Reads: fastq – this would be the query in a mapping job that creates a BAM
Reference: fasta – this would be the target in a mapping job that creates a BAM

Notes: The BAM itself does not contain the reference genome sequence – just some metrics about the reference at the top in the header (names and length), then the coordinates of the reads mapping to the genome are captured in the data lines. The sequences that you see in the data lines are for the reads that were mapped to the reference genome.

And, I am confused about this part:

Which tutorial had help for creating a fasta file from a BAM file? This would be very unusual. You can create fastq files from a BAM but those are the query reads, not the target fasta reference genome.

That is why you are having trouble with this. A BAM does not contain the sequence for a reference genome – just some metadata about the reference genome (the sequence identifiers, the length of those sequences, but not the nucleotide bases for those sequences – so a fasta cannot be generated). This is true everywhere, not just in Galaxy. Instead, you need to identify and find the fasta file that was used as the mapping target fasta when the BAM was created.

This tutorial explains more: Hands-on: Mapping / Sequence analysis

A BAM file will also usually include lines in the header that capture manipulations, including how the BAM was created (the command string for the mapping job). Do you want to share the header of your BAM? We can help you to find where the target fasta file name is listed (if it is listed). This would just identify the file name, not the file content itself, but might still help. You can copy/paste those BAM header lines back.

And you can share more if you want to.

We can sort this out!

hi @jennaj,

Thanks again for your quick response, I really appreciate it.
Yes, I have created the BAM file using galaxy. I used the BWA-MEM2 tool, aligning my NGS reads against the RefSeq of S. cerevisiae obtained from NCBI. Thus, I still have this FASTA file.
No tutorial had help for creating a FASTA file from a BAM file, that is why I tried somethings myself. I found the tools “convert BAM to FASTA multiple sequence alignment sequence” and “samtools fastx”.

I do understand that the BAM file does not contain the genome information of the reference genome that is used for the mapping. But that is also not what reference genome I want to extract from the BAM file. I don’t want to compare the genomes of my yeasts against a reference genome but among each other. So I thought that I could only use this Reference genome of S. cerevisiae for the mapping of my NGS data and then proceed with the files that I generated in this mapping for variance calling.
Because the tool that I want to use for variance calling (FreeBayes) needs a RefSeq in FASTA format (which I want to be the data of my own yeasts and not the RefSeq I got from the internet).

So to clarify, I want to compare the genome sequence of yeast A with the genome sequence of yeast B to search for variances between the two. Which I initially wanted to do by making a RefSeq format of one of the yeasts so I could use that.
I do however begin to think that this is not the right way of approaching this problem, because that might not be possible because I can’t make a FASTA format RefSeq of my yeast A from a BAM file?
Are there other ways I could approach this problem?

I did this tutorial (multiple times), and I understand all the theory in this tutorial I just have a hard time applying that on my own data because I feel like it does not tell me what I need to know, about the problem I described above.

Below you can se the copy/pasted BAM header of my BAM file:
@HD VN:1.6 SO:coordinate
@SQ SN:NC_001133.9 LN:230218
@SQ SN:NC_001134.8 LN:813184
@SQ SN:NC_001135.5 LN:316620
@SQ SN:NC_001136.10 LN:1531933
@SQ SN:NC_001137.3 LN:576874
@SQ SN:NC_001138.5 LN:270161
@SQ SN:NC_001139.9 LN:1090940
@SQ SN:NC_001140.6 LN:562643
@SQ SN:NC_001141.2 LN:439888
@SQ SN:NC_001142.9 LN:745751
@SQ SN:NC_001143.9 LN:666816
@SQ SN:NC_001144.5 LN:1078177
@SQ SN:NC_001145.3 LN:924431
@SQ SN:NC_001146.8 LN:784333
@SQ SN:NC_001147.6 LN:1091291
@SQ SN:NC_001148.4 LN:948066
@SQ SN:NC_001224.1 LN:85779

I hope I clarified a bit better this way, and that you still can help me with my problem. Thanks in advance!

Hi @Isa

Thanks for this! It helps me to understand better!

There are a few ways to do this.

Call variants for both yeast A and yeast B against the same reference yeast genome (the one you obtained from NCBI), then compare those calls to known annotation.

This would treat your novel data for yeast A and yeast B as samples. The variant call positions could be directly compared because both will be based on the same genomic backbone (same coordinate scheme).

This is the most common way to do this as far as I know. And, it sounds like you already have some of that done – you’ve mapped your yeast A and yeast B samples to the yeast reference. From here you can call the variants, annotate them, and explore the differences between A and B, and the reference.

This tutorial has a simple example with the basic steps. Maybe review and let us know if this will do what you are interested in? → Hands-on: Calling variants in diploid systems / Variant Analysis

If you have more questions, if you could clarify the read type (WGS?) and more about your overall goals, that would be helpful.

Thanks!

hi @jennaj,

I appreciate the swift response once again! I’m will dive into the tutorial, hoping it sets me on the right path. I’ll keep you posted on whether it aligns with what I’m aiming for and if I manage to figure things out. If any further questions/problems pop up, I’ll be sure to reach out. Thanks!

1 Like

Hi @jennaj,

I followed the provided tutorial and attempted to follow these steps with my own data. However, I encountered a roadblock with the GEMINI load tool, as it only supports 37 (hg19) of the human genome, while my data originates from S. cerevisiae.
The data I received was obtained by WGS using Illumina. The primary objective of the project is to identify differences among the five distinctive S. cerevisiae yeast strains to design yeast specific primers for east strain to differentiate them from each other. I’m currently focusing the rDNA of S. cerevisiae because that is expected to be the most variable region between the different strains.

Since GEMINI isn’t available for further analysis due to its lack of support for the S. cerevisiae genome. I’m again getting stuck on how to proceed. What are other tools/approaches I can do to reach the goal of my investigation? I hope you have any suggestions, because I am really unsure how to proceed after doing FreeBayes on the yeasts against the S. cerevisiae RefSeq I got from the internet.

I already performed SnpEff eff to annotate the variants, but I also don’t know what I could do with this information. To do this I got a snpeffdb file of S. cereviseae by using SnpEff download and download a file from: https://sourceforge.net/projects/snpeff/files/databases/v4_3/. But this resulted in a VCF file that I don’t now what is different from the VCF file of FreeBayes, and in the HTML file none of the graphs are shown and it gave very little results overall.

So in conclusion:
What tools/approaches can I perform on the data I have to find differences between the five yeasts, and design yeast specific primers for differentiation? Thanks in advance for your response.

Hi @Isa

Oh, I forgot that Gemini only supports that single human genome! Galaxy has many VCF manipulations tools that do what Genimi does but in different ways (no database). See the tool panel under VCF/BCF for these.

What you have now are annotated variants. Meaning, where those are with respect to the genes and other features that you created the database from. Did you have trouble creating the database? Do you want to troubleshooting the usage? If yes, you can share your data in a shared history link.

And, SNPeff should have a pre-computed database for yeast. That would mean you would need to first map and call variants against the same genome. If there are any mismatches, you’ll get incorrect scientific results.

For the rest, thanks for explaining! Unfortunately, I don’t have a tutorial for this specifically. So, if you are not sure what to do next, I would suggest finding a publication (or several) that perform this analysis as a starting place. Then if you can’t find suitable/similar tools in Galaxy, you can list out the steps in the protocol and we can make suggestions about how to translate.

Hi @jennaj

Thank you for your explanation about annotated variants. This makes it already a bit more clear for me. I did indeed have a lot of trouble with creating a database. In this history: Galaxy you can see what in the end I landed on. Is this a good database for further analysis?
I doubt if this is good. I also am unsure of why the SnpEff build that I wanted to do with the gff and gtf files from NCBI gave me an error.
Due to confidential regulations i cannot share my data, but this is what part of the output of the SnpEff looks like:



I don’t think is is supposed to, because in the tutorial there also were graphs but i dont have those, but I don’t know why or where it goes wrong.

I am also not sure what you mean with the above? In my galaxy there are only four pre computed databases (all human).

I hope you can agains help me, thanks in advance!

Hi @Isa

I’ll clarify – there are two ways to do this. The second is easier. Why? Because the first will only incorporate the annotation that you provide when creating the index. See the help on each tool form referenced for usage details – it is too much to write out here – but we can troubleshoot the usage if you get stuck if you can share the work (should be all public data at this stage).

Create a new index

Using a genbank record, or a GFF/GTF annotation and genome fasta, create a new SnpEff database as a special dataset in your history.

If you run this multiple times, be sure to give each run a distinct database naming label. Each run you are creating a SQLite database and the special dataset in the history is a reference to that. You can name these whatever you want, with the species name or anything else. But keep it simple: all oneWord with no special characters or spaces. To make this easier, I tend to number repeated runs like: myLabel1, myLabel2, … myLabelN.

You could share just this part since the data would all be from a public source. Put this into a separate history with just the inputs and outputs. Run this at one of the UseGalaxy servers since we know those are working technically, and we can focus on the data content and parameters. Once that is working, you can then try to replicate the work at whatever server you are working at (private Galaxy?).

Get a pre-built index

The other option is to see if a pre-built index for your genome is already available.

There are two tools for this: one is a way to query for indexes to learn the exact name (SnpEff databases), then the other uses that name to put the index into your history as the special type of dataset (SnpEff download).

Screenshot of the suite of tools:

snpeff-tools

Hi @jennaj,

Alright, i tried to do these steps, which you can see in the shared history below:

I first tried to do the pre-build index by first looking for the databases and them downloading them. I think i succeeded in this, but now I am unsure about what the differences are between these two indexes?

I also tried to do the approach of using a GFF/GTF file with the SnpEff build tool, however when I do that, I keep getting an error (see history), that I don’t know why I get this error and how to resolve this error.

I would like to get your insights in this, if I did it right, thanks in advance!

Hi @Isa

Both your data and the index appear to be Yeast R64 with the accession GCF_000146045.2.

See this in the metadata for all the files, including Snpeff’'s and in yours. These appear to all be based on the same genome. So, you could try out both ways, and decide which works best.

For the error, I think the SnpEff tool itself has a problem on the UseGalaxy.org server!! I follow up about that, but you don’t need to wait. Update: found my outstanding ticket about this. The fix is still pending Resolved! Snpeff build galaxy6 failures at Jetstream2 ORG · Issue #731 · galaxyproject/usegalaxy-tools · GitHub

The earlier version works fine.
If you rerun your original jobs, but switch to using 4.3+T.galaxy5, both runs will be successful. FAQ: Changing the tool version

SnpEff build: database from Genbank or GFF record (Galaxy Version 4.3+T.galaxy5)

snpeff-versions

Please also give that a try and let us know if it works!

hi @jennaj,

Thanks for the advice, when i used the other version of the tool of the SnpEff build tool the error was solved and I got a database.
I ran the SnpEff eff; annotate variants tool with this database and my own genome, and I got this file as an output.
https://usegalaxy.org/api/datasets/f9cad7b01a472135a01c7a84f901d848/display?to_ext=html
In the tutorial the output also gave all sorts of graphs, but in my output these are not there, why is that?

Kind Regards,
Isa

Hi @Isa

I see graphs in your output file.

Which graph did you want to replicate? A link to the tutorial where that is generated, then a note about which specifically would help to clarify this.

Thanks and very glad you have come so far in getting this working for you! :slight_smile:

hi @jennaj

Nevermind, I interpreted it wrong, I have the data that I want to have.
I just have two last small question I hope you can answer:

  1. Why is my error count this high? It gives me 55.000 errors, why can that be?
  2. What do all the red and green boxes say? What does this mean?
    I hope you can answer my last two questions.
    I really want to thank you for all your help and quick responses. It has really helped me further, i appreciate it. Thank you

Kind regards,
Isa

Oh good! I thought you had this too :slight_smile:

At the top of some graphs is a legend. The others are covered in the manual, with some in our tutorials if relevant to that lesson. Example

Codon changes

How to read this table:

  • Rows are reference codons and columns are changed codons. E.g. Row ‘AAA’ column ‘TAA’ indicates how many ‘AAA’ codons have been replaced by ‘TAA’ codons.
  • Red background colors indicate that more changes happened (heat-map).
  • Diagonals are indicated using grey background color
  • WARNING: This table may include different translation codon tables (e.g. mamalian DNA and mitochondrial DNA).

The rest is the scientific part. Suggestions for resources

  1. Review the tool form Help section (scroll down) since we include links to the original author documentation. This includes publication links that are the definitive source for how the tool works. I don’t want to replicate too much… but this is part of it

    To learn more about snpEff read its manual at http://snpeff.sourceforge.net/SnpEff_manual.html

    Citations
    Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X., & Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly, 6(2), 80–92. https://doi.org/10.4161/fly.19695

  2. Some discussion is also include in the GTN tutorials. You’ll also find these linked in the Help section of tool forms

    Tutorials

    There are 5 tutorials available which use this tool. These tutorials include training for the current version of the tool. View all tutorials referencing this tool.

  3. Visit scientific forums

  • Ask scientific questions where the people who do this kind of research are most likely to be
  • Your results in Galaxy are the same as would be produced anywhere else
  • Set the history to a shared state, and use the link from the “eye” icon (instead of the “link” icon) so people can see a nicer view.
  • Or screenshot – you are more likely to get feedback if people can see immediately and not think there is some technical issue with Galaxy (they will probably bounce you back to us if you mention Galaxy at the start!)
  • Galaxy is just part of your method, and what we can help with most here (and glad we sorted that part out!) – the tools, parameters and your data are what matter now. Find the full command line on the “i” icon view for any job if you need it.
  1. Extract a workflow from your history so you can rerun quickly and compare the final summaries across parameter sets. Using all defaults is unlikely to produce the optimal results.

I know this is not what you wanted but still hope it helps. Others are welcome to add more comments.