My algal data has been sequenced in the haploid phase, and I have been trying to input my trimmomatic paired read files into Snippy. These have been successfully mapped with BWA MEM2 previously, so I don’t think there is a problem with them, but are larger than the average microbial genome. The tutorial says the pipeline uses BWA MEM followed by Freebayes using custom criteria but doesn’t detail what these are. The analysis fails on memory for all individuals tried: here is a link to a history - Galaxy
It’s not clear what the issue might be
If I can’t use snippy, would it be possible to know the optimal process criteria used so I can set up the mapping/ post processing/ variant calling/ filtering steps and run them separately? Thanks
Hi @cgreig
Thanks for sharing the history, very helpful!
Everything here looks to be set up correctly, so the job does seem to be simply running out of resources.
You could try at a different public server. UseGalaxy.eu would offer the most resources. You’ll need to recreate the SnpEff database but everything else can be copied over.
Or, you can try using different tools. This could be some detective work since it might uncover what Snippy had a problem with.
Below is a good tutorial to try since it covers SnpEff, along with some realignment steps that can help when extracting meaningful SNPs versus spurious detections due to quality and related. The settings here are mostly using defaults with a bit of tuning to output more into the VCF that the later tools can interpret.
- Hands-on: Calling variants in diploid systems / Calling variants in diploid systems / Variant Analysis
- All → GTN Tutorials including FreeBayes: bayesian genetic variant detector
Since you have a haplotype genome, start with a preset that fits: 3 or 4.
Frequency-based pooled calling with filtering and coverage
For the mapping, you can follow any of the other tutorials. The important part for mapping is to filter the BAM after mapping. This can reduce the size and complexity of the data, to lighten the load on the variant calling tool. Remove unmapped, then filter for proper pairs, primary alignments, and some minimum mapQ value. Doing this is where you might learn what Snippy had a problem with, too, if you find large chunks of data that are not passing some of these criteria.
Other than that, the Freebayes docs and scientific discussions online are more resources. This forum is mostly focused on usage in Galaxy, but Seqanswers and Biostars should have discussions about haplotype scientific strategies.
Hope this helps!
Thanks Jennifer.
-
I’m not able to download the data at the moment - can I transfer it directly to galaxy.eu? I have a lot of individual files to analyse…they are set up on the .org server but that took a very long time …
-
I’d like to try the analysis in separate steps but the tutorial is for mitochondrial genes analysed as part of a family group- and the freebayes settings for this are not the right ones for my algal genome. I have individual strains not related populations, and also the indels would be quite important. It would be good to have some advice about how to set this up to minimise noise in my samples - (highest in indels) and to know how best to filter after variant calling (as well as how best to filter the Bam files exactly)
I had included a left align step and a minimum quality and DP filter after variant calling in previous analyses. I understand that variants in very high DP regions are likely spurious too, but not sure how to set “very high”. I have used Snpeff before with no problems-
…so having a workable series of steps to go from the mapped and duplicates marked Bam of a haploid genome to the called and filtered VCF to use with snpeff would be very useful. I haven’t been able to find any clear instructions. If there are file specific parameters I can work them out with a little direction, there may be other people working on algal/fungal/ bacterial genomes that would find the suggested settings / criteria to filter on useful and informative.
Thank you for your help- it is appreciated
Carolyn
One more thought- I have been able to use freebayes on BAM datafiles mapped from these trimmed datasets using BWA MEM2 (with the default freebayes diploid settings which need to be changed) without filtering - so why is the pipeline not working with these files? They are not too large for freebayes…
Also does the pipeline contain post mapping filters? Or post variant calling filters? or both?
I hope you can help me understand
Dear Jennifer
I have been able to produce a filtered BAM file but am not sure this is as you suggested or how to proceed further …https://usegalaxy.org/u/cgreig/h/r16-snippy
Hi @cgreig
Thanks for sharing the history.
I’ll try to answer all of these
The filter looks good! You could also add in a filter for proper pairs.
Snippy and BWA/Freebayes are different tools with different memory requirements. The EU server can allocate more runtime memory which is why I suggested trying there.
Yes, I only suggested that tutorial for a path through the tool usually used for variant calling. For parameter settings (and later tuning), you’ll need to consult a publication or one of the scientific forums I suggested. Scientists that work within your domain will be the best resource for these details.
Yes, you can copy the files individually by URL or an entire history by URL. If you want to, copy the important files into a new history, and transfer that. Maybe just those needed for this single tool to start with?
See the FAQ I linked here:
This is the direct link: FAQ: Transfer entire histories from one Galaxy server to another
From here I would suggest trying at the EU server, because if you can get Snippy to run on your data, that seems like it will solve your immediate uncertainly about Freebayes.
Big picture, this forum is not the right place to get scientific guidance for working out novel parameters, and the GTN tutorials are only a small slice of what is possible. But those tutorials are mostly pulled from publications and can maybe show how the translation between published work and Galaxy can be done? If you later find a publication that does something similar to what you want to do, and get stuck finding analogous tools, that is something we can help with here. And resolving errors of course. But not the data interpretation steps.
Dear J
Thank you for your help, it is always useful to get your advice. its clear it’s important to get the settings right to use freebayes for a haploid genome: however the exact information is not clearly given in many papers, and many of these look at populations/ communitiues rather than the individual mutant strains I have, where different criteria would be needed. I am experimenting with some reported settings and hope to find the best way to use freebayes for my samples. I guess a simple haploid option for freebayes would be a useful option!
I have been checking the resultant ts/tv ratio as a quality metric - what would you recommend to compare quality of vcf files?
I think the issues with using the snippy pipeline software with my samples may be because although haploid, these genomes are much larger than a bacterial genome
TB haploid 4.4Mb
CR haploid ~120 Mb (17 chromosomes, total 60 scaffolds)
I’ve had a couple of further galaxy issues:- one I hope you might be able to help me with. I have been using BCF tools stats and was hoping to compile results using MultiQC, but whatever I try, I am getting average results for my files rather than a comparison. Is it possible to do this?
I have set up an EU account and been able to transfer one of my samples and run Snippy on it but had trouble with exporting a history (even a very small one) for several days. I also had trouble with some tools that failed sporadically (principally VCF-VCF intersect). I’m not sure there is still a problem, but I thought I’d better report it. I have just managed to export a history so that is good news!
Thanks again
Carolyn
Hi @cgreig
Let’s go through these, and if you solved some already, please let us know!
These tools seem like good choices:
- VCFcommonSamples: Output records belonging to samples common between two datasets
- You could also run one of the VCF report tools, maybe on just a sliced region, and inspect that in a genome browser like IGV. Known regions could be interesting such as a particular gene, or class of genes.
The “sample merging/ignoring” issue happens with MultiQC anywhere, not just in Galaxy, and comes up occasionally here at this forum. Recent example from last week involving a custom report MultiQC not displaying all elements in a table - #2 by jennaj and another that used a different kind of standard report multQC issue and guidance?
MultiQC expects distinct sample identifiers per report. Changing the file name before running the upstream statistics tool is usually enough. If the data is in a collection, that would mean adjusting the element identifiers. I can share instructions for this – so let me know if you need help with it. Seeing the data in history might help.
There were some issues at the ORG server last week that are now resolved. I would try all of that again as a first pass solution, and we can troubleshoot more about anything persistent. Server issue UseGalaxy.org March 17 2025: RESOLVED
Yes, this is unfortunately the “state of reproducibility” for computational biology experiments. Wet lab work will have exact details, but the bioinformatics, not so much!
The best resource is the tool manual itself, and direct experimentation, plus publications (where the latter is only as good as the previously published work is, as you noticed). You could also search scientific forums for advice. This forum is mostly about getting things to work in Galaxy, not scientific advice, so the other experts working in your field will not see the discussions here.
Hope this helps anyway!