When does in-file metadata matter, and what tools on galaxy can help me do so?

fcy · March 10, 2019, 2:24pm

Hello. I’m completely new to bioinformatics and I need to analyze some C. elegans paired-end whole genome sequencing data from illumina NovaSeq 6000. I’ve been reading and comparing multiple manuals for beginners, and I’ve come across MiModD’s manual “Mapping by sequencing: identification of a phenotype-causing mutation in a nematode genome”. It uses MiModD tools and stresses that some tools (including MiModD alignment tools) might fail to work without proper in-file metadata.

Since I’ve read that the best tools for nematode wgs analysis is Bowtie2 or BWA-MEM for alignment (BBMap even better but not on the servers) and FreeBayes for variant calling - I’m not particularly worried about compatibility with MiModD alignment tools. However, I do not think my files come with in-file metadata and I do not know whether that will affect Bowtie2/ BWA-MEM and FreeBayes. Can someone please shed some light for me? Some advice for this newbie would also be appreciated.

wm75 · March 10, 2019, 5:50pm

Hi and welcome, fcy!

You have done a good bit of reading, obviously, before asking here, but it looks like it is necessary to put a few things into perspective still.
It is true that Bowtie2 and BWA-MEM are excellent aligners, just like FreeBayes is an excellent variant caller - not just for C.elegans data, but in general. MiModD, on the other hand, is a complete suite of tools for identifying phenotype-causing mutations from genetic screens in model organisms. As part of its functionality, MiModD comes with samtools/bcftools, which it uses, among other things, for variant calling. For the alignment step, MiModD allows you to use any modern aligner, including bowtie2 or bwa-mem. For use on local machines (desktops and notebooks), MiModD offers the snap aligner as a built-in choice because this is the most performant solution on many typical single machines. On usegalaxy.org, the recommended aligner would, in fact, be Bowtie2 (see https://sourceforge.net/p/mimodd/wiki/MiModD%20on%20public%20Galaxy%20servers/).
Now what MiModD offers on top of variant calling is causative variant mapping, which is usually a key component of your analysis workflow if you have a mutant line of say worms obtained from a random mutagenesis screen. If you really want to go for the combination bowtie2/freebayes for producing a list of variants found in the mutant line, the resulting output is unlikely to be directly compatible with downstream MiModD tools (you could certainly make it compatible with a bit of command line effort, but then you said you were new to bioinformatics).
OTOH, if you go with the combination of bowtie2/bcftools (accesible via the MiModD tool suite) there is no problem at all. It is true that several papers demonstrate that freebayes is a bit better than bcftools mpileup/call in certain situations, but be assured the relatively small difference between these tools will almost certainly not affect you (unless I’m mistaken about your use case, and you are going to do some professional genome curation effort).

Finally, to answer your concrete question: the in-file metadata you are referring to is just an ID for your sequencing data and a sample name that the MiModD variant caller expects for each of your samples. You have to make sure the aligner adds this information to its BAM format output, or you would have to add it yourself later. The Galaxy wrapper for Bowtie2 has text boxes for entering these two pieces of info (and more) and that’s all.

Summary:

go for Bowtie2/freebayes if you think you can interpret and handle the resulting list of variants yourself
go for Bowtie2/MiModD for a complete solution (in that case, the best starting point for use on usegalaxy.org (or .eu) would be https://galaxyproject.github.io/training-material/topics/variant-analysis/tutorials/mapping-by-sequencing/tutorial.html, which also has a good explanation why you need to map your causative variant)

fcy · March 14, 2019, 8:28am

Thank you wm75!

I’m looking for a homozygous phenotype causing mutation in my genomic DNA, induced by EMS and backcrossed to filter out background variations - so it’s a fairly simple and straightforward setup.

You’re probably right about the small improvement gap between using the MiModD workflow and using Bowtie2/Freebayes, but I’d like to maximize my chances of finding my mutation. My current workflow and progress is like this: FASTQC - Trimmomatic - Bowtie2 - FreeBayes. I’ve run into 2 issues:

I read about post-processing FreeBayes calls with VCFAllelicPrimitives tool in Galaxy Main. What does this tool do? Is it necessary? This tool is not available in Galaxy Main at the moment, what other tools can I use in its place?
I have data from three strains for variant subtraction. Which tools are available in Galaxy Main to do this, and how should I compare these tools?

Thank you.

jennaj · March 14, 2019, 3:34pm

This tool is available at Galaxy Main https://usegalaxy.org. Search the tool panel with the full tool name to find it. The tool form describes what it does.

wm75 · March 14, 2019, 8:18pm

You don’t say whether you used freebayes to call the variants from all your samples together (the recommended approach) and now have 1 vcf, or whether you ran things separately and have 3 files now.

fcy · March 15, 2019, 1:11am

I ran all of them together.

fcy · March 15, 2019, 1:12am

Thank you Jen, I’ll look it up!

wm75 · March 15, 2019, 11:40am

I see. You should be able to use MiModD VCF Filter for simple variant subtraction then. You won’t be able to make use of all possible settings (because some VCF keys expected by MiModD will not exist in the freebayes-generated VCF), but filtering for genotypes (combined with depth of coverage) should work fine I’d guess.

You do not need to care about variant normalization with VCFAllelicPrimitives.