I have R1 and R2 reads of 18 samples from Illumina paired-end sequencing to create de novo genome assembly/reference genome. Is it possible using Galaxy? I have read the tutorial of genome assembly, but I did not find the step when to make collections/concatenate with R1/R2 reads or how to merge assembly. Could somebody help me how to manage this step-by-step and what tutorial should I follow? I am a beginner of genome assembly.
Hi @mungyon
Just to confirm that you have 18 biological samples, and you want a reference genome. If yes, I am not sure about this approach. Years ago a similar approach was used for assembling of a fish genome, resulting in numerous artificial duplications, because of natural variability between samples. Maybe select one sample with high quality data, assemble the genome and call it “the reference genome”. Depending on your task, you can assemble all genomes and compare the resulting genome assemblies or use the reference genome for variant calling in other samples, so you’ll get VCF files with variants (sites that differ from the reference sequence).
Kind regards,
Igor
Yes, the task would be creating a “reference genome” from 18 biological samples. I will try your suggestion to select one sample to assemble a „reference genome” and then compare it to the other samples’ genome assembly. With this, all existing “variants” will appear in this population. (Do you have training for comparing different genomes?)
Although, if a „variant” appears in multiple genomes, this variant can be seen as the reference on that site, am I right? Now, I am wondering how I can exchange these sites and copy-paste the more common bases of these variants to the initial reference genome. Maybe it’s a bit complicated to do it site-by-sites, but it would be nice to assemble a reference genome somehow that represents all these 18 biological samples by itself. (The question is similar to how hg19 and hg38 were created…) Does any tool exist in Galaxy for this issue?
It depends on your project, your aims and species you work with. For example, if you work with a bacterial species and interested in SNPs, snippy followed by snippy-core might be a good option. If you are after genes, maybe look at Prokka > Roary approach. If you have a reference genome already available for the species, consider using it.
Sites can be modified (replaced, inserted) in a reference genome by variants from a VCF file using bcftools consensus. Again, it depends on your goal. Both hg19 and hg38 assemblies contain positions with minor alleles, with majority of screened population having alternative alleles in these sites. Work on the human genome assembly was started long time ago, and now we have high quality short and long reads, HiC, extremely long reads for repeat resolution etc.
I guess you can create a reference genome with most common sites from 18 samples, but if you sequence another sample, you most likely may need a new reference, because of changes in variant frequencies, while “reference” implies some kind of stability. Again, it depends on your goals. Description of a sample by a reference and list of variants has a major advantage: the reference is common, and list of variants is rather small compared to a genome. For example, on average, humans have 1 difference for 1 kb, while 99.9% of sites are identical to the reference genome, and it is easier to deal with 0.1% of data (I am sorry for non-scientific language).