I used Galaxy Europe server and tool: Raven De novo assembly of Oxford Nanopore Technologies data (Galaxy version 1.8.3+Galaxy0 on a concatenated fastq file of ONT long reads. We have sequenced the fungus, Osmozyma mogii (formerly Candida mogii) and are assembling the genome de novo. I initially used default settings and then tried varying each setting one by one to reduce the contig number.
--kmer-len optimized to 20
--window-len optimized to 10
--frequency optimized to 0.002
--identity optimized to 0.1
kMaxNumOverlaps optimized to 29
These optimizations reduced the number of contigs in the assembly from 27 (before optimization) to 16 (after combining all optimizations)
Next I began to play with the --min-unitig-size
(see attached Excel sheet “BatchQuast”)
I analyzed the resulting assemblies using Quast and the settings:
Assembly mode: Individual assembly
Use customized names for the input files? No, use dataset names
Contigs/scaffolds file: appropriate Raven-generated fasta file
Reads option: Nanopore reads
Type of assembly: Genome
Use a reference genome?: No
Estimated reference genome size: 15000000
Type of organism: Fungus
Other settings: default
The N50 values were similar (964832-964910) except when the min-unitig-size reached 800000
As expected, increasing min-unitig-size gradually reduced the number of contigs
However, there were anomalies with #total reads, Mapped (%) and Avg. coverage depth
Setting min-unitig-size to 77500 or less results in Mapped (%) of 166% (does this imply multi-mapping of same reads to different places?), #total reads of 74502 (a nice big number) and average coverage depth of 60 (good cover)
Setting min-unitig-size to 95000 to 140000 results in #total reads that equal the number of contigs (16-18), Mapped (%) of 100% (seems perfect) and Avg. coverage depth of 1 (again, suggests that only 16-18 reads were used to assemble the contigs, with each read being treated as a separate contig)
Setting the min-unitig-size to 150000 or above once again returns #total reads 74502 (very good), Mapped (%) below 100% but being 92.44% for min-unitig-size of 140000 and Avg. coverage depth of 37 (or for min-unitig-size of 800000, 33) Again, not a bad coverage depth
There was a strange inversion in results between min-unitig-sizes of 80000 and 90000
I have read that we should aim for Mapped(%) of over 95% but this would mean choosing assemblies with a coverage depth of only 1
The alternatives would be to choose high coverage depth but either Mapped (%) over 100% or Mapped(%) of a little below 95 %.
I have repeated assemblies with particular settings to ensure that I did not set one of the parameters incorrectly by accident. There were no changes in results.
I would like advice on the best way forward. I did try increasing Racon polishing rounds with little effect but have not changed any parameters after that. I don’t think there is a parameter that blocks reads from being used more than once in an assembly but maybe this is incorrect. My gut feeling is that I should accept the lower Mapped (%) to achieve good coverage depth, a high number of total reads and an apparent absence of multi-mapping. What do you think? I would be grateful for any advice.
By the way, in the Quast settings I set the estimated genome size to 15000000 as this is around the size of the C. albicans genome as a lot of the genes detected for the default setting assembly have a closest ortholog in the C. albicans genome. Based on the current Quast output, I would say a more accurate estimate would be 13500000, which is half way between the C. albicans and S. cerevisiae values.
