Low cover depth, > 100% read mapping and low total read number with Raven assembly

I used Galaxy Europe server and tool: Raven De novo assembly of Oxford Nanopore Technologies data (Galaxy version 1.8.3+Galaxy0 on a concatenated fastq file of ONT long reads. We have sequenced the fungus, Osmozyma mogii (formerly Candida mogii) and are assembling the genome de novo. I initially used default settings and then tried varying each setting one by one to reduce the contig number.

--kmer-len optimized to 20

--window-len optimized to 10

--frequency optimized to 0.002

--identity optimized to 0.1

kMaxNumOverlaps optimized to 29

These optimizations reduced the number of contigs in the assembly from 27 (before optimization) to 16 (after combining all optimizations)

Next I began to play with the --min-unitig-size

(see attached Excel sheet “BatchQuast”)

I analyzed the resulting assemblies using Quast and the settings:

Assembly mode: Individual assembly

Use customized names for the input files? No, use dataset names

Contigs/scaffolds file: appropriate Raven-generated fasta file

Reads option: Nanopore reads

Type of assembly: Genome

Use a reference genome?: No

Estimated reference genome size: 15000000

Type of organism: Fungus

Other settings: default

The N50 values were similar (964832-964910) except when the min-unitig-size reached 800000

As expected, increasing min-unitig-size gradually reduced the number of contigs

However, there were anomalies with #total reads, Mapped (%) and Avg. coverage depth

Setting min-unitig-size to 77500 or less results in Mapped (%) of 166% (does this imply multi-mapping of same reads to different places?), #total reads of 74502 (a nice big number) and average coverage depth of 60 (good cover)

Setting min-unitig-size to 95000 to 140000 results in #total reads that equal the number of contigs (16-18), Mapped (%) of 100% (seems perfect) and Avg. coverage depth of 1 (again, suggests that only 16-18 reads were used to assemble the contigs, with each read being treated as a separate contig)

Setting the min-unitig-size to 150000 or above once again returns #total reads 74502 (very good), Mapped (%) below 100% but being 92.44% for min-unitig-size of 140000 and Avg. coverage depth of 37 (or for min-unitig-size of 800000, 33) Again, not a bad coverage depth

There was a strange inversion in results between min-unitig-sizes of 80000 and 90000

I have read that we should aim for Mapped(%) of over 95% but this would mean choosing assemblies with a coverage depth of only 1

The alternatives would be to choose high coverage depth but either Mapped (%) over 100% or Mapped(%) of a little below 95 %.

I have repeated assemblies with particular settings to ensure that I did not set one of the parameters incorrectly by accident. There were no changes in results.

I would like advice on the best way forward. I did try increasing Racon polishing rounds with little effect but have not changed any parameters after that. I don’t think there is a parameter that blocks reads from being used more than once in an assembly but maybe this is incorrect. My gut feeling is that I should accept the lower Mapped (%) to achieve good coverage depth, a high number of total reads and an apparent absence of multi-mapping. What do you think? I would be grateful for any advice.

By the way, in the Quast settings I set the estimated genome size to 15000000 as this is around the size of the C. albicans genome as a lot of the genes detected for the default setting assembly have a closest ortholog in the C. albicans genome. Based on the current Quast output, I would say a more accurate estimate would be 13500000, which is half way between the C. albicans and S. cerevisiae values.

Hi @Derek_Wilkinson

Great that you are able to use Galaxy to process all this work! :rocket:

For fugal assemblies, this forum isn’t the best place to reach the scientists who are working with similar species who can advise about parameter choices and related threshold decisions. That said, multi-mapped reads are not necessary “bad” from what I know but I’m also more of a transcriptomics person than genome assembly. :scientist:

From here, I would suggest trying a forum like Biostars.org for the scientific questions. This is one example topic and I can see more in the related questions that seem to be addressing decisions like yours.

They reference the VGP project, and parts of the Galaxy team are involved in the computational work. As you probably already know, we host several workflows and tutorials for everyone to use as a reference and to potentially adopt, so I’ll link those here for anyone reading this later on.

I hope this helps! :slight_smile: