Suggestions for tools for de novo assembly from mixed amplicon samples

I’m doing a project where we are identifying mosquito bloodmeal host species. We amplify a 700 bp portion of the COI (barcode) gene for 48-96 samples, barcode and pool the PCR product, and sequence it using nanopore sequencing. I’ve been able to get one consensus per barcode using the medaka consensus pipeline. We use American Robins as the reference sequence but I can’t collect unmapped reads.

Does anyone know of tools that would let me do a de novo assembly of the data without a reference? I want to identify all species where the mosquito has fed on more than one kind of animal, so would like contigs instead of one consensus. I’ve tried I think all of the options when I search “de novo” in the tool list, but none have worked. Thanks.

Hi @lkothera

Several nanopore assembly tutorials are available here → GTN Materials Search

If you want to explore what didn’t work, you can post back details about those jobs. I’m not clear if they didn’t work because of scientific or technical reasons. We can certainly help with the latter here.

Hi Jennifer,
This is a history where I’ve processed one sample that I suspect is mixed (mosquito bloodmeal from two different birds) via processing the reads in Geneious.

When I use Raven, I get no sequences for the fasta output, and one sequence under “Graphical Fragment Assembly” which blasts to one of the species I get hits for in Geneious.

When I use Spades, I get 1 contig that’s about the right size and hits to the same species as Raven, then there are 99 more contigs, the largest of which is about 300bp. The first few don’t map to anything meaningful.

When I did a denovo assembly in Geneious, of the first 5 contig consensus sequences, two returned different species for this sample.

Are there settings I can adjust where you can make suggestions what to try? I was using the medaka consensus pipeline, but couldn’t figure out how to save unmapped reads and didn’t know if mixed samples would shake out from that tool. Sorry for the lengthy post.

What tool were you using or is that masked from the user? That is just another platform correct?

I used their proprietary assembler. Right, Geneious is a canned program for bioinformatics that has a bunch of good tools, but it costs $, so I’m looking for a publicly available way to do the same thing. Thanks.

Do you think it’s a problem with the relatively short length (700bp) of my amplicon?

1 Like

Hi @lkothera

Do you only have the Nanopore long reads? Or short reads as well?

Do you have reference genomes for the avians?

My guess is that the Geneious “tool” is actually a workflow, and there is a sorting step either before or in between assembly rounds. In Galaxy, you would make those decisions yourself. The VGP workflows would be one place to start if you want to explore those as a potential template.

My questions were to see if a specific tool alternative tool might work for the assembly better, or in combination, or if the sorting part seems possible or not. Re-reading your original question, it seems the reference is not available or you don’t want to use it (?).

You could also do the reverse and use the Mosquito assembly.

Canu is another assembler but I haven’t used it much and not sure how performative it is. And, with any of these, tuning the assembly parameters seems important for this as defaults are probably not going to work well.

This tutorial focuses on Unicycler but that tools wraps around SPAdes in a conceptually similar way as I suspect Geneious wraps around some other assembler (or is actually SPAdes again). Maybe helps? Even if just for concepts like k-mers and reviewing assembly success/fail? Hands-on: Unicycler Assembly / Assembly

Sorry for the delay. Thanks for this info, I will check out those resources.

This data set is exclusively shorter reads. We have 700bp amplicons from a mosquito bloodmeal ID PCR that uses the barcode gene (COI). We have a set of 96 barcoded samples that have been multiplexed for sequencing, then computationally demultiplexed into fastq reads (I can export BAM files too).

I’d like to have a way to make de novo assemblies on these in Galaxy because the Geneious de novo workflow found a lot of mixed samples (= more than one species).

I don’t know about the feasibility of having reference genome(s). The possible hits include mostly birds, but other vertebrates too - it would be a big list. I’ll check out the VGP workflows though and circle back around it I can’t get anywhere. Thanks.