Metaspades Single Reads

When using interlaced reads for Metaspades, can I add Single unpaired reads to the end of the interlaced reads???

Hi @Jon_Colman

You can include multiple samples?

But I’m guessing that you were able to interlace some of the reads, and these are the singletons that didn’t overlap, all from the same library.

Is there a compelling reason to pre-assemble? Were you getting a memory error with the full dataset when running as paired-end?

Some ideas

  1. run as paired-end and consider subsampling – you might be able to notice what is scaling up the assembly complexity
  2. subsample the whole dataset into batches, run, and compare those to learn if you are really missing anything important with any particular slice
  3. run more QA to confirm you don’t have contamination/artifact, or read ends that have low quality. both can significantly impact per sequence pairing/overlap and the primary assembly
  4. maybe rethink how you are interlacing reads (Fastq interlacer is just one choice, PEAR is another I can think of)
  5. consider strategies to slice the data horizontally (to get rid of unnecessary deep coverage (if any), and to preserve the vertical novel content along the assembly fragments)

For creative ways to use Metaspades, I think you’ll need to get advice from out in the wild about what others are doing. Technically, you can try with mixed sequence types to see what happens. If every sequence has just one “read”, that is “single-end” for the format but the tool might detect the data content isn’t fitting what it was expecting.

I just added a tag that might attract others working on metagenomics. They are certainly welcome to add in more! :scientist: I think you already have an assembly by now but your question is interesting so it seemed worth posting some comments, even if late.

Here is the issue I was running into. I’m sure for most instances, after standard qc and trimming, then host removal, then de novo assembly. The problem is that Naegleria Fowleri is reported to have about 30% similarity to humans. I think normally, a small amount of human dna present does not cause an issue, but I think due to similarity issues metaspades was trying to assemble human and Naegleria Fowleri together.

So I’m currently just trying to separate Naegleria Fowleri from the rest of the reads, starting with reads that map to Naegleria Fowleri at 100% match, followed by host removal at 100% match (T2T & hg38), then to 99%, then to 98%. This should give me good Naegleria Fowleri reads with very little human contamination.?

This is good insight, and I agree. This can happen between all sort of species. Mouse reads in your human data will do the same!

You can try with a 100% but pay attention to the coverage length too, and maybe how many hits, not just the percent identity, if you are doing this with BLAST+ (blastn?). 100% identify over 8 bases isn’t very specific, especially if that occurs in 10+ HSPs scattered across a genome in different locations. Blast is great with longer reads and requires some care with shorter reads.

Then, NGS aligners like Bowtie/HISAT2 will sort consider both identify and coverage together, while only paying attention to queries with unique matches against the target, and ignoring any multi-mapping to entirely different locations. These parse the alignment metrics into the BAM file and have complex filtering options. Designed to handle shorter reads, some can now can handle longer reads but the complex metrics/filtering were retained.

In short: The “edit distance” in a BAM file (count of mismatched/inserted/deleted bases in the query versus the target) is closest in concept to what blast will report as “percent identity”. An edit distance of 2 over a 100 bases of alignment ~= 98% identify over 100 bases of coverage.

I think what I’m doing makes sense, let me know what you think.

  1. I can confirm the species is in my sample, so just mapping accurately to reference shouldn’t cause much interference with assembly, but will allow more complete assembly?
  2. As for using blastn, I only use it for two purposes a. Spot check reads to make sure I’m only getting my reference species at near 100%, and b.) To classify the contigs.
  3. Mapping - Since this species is “similar” to human, the more errors allowed, the more Naegleria Fowleri reads will be classified as human. So I set Bowtie2 up for perfect matching with no errors, so even one error in a read will prevent it from matching.
  4. My reference Naegleria Fowleri Karachi NF001 - I have broken it up with CONCOCT into 125 base sequences with 10 base overlap, which gives me 215,891 sequences. Mapping with Bowtie2 Perfect gives me the following matches to human T2T- 1,272 sequences which assembled into 11 contigs 300-550. Hg38 - 822 sequences which assembled into 4 contigs 376-460. Hg19 - 383 sequences with only 1 contig over 300.
  5. So the way I look at it, as long as I map only reads that map perfectly to the reference without errors, that the inclusion of a very small number reads that also map to human should be insignificant??
  6. So maybe the Nagleria Fowleri reads map to something else??? All of the spot checking I have done, the only species similar is always human, with a few reads matching Tuberculosis and Malaria.

Yes. If the reads are mapping the same way to two (or more) different references, then the reads themselves are indistinguishable from reads that really are from one species or the other(s). In a sense, they belong to all! Why? The reads have bases and a quality score assessing how accurate the base call is. Then reference assembly(ies) have bases. Using tools to compare the reads to the reference(s) can interpret base homology and “unique mapper” characteristics but not much more. So if a read maps “to both” then it belongs to both based on this technology.

I guess there could be tiny amounts of contamination. But in theory, other genomic elements in common could also lead to this. The first case is known and might still be present in the current assemblies (the others, it would be unexpected in hg38/t2t human) and I don’t know enough about this set of species to comment more about the later situation eg status/known cases.