Classify Contigs to reference sequences

What is the best tool to classify my contigs. I have at least two species not represented in the databases for which I have the reference sequences, I would like to use the reference sequences to classify the contigs, ideally showing the % match to the reference.

Then after I have the contigs that matched the reference, what is the best tool to map the reads to the contigs to get a good estimate of how many reads align to the contigs, ideally getting a Reads Per Million count.

Hi @Jon_Colman

Do you still need help with this or was it addressed in the other questions already?

Thanks! :slight_smile:

I’m still having issues. The problem is that the reference (Naegleria fowleri)is similar, but different than than human. Using T2T worked much better than hg38. So they may align full length, but I think I need to map to 98% or something like that. Essentially, I have contigs that show 99% ID of Naegleri Fowleri NF001, and say 98% ID of Human by NCBI Blast. I don’t know if it’s possible to separate these or not.

Thanks @Jon_Colman for reminding me. Let’s group all of your discussions together here.

I know those weren’t really resolved yet but I think the prior advice from Peter is good (he also works in public health). Your question is about the scientific analysis strategy, and the “standardized” bioinformatics methods I can help with here will not be enough. Your analysis will require in depth detective work – the baseline alignment generation is just the very first step! I think you will need find someone private to collaborate with.

We can leave this one open to attract potential partners, but casting a wider net is almost certainly needed. Academic or hospital labs working in infectious disease areas are probably your best options. I know this isn’t helpful but I do hope this works out.

Ok, I think I’m working through the issue.

Another question regarding human reference. For a generic question, I take my reads (assumed all high quality), and do host removal against T2T reference. This should remove human reads against the most current complete assembly.

I assemble with Metaspades and run the longest contig on NCBI Blast, assuming that all have 100% coverage.

  1. contig maps to human 99%, I check the sequence it aligns to and it’s a human reference from 2002.
  2. contig maps to human 99%, I check the sequence it aligns to and it’s a human reference from 2012.
  3. contig maps to human 99%, I check the sequence it aligns to and it’s a human reference from 2016.
  4. contig maps to Klebsiella 99% (something other than human)

My assumption that 1 & 2 are very old references that were likely included in hg38 or T2T, and alignment to these is irrelevant. 3, since the assembly was from 2016 it should be included in T2T, and alignment is irrelevant. 4 should be the valid identification on the assumption that other parameters are good???

Since T2T utilizes superior sequencing technology and bioinformatics, considered complete, my assumption that alignment to some old human sequence likely isn’t valid anymore???

These are close but I can clarify

The status of older human assemblies is that they were less accurate assemblies than CHRh38/hg38, but the overall total content is still about the same. The major difference is that more of the unplaced scaffolds were integrated into the primary chromosomes and a few misassembled regions were resolved (this was in a higher volume between the very early releases, then tapered off, and hg38 is considered “done” by most).

People still use CHRh37/hg19 (the prior major human release) for analysis.

Then, CHRh38/hg38 is about the same as HUMAN CHM13 2.0 (T2T Consortium). The 2T2 release has repeats exposed, and the telomeres sequenced and exposed (scroll down to Q&A). The latter is the “new” part. UCSC maps most annotation from hg38 onto the 2T2 assembly, instead of recreating it from scratch.

So, given that context, the question here

would not be exactly true. Reads mapping to any of these human assemblies are probably baseline “human”, or at least present in the humans sampled for these assemblies.

For more context, you could explore the exact differences. The changes between assemblies are all logged here → Genome - NCBI - NLM. UCSC also has a summary from a top level perspective here → https://genome.ucsc.edu/cgi-bin/hgGateway (go to a genome, then scroll down to review the Assembly Summary).

I don’t understand the details about your suspect species well enough to discuss why these reads are mapping to both Klebsiella and older Human assemblies. As a guess, very short hits? contamination? artifact? All seem possible but this would be more likely to be in the older assemblies (and the detailed assembly updates would note if found/removed and which coordinates).

In short, I wouldn’t expect contamination events to be present in the CHRh38/hg38 or T2T assemblies. Using these for screening seems like a good idea. :slight_smile:

I guess I should have used an exact example, and not trying to make a generic question. So this is related to the Naegleria Fowleri that appears to be in my samples. From what I have read, this species has about 30% similarity to human dna.

So if I take my trimmed reads Q20 with adapters removed 2x150, I use HISAT2 with default settings using the T2T reference. I’m assuming this “should” suffice for my host removal process. Then I map my reads to Naegleria Fowleri reference, and assemble with Metaspades.

I take some of the contigs and paste them into the Web NCBI Blast, if I just look at Naegleria Fowleri, I’m generally 100% alignment with 98+% match, yet often it will match a human reference mostly about 25 years old (so before Hg19, Hg38, T2T) that show a slightly higher score.

I have noticed that when blasting contigs, my assembled contigs are very close to the accession length showed in the Blast results, sometimes slightly longer. Whereas the human matches are just very short stretches of a much longer accession length.

I agree that it was probably contamination in the earlier assemblies!

All that sounds like you were able to remove the contamination with evidence of correctness. The short stretches are probably that 30% reported.

I’m really glad you are making progress with this! All of the basic logic seems sound to me.