Classify Contigs to reference sequences

These are close but I can clarify

The status of older human assemblies is that they were less accurate assemblies than CHRh38/hg38, but the overall total content is still about the same. The major difference is that more of the unplaced scaffolds were integrated into the primary chromosomes and a few misassembled regions were resolved (this was in a higher volume between the very early releases, then tapered off, and hg38 is considered “done” by most).

People still use CHRh37/hg19 (the prior major human release) for analysis.

Then, CHRh38/hg38 is about the same as HUMAN CHM13 2.0 (T2T Consortium). The 2T2 release has repeats exposed, and the telomeres sequenced and exposed (scroll down to Q&A). The latter is the “new” part. UCSC maps most annotation from hg38 onto the 2T2 assembly, instead of recreating it from scratch.

So, given that context, the question here

would not be exactly true. Reads mapping to any of these human assemblies are probably baseline “human”, or at least present in the humans sampled for these assemblies.

For more context, you could explore the exact differences. The changes between assemblies are all logged here → Genome - NCBI - NLM. UCSC also has a summary from a top level perspective here → https://genome.ucsc.edu/cgi-bin/hgGateway (go to a genome, then scroll down to review the Assembly Summary).

I don’t understand the details about your suspect species well enough to discuss why these reads are mapping to both Klebsiella and older Human assemblies. As a guess, very short hits? contamination? artifact? All seem possible but this would be more likely to be in the older assemblies (and the detailed assembly updates would note if found/removed and which coordinates).

In short, I wouldn’t expect contamination events to be present in the CHRh38/hg38 or T2T assemblies. Using these for screening seems like a good idea. :slight_smile: