Preliminary authentication of ancient Yersinia pestis-like signal

Hi! I’m currently building an ancient pathogen screening workflow in Galaxy Europe for archaeological genomic datasets, so I would appreciate methodological feedback.

The workflow has been preliminarily tested on ancient genomic datasets with previously confirmed Yersinia pestis presence and appears to reproduce the expected signal.

My current workflow is:

FASTQ
→ Kraken2 (Standard-16)
→ competitive mapping against concatenated Yersinia references
(Y. pestis CO92, Y. pestis Microtus 91001, Y. pseudotuberculosis IP32953, Y. similis, Y. enterocolitica)
→ MarkDuplicates
→ Samtools view (MAPQ filtering)
→ Samtools idxstats
→ Samtools flagstat
→ mapDamage

My goal is preliminary authentication of ancient Yersinia pestis candidate signal in archaeological shotgun datasets.

I can share the workflow itself if that would make feedback easier.

Thanks!

Welcome @Milos

We tend to avoid offering scientific guidance at this forum since field specialist are usually not available. We can however help with technical logic and workflow design, so I’ll focus on that.

The Galaxy Workflow Library is a resource for HTP production workflows and can give some ideas about how processing steps are usually combined.

Classification steps

Please see the Microbiome category in the library above for taxonomic classification example workflows.

For upstream QC steps, visiting the Galaxy Training Network (GTN) might be helpful so I’ll link that too! FastQC is to assess quality but doesn’t involve trimming (if you read type needs that?). The usual path is FastQC → [some trimming tool] → FastQC → MultiQC. You may want to break that out into a distinct workflow, or move it into a sub-workflow.

mapDamage

In addition to the tool form help, the original tool guide has some guidance about data content expectations. You could check your BAM to make sure your data fits these (example: FixMateInformation).

Visualization/Reports

MultiQC supports all (most?) of the Samtools package, so you could put all of these together too.

Then, for customizing your report at the end, we have a tutorial that explains the basic functionality here.


I hope this helps and others are welcome to comment more!

Let’s also ping one of the developers who helped wrap this last tool for Galaxy to see if they have any suggestions! Hi @bernt-matthias would you like to comment?

I’ve also cross posted this topic over to the MicroGalaxy special interest group to see if anyone there has more to add! You can also join here!

@Milos very interesting!

@jennaj covered nearly all relevant points.

Maybe just let me add that there’s currently an open pull request against the Galaxy Workflow Library to bring the nf-core eager pipeline to Galaxy. That pipeline has the steps you’re listing above plus a few more and so that workflow could be very useful for you to check yours against.

You can import the workflow directly from the pull request like this:

  1. Go to Add nf-core/eager style ancient DNA (aDNA) analysis workflow by mertydn · Pull Request #1234 · galaxyproject/iwc · GitHub
  2. Find the diff for the actual workflow file adna-analysis.ga
  3. At the top of the diff go to … -> View file
  4. At the top of the file view, copy the link behind the Raw button
  5. Use that link in the workflow import dialog in Galaxy

Remember that the result will be a static snapshot of the pull request state so when the PR gets updates you may want to repeat the above to get the latest version.

Now my very limited scientific advice:

assuming that your samples come from human remains, I would probably first map against the human ref genome and only keep reads that didn’t map before doing the mapping against the combined Yersinia genomes. Alternatively, you could add the human genome to the combined genome and do 1-step competitive mapping against everything. Either way should offer some protection against spectacular, but false-positive results. At the very least, I’d run this kind of analysis in parallel to your current one and inspect the differences carefully.

Maybe this can be generalized into the key advice for this kind of research: be extremely cautious about any interesting findings you’re getting as it’s very easy to produce results that are actually purely technical artefacts. Use every kind of inspection tool available and try to understand anything that looks suspicious about any result (skewed coverage of a seemingly detected genome, imperfect read matches, etc.).

Yes, very recommended. I will most likely not be available for the next two weeks, but maybe others can weigh in in the meantime and I’ll definitely check that channel later.

Thank you very much for the suggestions!

Since I can’t share the direct link on the forum, I made the workflow public (aDNA_Yersinia_screening_v5) so you can review its structure and see exactly what I’m doing.

As positive reference datasets, I used the ancient genomes RISE509 and RISE505, which are known to contain Yersinia pestis (Rasmussen et al. 2015, Early Divergent Strains of Yersinia pestis in Eurasia 5,000 Years Ago). These genomes are publicly available through the European Nucleotide Archive.

I also retested the workflow on a human ancient genome that we assumed to be negative for pathogenic bacterial DNA, and in that case the workflow behaved as expected, returning no meaningful signal for Yersinia.

So, at least at the level of a basic screening pipeline, it appears to be functioning with both positive and negative controls.

However, I’m an archaeologist not a geneticist, and while I have tried to build this workflow carefully by following published approaches and available Galaxy tools, I am concerned that I may be overlooking methodological issues or introducing biases that are obvious to someone with deeper bioinformatics or genomics expertise.

My main question is therefore not only whether the workflow technically runs, but whether the logic of the screening approach itself is sound for ancient pathogen detection.

Any feedback would be greatly appreciated.