Bacteria community and Antibiotic Resistance Gene

Hello everyone,

I am working on metagenomic data from human samples (oral swabs) and I would appreciate guidance on how to properly analyze and integrate two outputs generated in Galaxy.

I have:

1. Bacterial taxonomy output (CAT-like annotation)

  • Each record corresponds to contigs annotated with bacterial taxonomy

  • Format includes:

    • sample ID (e.g. MROV24_9S, AMBA24_8S, etc.)

    • contig ID (e.g. MROV24_9S_S179_final_contigs_10063_lt_793_1)

    • taxonomic assignment (phylum, species, genus, etc.)

2. ARG (antimicrobial resistance gene) output (AMRFinder-like results)

  • Each record corresponds to contigs carrying ARGs

  • Includes:

    • sample ID (e.g. MROV24_9S, AMBA24_8S, etc.)

    • contig ID (similar format but slightly different suffix structure, (e.g. MROV24_9S_S179_final_contigs_10063_lt_793)

    • ARG annotation (e.g. blaTEM-1, tet(A), sul2, etc.)

I want to perform: 1/ Bacterial diversity analysis; 2/ARG diversity analysis; and want to know which bacteria carry ARG

However, I am unsure about the best way to handle the data integration step because:

  • Contig IDs have slightly different formats between the two outputs (e.g. presence/absence of suffixes like _1 or _cir_)

  • I am not sure whether I should match:

    • full contig IDs

    • or cleaned contig identifiers

    • or sample + contig combinations

My question is: What is the recommended way in Galaxy to harmonize contig identifiers between taxonomy and ARG outputs?

Any advice or recommended workflows would be greatly appreciated.

Thank you very much!

Hi @santatra

If you are concerned about a tool not being able to match up sample and contig identifiers between files, you can adjust the identifiers in one or both files. With computational tools, shorter is usually better anyway! Put all of the mappings into a tabular file you can reference to keep track of this.

Or, sometimes you can use a collection and the collection identifiers hold the sample name, and tools use that instead.

You can also see our workflows for the process. To use directly, or to review how the sample/data labels are handled.

More about each below!

Text Manipulation

Modify the data files, but same back the original mappings into a tabular file. Example:

SampleID
ShorterID
LongerID in files from step N
LongerID in files from step Z

You could also add in any of the encoded results in the file names to your tabular master sample list, like:

SampleID
ShorterID
LongerIDN (used in files from step N)
ResultN (presence/absence notation)
LongerIDZ (used in files from step Z)
ResultZ (if you want to track anything from these too)

Any text data manipulations you want are likely possible. We have tutorials here that go through some examples. Or, search the tool panel with keywords – the tools will usually contain the common command-line utility name (if you are used to doing it that way). If you will have batches of data, putting your custom manipulations into a mini-workflow will make this go quicker next time, or to include with your reproducibility methods if this is for a publication.

Collections

If you are using dataset collections, the “name” given to data inside the sample files sometimes doesn’t matter. Instead, the collection identifier can hold the SampleID and that is enough. But this depends on the tool. Which tool do you plan to use next?

Workflows

Are you following a tutorial or are you currently using (or plan to use) an :gear: IWC Workflow? Do you want to? I wasn’t sure if you were looking for a workflow for these steps or for the data manipulations? Arbitrary data manipulations are unlikely to be in a stand-alone community workflow (overly custom!) – instead, those are included as intermediate steps for a specific purpose to help tools to chain together – or, not needed at all (collections are enough).

For the full pathway, this is one suggestion and starts from reads to produce data similar to what you have generated. Could be a useful comparison!

For training, you could also review here at the :graduation_cap: Galaxy Training Network (GTN).



I hope this helps to frame what we have! Not all tools are included in a tutorial but can be found in the tool panel with the simple and advanced filters. Most will have help and references on the tool form (scroll down!) and you can ask here with any questions. Hope this helps! Follow up questions are welcome. :slight_smile: