Hello everyone,
I am working on metagenomic data from human samples (oral swabs) and I would appreciate guidance on how to properly analyze and integrate two outputs generated in Galaxy.
I have:
1. Bacterial taxonomy output (CAT-like annotation)
-
Each record corresponds to contigs annotated with bacterial taxonomy
-
Format includes:
-
sample ID (e.g. MROV24_9S, AMBA24_8S, etc.)
-
contig ID (e.g. MROV24_9S_S179_final_contigs_10063_lt_793_1)
-
taxonomic assignment (phylum, species, genus, etc.)
-
2. ARG (antimicrobial resistance gene) output (AMRFinder-like results)
-
Each record corresponds to contigs carrying ARGs
-
Includes:
-
sample ID (e.g. MROV24_9S, AMBA24_8S, etc.)
-
contig ID (similar format but slightly different suffix structure, (e.g. MROV24_9S_S179_final_contigs_10063_lt_793)
-
ARG annotation (e.g. blaTEM-1, tet(A), sul2, etc.)
-
I want to perform: 1/ Bacterial diversity analysis; 2/ARG diversity analysis; and want to know which bacteria carry ARG
However, I am unsure about the best way to handle the data integration step because:
-
Contig IDs have slightly different formats between the two outputs (e.g. presence/absence of suffixes like
_1or_cir_) -
I am not sure whether I should match:
-
full contig IDs
-
or cleaned contig identifiers
-
or sample + contig combinations
-
My question is: What is the recommended way in Galaxy to harmonize contig identifiers between taxonomy and ARG outputs?
Any advice or recommended workflows would be greatly appreciated.
Thank you very much!