How to extract flanking genomic sequences for a transgene?

fshapt · September 23, 2025, 10:46pm

Hi, I have a task to identify the insertion sites of my transgene in the genome of cultured CHO cells. I have performed the whole genome sequencing for my cells. I wonder whether there is any tool at Galaxy that can be useful for me? I hope to identify the reads that can be mapped to my transgene as a reference and then assemble those reads to form contigs, which will contain the trasngene and its flanking sequences.

I appreciate any input.

thanks.

jennaj · September 24, 2025, 9:05pm

Hello @fshapt

Yes, all of this seems possible to do in Galaxy!

Are you following a published protocol? We don’t have a tutorial for exactly this, so having some outline of the basic steps and an idea of what final outputs you want is a good idea. Galaxy will have the exact or analogous tools, and we can help you to identify them.

For practical first steps, you’ll will want to load up your reads, reference genome, transgene sequence, baseline reference annotation, maybe create a custom annotation record for your transgene, and then optionally prepare a SnpEff reference.

FAQ: How to use Custom Reference Genomes? (covers both the genome fasta + annotation)
Quality Control Start Here! multQC issue and guidance? (see the workflow!)

UCSC hosts a version of your reference genome, or you can get the data from NCBI, or you can use what you may already have. Try to prepare all of your baseline reference data at the very start!

UCSC Genome Browser Downloads (chinese_hamster)
Index of /goldenPath/criGriChoV2/bigZips
FAQ: NCBI reference data

If you are completely new to Galaxy, I would strongly suggest taking an hour or so to go through at least one tutorial! Galaxy hosts the common bioinfomatics tools many use for projects like yours, plus utilities for intermediate/custom data parsing, and a robust GUI-based workflow design and execution engine.

Good places to start

Data manipulation, workflows.
Hands-on: Galaxy Basics for everyone / Galaxy Basics for everyone / Introduction to Galaxy Analyses
QA, mapping
Sequence analysis / Tutorial List
And while this tutorial focuses on variant calling, the early steps are similar across protocols.
Hands-on: Exome sequencing data analysis for diagnosing a genetic disease / Exome sequencing data analysis for diagnosing a genetic disease / Variant Analysis
Then, full assembly is covered here. You might not need it, and I wouldn’t attempt the full genome – just the region of interest + junctions – since the remainder can be derived directly or characterized with variants.
Assembly / Tutorial List

That’s a lot of information! Please review and let us know if you have any questions!