How to extract flanking genomic sequences for a transgene?

Hi, I have a task to identify the insertion sites of my transgene in the genome of cultured CHO cells. I have performed the whole genome sequencing for my cells. I wonder whether there is any tool at Galaxy that can be useful for me? I hope to identify the reads that can be mapped to my transgene as a reference and then assemble those reads to form contigs, which will contain the trasngene and its flanking sequences.

I appreciate any input.

thanks.

Hello @fshapt

Yes, all of this seems possible to do in Galaxy! :scientist:

Are you following a published protocol? We don’t have a tutorial for exactly this, so having some outline of the basic steps and an idea of what final outputs you want is a good idea. Galaxy will have the exact or analogous tools, and we can help you to identify them.

For practical first steps, you’ll will want to load up your reads, reference genome, transgene sequence, baseline reference annotation, maybe create a custom annotation record for your transgene, and then optionally prepare a SnpEff reference.

UCSC hosts a version of your reference genome, or you can get the data from NCBI, or you can use what you may already have. Try to prepare all of your baseline reference data at the very start!

If you are completely new to Galaxy, I would strongly suggest taking an hour or so to go through at least one tutorial! Galaxy hosts the common bioinfomatics tools many use for projects like yours, plus utilities for intermediate/custom data parsing, and a robust GUI-based workflow design and execution engine.

:graduation_cap: Good places to start

  1. Data manipulation, workflows.
    Hands-on: Galaxy Basics for everyone / Galaxy Basics for everyone / Introduction to Galaxy Analyses
  2. QA, mapping
    Sequence analysis / Tutorial List
  3. And while this tutorial focuses on variant calling, the early steps are similar across protocols.
    Hands-on: Exome sequencing data analysis for diagnosing a genetic disease / Exome sequencing data analysis for diagnosing a genetic disease / Variant Analysis
  4. Then, full assembly is covered here. You might not need it, and I wouldn’t attempt the full genome – just the region of interest + junctions – since the remainder can be derived directly or characterized with variants.
    Assembly / Tutorial List

That’s a lot of information! Please review and let us know if you have any questions! :slight_smile: