Making a new reference genome

Malpotras · July 28, 2025, 8:06pm

Hi
I have RNA-seq data from human cell line with GFP integrated in the genome, I want to map it with human genome as well as GFP sequence (destabilized Copepod-GFP). Can you suggest a complete pipeline for that?

igor · August 4, 2025, 11:59pm

Hi @Malpotras,

Probably the best option is to follow any established protocol. Search Google Scholar for relevant papers. Also, you can search published available workflows on public Galaxy servers.

The description is somewhat unclear. Do you know the insertion site or not? Do you want to map RNA-Seq reads or insertion site? Do you expect to perform any downstream analysis of mapped RNA-Seq data? For example, you can merge (concatenate) the human genome assembly and the insertion sequence. In this scenario and can map RNA-Seq to the concatenated file and use gene annotation for read counting, but reads spanning the insertion site might be labelled as multimapped (these reads probably will be mapped to the human genome in one conting and the insertion in another contig). Alternatively, you can insert the GFP sequence into the genome. In this scenario, reads spanning the insertion site will (most likely) map to a single location, but gene annotations to the right from the insertion site will be incorrect (require coordinate adjustment). It depends on what you want.

Kind regards,

Igor

Malpotras · August 5, 2025, 3:12pm

Hi
Thanks for your response.
So, we had used lentiviral transduction thus it is a random integration of target cassette in the genome. I basically want to see differential expressions of GFP in my samples as they are FACS-sorted into GFP-high and GFP-low subpopulations. Since this is RNA-seq data I cannot get the copy of the variation of target cassette integrated in the genome, I just want to confirm that the GFP-High cell population indeed has high GFP expression and the same for the GFP-low cell population.
I’m a beginner to the galaxy platform, so If you can provide me with a stepwise protocol to which pipeline you suggest, that would be great.

Thanks
Best
Shivani

igor · August 5, 2025, 11:38pm

Hi Shirvani,

The best option is an existing protocol used in similar situations.

As a rough estimate, you probably can concatenate the human genome and GFP, map RNA-Seq reads using HiSAT2 and get number of reads mapped to every contig using one of samtools stats tools. After that you can estimate proportion of reads mapped to GFP contig, for example, how many reads per million are mapped to GFP. Alternatively, you can add annotation for GFP to annotation of human genes and estimate read counts to human genes and GFP using featureCounts and use the read data for differential gene expression analysis. In the latter case you will get statistical support for observed difference of GFP expression.

Kind regards,

Igor