Unique insertion sites Calculation from Himar1 C9 based TnSeq

ATIKAKO_KOSSIVI · October 15, 2025, 12:34pm

Hello,

I am using bioinformatic tools for the analysis of my TnSeq data. I am using the pipe line : “Essential gene détection with transposon insertion sequencing” from Galaxy plateforme and I got the list of my essential genes. My TnSeq is Himar1 C9 based, thus insert in TA sites.

I wanted to calculate the unique insertion sites but don’t know how to do that.

I tried this : I converted the BAM file to BED file then extracted the 5’ends of the reads representing Transposons insertion sites cordinates awk function ( ‘BEGIN{OFS=”\t”}{if($6==”+”) print $1,$2,$2+1; else if ($6==”-”) print $1,$3-1,$3;}. Then the output was filtered to keep only unique occurrences. And this was considered my unique insertion sites (UIS). However I got around 200 000 UIS which is far more than the 78000 TA sites available in my genome. So I guess my strategy is not working .

Help please, don’t know how to do this.

jennaj · October 18, 2025, 1:20am

Welcome @ATIKAKO_KOSSIVI

As a reference for anyone else reading this later on, this is the tutorial.

Hands-on: Essential genes detection with Transposon insertion sequencing / Essential genes detection with Transposon insertion sequencing / Genome Annotation

Now, for your question here.

It sounds like you are attempting to use the mapped sequencing reads from the BAM (your reads versus the genome) to determine the potential locations. I don’t think that will work. Remember that the BAM will list out all of the sequences – both mapped and unmapped. This also includes the regions of any mapped sequences that were not a part of the mapped region! And the query doesn’t appear to be screening for logical proximity of the alignments, which will be important. Maybe I am reading your AWK command incorrectly.. but given the exploded number of “sites” you found, trying something else will likely help.

I think you were making a query against the reads themselves, and not the mapped part of the reads, and without respect to how both ends aligned, and without respect to the genome’s bases in particular.

Consider starting here instead → Hands-on: Essential genes detection with Transposon insertion sequencing / Essential genes detection with Transposon insertion sequencing / Genome Annotation (compute-coverage-of-the-genome)

Then proceed to the downstream steps. If these seem tedious, you can run through them once and extract a workflow to rerun this faster next time, or try the workflow included in the tutorial (import, remove the steps you don’t want, add more you do want, this kind of thing). It is all in the web version, even the editing part. Later on, if you want to work on the command line, you could explore using the API to run your workflow for batch work (Galaxy will function like any other workflow engine, the “web” part is for accessibility reasons and isn’t required).

Then, you can also try to replicate what the utilities are doing with other tools – but maybe understand these baseline manipulations first, then move into environments like Jypiter notebooks for even Rstudio after?

In short, you will be generating some coverage about where the reads are aligning, then screening the genome for potential sites (TA bases), then cross referencing those against eachother, generating some more statistics, and making decisions about which locations are associated with known genes. All to learn which gene loci, when disrupted, don’t matter for baseline organism survival, and then which ones do (the latter are then designated as “essential”).

If you are having trouble with the tutorial protocol, one good thing to try is to create a sort of “reference history” with the tutorial data + tutorial workflow. Output this to a new history. You can use it for comparison purposes – and even for getting around tedious manipulations – copy datasets between histories then “rerun” jobs but select your inputs instead!

I hope this helps but please let us know if it actually does! Issues with the tutorial that you can’t solve can have your working history shared back here and we can work through it together to solve the technical issues. See → How to get faster help with your question

Topic		Replies	Views
TnSeq and Whole Genome Sequencing gtn-tutorial , wgs , genome-annotation	0	366	May 11, 2022
Transcription start site mapping and counts	2	428	December 3, 2021
identify mutations in RNAseq data gtn-tutorial	1	721	June 17, 2020
Analyzing tRNA small-seq reads with htseq-count usegalaxy.org support htseq-count	2	1002	July 28, 2020
Tools to get precise annotated genomic regions for bedgraph input data bed , feature-annotation	5	2305	February 6, 2020

Unique insertion sites Calculation from Himar1 C9 based TnSeq

Related topics