Welcome @ATIKAKO_KOSSIVI
As a reference for anyone else reading this later on, this is the tutorial.
Now, for your question here.
It sounds like you are attempting to use the mapped sequencing reads from the BAM (your reads versus the genome) to determine the potential locations. I don’t think that will work. Remember that the BAM will list out all of the sequences – both mapped and unmapped. This also includes the regions of any mapped sequences that were not a part of the mapped region! And the query doesn’t appear to be screening for logical proximity of the alignments, which will be important. Maybe I am reading your AWK command incorrectly.. but given the exploded number of “sites” you found, trying something else will likely help.
I think you were making a query against the reads themselves, and not the mapped part of the reads, and without respect to how both ends aligned, and without respect to the genome’s bases in particular.
Consider starting here instead → Hands-on: Essential genes detection with Transposon insertion sequencing / Essential genes detection with Transposon insertion sequencing / Genome Annotation (compute-coverage-of-the-genome)
Then proceed to the downstream steps. If these seem tedious, you can run through them once and extract a workflow to rerun this faster next time, or try the workflow included in the tutorial (import, remove the steps you don’t want, add more you do want, this kind of thing). It is all in the web version, even the editing part. Later on, if you want to work on the command line, you could explore using the API to run your workflow for batch work (Galaxy will function like any other workflow engine, the “web” part is for accessibility reasons and isn’t required).
Then, you can also try to replicate what the utilities are doing with other tools – but maybe understand these baseline manipulations first, then move into environments like Jypiter notebooks for even Rstudio after?
In short, you will be generating some coverage about where the reads are aligning, then screening the genome for potential sites (TA bases), then cross referencing those against eachother, generating some more statistics, and making decisions about which locations are associated with known genes. All to learn which gene loci, when disrupted, don’t matter for baseline organism survival, and then which ones do (the latter are then designated as “essential”).
If you are having trouble with the tutorial protocol, one good thing to try is to create a sort of “reference history” with the tutorial data + tutorial workflow. Output this to a new history. You can use it for comparison purposes – and even for getting around tedious manipulations – copy datasets between histories then “rerun” jobs but select your inputs instead!
I hope this helps but please let us know if it actually does!
Issues with the tutorial that you can’t solve can have your working history shared back here and we can work through it together to solve the technical issues. See → How to get faster help with your question