So I have some sequences done on the Novaseq 6000, where a large portion of R2 reads are all Poly-g, or I lose one of R1 or R2 after trimming. I am wondering the possibility of taking the unpaired reads, and using the reverse complement to make it a paired read???
Essentially taking the “unpaired” GOOD reads, reverse complementing them, use FASTQ to SAM to set the forward and reverse reads, and finally using SAM to FASTQ to put them back into an R1/R2 orientation that I can concatenate onto the rest of my reads.
In my head this makes sense, as I recall Trimmomatic does somewhat the same thing for overlapping pairs?
The problem is that I have a massive amount of reads, say 100mb compressed that isn’t paired due to PolyG tails. Cloud based platforms either want paired, or single, not paired with unpaired singles.
I have a mapping question Igor. I have had issues with Host removal, in that it is removing a LOT of the species reads that I’m looking for. One of which is Plasmodium (malaria), which I know at least part of the genome is in the 95%+ matching to Human. Is there a galaxy program and/or settings that I can try to only get 100% match to human (I assume the newer T2T reference is probably best). I don’t mind if I miss some of the human reads, I just don’t want to miss microbial reads.
I am surprised with sequences having such strong similarity, 95%, between the human and Plasmodium genomes. If many reads are filtered the sequence(s) must of a reasonable length. I am not saying it is impossible, but I recollect multiple stories about foreign sequences present in genome assemblies in early days of genomics.
You can increase cost of mismatch and gap in advanced settings of the aligner used for read mapping, but it is double edged sward. You will get many human reads as unmapped.
The problem with my files, is from what I can see, its the microbes with the bad reads, whereas the host reads look good. I have changed my sequencing methods to get better quality, but still trying to salvage these.