Convert Single End Read to Paired End??

So I have some sequences done on the Novaseq 6000, where a large portion of R2 reads are all Poly-g, or I lose one of R1 or R2 after trimming. I am wondering the possibility of taking the unpaired reads, and using the reverse complement to make it a paired read???

Essentially taking the “unpaired” GOOD reads, reverse complementing them, use FASTQ to SAM to set the forward and reverse reads, and finally using SAM to FASTQ to put them back into an R1/R2 orientation that I can concatenate onto the rest of my reads.

In my head this makes sense, as I recall Trimmomatic does somewhat the same thing for overlapping pairs?

Hi @Jon_Colman,

Technically, you can produce reverse complement sequences, but I do not recommend it. A proper approach might be use of good PE and SE data.

I am not aware about this feature in Trimmomatic.

Kind regards,
Igor

The problem is that I have a massive amount of reads, say 100mb compressed that isn’t paired due to PolyG tails. Cloud based platforms either want paired, or single, not paired with unpaired singles.

I have a mapping question Igor. I have had issues with Host removal, in that it is removing a LOT of the species reads that I’m looking for. One of which is Plasmodium (malaria), which I know at least part of the genome is in the 95%+ matching to Human. Is there a galaxy program and/or settings that I can try to only get 100% match to human (I assume the newer T2T reference is probably best). I don’t mind if I miss some of the human reads, I just don’t want to miss microbial reads.

Jon

Some protocols can tolerate PE and SE data in a single alignment. You can map PE and SE separately and merge the BAM files.

100mb - it depends on context. If you have 2x4gb PE files, 0.1tb is rather small.

Generally, people get excessive amount of data in these days, so loosing some may not be a big deal, but it depends on individual situation.

0.1gb sorry for the typo.

I am surprised with sequences having such strong similarity, 95%, between the human and Plasmodium genomes. If many reads are filtered the sequence(s) must of a reasonable length. I am not saying it is impossible, but I recollect multiple stories about foreign sequences present in genome assemblies in early days of genomics.

You can increase cost of mismatch and gap in advanced settings of the aligner used for read mapping, but it is double edged sward. You will get many human reads as unmapped.

Kind regards,
Igor

The problem with my files, is from what I can see, its the microbes with the bad reads, whereas the host reads look good. I have changed my sequencing methods to get better quality, but still trying to salvage these.