Filtering out non-target organism reads before assembly

Hello,
I have sequenced on an Illumina platform a virus grown in eggs. I want to assemble the virus genome, but the vast majority of the reads in the fastq file are not of the virus. I guess only less than 1% of the reads belong to the virus.
What tool should I use to filter all of the non-target DNA? I assume I can’t start the assembly before that.

2 Likes

Hi @omer

Please see this GTN tutorial for an example. That particular protocol uses Bowtie2 to remove reads that align to the non-target genome but other mapping tools are discussed.

2 Likes

Hey,

I have a few additional questions. I am also dealing with a mixed sample (virus and bacteria) and want to filter out the reads that map to the bacterial genome. My situation is slightly different as I want to filter out bacterial reads before RNAseq analysis.

I started with 12 fastq files = 6 different timepoints x 2 (paired-end) so I created two datasets, 1 dataset with 6x R2 reads and 1 dataset with 6x R1 reads. I performed BWA-MEM mapping against the bacterial genome and selected the paired-end option, which allows me to input these datasets separately into forward (R2) and reverse (R1) reads. The output is just 6 BAM files which all have the label R2, can I assume the R1 reads have been merged with the corresponding R2 files at this point?

Assuming they were merged, I moved forward and ran samtools view to filter out mapped reads as per the tutorial, and fastx to convert back to fastq files (I selected both READ1 and READ2 as outputs). I see I have 2 separate datasets, however I’m unsure which one contains R1 and which contains R2 reads - all the file names in both datasets have the R2 label and none contain the R1 label. Are my R1 and R2 reads actually separated again (as they were in the beginning) and something just got messed up with the file names? If so, how would I know which is which?

Thank you again for your help, and looking forward to hearing back!