Removal of host sequences without reference genome

emiliomstriani · August 23, 2021, 2:25am

Dear all, Suppose to have a collection of viral reads from NGS (Illumina) technology in fastq format. After the usual pre-processing step (addressed by fastp), I need to remove the host sequences (contaminants) without having the reference genome (I cannot use bowtie2 and samtools for mapping, of course). I have ready some approaches, but I am still not sure. The goal of my project is to identify the correct taxonomy of the viral reads I have. In detail, we always know the host of our sample even if the reference genome is not available, like bat, rodents, human, or mosquito. Please, can someone suggest an appropriate strategy/starting point/approach? Thanks for your support.

gbbio · August 24, 2021, 9:51am

Could you use blast? Without reference it will difficult… In some cases if you have used specific primers you can filter on that, or on the length of the reads. Did you do some kind of amplicon sequencing?