Filtering out host genomic sequences from Illumina paired-end reads

Hi Galaxians. I have obtained Illumina paired-end reads of microbiome metagenomes from a human sample and would now like to remove human (contaminant) sequences from them. I have tried installing Bowtie2 on Anaconda but didn’t get very far as I am new to Anaconda and Bowtie2. I found some instructions at this link for doing this but sadly, I do not know how to implement it. Is it possible to add this function to Bowtie2 in Galaxy? Or is there a tool in Galaxy that can perform a similar function?

Thank you in advance for your help.

1 Like

There is Removal of human reads from SARS-CoV-2 sequencing data. It’s demonstarting things with SARS-CoV-2 sequencing reads, but try to work through it and it should be rather obvious how to apply this to your data, I hope.

1 Like

Thank you @wm75. I am exploring that now. My issue is that I only have one set of paired-end reads but the example used 2 sets and grouped the data into collections. I will try to figure it out. :handshake:

1 Like

For those who are interested, I found out that Bowtie2 can perform that function in Galaxy. In the Bowtie2 window, select Yes for “Write unaligned reads (in fastq format) to separate file(s)”. All those reads that do not map to hg38 (or any other reference genome of your choice) will be written to those files.

I tried the method detailed in Removal of human reads from SARS-CoV-2 sequencing data but still find the majority of scaffolds assembled from the filtered reads to be human sequences. One of them is a 13Kbp human mitochondrial genome :sweat_smile: There are also sequences like “Homo sapiens contig freeze2_XXXX genomic sequence” and “homo sapiens chromosome 5 clone RP11-455D3 complete sequence”, which I do not understand. I thought the filtering processes should get rid of all human reads. Could this be because hg38 is an incomplete draft of the human genome? I read that the human genome has been fully completed recently. Will we see hg39 anytime soon?