Filtering reads in a BAM dataset

cjain · December 10, 2021, 7:47am

Hello:

I would like to filter a set of reads that were aligned using Bowtie in a few different ways. 1. Filter reads that map to a certain class of genes (e.g., CDs or tRNAs). 2. Filter reads that contain a single mismatch. 3. Filter reads that contain a specific mismatch (e.g. T to C). What tools would be suitable to perform these operations?

Thank you!

Flow · January 5, 2022, 10:05am

Dear @cjain,
(1) You can use bedtools intersect (bedtools Intersect intervals find overlapping intervals in various ways) it works with bam files. You would need to provide an annotation for your CDS or tRNAs.

(2) You can use BAM filter Removes reads from a BAM file based on criteria

(3) This is actually not that straightforward (to my knowledge) and would need samtools (see this post Extracting Reads Containing A Specific Variant From A Bam File). The question that comes immediately to my mind is, why you want to filter reads with a specific mismatch? Because if you want to investigate specific variants and count how many reads support which variant, then this goes into variant callling (see here Galaxy Training!).

Best wishes,
Florian

cjain · January 6, 2022, 2:05am

Thank you again, Florian. I will try your suggestions.

As for the third point, it concerns the use of PAR-CLIP data, where one expects to see diagnostic T>C mutations at crosslinked positions. I have noticed that that there are a lot of reads that do not have any mutation as well as some that have other mutations. I would like to identify protein-binding motifs, and so, think that removing this background will be better for motif identification.

By the way, I saw that you have a recent paper on RNase E. I published a few papers on E. coli RNase E myself, so I will plan to read your paper soon.

Regards,

Chaitanya Jain

Flow · January 7, 2022, 12:40pm

Dear @cjain,
I would not necessarily remove the PAR-CLIP background. Some peak calling algorithms use the T->C point mutation at the corsslink site for the peak calling. For example, you can use the peak calling tool PARalyzer A method to map interaction sites between RNA-binding proteins and their targets. It was specifically designed for PAR-CLIP data [1].

Yes, indeed I was involved in the RNase E project by Ute Hoffmann et al. I also did some other work regarding RNA protein interaction (CLIP and Ribo-Seq).

Best wishes,
Florian

cjain · January 13, 2022, 6:19pm

Hi Florian,

I finally got to reading the paper on RNase E. The amount of bioinformatics work on the paper is very impressive! I didn’t understand a few things though and it would be nice if one could discuss them.

Regarding our PAR-CLIP paper, we did initially get the Hafner lab (who invented PAR-CLIP) to analyze the data, but they found few peaks. On the other hand, when I analyzed the data myself, I found T>C diagnostic changes in many genes, and I later showed through experimentation that several of these changes are biologically relevant. So, I am not sure whether analyzing the data through peak identification is the best way to go.

However, I did also try to use Paralyzer on my data but got errors. I am not sure whether the data was properly formatted for Paralyzer. Do you have any suggestions?

Also, as I mentioned earlier, I would ideally like to use only reads that contain T>C mutations for motif identification and remove background reads. So if it is possible using Galaxy tools, that would be great. I also tried to use MEME on unfiltered reads but got an error. Perhaps I don’t know how to format the Bam file for MEME. Any advice would be appreciated.

Finally, it is possible that for future projects we may need someone who has experience with bioinformatics. Would you be open to collaboration or know anyone who might be?

Regards,

Chaitanya

Flow · January 14, 2022, 11:36am

Dear @cjain,

However, I did also try to use Paralyzer on my data but got errors. I am not sure whether the data was properly formatted for Paralyzer.

Do you have the error message you got from Galaxy?

Also, as I mentioned earlier, I would ideally like to use only reads that contain T>C mutations for motif identification and remove background reads.

I think in Galaxy it is really a bit of a problem. Or needs a bit more trickery. There is no stand-alone tool for it to my knowledge. This post leads to the direction, what you want to do, but it might not be enough. It is probably important to check if you use soft- or hard-clipping and make sure that insertion and deletions are correctly covered.

I also tried to use MEME on unfiltered reads but got an error.

I would need the error report here.

Finally, it is possible that for future projects we may need someone who has experience with bioinformatics. Would you be open to collaboration or know anyone who might be?

Yes, sure. Write me an E-Mail and we can connect. Generally, our group is always open for collaborations.

Cheers,
Florian Heyl

Topic		Replies	Views
Filtering aligned reads	0	361	November 21, 2021
Bowtie2 filtering reads mapping , blast , igv	1	591	September 28, 2023
Bowtie2 alignments on duplicated genes mapping	3	915	April 15, 2021
To get a file of only uniquely mapped reads (RNA star) usegalaxy.org.au support	1	382	June 14, 2023
Conversion of BAM files to bigwig usegalaxy.eu support	0	348	May 13, 2022

Filtering reads in a BAM dataset

Related topics