how to perfom Samtools view tagged.bam | grep -c “miRNAname” >> genome-basedcounts.tsv

Hello, I need to do the following work in galaxy: use command ‘Samtools view tagged.bam | grep -c “miRNAname” >> genome-basedcounts.tsv’. Apply ‘sort -k1 | uniq’ on the counts file to retain only the unique miRNA counts.
Would you please tell me what tool I need and give me some instruction

Hi @lotus

It sounds like you want to find all of the query miRNA sequences that only had one hit in the BAM file, correct? Unique mappers? You can use the tool Filter BAM for this.

Any query with mapQ value over 20 is unlikely to have other meaningful hits, but you can use 30 or even higher like 60 if you want more confidence or stringency. Maybe give that a try to see how it works for you?

The method explained with samtools would involve converting the BAM to SAM format, then counting up the number of rows in the SAM file (per unique query name), then filtering based on the count values (number of reported lines). You can do it this way too in Galaxy, maybe with a mini-workflow, and compare to the filter above.

Remember each query will always be present at least once – including lines that represent “no hit” for that query sequence, so you would need add in another filter in there to distinguish between lines reporting a single reported valid hit (single primary hit, usually above some threshold, like the CIGAR or other) and line reporting no hit at all. Most of this is what is captured in a mapQ metric.

This is a nice summary at the Biostars forum → Is there a way to do read filtering (MAPQ> certain value) on a BAM instead of SAM file?. Just keep in mind that different mapping tools may have different “magic numbers” to designate certain mapping conditions. An internet search will tend to find these – please ask if you get stuck. We’ll need to the full name/version of the tool you used in Galaxy to help look it up!

Xref → Hands-on: Data Manipulation Olympics / Data Manipulation Olympics / Introduction to Galaxy Analyses

With an example of using mapQ for filtering hits before variant calling in here → Hands-on: NGS data logistics / NGS data logistics / Introduction to Galaxy Analyses

Hope this helps! :slight_smile:

1 Like

Hi jennai
Thank you very much! I am new to Galaxy. I learned a lot from your answer. Now I have a new question. I am trying to get a new column with miRNA names from an existing column in the format starting with XQ:Z:. For example, I try to get the line to show hsa-miR-200b-5p but not XQ:Z:hsa-miR-200b-5p I used the Replace Text in a specific column (Galaxy Version 9.5+galaxy2) with the following information:
Find pattern - optional : ^XQ:Z:(hsa-miR-\d+[a-z]?(-[1-9])?(-3p|-5p))$
Replace with - optional: hsa-miR-\d+[a-z]?(-[1-9])?(-3p|-5p)
But the output column is still XQ:Z:hsa-miR-200b-5p.

Would you assist me in getting the right tool and method?

Your help will be greatly appreciated!
lotus

Hi @lotus

I’m wondering what the entire line looks like. My first guess is the trailing $ is one problem in the Find function. I’m not sure if that should be left out entirely, so I would suggest trying that next. While doing that, I would also strongly suggest putting the Find pattern you want to report into a variable, then to call that variable in the Replace function. How to do this is in the examples on the form. What you are doing now in the Replace won’t work (as far as I know).

I’ll also sometimes try one of the other replace tools since they all work a bit different. I personally like the tools that work on entire lines best. But let’s try to get this one working for you first, then switch if needed. Focusing on one column should be possible.

Or, maybe you solved this already? Let us know and hope this helps :slight_smile: and great that you are learning this. Data manipulation is a huge part of analytics.