I have an aligned bacterial RNA-Seq file and was trying to get read counts for each of the four bases at each covered genome coordinate to identify both low-frequency SNPs and read depth. I tried both Samtools mpileup and Naïve Variant Caller. However, when I compared the results with read tracks on IGV, I noticed that each of these tools is not reporting between 2-5% of the covered positions, generally where there is low density. Is there any tool that will allow me to get more accurate results?
Naive Variant Caller does sound like a good choice. Are you using the latest version and are you sure you haven’t activated any filters? If not, then that’s probably as good as it gets.
Samtools mpileup also has lots of filter options, many of which are not deactivated by default.
On the IGV side, it is possible to display soft-clipped bases there, but I don’t think either samtools mpileup nor Naive Variant Caller would consider those.
No other things I can think of right now, sorry.
Thank you for your reply, which turned out to be very helpful. I used the Variant Annotator tool on the Naive Variant Caller output to view the data in Excel and did not notice that the default has a filter setting on. When I turned it off, it worked perfectly fine!
On another note, Naive Variant caller>Variant Annotator yields a read count =2 for each read that is paired. How can I create a tabular list where each paired end read is counted only once (as is the case with Samtools mpileup)? I am also losing the reference base with Variant Annotator, which makes it difficult to tell whether the variant frequency is 0% or 100%.
I’m not sure there’s anything that can be done about the double counting of overlapping read mates in PE data for the Naive Variant Caller (if anything you could look for a tool that lets you merge overlapping mate alignents into one at the BAM level before passing that to Naive Variant Caller, but I’m not sure there is such a tool).
Regarding Variant Annotator, I would probably go with SnpSift Extract Fields to create a tabular report out of your VCF instead. This is untested and will work only if the VCF written by the Naive Variant Caller conforms to the VCF standard, but I really like the tool because of its flexibility. You can then continue with any of Galaxy’s Text Manipulation tools to tailor the report exactly to your needs.
Thanks for your interest and suggestions. It seems that there are some tools out there to merge mate pairs but Galaxy does not seem to support them. Possibly dada2: mergePairs in Galaxy could work but I am not sure how to use it. Meanwhile, I will try to see whether I can tweak some of the Samtools mpileup options so that it will give the correct output.