Downsampling all regions to specified maximum number of reads

I see that Samtools View can downsample by specifying a fraction of the total reads to keep or by specifying a target number of total reads. However, I’m interested in randomly downsampling by specifying the maximum number of reads that would be consistent across all regions. This would make it easier to set minimum read threshold for variant calls (specificity and sensitivity). Is this there a tool/app for this?

Hi @gco,
I’m not entirely sure if I understand your question correctly. Are you saying you would like to cap read coverage at every site at a certain maximum?
If so, then BAM filtering is not the right answer because reads, of course, always span multiple sites and starting to throw away entire reads just to satisfy max coverage constraints at a few sites would waste a lot of information.
That’s why, typically, this kind of capping is done during pileup generation/variant calling though not all tools in these categories offer such an option. Which tool(s) are you planning to use for pileup generation/variant calling?

Yes, I would like to cap at a specified number of reads (eg 25x, 50x, 100x) at all locations and discard the rest to see which known variants are called and not called at that level. Let me know if you have any suggestions for pileup generation/variant calling that can help otherwise I can just play with Samtools view to get my reads down and do manual counts. Thank you.

Well, you still haven’t mentioned, which variant caller you’d like to use, but e.g. the latest version of freebayes - (not yet available on it seems) - lets you set a depth cap under Read coverage.

1 Like

I’m getting my secondary analysis from either Illumina BaseSpace and ThermoFisher Torrent Suite depending on the application. Unfortunately, they don’t provide the needed downsampling. I’ll look into freebayes as you suggested. Thank you.

1 Like

Hi, I have a similar issue, what did you find was the best solution in the end?

Hi stauntok,

If I remember correctly, I was not also able to find anything in freebayes. What I ended up doing was using IGV to manually obtain the stats needed at 25x, 40x and 50x for each variants I was investigating for validation. IGV will randomly display all the aligned reads and I would just take the first 25, 40 or 50 reads for my data. Let me know if you find another way. Best.