Downsampling all regions to specified maximum number of reads

gco · January 13, 2020, 5:22am

Hello,
I see that Samtools View can downsample by specifying a fraction of the total reads to keep or by specifying a target number of total reads. However, I’m interested in randomly downsampling by specifying the maximum number of reads that would be consistent across all regions. This would make it easier to set minimum read threshold for variant calls (specificity and sensitivity). Is this there a tool/app for this?

wm75 · January 13, 2020, 7:56am

Hi @gco,
I’m not entirely sure if I understand your question correctly. Are you saying you would like to cap read coverage at every site at a certain maximum?
If so, then BAM filtering is not the right answer because reads, of course, always span multiple sites and starting to throw away entire reads just to satisfy max coverage constraints at a few sites would waste a lot of information.
That’s why, typically, this kind of capping is done during pileup generation/variant calling though not all tools in these categories offer such an option. Which tool(s) are you planning to use for pileup generation/variant calling?

gco · January 13, 2020, 4:09pm

Yes, I would like to cap at a specified number of reads (eg 25x, 50x, 100x) at all locations and discard the rest to see which known variants are called and not called at that level. Let me know if you have any suggestions for pileup generation/variant calling that can help otherwise I can just play with Samtools view to get my reads down and do manual counts. Thank you.

wm75 · January 13, 2020, 4:24pm

Well, you still haven’t mentioned, which variant caller you’d like to use, but e.g. the latest version of freebayes - https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes/1.3.1 (not yet available on usegalaxy.org it seems) - lets you set a depth cap under Read coverage.

gco · January 13, 2020, 5:43pm

I’m getting my secondary analysis from either Illumina BaseSpace and ThermoFisher Torrent Suite depending on the application. Unfortunately, they don’t provide the needed downsampling. I’ll look into freebayes as you suggested. Thank you.

stauntok · October 23, 2020, 12:27pm

Hi, I have a similar issue, what did you find was the best solution in the end?

gco · October 23, 2020, 5:11pm

Hi stauntok,

If I remember correctly, I was not also able to find anything in freebayes. What I ended up doing was using IGV to manually obtain the stats needed at 25x, 40x and 50x for each variants I was investigating for validation. IGV will randomly display all the aligned reads and I would just take the first 25, 40 or 50 reads for my data. Let me know if you find another way. Best.

Topic		Replies	Views
Change pileup to vcf format allowing for great depth (100.000 reads)??? usegalaxy.org support freebayes , variant-analysis	1	1335	December 6, 2018
downsampling many bam files in parallel workflow , filter-bam	2	119	July 8, 2024
Naive Variant Caller, Samtools mpileup inaccuracies variant-analysis , vcf	4	408	April 7, 2022
RepeatExplorer sampling less and less each run public-galaxy-server	1	554	March 19, 2020
Samtools - output read number similar for each sample usegalaxy.org support workflow , tool-dev , samtools	0	381	October 22, 2020

Downsampling all regions to specified maximum number of reads

Related topics