downsampling many bam files in parallel

I have multiple bam files in collection with different read depths and would like to make a workflow in which I can ideally without doing it for each file individually automate the downsampling procedure, so that my resulting collection of bam files all have the same read depth in the end. The lowest read depth of a bam file within the collection would be the read depth to which all other bam files should be downsampled. I’m wondering if this could be achieved through applying rules to manipulate the downsampling tool ( Downsample SAM/BAM Downsample a file to retain a subset of the reads(Galaxy Version 3.1.1.0)) to retrieve the correct downsampling factor from a text file and connect this with the downsampling function to downsample each bam file in a collection with a different scaling factor. Does anyone know how to do this?

Hi @2-tetrad,
It looks like you want the same number of mapped reads in each BAM file. If you happy with downsampling to number of reads, samtools view can do that. I have no idea how it handles unmapped reads. You may need to filter these out.
Kind regards,
Igor

Hi @2-tetrad

For the scaling factor part, yes, it seems you could create a workflow for this. Determine the scaling (first round of filtering), extract that number into a text file, then use the “simple inputs” workflow function for downsampling.

This tutorial will explain how to do use the variable inside of a text file → Hands-on: Using Workflow Parameters / Using Workflow Parameters / Using Galaxy and Managing your Data

And @igor 's point is important: think about the logic of how these different tools work. Removing unmapped seems important, and you also might want to filter for mapQ, proper pairs, etc before determining the scaling factor, then be sure to process those same post-filtered BAMs for the downsampling steps, too.

Hope this works out! :slight_smile: