Subsetting collections and doing operations on an arbitary number of groups

I’m trying to create a workflow for mutation calling that can handle working on a variable number of mutant strains in parallel. I’d like to be able to make a single collection of files as input and use rules to group my sequencing data according to the mutant strain. Then do the alignment and mutation calling group-wise according to those groups. It is this last bit that I can’t figure out. There does not seem to be anything in the collection operations section that would permit me to do this but I’m new to Galaxy so I’m probably missing something.

for example if I had input data files like this:

A123_0001_S1_R1_L001.fq.gz Group1
A123_0001_S1_R2_L001.fq.gz Group1
A123_0001_S1_R1_L002.fq.gz Group1
A123_0001_S1_R2_L002.fq.gz Group1
A123_0002_S2_R1_L001.fq.gz Group2
A123_0002_S2_R2_L001.fq.gz Group2
A123_0002_S2_R1_L002.fq.gz Group2
A123_0002_S2_R2_L002.fq.gz Group2

I’d like to create collection with pair information and with groups for each sample (S) for an arbitrary number of different sample groups. Then align and do mutation calling on each group in parallel ultimately returning a collection of VCF files with one file per group.

Is this possible? and if so i’d appreiciate some pointers on how to do it.

1 Like

Hi Richard and welcome,
what exactly would work best for you depends a bit on the exact tools you will be using (not every tool handles collections the same way).
One very simple suggestion first (let us know if you are not happy with it for whatever reason):
it looks as if the individual PE datasets within a group all belong to one sample that was simply sequenced over multiple lanes? In that case you could have a preprocessing step that concatenates all R1 files (and all R2 files) from the same sample, then conitnue with a less complex data structure?


Hi Wolfgang,
That would be fine though it is my understanding that is a good idea to run a QC step first to detect lane related issues before concatenating the fastqs. This is why I was going in with the more complex structure so I could just give the whole collection to fastp and get QC info on each file.
My objective is to make the pipeline accept as flexible a set of inputs a possible for ease of use by my colleagues. So if a subsequent mutation experiment dataset had reads from 1 lane or more like my example, i’d like the pipeline to be able handle this gracefully.

Thank you,

1 Like

Hi Richard,
thanks for the clarification, which sounds completely reasonable.
Wel then, you need to go advanced it seems and use Apply Rule to Collection.
This is an extremely powerful, but hard to explain tool. It’s best if you start out with the dedicated tutorial, I guess, then experiment a bit with it before asking remaining questions here.
This tool will almost certainly be capable of doing what you need, and though even harder, you can use it in a workflow, too.
My recommendation would be to get it working manually in a history, then copy the rule json representation and paste it into the same step in the workflow editor.
Once you have something working, you can also share your WF and ask here for review again, if you want to.
Good luck,

Hi Wolfgang,

I thought that the ‘apply rule to collection’ tool would likely be what I needed to solve the issue. I managed to get the groups defined in the build collection from rules tool. I created ‘paired indicator’ and ‘group tags’ columns for my samples using a regex on the sample names. Where I got stuck was how to get e.g. bowtie2 to map the reads by group an not just the whole collection.

The tutorial covers creating collections very nicely - its the interacting with them afterwards bit I’m getting stuck on.

Thanks again,

Ok, so my rough idea was something like:

  1. Start with two flat lists - one with all the forward reads, one with all rv reads (if you want to have it nice and clean for users include a 0. step that unzips a corresponding list:paired structure)
  2. Feed this to fastp, which will return the same structure back
  3. Use the rule builder to restructure each of the two collections into nested lists organized by samples (the contained datasets would represent the lanes for each sample).
  4. Use Collapse Collection to collapse each inner list (representing one sample) into one dataset (i.e., concatenate the fastqs from each sample). At this point you should have a simple list of fastqs (one per sample), and you’ll have two of them - one with fw, one with rv reads.
  5. Pass the two simple lists to bowtie2, which should return one simple list of bams

From then on, things should be straightforward.

Does that help?

1 Like

OK that sounds like it should work - i’ll try implementing it and post back here once I get something working or get stuck again.
Thanks for the help!

OK so I ended up with a slightly different approach, My input ended up being a ‘list of dataset pairs’ fastp seemed to like this input format better. I had a little trouble with bowtie2’s input preferences I had to run paired collection through unzip collection splitting it into forward and reverse lists at this point and then use fastq groomer to make the files into fastqsanger format for bowtie2. (I’m unclear on the difference between fastq and fastqsanger if the phred format is specified). This did all work though!

Here is the workflow I ended up with:


Really nice, thanks for sharing!

A few comments/answers from me:

  1. fastqsanger is a Galaxy datatype that is meant to avoid issues with old sequencing data, for which base qualities might not be Phred+33 encoded (see for a recap of different subformats of fastq that have been in use historically).
    Essentially, by specifying the format of a fastq dataset as fastqsanger in Galaxy, you are “promising” that this data uses Phred+33 encoding and that you know what you are doing. Without this declaration, many tools, including bowtie2, will refuse to work with fastq datasets because their results would depend on the interpretation of the base quality string.
    If you have non-Phred+33 data, you can use the fastq groomer to convert the base qualities. You can also use this tool to check the encoding of fastq datasets.
    On the other hand, if you know that this is recent data and that it’s using Phred+33 encoding, you can simply set the dataset’s format to fastqsanger at upload time and you’ll be fine. So I suspect that the fastq groomer step in your workflow isn’t really necessary?
  2. The samtools sort step you’re doing after bowtie2 should also be unnecessary since the Galaxy wrapper of bowtie2 outputs sorted BAM by default. Instead I would consider running samtools view to remove unmapped reads or maybe also read pairs for which only one mate got mapped?
  3. You could think about adding a bamleftalign step after the MarkDuplicates step, which might help for improved indel calling. I’m also typically running bcftools norm on the VCF dataset (produced by the MiModD extract variants tool) with Left-align and normalize indels? turned on as the only action to perform to harmonize the representation of called indels. Both steps will usually not matter much for users, but won’t do any harm either.
  4. I would strongly encourage you to run SnpEff before SnpSift so that your users can also see the predicted effects that the reported variants would have on genes and proteins. Typically that would be one of their main criteria for choosing variants to follow up on in the wet lab.
    If you use snpeff eff version 4.3t (careful here: Galaxy doesn’t recognize this as the latest version though it is), you can use the SnpEff genome file WBcel235.86, which is coordinate-compatible with ce11.
    Use the SnpEff download tool to get this genome file into Galaxy as part of the workflow.
    For compatibility with other MiModD tools, which users may want to use after the workflow, make sure you choose Use 'EFF' field compatible with older versions (instead of 'ANN') in the SnpEff eff Annotation options.

I hope some of this is useful for you and best wishes,

1 Like

Hi Wolfgang,

Thank you for the suggestions, that’s very helpful.

I’ve set the output format for my unzip stage to fastqsanger that way I can skip the fastq groomer step but don’t have to rely on the users setting the input format to be fastqsanger, or fastp to correctly output fastqsanger if it receives fastqsanger as input.

Good point about the unnecessary sort - didn’t notice galaxy did that by default. I also added a samtools view step to retain only the properly paired reads.

bamleftalign and bcftools norm are excellent suggestions for keeping the output consistent, I’ve added those too.

I was trying to get SnpEff in the pipeline before but for some bizarre reason if you search ‘snpeff’ or ‘snpeff eff’ you get a COVID specific SnpEff eff and the SnpEff build in the search results but not just SnpEff eff for that you have to manually expand the ‘Variant Calling’ section and scroll to it.

I’m testing my updated version in a history now.


1 Like