Choosing datafiles for DESeq2

Hopefully a simple question. I have been extracting the individual gene count files from a collection containing 3 replicates in order to choose them individually in the factor levels (I’ve been following the the Reference-based RNA-seq analysis tutorial which has them as separate files). However, I noticed that I can instead choose the dataset itself (which contains the individual gene count files from the 3 replicates), will DEseq2 run in the same way?

Hi @kate2

DESeq2 expects each sample to have a count file (instead of a count matrix like Limma accepts). You can input those each one-by-one or group your data in a collection with group tags or split into multiple collections.

Collections are very powerful! This means you choose any of these ways to input the data and have the same result

  • multiple-file select → individual datasets listed in the history. One dataset per sample.

  • collection select → those same datasets grouped into a Flat List dataset collection folder. Still one dataset per sample. The collections should be organized in way that the datasets can be split out into each Factor level.

If you processed all the samples together through the upstream steps in a collection (avoids a lot of clicking!), then you can either split out the count files into multiple collections (per factor level) or even better, apply some group tags and use those on the form instead.

This is our exact example here for DESe2. → Hands-on: Group tags for complex experimental designs / Group tags for complex experimental designs / Using Galaxy and Managing your Data

For completeness, you can also use a count matrix with some tools, like Limma, and provide a Factor file to split out conditions/factors. DEseq2 won’t accept a matrix .. but knowing that those same original counts can be used with both tools and are easy to transform seems worth mentioning. → Hands-on: 2: RNA-seq counts to genes / 2: RNA-seq counts to genes / Transcriptomics

In short, once you have your counts all together in a collection, there is a lot you can do going forward to split the data out for the DE! The tutorial you are following is showing the simplest way to process the data, but it is great that you are exploring the others since that is how you’ll probably be using the tools later on with larger batches of work.

How to confirm?

Review the job Details view (using the i-icon) for the different jobs. The top summary table of inputs will list out what was originally selected and used, and the job stdout log will include the data matrix constructed from those inputs (the R data structure). We had a discussion about this last week with some screenshots showing exactly what/where to review. → Clarification on DESeq2 Factor Level Direction in Galaxy - #2 by jennaj

Hopefully this helps, but let us know if it actually does! If I misunderstood, would you please you explain a bit more? Maybe with screenshots? Thanks! :slight_smile:

Hi @jennaj. Thank you for your comprehensive response which is really helpful - good to know I don’t need to be extracting the individual datasets from the collections! I can see using tags will be really useful.
Just one further question, if I was using tags, can I specify more than one collection from which they can be chosen? I think probably yes from the use of optional plural on the field note: “Count file(s) collection”.

1 Like

The tool is expecting all of the files to be inside the same single collection.

But, if you currently have two or more collection, it is easy to rearrange the data, and exact copies are just clones (do not consume any extra quota space). See the Collection Operations tools. You would likely be using Merge Collections and what each does is down on the tool form in the Help, and we have an overview with some ideas about how to use these manipulations together here. → Hands-on: Using dataset collections / Using dataset collections / Using Galaxy and Managing your Data

In short, your data files each exist on the file system, and a “collection” is a group of references to those files. You are organize these references to optimize how data streams through tools (example: discard failures in large batches and keep going) or is organized for different parameters (example: only the factor groups you care about).

Glad all this is helping and hope this bit does too! :slight_smile:

1 Like