Clarification on DESeq2 Factor Level Direction in Galaxy

Dear Galaxy Support Team,

I am writing to request clarification about how DESeq2 handles factor levels in the Galaxy interface. I recently ran a differential expression analysis in Galaxy using DESeq2, and I want to verify the direction of the log2 fold change values.

Specifically, I would like to confirm:

  1. Whether Galaxy always computes log2 fold change as Factor Level 1 – Factor Level 2.

  2. Whether the order in which samples are selected for Level 1 and Level 2 determines the direction of the fold change.

  3. How I can verify from the job parameters (or tool metadata) which samples were assigned to Level 1 and Level 2 in my completed DESeq2 job.

I am checking this because one of my known marker genes (per2, ENSDARG00000034503) shows higher VST expression in KO samples but the DESeq2 log2FC is negative, suggesting that the direction may be reversed (i.e., DESeq2 may have been run as WT – KO instead of KO – WT).

Could you please confirm how Galaxy determines the factor level assignment and how I can confirm the direction for my specific analysis?

Thank you very much for your help.

Hello @aaku7

These are good questions! The statistics are reported for the first factor (primary factor), then the first factor level input on the form. You can think of this as a “top down” organization. The counts you are asking the question about are input first.

Help

From the tool form Help. This can be easy to miss!

Count Files

DESeq2 takes count tables generated from featureCounts, HTSeq-count or StringTie as input. Count tables must be generated for each sample individually. One header row is assumed, but files with no header (e.g from HTSeq) can be input with the Files have header? option set to No. DESeq2 is capable of handling multiple factors that affect your experiment. The first factor you input is considered as the primary factor that affects gene expressions. Optionally, you can input one or more secondary factors that might influence your experiment. But the final output will be changes in genes due to primary factor in presence of secondary factors. Each factor has two levels/states. You need to select appropriate count table from your history for each factor level.

The following table gives some examples of factors and their levels:

Factor Factor level 1 Factor level 2
Treatment Treated Untreated
Condition Knockdown Wildtype
TimePoint Day4 Day1
SeqType SingleEnd PairedEnd
Gender Female Male

Note: Output log2 fold changes are based on primary factor level 1 vs. factor level2. Here the order of factor levels is important. For example, for the factor ‘Treatment’ given in above table, DESeq2 computes fold changes of ‘Treated’ samples against ‘Untreated’, i.e. the values correspond to up or down regulations of genes in Treated samples.

You can review the command line used in Galaxy by going into the summary on the Details page (using the i-icon). You’ll find a table summary of the inputs and parameters, then the command line string submitted to the underlying tool.

The command abstractions can be a bit confusing, but inside of the Tool Standard Output you’ll see some of the intermediate tables constructed from the form inputs. You can output some of the other intermediate files under Optional Outputs for use in RStudio or other direct queries, too.

Example

From one of my histories, this was the data input on the form:

Tool Parameters

Input Parameter Value
how datasets_per_level
Specify a factor name, e.g. effects_drug_x or cancer_markers My1WORDFacterName
Specify a factor level, typical values could be ‘tumor’, ‘normal’, ‘treated’ or ‘control’ MyFirstFactor
Counts file(s)
Specify a factor level, typical values could be ‘tumor’, ‘normal’, ‘treated’ or ‘control’ MySecondFacter
Counts file(s)

And this is that same data organized into a count table during runtime and reported into the stdout logs.

DESeq2 run information

sample table:
My1WORDFacterName
GSM461177 MySecondFacter
GSM461178 MySecondFacter
GSM461181 MyFirstFactor
GSM461180 MyFirstFactor

design formula:
~My1WORDFacterName


This seems to line up with your data observations, but does it actually help? Maybe try to review your stdout job logs to see if you can find the sample table as a place to start? Please let us know, or if you need more help! :slight_smile: