I try to perform differential expression analysis by use of DESeq2. My tabular information is as follows:
The moment I start DESeq2 run I get the following error
Error in data.frame(sample = basename(filenames_in), filename = filenames_in, : duplicate row.names: /corral4/main/objects/6/6/a/dataset_66a66256-93c6-4d3f-9c06-6dd77297b3d7.dat
For now I don’t know if my table is wrong or if the parameters I am using are not correct. I have done the following
1: Factor level= Cont
2: Factor level= LPS
Files have headers= YES
Advanced setting= all default EXCEPT Pre-filtering= Enabled; Pre-filter value= 1
Can anyone provide me with some advise on this one?
This kind of error is directly from the underlying tool, and indicates that the tool had trouble parsing the inputs.
Those errors can be searched at the Bioconductor forum to find prior Q&A answered directly from the authors. Example: https://support.bioconductor.org/post/search/?query=deseq2+duplicate+row.names. Some will have Q&A at this forum as well if it happened to come up before. Example: Search results for 'deseq2+duplicate+row.names' - Galaxy Community Help
The screenshot is a bit blurry, and is just one part of the input.
I would suggest starting with these items first:
- Does each sample have a unique label?
- Does each line in the file have the same number of columns?
You could also share more details about the entire job. The “inputs” includes the files chosen on the form plus any labels input directly on the form. It is easier for others to offer actionable help when all of those details are in context.
Please see for the how-to Troubleshooting errors → Sharing your History
You could also compare to examples on the tool form (scroll down to the help section), or in the “end to end” tutorials here: Galaxy Training!
Let’s start there
Here is a better figure of my input data:
Yes, each sample has a unique label and each column has the same number of rows.
This is the link to my history: Galaxy
The error we discuss is about job 191-193, input table is 190.
Thanks for your help, appreciate this!
Thanks for sharing the history, I see the problem, and should have see it originally!
DESeq2 expects one distinct count file per sample. The sample name is read in from the files, and the factor label is input on the form.
Limma and EdgeR both have an option to accept a count matrix as well. Your data would be correctly formatted for that. Then you can either supply a file with the factor organization, or use the form to capture the same. Examples for this use case is on those tool forms (scroll down to the help section).
Small warning: either of those two latter tools may not be happy with lines where the counts are all 0. You could remove those lines (entirely) from your matrix if that happens. If you are not sure how to do that in Galaxy, this tutorial maps common unix utilities/functions to tools: Data Manipulation Olympics
Thanks for this update. So if I want to determine DE expressed genes with DESeq2 for 4xLPS samples vs 4xcontroles I have to toss in 8 tables with data per sample?
Factor label is just what I write on the form and sample name is picked up at the top of the table. For one sample the table will look like this:
Is this correct? If not, could you send me an example how it should look like?
Thanks again for your help!
Yes, all of that is true.
Four of your samples would go under one factor condition, and four would go under another, with each supplied as an individual file.
And, you might want to adjust those gene identifiers. Certain characters, like dots and pipes, have caused problem for others. Using R friendly terms tends to work best with all Bioconductor tools: terms are oneWord with no spaces, not starting with a number, and alphanumeric plus underscores (only). The Galaxy wrapper tries to handle alternative naming but can’t catch everything. And, this might not even show up until a downstream tool is used.