How to solve problems in edgeR?

Hey, I created a file with three replicates of RNA counts of a total fraction and a cytoplasmic fraction (so 2x 3 replicates). It is a tab-text file, uploaded as tabular.
I used this as single count matrix file and wrote an metafile for the count matrix with the sample/replicate names in the first column and the condition (total or cytoplasm) in the second row. I does´nt have an annotation and inserted in Contrast of Intrest
C-T for Cytoplasm and Total. I used also T and C as name for the conditions and T1, T2, T3 and C1, C2, C3 as names for the replicates. But if I run the job I got the following error warning:
Warning message:
In Sys.setlocale(“LC_MESSAGES”, “en_US.UTF-8”) :
OS reports request to set locale to “en_US.UTF-8” cannot be honored
Error in makeContrasts(contrasts = contrast_data, levels = design) :
The levels must by syntactically valid names in R, see help(make.names). Non-valid names:

I tried to use different names of samples and conditions, but it does´nt work anyway. What did I wrong? How can I fix this?

Here the things I tried.

I´m grateful for all ideas :slight_smile:

1 Like

Welcome @AnneBollschu1

Thanks for explaining and for sharing your history! Very helpful.

These forms can be a bit complicated to set up since they are attempting to model a sample sheet for you from the input files and the content entered on the form.

Below are three of your jobs that had problems. I’m going to explain what I see that went wrong, and maybe that helps you to solve the problem? I’m looking at the job details pages (i-info icon) and the rerun views (rerun-icon). This helps to see all of the parameters and file contents in a summary, and are always the first places to review when something goes wrong.

One of your DESeq2 jobs.

There are a few things going on here.

  1. DESeq2 expects each count file to represent a single sample, not multiple samples. This is called out down in the Help section but is easily missed!
  2. The batch factor sample sheet will have two columns.
    • The first column has a sample identifier that is also included in the header of a count file (the second column’s header – the first is always “Genes” or similar, the important part is that it is same for all count files)
    • The second column is the factor level. This should match the customized factor level you need to enter on the tool form. Right now you have these as the defaults, and because these are both “the same” the tool didn’t get far enough along to report about the other conflicts.
  3. The way the genes are named are not R compliant. All of your examples have this. I’ll explain more at the end of this post.

Then, one of your EdgeR jobs

This one is simpler to solve.

  1. Notice how the help is specifying to list out how to classify each sample (column) in your count file.
  2. You have six columns representing six samples. Something like this would work: c1,c1,c1,c2,c2,c2
  3. Correct the gene names.

The last EdgeR job (the one you are posting about)

This one looks great with the set up! The problem is the gene identifiers. I see the tool reported about these, but guessed wrong about which specific data point had a format issue.

The error messages in the logs (i-info, then scroll down to stdout) are just from the underlying tool. Sometimes a search at the Bioconductor forum can find all the weird ways problems can get trapped but not perfectly by these tools. So, try a search but then come back here with questions when working in Galaxy. → https://support.bioconductor.org/post/search/?query=The+levels+must+by+syntactically+valid+names+in+R

Correcting data keys to be R compliant

Any data key used in R should fit the format of:

alphanumeric characters, with optional underscores, and not starting with a number.

The Galaxy tool form has some tips about this:

NOTE: Please only use letters, numbers or underscores (case sensitive), and the first character of each sample, factor and group must be a letter

To complicate it, “gene” identifiers are a bit special in that they can be all number characters!

The Galaxy tool wrapper itself will make an attempt to sanitize the values, but this doesn’t always resolve things fully. If you get an error, fixing your important data keys directly can not only solve the issue, but avoid your data not being passed to the underlying tool as expected (and wanted!).

I tried to sanitize your gene values with a simple sed command

But the rerun then had a very clear message about duplicated gene identifiers. Maybe substitute + with plus and - with negative instead to see if that is enough?

Galaxy also has less complicated find/replace tools, and you can enter interactive environments for full control. We have tutorials to get your started.

If you are merging this data later with other files, remember to change all of your common keys to be the same, so that your joins will later work out.

Hope this helps and we can follow up more! You had some good examples to clarify so I’m using this as a bit of an example for others who later might have similar problems. Thanks! :slight_smile:

Thanks a lot, I will try to fix this :slight_smile:

1 Like