An fatal error with DESeq2

Hi,

I tried running DESeq2 using output data from upstream Salmon analysis. The samples I’m investigating are RNA-seq, paired-end. The run failed and the following error message came up when investigating the bug reports:

I saw that other people encountered similar issues before, but with different errors (error message from DESeq2/EdgeR - #4 by jennaj). Some recommendations for this other type of error were to try run datasets with option “Files have a header” off.

I tried that (just in case), but still had the same error message (only the line number changed). So maybe I did something wrong upstream.

Also, not sure if this is of relevance here, but the upstream Salmon report mentioned that there are newer versions of Salmon, with bug fixes, etc. Could this be some sort of bug?

If somebody had any suggestions on how to sort this, that would much appreciated!

Hi @Egle

Double check that these settings are correct:

  1. Choice of Input data → TPM values (e.g. from kallisto, sailfish or salmon
  2. Program used to generate TPMs → Salmon
  3. Gene mapping format → input the same annotation (GTF or tabular) as input to Salmon.

If you don’t have item 3, you’ll need to back up and rerun Salmon with that annotation to produce the proper inputs for DESeq2.

The option on the Salmon form is: File containing a mapping of transcripts to genes

If this file is provided Salmon will output both quant.sf and quant.genes.sf files, where the latter contains aggregated gene-level abundance estimates. The transcript to gene mapping should be provided as either a GTF file, or a in a simple tab-delimited format where each line contains the name of a transcript and the gene to which it belongs separated by a tab.

quant.genes.sf files is the Salmon output that is input to DESeq2, along with the annotation providing transcript-to-gene information.



Please be aware that DESeq2 expects all sample counts to be in individual dataset inputs. This error might come up if a matrix of counts was input. The “8 elements” is not expected – that number of columns does not match the column count of any of the expected inputs:

  1. quant.genes.sf files has 5 columns and “a single header or not” is specified on the tool form
  2. transcript to gene has 2 columns and should not contain any header lines (# or others)
  3. GTF has 9 columns and should not contain any header lines (# lines).

Try this:

  1. Check that all of the inputs match the help above
  2. Rerun as needed, and be sure to use the most current Galaxy tool-wrapper version of all tools. “Version” can be navigated at the upper right corner of any tool form, but the messages about tool versions in stderr/stdout logs are from the original tool and usually do not mean anything relevant when running a tool wrapper in Galaxy.
  3. If that still fails, please send in a bug report from the error(s) to the server admins
  4. At the same time, you can follow up here by posting back a #sharing-your-history link. That link can be posted back publicly or ask for a moderator to start up a private chat. We’ll need all inputs and outputs to be undeleted and for you to note which dataset numbers are involved.

Let’s start there :slight_smile:

1 Like

Dear @jennaj,

Thank you very much for your advice. I have now tried running both Salmon and DeSeq2 a few more times, but the issue remains.

Double check that these settings are correct:

  1. Choice of Input data → TPM values (e.g. from kallisto, sailfish or salmon
  2. Program used to generate TPMs → Salmon
  3. Gene mapping format → input the same annotation (GTF or tabular) as input to Salmon.

If you don’t have item 3, you’ll need to back up and rerun Salmon with that annotation to produce the proper inputs for DESeq2.

Yes, the first run was wrong. I now corrected 1 and 2. Unfortunately, for 3. I only have tabular data from BioMart. I reran everything several times, but issue remained. I removed the title row from the transcript/gene ID table, the error message changed to the following:

Fatal error: An undefined error occurred, please check your input carefully and contact your administrator.

Warning message:
In Sys.setlocale(“LC_MESSAGES”, “en_US.UTF-8”) :
OS reports request to set locale to “en_US.UTF-8” cannot be honored
reading in files with read.delim (install ‘readr’ package for speed up)
1 2 3 4
Error in $<-.data.frame(*tmp*, “TXNAME”, value = character(0)) :
replacement has 0 rows, data has 274080
Calls: get_deseq_dataset → $<- → $<-.data.frame

Please be aware that DESeq2 expects all sample counts to be in individual dataset inputs. This error might come up if a matrix of counts was input. The “8 elements” is not expected – that number of columns does not match the column count of any of the expected inputs:

  1. quant.genes.sf files has 5 columns and “a single header or not” is specified on the tool form
  2. transcript to gene has 2 columns and should not contain any header lines (# or others)
  3. GTF has 9 columns and should not contain any header lines (# lines).

I think my input was individual datasets (nothing was merged, concatenated, etc.). Also, I don’t have any GTF data, but the formats of other datasets seem to be correct and the number of columns seems to match.

As advised, I sent the bug report.

These are the datasets I used for the most recent round of runs (the ones that resulted in altered error message):

Salmon:
Transcripts fasta file: 9 (Fasta normalised reference transcriptome).
Input datasets: paired data, 1-8 (trimmed reads; 1,3,5,7 at Mate pair 1; 2,4,6,8 at Mate pair 2).
File containing a mapping of transcripts to genes: 10 (transcript ID-gene ID table from BioMart. I think I changed it to tab-delimited. I also took off title row).
Outputs were 11-18.

DeSeq2:

Comparison group 1: datasets 12 and 16.
Comparison group 2: datasets 14 and 18.

Gene mapping format - tabular.
Tabular file with Transcript-ID to Gene-ID mapping: dataset 10 (the BioMart table, with transcript and gene ID versions).

If full history is required, I would prefer to share it privately.

An update. I spent some time playing around with “transcript to gene” table formats, swapping “transcript/gene stable ID versions” with “transcript/gene IDs” in various combinations, different ways of importing the table (TSV instead of CSV), converting the table to tab-delimited format, adjusting the titles, removing the titles etc.

Unfortunately, none worked for DESeq2, despite upstream Salmon quantification working well. I suspect that perhaps there was some mismatch in ensembl transcriptome data in fasta format (maybe I messed up something when normalising it or used wrong reference dataset) and the BioMart table (I tried various versions).

What worked was downloading ensembl reference transcriptome in gtf format. I used it unmodified in Salmon and downstream DESeq2 instead of the Transcript-to-Gene table. The result was this:

I guess my next question to @jennaj and other experts is the following: can I somehow validate if DESeq2 worked correctly? Of course, if something did not work, that will come out during downstream analyses. But maybe there is a tool or type of analysis that allows flagging up issues early?

Thank you!

1 Like

Great! It looks like you have avoided technical issues.

This specific tutorial includes summary and graphing steps that can help to detect scientific issues. If you go up one level, there are more.

1 Like

Hi @Egle
I tested DESeq2 using Salmon outputs and transcript to gene map in tabular format and don’t see any issue with the tool. The results are in the history shared with you earlier. I renamed transcript count tables for convenience.
Kind regards,
Igor

2 Likes

Dear @igor,

Thank you so much for investigating this! I think your analysis confirms there is something wrong with matching my “transcript-to-gene” table with the reference transcriptome.

Somebody had a similar issue using Kallisto output for DESeq2, so I should probably dig there.