DESeq2 Analysis with tximport Output - Matrix Dimension Error

I tried to run DESeq2 but encountered the following error message:

   Error in DESeqDataSetFromMatrix(countData = tbl, colData = subset(sample_table,  : 

ncol(countData) == nrow(colData) is not TRUE
Calls: get_deseq_dataset → DESeqDataSetFromMatrix → stopifnot
I think this error message means: the number of columns in the countData table does not equal the number of rows in the colData table. That is, the number of samples in my gene count data table does not match the number of rows in my sample information table.
I tried to use DESeq2 with the Salmon output according to the relevant tutorial, but it didn’t work, so I created the attached files using tximport. However, I am still getting the above error.
The counts were created from the NumofReads columns of the TXIMPORT file.

sample info file

image

Drought counts file

image

Control file

image

The counts were created from NumofReads columns of TXIMPORT file.

The inputs were feed into the deseq2 shown below:

image

Hi @f_kurt

Try converting all of your data to a tabular format since it appears to be csv right now. FAQ: Converting the file format

Next time you can just share your Galaxy history with the error. How to get faster help with your question. This would include all of the details.

Reviewing the data in other applications adds in another step of complexity, and can sometimes even introduce data content errors. If you are moving data in and out of a spreadsheet program, maybe explain why and we can help you to view your data in Galaxy without that extra step. Excel in particular doesn’t “play well” with the plain text files that R works with (the underlying programming language that all Bioconductor tool use).

I would strongly suggest to try to using the exact files that were produced in Galaxy without any extra steps to other external applications. At least the first time until you have some working baseline data. Then, you’ll have a format/content reference for when you want to try using the tools another way.

  • The outputs from Salmon executed in Galaxy can be directly interpreted by DESeq2, so I’m not sure where the other csv files were sourced. You could run Salmon in Galaxy on one sample to see how the data are formatted if you need to reformat files created somewhere else. Examples are also on the tool form (scroll down to the Help section).
  • And, you don’t necessarily need to create a sample sheet at all – just use the Factor level options on the form – meaning, directly type in the Control/Drought information.

So, give this a try, and hope the advice helps! We can follow up if you get stuck, so sharing your history is still an option. :slight_smile:

Based on the tutorial [Hands-on: Whole transcriptome analysis of Arabidopsis thaliana / Whole transcriptome analysis of Arabidopsis thaliana / Transcriptomics], Salmon shouldn’t create any problems, but it does! If you look at the first DESeq2 analysis result, whether I used a GTF file or a TranscriptIDtoGeneID file, I got errors! [I DELETED ALL THE ANALYSIS AND RERAN IT AGAIN FOR YOU TO EASILY SEE WHAT IS GOING ON IN THE HISTORY]

DESeq2 outputs 677 and 678 were created using Salmon gene quantification outputs for w040drought and w040control with TPM values from Salmon, and a GTF file.

The error message:

Import genomic features from the file as a GRanges object … OK
Prepare the ‘metadata’ data frame … OK
Make the TxDb object … OK
‘select()’ returned 1:1 mapping between keys and columns
reading in files with read.delim (install ‘readr’ package for speed up)
1 2 3 4 5 6 7 8 9 10 11 12
reading in files with read.delim (install ‘readr’ package for speed up)
1 2 3 4 5 6 7 8 9 10 11 12
Error in .local(object, …) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.

Example IDs (file): [SORBI_3K010100, SORBI_3K025800, SORBI_3K044406, …]

Example IDs (tx2gene): [EER90453, OQU90574, EER90454, …]

This can sometimes (not always) be fixed using ‘ignoreTxVersion’ or ‘ignoreAfterBar’.

Calls: get_deseq_dataset … tximport → summarizeToGene → summarizeToGene → .local
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The “phase” metadata column contains non-NA values for features of type
stop_codon. This information was ignored.

What I understand from this error message is that the transcript IDs in the quantification files are not present in the first column of the tx2gene file. In other words, the annotation file I am using does not match the transcript IDs in my quantification files. This is kind of weird since I used the same GTF file. The file is below, and as you instructed previously to someone on the forum, it contains no headers and is in a tab-delimited format:

image

Later on, I thought that perhaps if I used the TranscriptID to GeneID file, I could overcome the error! To do this, I prepared the TranscriptID to GeneID file using Gffread and Cut tools, and fed it into DESeq2 by changing the option to “Gene mapping format: Transcript-ID to Gene-ID mapping file.” I ran the analysis:
DESeq2 outputs 680 and 679 were created using Salmon gene quantification outputs for w040drought and w040control with TPM values from Salmon, and the TranscriptID to GeneID file.
Error message:
reading in files with read.delim (install ‘readr’ package for speed up)
1 2 3 4 5 6 7 8 9 10 11 12
Error in $<-.data.frame(*tmp*, “TXNAME”, value = character(0)) :
replacement has 0 rows, data has 48558
Calls: get_deseq_dataset → $<- → $<-.data.frame

I searched the error message and found that it indicates a mismatch in row counts when trying to add a new column to a data.frame in R. Specifically, it means that the new column (TXNAME) I am trying to add has 0 rows, while the existing data frame has 48,558 rows. This row count mismatch is causing the error.

Then I thought it would be better to create count files from Salmon transcript quant files. To do this, I used the tximport tool. After obtaining collective count files, I prepared them by splitting into two files for my DESeq2 analysis. As you suggested in your previous answers, I converted the files into tabular format.

DESeq2 outputs 682 and 681 were created using 666 Drought_counts_w040, 665 Drought_counts_w040, and 667 sample info.

Error message:

Error in DESeqDataSetFromMatrix(countData = tbl, colData = subset(sample_table, : ncol(countData) == nrow(colData) is not TRUE Calls: get_deseq_dataset → DESeqDataSetFromMatrix → stopifnot"

Now, I am open to suggestions to resolve this problem. Here is the link to my history:

Note: You might suggest that I can do this analysis with other tools. I know and I can, but I have liked Salmon. Skipping the trimming procedure and avoiding reproducibility issues are nice…

1 Like

Hi @f_kurt

Thanks for sharing the history.

You were very close with this observation. The issue was that the gene quantification inputs don’t have the transcript IDs in them – just the gene ID.

Try this: input the transcript quantification files when using DESeq2. I ran a quick test in your history and that was enough.

Let me know once you see the history, or grab a copy, or run this yourself and it works and I’ll unshare my shared history copy and purge it. :slight_smile:

1 Like

First of all, I sincerely thank you for your help and quick support. However, I feel that there is another issue here (btw I ran the analysis with transcript quantification files without sample info file and it worked as you indicated). Normally, the forward and reverse counts of each experiment are expressed in a single column and under the code of the experiment in the DESeq2 results. Although the result file obtained using transcript files seems correct, we see that the forward and reverse columns are not merged when we look at the normalized values. In this case, the normalized values cannot be used for further analyses in their current state. I had mentioned above that the steps I took to overcome this did not yield results [due to DESeq2 tool error].

In this case, I will try to give all the data to Salmon again as forward and reverse files and obtain the Salmon values for each column. I hope that in this way, the forward and reverse counts will be combined into a single column, which may lead to smoother functioning of DESeq2. I will let you know the result! Thanks again!

1 Like