Couldn't annotate any gene Name using Annotate DESeq2/DEXSeq output tables

Corry_Mao · July 20, 2021, 8:58am

I learned the protocol of # Reference-based RNA-Seq data analysis, however, the name column is NA from the results of Extraction and annotation of differentially expressed genes using Annotate DESeq2/DEXSeq output tables. The Reference annotation in GFF/GTF format I used is DanRer 10 and 11 as the data is from zebrafish.

jennaj · July 20, 2021, 6:59pm

Hi @Corry_Mao

Are you following this tutorial? Reference-based RNA-Seq data analysis

And are you using your own data?

The NA means that the reference annotation either doesn’t have any appropriate attributes to add – or there could be some mismatch problem. Expand the Advanced Settings section and compare the attributes the tool is attempting to match up with your GTF dataset. If they don’t match, it needs to be addressed (maybe with a different annotation source).

If you used the same GTF as upstream steps, then technical format problems would have presented with problems already (probably), but this FAQ describes the expected format and may be helpful: Datatypes - Galaxy Community Hub

And this FAQ covers general troubleshooting for differential expression tool inputs: Help for Differential Expression Analysis - Galaxy Community Hub

Quote from the tool form that may also help:

What it does

This tool appends the output table of DESeq2/edgeR/limma/DEXSeq with gene symbols, biotypes, positions etc. The information you want to add is configurable. This information should present in the input GTF/GFF file as attributes of feature you choose. DEXSeq-Count tool is used to prepare the DEXSeq compatible annotation (flattened GTF file) from input GTF/GFF. In this process, the exons that appear multiple times, once for each transcript are collapsed to so called exon counting bins . Counting bins for parts of exons arise when an exonic region appears with different boundaries in different transcripts. The resulting flattened GTF file contains pseudo exon ids per gene instead of per transcript. This tool maps the DEXSeq couting bins back to the original exon ids. This mapping is only possible if the input GTF/GFF file contains transcript identifier attribute for the chosen features type.

Inputs

Differential gene expression tables

At the moment, this tool supports DESeq2 and DEXSeq tool outputs.

Annotation

Annotation file ne GTF or GFF3 format that was used for counting.

Outputs

Input tabular file and with chosen attributes appended as additional columns.

This may be the root of the problem or at least some kind of contributing factor. If all inputs are not a match (same genome version/build), any tool may not necessarily fail but produce incorrect results (even if not an “NA” “not found” for linked annotation).

This part isn’t clear. Are you mixing different genome builds in the same analysis? That will always be a problem (technical + scientific). You should be using the same reference annotation throughout the same analysis. It will be either danRer10 or danRer11 for all inputs – and those should be based on whichever genome version you originally mapped against. The associated reference annotation GTF used for counting and all other steps should be based on that same genome assembly (10 or 11). Your original fastq reads don’t belong to any particular genome assembly – just a specific species. Once you start processing the reads (mapping, etc), that is when using a particular genome assembly/build for all inputs matters.

Please review, then we can follow up. Post back the input parameters (capture this from the Job details page – click on the “i” icon with the result dataset to find this) along with the first few lines of your GTF for troubleshooting. Also, note where/how the annotation was sourced, please.

Corry_Mao · July 22, 2021, 12:37pm

Thank you for your kind reply it’s been very helpful! I found the reason might be that the annotation file I used (DanRer11 from UCSC) doesn’t contain correct Gene ID information (gene ID is the same as the transcript ID), so where I can find the annotation file with the correct gene ID? Or perhaps, the transcript ID from the Deseq2 output can’t match anything to the annotation file? I used this annotation file in the upstream analysis and it looks all right.

The first few lines of the DanRer 11.gtf file from UCSC (TableBrowser) :

Seqname	Source	Feature	Start	End	Strand	Frame	Attributes
chr1	danRer11_ncbiRefSeq	start_codon	8304904	8304906	+	.	gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1	danRer11_ncbiRefSeq	CDS	8304904	8304996	+	0	gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1	danRer11_ncbiRefSeq	exon	8304862	8304996	+	.	gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1	danRer11_ncbiRefSeq	CDS	8309655	8309788	+	0	gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1	danRer11_ncbiRefSeq	exon	8309655	8309788	+	.	gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1	danRer11_ncbiRefSeq	CDS	8312162	8312259	+	1	gene_id XM_021475941.1; transcript_id XM_021475941.1;

The input parameters of “Annotate DESeq2/DEXSeq output tables” are listed as follows:

Tool Parameters

Input Parameter	Value
Tabular output of DESeq2/edgeR/limma/DEXSeq	* 124: Filter on data 123
Input file type	DESeq2/edgeR/limma
Reference annotation in GFF/GTF format	* 32: danRer11.gtf
advanced_parameters
GFF feature type	exon
GFF feature identifier	gene_id
GFF transcript identifier	transcript_id
GFF attributes to include	gene_biotype, gene_name

jennaj · July 22, 2021, 4:49pm

Yes, this is a known content issue with GTF datasets extracted from the UCSC Table Browser.

Update: It looks like they are now generating GTFs in the downloads area automatically. Capture the link and paste it into the Upload tool. danRer11 has few choices, and one of them is a match for what you used already: danRer11.ncbiRefSeq.gtf.gz

https://hgdownload.soe.ucsc.edu/goldenPath/danRer11/bigZips/genes/

How to find these:

UCSC Genome Browser Downloads
navigate to the genome build
click into Genome sequence files and select annotations (2bit, GTF, GC-content, etc), then into the folder named genes

This is a better way to get the GTF for all use cases. UCSC does not recommend extracting GTF data from the Table Browser for a few reasons: the gene_id = transcript_id attribute content AND the limit of about 100k lines of output (can result in a truncated output).

For you, if the original GTF was not truncated, it is probably OK to just swap the annotation at this point. But do double check – inspect/compare a few lines plus count up the total number of lines – all should match. If it doesn’t, you might need to reprocess the upstream steps using the corrected annotation.

Corry_Mao · July 28, 2021, 12:30pm

Here is what I got after the annotation for Differentially expressed genes:

GeneID	Base mean	log2(FC)	StdErr	Wald-Stats	P-value	P-adj	Chromosome	Start	End	Strand	Feature	Gene name
XM_683666.8	498.88515410306	2.4273845458708	0.15416043212282	15.745833820295	7.33577602702814e-56	1.67173036784951e-54	chr2	23790784	23805139	+	NA	NA
XM_001922608.6	200.837563940712	2.59668599485521	0.201900187135723	12.8612361964263	7.43775692383719e-38	1.20042800027617e-36	chr7	20239921	20241386	-	NA	NA
NM_212914.1	336.556168633185	2.20798978407476	0.177908177351559	12.4108392146114	2.28247763802162e-35	3.44876154239905e-34	chr11	43114107	43116250	+	NA	NA
NM_199950.1	695.138343159532	1.70861836869917	0.139132940363923	12.2804733676297	1.15321523729856e-34	1.71577218753937e-33	chr3	57423136	57425959	-	NA	NA

It seems like the Gene ID is actually the transcript ID, so how to convert it to official gene symbol?

Corry_Mao · July 31, 2021, 8:14am

I used the tool from g:Profiler – a web server for functional enrichment analysis and conversions of gene lists. It converted the transcript Id of zebrafish pretty well!

jennaj · September 28, 2022, 11:00pm

A post was split to a new topic: Mouse GTF reference annotation

Topic		Replies	Views
Annotate DESeq2/DEXSeq output tables usegalaxy.org support tool-help , deg_annotate	5	59	November 8, 2024
How can i extract gene name from custom GTF file? transcriptomics	1	304	March 12, 2024
Annotate DESeq2/DEXSeq output table error usegalaxy.org support tool-help , deg_annotate	4	18	June 10, 2025
DESeq2 Returning Nucleotides As Gene ID usegalaxy.org support ncbi	4	449	October 26, 2022
featureCounts output not compatible with Annotate DeSeq2/DexSeq output tables tool-dev	2	714	March 9, 2021

Couldn't annotate any gene Name using Annotate DESeq2/DEXSeq output tables

Tool Parameters

Related topics