Couldn't annotate any gene Name using Annotate DESeq2/DEXSeq output tables

I learned the protocol of # Reference-based RNA-Seq data analysis, however, the name column is NA from the results of Extraction and annotation of differentially expressed genes using Annotate DESeq2/DEXSeq output tables. The Reference annotation in GFF/GTF format I used is DanRer 10 and 11 as the data is from zebrafish.

1 Like

Hi @Corry_Mao

Are you following this tutorial? Reference-based RNA-Seq data analysis

And are you using your own data?

The NA means that the reference annotation either doesn’t have any appropriate attributes to add – or there could be some mismatch problem. Expand the Advanced Settings section and compare the attributes the tool is attempting to match up with your GTF dataset. If they don’t match, it needs to be addressed (maybe with a different annotation source).

If you used the same GTF as upstream steps, then technical format problems would have presented with problems already (probably), but this FAQ describes the expected format and may be helpful: Datatypes

And this FAQ covers general troubleshooting for differential expression tool inputs: Help for Differential Expression Analysis

Quote from the tool form that may also help:

What it does

This tool appends the output table of DESeq2/edgeR/limma/DEXSeq with gene symbols, biotypes, positions etc. The information you want to add is configurable. This information should present in the input GTF/GFF file as attributes of feature you choose. DEXSeq-Count tool is used to prepare the DEXSeq compatible annotation (flattened GTF file) from input GTF/GFF. In this process, the exons that appear multiple times, once for each transcript are collapsed to so called exon counting bins . Counting bins for parts of exons arise when an exonic region appears with different boundaries in different transcripts. The resulting flattened GTF file contains pseudo exon ids per gene instead of per transcript. This tool maps the DEXSeq couting bins back to the original exon ids. This mapping is only possible if the input GTF/GFF file contains transcript identifier attribute for the chosen features type.

Inputs

Differential gene expression tables

At the moment, this tool supports DESeq2 and DEXSeq tool outputs.

Annotation

Annotation file ne GTF or GFF3 format that was used for counting.

Outputs

Input tabular file and with chosen attributes appended as additional columns.


This may be the root of the problem or at least some kind of contributing factor. If all inputs are not a match (same genome version/build), any tool may not necessarily fail but produce incorrect results (even if not an “NA” “not found” for linked annotation).

This part isn’t clear. Are you mixing different genome builds in the same analysis? That will always be a problem (technical + scientific). You should be using the same reference annotation throughout the same analysis. It will be either danRer10 or danRer11 for all inputs – and those should be based on whichever genome version you originally mapped against. The associated reference annotation GTF used for counting and all other steps should be based on that same genome assembly (10 or 11). Your original fastq reads don’t belong to any particular genome assembly – just a specific species. Once you start processing the reads (mapping, etc), that is when using a particular genome assembly/build for all inputs matters.

Please review, then we can follow up. Post back the input parameters (capture this from the Job details page – click on the “i” icon with the result dataset to find this) along with the first few lines of your GTF for troubleshooting. Also, note where/how the annotation was sourced, please.

Thank you for your kind reply it’s been very helpful! I found the reason might be that the annotation file I used (DanRer11 from UCSC) doesn’t contain correct Gene ID information (gene ID is the same as the transcript ID), so where I can find the annotation file with the correct gene ID? Or perhaps, the transcript ID from the Deseq2 output can’t match anything to the annotation file? I used this annotation file in the upstream analysis and it looks all right.

The first few lines of the DanRer 11.gtf file from UCSC (TableBrowser) :

Seqname Source Feature Start End Score Strand Frame Attributes
chr1 danRer11_ncbiRefSeq start_codon 8304904 8304906 0.000000 + . gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1 danRer11_ncbiRefSeq CDS 8304904 8304996 0.000000 + 0 gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1 danRer11_ncbiRefSeq exon 8304862 8304996 0.000000 + . gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1 danRer11_ncbiRefSeq CDS 8309655 8309788 0.000000 + 0 gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1 danRer11_ncbiRefSeq exon 8309655 8309788 0.000000 + . gene_id XM_021475941.1; transcript_id XM_021475941.1;
chr1 danRer11_ncbiRefSeq CDS 8312162 8312259 0.000000 + 1 gene_id XM_021475941.1; transcript_id XM_021475941.1;

The input parameters of “Annotate DESeq2/DEXSeq output tables” are listed as follows:

Tool Parameters

Input Parameter Value
Tabular output of DESeq2/edgeR/limma/DEXSeq * 124: Filter on data 123
Input file type DESeq2/edgeR/limma
Reference annotation in GFF/GTF format * 32: danRer11.gtf
advanced_parameters
GFF feature type exon
GFF feature identifier gene_id
GFF transcript identifier transcript_id
GFF attributes to include gene_biotype, gene_name
1 Like

Yes, this is a known content issue with GTF datasets extracted from the UCSC Table Browser.

Update: It looks like they are now generating GTFs in the downloads area automatically. Capture the link and paste it into the Upload tool. danRer11 has few choices, and one of them is a match for what you used already: danRer11.ncbiRefSeq.gtf.gz

https://hgdownload.soe.ucsc.edu/goldenPath/danRer11/bigZips/genes/

How to find these:

  1. UCSC Genome Browser Downloads
  2. navigate to the genome build
  3. click into Genome sequence files and select annotations (2bit, GTF, GC-content, etc), then into the folder named genes

This is a better way to get the GTF for all use cases. UCSC does not recommend extracting GTF data from the Table Browser for a few reasons: the gene_id = transcript_id attribute content AND the limit of about 100k lines of output (can result in a truncated output).

For you, if the original GTF was not truncated, it is probably OK to just swap the annotation at this point. But do double check – inspect/compare a few lines plus count up the total number of lines – all should match. If it doesn’t, you might need to reprocess the upstream steps using the corrected annotation.

Here is what I got after the annotation for Differentially expressed genes:

GeneID Base mean log2(FC) StdErr Wald-Stats P-value P-adj Chromosome Start End Strand Feature Gene name
XM_683666.8 498.88515410306 2.4273845458708 0.15416043212282 15.745833820295 7.33577602702814e-56 1.67173036784951e-54 chr2 23790784 23805139 + NA NA
XM_001922608.6 200.837563940712 2.59668599485521 0.201900187135723 12.8612361964263 7.43775692383719e-38 1.20042800027617e-36 chr7 20239921 20241386 - NA NA
NM_212914.1 336.556168633185 2.20798978407476 0.177908177351559 12.4108392146114 2.28247763802162e-35 3.44876154239905e-34 chr11 43114107 43116250 + NA NA
NM_199950.1 695.138343159532 1.70861836869917 0.139132940363923 12.2804733676297 1.15321523729856e-34 1.71577218753937e-33 chr3 57423136 57425959 - NA NA

It seems like the Gene ID is actually the transcript ID, so how to convert it to official gene symbol?