I learned the protocol of # Reference-based RNA-Seq data analysis, however, the name column is NA from the results of Extraction and annotation of differentially expressed genes using Annotate DESeq2/DEXSeq output tables. The Reference annotation in GFF/GTF format I used is DanRer 10 and 11 as the data is from zebrafish.
Hi @Corry_Mao
Are you following this tutorial? Reference-based RNA-Seq data analysis
And are you using your own data?
The NA means that the reference annotation either doesn’t have any appropriate attributes to add – or there could be some mismatch problem. Expand the Advanced Settings
section and compare the attributes the tool is attempting to match up with your GTF
dataset. If they don’t match, it needs to be addressed (maybe with a different annotation source).
If you used the same GTF
as upstream steps, then technical format problems would have presented with problems already (probably), but this FAQ describes the expected format and may be helpful: Datatypes - Galaxy Community Hub
And this FAQ covers general troubleshooting for differential expression tool inputs: Help for Differential Expression Analysis - Galaxy Community Hub
Quote from the tool form that may also help:
What it does
This tool appends the output table of DESeq2/edgeR/limma/DEXSeq with gene symbols, biotypes, positions etc. The information you want to add is configurable. This information should present in the input GTF/GFF file as attributes of feature you choose. DEXSeq-Count tool is used to prepare the DEXSeq compatible annotation (flattened GTF file) from input GTF/GFF. In this process, the exons that appear multiple times, once for each transcript are collapsed to so called exon counting bins . Counting bins for parts of exons arise when an exonic region appears with different boundaries in different transcripts. The resulting flattened GTF file contains pseudo exon ids per gene instead of per transcript. This tool maps the DEXSeq couting bins back to the original exon ids. This mapping is only possible if the input GTF/GFF file contains transcript identifier attribute for the chosen features type.
Inputs
Differential gene expression tables
At the moment, this tool supports DESeq2 and DEXSeq tool outputs.
Annotation
Annotation file ne GTF or GFF3 format that was used for counting.
Outputs
Input tabular file and with chosen attributes appended as additional columns.
This may be the root of the problem or at least some kind of contributing factor. If all inputs are not a match (same genome version/build), any tool may not necessarily fail but produce incorrect results (even if not an “NA” “not found” for linked annotation).
This part isn’t clear. Are you mixing different genome builds in the same analysis? That will always be a problem (technical + scientific). You should be using the same reference annotation throughout the same analysis. It will be either danRer10
or danRer11
for all inputs – and those should be based on whichever genome version you originally mapped against. The associated reference annotation GTF
used for counting and all other steps should be based on that same genome assembly (10 or 11). Your original fastq reads don’t belong to any particular genome assembly – just a specific species. Once you start processing the reads (mapping, etc), that is when using a particular genome assembly/build for all inputs matters.
Please review, then we can follow up. Post back the input parameters (capture this from the Job details page – click on the “i” icon with the result dataset to find this) along with the first few lines of your GTF for troubleshooting. Also, note where/how the annotation was sourced, please.
Thank you for your kind reply it’s been very helpful! I found the reason might be that the annotation file I used (DanRer11 from UCSC) doesn’t contain correct Gene ID information (gene ID is the same as the transcript ID), so where I can find the annotation file with the correct gene ID? Or perhaps, the transcript ID from the Deseq2 output can’t match anything to the annotation file? I used this annotation file in the upstream analysis and it looks all right.
The first few lines of the DanRer 11.gtf file from UCSC (TableBrowser) :
Seqname | Source | Feature | Start | End | Score | Strand | Frame | Attributes |
---|---|---|---|---|---|---|---|---|
chr1 | danRer11_ncbiRefSeq | start_codon | 8304904 | 8304906 | 0.000000 | + | . | gene_id XM_021475941.1; transcript_id XM_021475941.1; |
chr1 | danRer11_ncbiRefSeq | CDS | 8304904 | 8304996 | 0.000000 | + | 0 | gene_id XM_021475941.1; transcript_id XM_021475941.1; |
chr1 | danRer11_ncbiRefSeq | exon | 8304862 | 8304996 | 0.000000 | + | . | gene_id XM_021475941.1; transcript_id XM_021475941.1; |
chr1 | danRer11_ncbiRefSeq | CDS | 8309655 | 8309788 | 0.000000 | + | 0 | gene_id XM_021475941.1; transcript_id XM_021475941.1; |
chr1 | danRer11_ncbiRefSeq | exon | 8309655 | 8309788 | 0.000000 | + | . | gene_id XM_021475941.1; transcript_id XM_021475941.1; |
chr1 | danRer11_ncbiRefSeq | CDS | 8312162 | 8312259 | 0.000000 | + | 1 | gene_id XM_021475941.1; transcript_id XM_021475941.1; |
The input parameters of “Annotate DESeq2/DEXSeq output tables” are listed as follows:
Tool Parameters
Input Parameter | Value |
---|---|
Tabular output of DESeq2/edgeR/limma/DEXSeq | * 124: Filter on data 123 |
Input file type | DESeq2/edgeR/limma |
Reference annotation in GFF/GTF format | * 32: danRer11.gtf |
advanced_parameters | |
GFF feature type | exon |
GFF feature identifier | gene_id |
GFF transcript identifier | transcript_id |
GFF attributes to include | gene_biotype, gene_name |
Yes, this is a known content issue with GTF
datasets extracted from the UCSC Table Browser.
Update: It looks like they are now generating GTFs in the downloads area automatically. Capture the link and paste it into the Upload tool. danRer11
has few choices, and one of them is a match for what you used already: danRer11.ncbiRefSeq.gtf.gz
https://hgdownload.soe.ucsc.edu/goldenPath/danRer11/bigZips/genes/
How to find these:
- UCSC Genome Browser Downloads
- navigate to the genome build
- click into
Genome sequence files and select annotations (2bit, GTF, GC-content, etc)
, then into the folder namedgenes
This is a better way to get the GTF
for all use cases. UCSC does not recommend extracting GTF
data from the Table Browser for a few reasons: the gene_id = transcript_id attribute content AND the limit of about 100k lines of output (can result in a truncated output).
For you, if the original GTF
was not truncated, it is probably OK to just swap the annotation at this point. But do double check – inspect/compare a few lines plus count up the total number of lines – all should match. If it doesn’t, you might need to reprocess the upstream steps using the corrected annotation.
Here is what I got after the annotation for Differentially expressed genes:
GeneID | Base mean | log2(FC) | StdErr | Wald-Stats | P-value | P-adj | Chromosome | Start | End | Strand | Feature | Gene name |
---|---|---|---|---|---|---|---|---|---|---|---|---|
XM_683666.8 | 498.88515410306 | 2.4273845458708 | 0.15416043212282 | 15.745833820295 | 7.33577602702814e-56 | 1.67173036784951e-54 | chr2 | 23790784 | 23805139 | + | NA | NA |
XM_001922608.6 | 200.837563940712 | 2.59668599485521 | 0.201900187135723 | 12.8612361964263 | 7.43775692383719e-38 | 1.20042800027617e-36 | chr7 | 20239921 | 20241386 | - | NA | NA |
NM_212914.1 | 336.556168633185 | 2.20798978407476 | 0.177908177351559 | 12.4108392146114 | 2.28247763802162e-35 | 3.44876154239905e-34 | chr11 | 43114107 | 43116250 | + | NA | NA |
NM_199950.1 | 695.138343159532 | 1.70861836869917 | 0.139132940363923 | 12.2804733676297 | 1.15321523729856e-34 | 1.71577218753937e-33 | chr3 | 57423136 | 57425959 | - | NA | NA |
It seems like the Gene ID is actually the transcript ID, so how to convert it to official gene symbol?
I used the tool from g:Profiler – a web server for functional enrichment analysis and conversions of gene lists. It converted the transcript Id of zebrafish pretty well!
A post was split to a new topic: Mouse GTF reference annotation