ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is ‘exon’
An example of attributes included in your GTF annotation is ‘ID=exon-XR_003111846.3-1;Parent=rna-XR_003111846.3;Dbxref=GeneID:112587351,RFAM:RF00026,Genbank:XR_003111846.3;gbkey=ncRNA;gene=LOC112587351;inference=COORDINATES: profile:INFERNAL:1.1.1;product=U6 spliceosomal RNA;transcript_id=XR_003111846.3’
Can somebody explain its solution in a easy way. Running rna seq data on domestic water buffalo. genome and gene annotation file from the same database from ncbi. With thanks in anticipation.
Hi @rkd
You data content seems to be in GFF3 format, not GTF. You can get the GTF from NCBI.
Context links
- Original Q&A → Built-in reference genome of domestic water buffalo
- Species link from my post → Bubalus bubalis genome assembly NDDB_SH_1 - NCBI - NLM
- Top level for the first reference genome (the one with the green star at NCBI) → Bubalus bubalis genome assembly NDDB_SH_1 - NCBI - NLM
- FTP link from that page points to here → Index of /genomes/all/GCF/019/923/935/GCF_019923935.1_NDDB_SH_1
Then you have a choice of reference files. I would choose these (if I guessed the species you are interested in correctly?):
- genome (nucleotide fasta) → https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/923/935/GCF_019923935.1_NDDB_SH_1/GCF_019923935.1_NDDB_SH_1_genomic.fna.gz
- annotation (GTF format) → https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/923/935/GCF_019923935.1_NDDB_SH_1/GCF_019923935.1_NDDB_SH_1_genomic.gtf.gz
You should be able to import those two files by URL into a Galaxy history using the Upload tool with all default settings. Then, run some data cleanup to get the format into a very simple, basic specification. Example where I did this for another genome is here →
I did two things:
-
Ran NormalizeFasta on the reference genome fasta to remove the description content from the > title lines. This isolates the chromosome identifiers in a way STAR and many other tools will expect.
-
Ran Select to remove the # header lines from the reference annotation. Data providers include header lines for provenience reasons but many (most?) tools expect a stricter format that does not include any headers. So, remove them to avoid errors. You can keep a copy of the original file for your records if you need to check or cite any of that header information later.
More details about what I am suggesting → FAQ: Extended Help for Differential Expression Analysis Tools
And, if you really want to use GFF3 data instead, that is possible, but has scientific considerations since different data points will be used for the summaries. If interested, this is one topic where that is explored. → Featurecounts error using a gene annotation from a gff3 file - #2 by jennaj.
Please give that a try.
Thank you very much. Feature count worked. But one more query plz. I used domestic water buffalo gtf for feature count but now when I am proceeding further for annotating my ID in Organism column there is no inbuilt for buffalo in it and I have to select bos taurus which is a different species from buffalo. Is it going to affect my analysis? If yes, it’s solution pl. in a simple way. With thanks in anticipation.
Hi @rkd
Glad you have Featurecounts working!
Then for AnnotateMyIDs, if your exact species assembly is not supported with a native index, then it will not work for you.
However, you could use this tool instead, using your DESeq2 output and annotation file →
- Annotate DESeq2/DEXSeq output tables Append annotation from GTF to differential expression tool outputs
Please give that a try!
Thank you but the first link is broken. Is there a tutorial on downloading fungal RNA-seq and prep and mask the repeated and prep for the annotation workflow?
Hi @rkd
Which link is broken? We can fix that!
The RNA-seq tutorials can be found here. Since you have a reference genome, either the introduction or end-to-end series are good starting places.
Hope this helps!
I wish we could have a data repository with extra data files for every tutorial that is different from the one you use for the tutorial. something to be used as an assignment.
Thank you Jen.
I meant I want to download another fungal genome fasta file and its related RNA seq and repeat the steps with Funnonate with a new data set. How can I find a bioproject that has both datatypes for a new data for the same pipeline?
Hi @Shaadi_Mehr
Great, thanks for clarifying more.
Multiple reference datasets for tutorials is an interesting idea. You could propose this to the GTN as a new idea. Or, let me know if you want help doing that and I’ll get it started and link back. More data sharing across instructor groups has been discussed for a while. I think some of the regional communities do this but I don’t know the details, since they involve checking in with each individually, and change over time. A ticket about the idea might be enough to learn more.
For immediate use. I can let you know that any RNA-seq dataset could be appropriate. Maybe try a literature search to narrow it down? Or start with a search at SRA and find data that way?
Thanks a lot. I can do the SRA but I would like to take the lead and create a group and build and inventory of reference datasets and already prepared experimental conditions date that teacher can use.
Please email me to include me in a working group for instructors.
Best
Great, @Shaadi_Mehr your help would be very much appreciated! I’m going to direct message you here about this a bit more.