Getting NCBI Reference genome indexed for tools: custom genome, reference genome, reference annotation

rkd · January 8, 2025, 2:04pm

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is ‘exon’
An example of attributes included in your GTF annotation is ‘ID=exon-XR_003111846.3-1;Parent=rna-XR_003111846.3;Dbxref=GeneID:112587351,RFAM:RF00026,Genbank:XR_003111846.3;gbkey=ncRNA;gene=LOC112587351;inference=COORDINATES: profile:INFERNAL:1.1.1;product=U6 spliceosomal RNA;transcript_id=XR_003111846.3’
Can somebody explain its solution in a easy way. Running rna seq data on domestic water buffalo. genome and gene annotation file from the same database from ncbi. With thanks in anticipation.

jennaj · January 8, 2025, 7:36pm

Hi @rkd

You data content seems to be in GFF3 format, not GTF. You can get the GTF from NCBI.

Context links

Original Q&A → Built-in reference genome of domestic water buffalo
Species link from my post → Bubalus bubalis genome assembly NDDB_SH_1 - NCBI - NLM
Top level for the first reference genome (the one with the green star at NCBI) → Bubalus bubalis genome assembly NDDB_SH_1 - NCBI - NLM
FTP link from that page points to here → Index of /genomes/all/GCF/019/923/935/GCF_019923935.1_NDDB_SH_1

Then you have a choice of reference files. I would choose these (if I guessed the species you are interested in correctly?):

genome (nucleotide fasta) → https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/923/935/GCF_019923935.1_NDDB_SH_1/GCF_019923935.1_NDDB_SH_1_genomic.fna.gz
annotation (GTF format) → https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/923/935/GCF_019923935.1_NDDB_SH_1/GCF_019923935.1_NDDB_SH_1_genomic.gtf.gz

You should be able to import those two files by URL into a Galaxy history using the Upload tool with all default settings. Then, run some data cleanup to get the format into a very simple, basic specification. Example where I did this for another genome is here →

I did two things:

Ran NormalizeFasta on the reference genome fasta to remove the description content from the > title lines. This isolates the chromosome identifiers in a way STAR and many other tools will expect.
Ran Select to remove the # header lines from the reference annotation. Data providers include header lines for provenience reasons but many (most?) tools expect a stricter format that does not include any headers. So, remove them to avoid errors. You can keep a copy of the original file for your records if you need to check or cite any of that header information later.

More details about what I am suggesting → FAQ: Extended Help for Differential Expression Analysis Tools

And, if you really want to use GFF3 data instead, that is possible, but has scientific considerations since different data points will be used for the summaries. If interested, this is one topic where that is explored. → Featurecounts error using a gene annotation from a gff3 file - #2 by jennaj.

Please give that a try.

rkd · January 9, 2025, 1:46pm

Thank you very much. Feature count worked. But one more query plz. I used domestic water buffalo gtf for feature count but now when I am proceeding further for annotating my ID in Organism column there is no inbuilt for buffalo in it and I have to select bos taurus which is a different species from buffalo. Is it going to affect my analysis? If yes, it’s solution pl. in a simple way. With thanks in anticipation.

jennaj · January 9, 2025, 8:45pm

Hi @rkd

Glad you have Featurecounts working!

Then for AnnotateMyIDs, if your exact species assembly is not supported with a native index, then it will not work for you.

However, you could use this tool instead, using your DESeq2 output and annotation file →

Annotate DESeq2/DEXSeq output tables Append annotation from GTF to differential expression tool outputs

Please give that a try!

Shaadi_Mehr · March 13, 2025, 8:45pm

Thank you but the first link is broken. Is there a tutorial on downloading fungal RNA-seq and prep and mask the repeated and prep for the annotation workflow?

jennaj · March 13, 2025, 10:09pm

Hi @rkd

Which link is broken? We can fix that!

The RNA-seq tutorials can be found here. Since you have a reference genome, either the introduction or end-to-end series are good starting places.

Transcriptomics / Tutorial List

Hope this helps!

Shaadi_Mehr · March 13, 2025, 11:09pm

I wish we could have a data repository with extra data files for every tutorial that is different from the one you use for the tutorial. something to be used as an assignment.

Shaadi_Mehr · March 13, 2025, 11:59pm

Thank you Jen.

I meant I want to download another fungal genome fasta file and its related RNA seq and repeat the steps with Funnonate with a new data set. How can I find a bioproject that has both datatypes for a new data for the same pipeline?

jennaj · March 14, 2025, 2:37am

Hi @Shaadi_Mehr

Great, thanks for clarifying more.

Multiple reference datasets for tutorials is an interesting idea. You could propose this to the GTN as a new idea. Or, let me know if you want help doing that and I’ll get it started and link back. More data sharing across instructor groups has been discussed for a while. I think some of the regional communities do this but I don’t know the details, since they involve checking in with each individually, and change over time. A ticket about the idea might be enough to learn more.

For immediate use. I can let you know that any RNA-seq dataset could be appropriate. Maybe try a literature search to narrow it down? Or start with a search at SRA and find data that way?

Shaadi_Mehr · March 14, 2025, 8:15am

Thanks a lot. I can do the SRA but I would like to take the lead and create a group and build and inventory of reference datasets and already prepared experimental conditions date that teacher can use.

Please email me to include me in a working group for instructors.

Best

jennaj · March 14, 2025, 4:02pm

Great, @Shaadi_Mehr your help would be very much appreciated! I’m going to direct message you here about this a bit more.

Topic		Replies	Views
Troubleshooting FeatureCounts Error featurecounts	1	101	November 28, 2024
Genes from my reference genome are not annotated in output files. Trying to get read count per gene/abundant gene list usegalaxy.org support troubleshooting , reference-annotation , rna-seq	1	22	October 1, 2024
Built-in reference genome of domestic water buffalo usegalaxy.org.au support reference-index , reference-annotation , reference-genome	1	24	January 6, 2025
ref_gene_id featurecounts usegalaxy.org support	6	3175	May 22, 2019
Ensembl gene annotation gtf for rat problem with RNA STAR usegalaxy.org support troubleshooting , mapping , reference-annotation , reference-genome , resources	2	40	February 26, 2025

Getting NCBI Reference genome indexed for tools: custom genome, reference genome, reference annotation

Related topics