Formatting annotated genomes for submission to NCBI

This topic may be a little outside of the ones normally on Galaxy Help – sorry, but I have been scouring the web and searched BioStars with no luck. I’ve just spent a number of years assembling, annotating, and analyzing genomes. Now I’m battling table2asn to produce sqn files fit for submission to NCBI. Of course, there are errors, but no clear way of dealing with the errors, so any ideas will be more than welcomed:

  • Does Galaxy Europe have a converter tool for producing sqn files, either table2asn (converting gff3 + fasta to sequin) or GB2sequin (converting genbank to sequin)? I’ve tried looking around without any luck.
  • My gff3 files have “Ontology_term” and the sqn files want “product”, so all my carefully added ontology goes out the window and all the thousands of genes are called “hypothetical protein”. I thought maybe funannotate could help fix a gff3 that’s already annotated but I don’t yet see if this is possible. Also, my TE annotations have “classification”.
  • There are errors such as SEQ_FEAT.Range, etc. You can find a list of what they mean here, but the point is that there should be a tool that fixes them. The -c f tag in table2sqn does not seem to, as far as I can tell. Is there a tool to help with this?

Thanks for any ideas!

HI @jaredbernard

Yes, the default submission is created by the standardized processes. This is different from curation activities (which it sounds like you have been working on!).

Galaxy doesn’t have a replacement for the recommendations from NCBI here → Prokaryotic and Eukaryotic Genomes Submission Guide. Or, at least not that I know of! I shared your question with some others to see what they think, so you may get another reply!

If you want to share back whether you are working on prokaryotes or eukaryotes that might be helpful. And, I think you can get help with prokaryote annotation from NCBI but please double check me at the guide above!

You could also explore a tool like this one. It is experimental only (so purely for exploratory reasons!), but perhaps comparing the content between these output files (run on a small sample you have) helps to understand the expected file content for a submission?

  • NCBI EGAPx annotates eukaryotic genomes

Let’s start there, thanks! :slight_smile:


Xref

My genomes are for eukaryotes, but they are already annotated. I don’t want to start over on annotation and I don’t want to submit my genomes without annotation after all that work. The pipeline I used was standard: Maker, EDTA, Blast2GO. Yet it seems that my gff3s are not formatted in a way that works with table2asn. As I said, my gff3s have Ontology_terms whereas the NCBI sqn files require “products”. So I need tools to help with formatting, but haven’t found anything yet. Thanks for giving it some thought.

Hi @jaredbernard

Thanks for explaining more! From here, I would suggest writing into NCBI to get clarification? To learn how they would prefer the curated ontology annotation to be handled?

Your question will be about how to supply additional GFF3 attributes and to confirm which base attributes are required, or dependencies between attributes, or what the minimum set is.

  • You have Ontology_term=NNNN.
  • But probably also need: gene and CDS, and possibly mRNA, exon.

The term “product” is defined here under FEATURE TABLE FORMAT

The feature table specifies the location and type of biological features. table2asn will process the feature intervals and translate coding region (CDS) features into proteins.

The first and second columns are the start and stop locations of the feature, respectively, the third column is the type of feature (the feature key, e.g., gene, mRNA, CDS), the fourth column is a qualifier name (e.g., "product", and the fifth is a qualifier value (e.g., the name of the protein or gene). A simple example is:

  >Feature sde3n
  240     4084    gene
                          gene       SDE3
  240     1361    mRNA
  1450    1641
  1730    3184
  3275    4084
                          product    RNA helicase SDE3
  579     1361    CDS
  1450    1641
  1730    3184
  3275    3880
                          product    RNA helicase SDE3
A simple example  of a gene that is on the minus strand and is partial at its 3' end is:

  >Feature eno3
  1018    >1      gene
                          gene       ENO3
  1018    >1      CDS
                          product    beta-enolase

Be sure to review how ontology terms are used. These are called out like this in the readme above. This is different from Ontology_terms and if that was used in the tabular input, the values are in the 4th column and might explain your error. Or, maybe you need to add in the product line first (name of the protein)?

Gene Ontology (GO) terms can be indicated with the following qualifiers:

                          go_component        endoplasmic reticulum|0005783
                          go_process          glycolysis and gluconeogenesis|57|89197757|ACT,TEM
                          go_function         excision repair|93||IPD

I see your same question over at Biostars. I’m going to link it in case you do get a reply, and someone else searches here in the future about the same question. → https://www.biostars.org/p/9619529/

It is hard to guess more but I hope this helps! :slight_smile:

Thank you for these helpful tips, @jennaj! I also got some potentially good advice on Biostars too. I will see if I can get these steps to work. I did try corresponding with NCBI as you suggested and didn’t get a reply, but I will let you know how this goes. In the meantime, if any other tools are useful, I’m still very interested.

Great, it looks like they noticed similar items and offered some help to solve them! Hope this works out! :slight_smile: