Genome annotation statistics error

Hi again, a have detected a new error at usegalaxy_org. This time is Genome annotation statistics.

Please import. the updated history SVQ_ERROR. I have used the GFF3 output from funannotate and the masked reference genome output of repeatmasker.

Thanks for your help.

Best

Hi @H1889

Thanks for sharing the history again.

I see the error, and that usually indicates a data problem. When I run the annotation through gffread first (this can standardize gff formats), then run the tool again, I get a clearer log message. The tool is then complaining about the number of exon features versus gene features being inconsistent.

Are you sure this file is complete? Was it generated in Galaxy or command line then uploaded? If you want to share the complete tool progression path I might be able to better see where the data issue was introduced better.

Hope this helps! :slight_smile:

Thank you for your help.

The gff3 file is the output of galaxy’s funannotate predict.

Best regards

I’m wondering if the problem isn’t with the file or the program that uses it, but rather a general problem with the server. I have opened another post (“Large file downloads are truncated”) where I indicate that large files cannot be downloaded completely; after about 40 MB, the download stops, and perhaps that is the problem: for some reason, files of a certain size cannot be handled correctly by the server. Perhaps Genome Annotate Statistics cannot read the GFF3 file completely either. Could that be the cause of the error?

Greetings

Hi @H1889

This issue was unrelated to the problems during the Sept 27-29th time frame.

As a cross test last week, I attempted to run your data against the clusters hosted at UseGalaxy.eu. That job also fails, and for the same reason. However, if I run Helixer at EU on the original genome fasta, that Genome Annotation job is successful.

I just took a closer look at your gff3 data, and think I isolated the specific immediate problem. More may be going on, but getting over this issue is where to start. I’ll walk through the details here.

Galaxy command line (find this on the job’s Details view using the i-icon)

ln -s ‘/corral4/main/objects/f/d/f/dataset_fdfa69e1-e5b7-4069-a24f-d58c2b621872.dat’ ‘input.gff’ && python -m jcvi.annotation.stats genestats ‘input.gff’ > ‘/corral4/main/jobs/070/955/70955718/outputs/dataset_7390b3aa-e935-44a9-807a-fd53bd79f86b.dat’ && python -m jcvi.annotation.stats summary ‘input.gff’ ‘/corral4/main/objects/1/1/6/dataset_11639a65-cd34-4cf9-bd7d-b8cf60bc2d03.dat’ 2>&1 | tail -n +3 >> ‘/corral4/main/jobs/070/955/70955718/outputs/dataset_7390b3aa-e935-44a9-807a-fd53bd79f86b.dat’ && python -m jcvi.annotation.stats stats ‘input.gff’ 2>&1 | grep Mean >> ‘/corral4/main/jobs/070/955/70955718/outputs/dataset_7390b3aa-e935-44a9-807a-fd53bd79f86b.dat’ && python -m jcvi.annotation.stats histogram ‘input.gff’ && pdfunite *.input.pdf ‘/corral4/main/jobs/070/955/70955718/outputs/dataset_54e17c40-773c-433f-a65e-61a0765944da.dat’

The base tool is hosted here → GitHub - tanghaibao/jcvi: Python library to facilitate genome assembly, annotation, and comparative genomics

The first module, jcvi.annotation.stats genestats, is triggering the error → jcvi/src/jcvi/annotation/stats.py at 1d66cac3a43a5042ccd8d7998a21131cadcb427e · tanghaibao/jcvi · GitHub

e[0;33m09:05:02 [gff]e[0me[0;35m Indexing input.gffe[0m
e[0;33m09:05:22 [base]e[0me[0;35m Load file transcript.sizese[0m
e[0;33m09:05:22 [base]e[0me[0;35m Imported 11653 records from transcript.sizes.e[0m
e[0;33m09:05:22 [base]e[0me[0;35m Load file transcript.sizese[0m
e[0;33m09:05:22 [base]e[0me[0;35m Imported 11653 records from transcript.sizes.e[0m
e[0;33m09:05:22 [stats]e[0me[0;35m A total of 11653 transcripts populated.e[0m
Traceback (most recent call last):
File “/usr/local/lib/python2.7/runpy.py”, line 174, in _run_module_as_main
main”, fname, loader, pkg_name)
File “/usr/local/lib/python2.7/runpy.py”, line 72, in _run_code
exec code in run_globals
File “/usr/local/lib/python2.7/site-packages/jcvi/annotation/stats.py”, line 355, in
main()
File “/usr/local/lib/python2.7/site-packages/jcvi/annotation/stats.py”, line 56, in main
p.dispatch(globals())
File “/usr/local/lib/python2.7/site-packages/jcvi/apps/base.py”, line 96, in dispatch
globalsaction
File “/usr/local/lib/python2.7/site-packages/jcvi/annotation/stats.py”, line 176, in genestats
conf_class = conf_classes[transcripts[0]]
IndexError: list index out of range

What is happening:

  1. The tool is first counting up the number of mRNA features and generating some stats (lengths).
  2. Next, it is reviewing the gene features and exon features to reconcile against the mRNA features then generate a few more statisics.
  3. However, your gff3 file contains gene feature blocks like this
Seqid Source Type Start End Score Strand Phase Attributes
contig_1 funannotate gene 426694 426765 . - . ID=ASPNIG_000108;
contig_1 funannotate tRNA 426694 426765 . - . ID=ASPNIG_000108-T1;Parent=ASPNIG_000108;product=tRNA-Ala;
contig_1 funannotate exon 426694 426765 . - . ID=ASPNIG_000108-T1.exon1;Parent=ASPNIG_000108-T1;
  1. This is confusing the tool, and it is failing. It would fail anywhere with this input.


What to do

Removing these tRNA lines (all associated features – gene, tRNA, exon) will avoid the immediate problem.

You could also go into a Jupyter Notebook, load the package, and run these tools directly (all modules). Moving data out of and back into a Galaxy history is part of this. :graduation_cap: GTN tutorials for Jupyter Notebook.

This is the reformat module. It doesn’t have a Funannotate specific conversion but maybe it is useful anyway? To see what is expected? → jcvi/src/jcvi/annotation/reformat.py at 1d66cac3a43a5042ccd8d7998a21131cadcb427e · tanghaibao/jcvi · GitHub

I made a request to see if it could be wrapped for Galaxy since it does more than just the tRNA reformatting, although to make it useful for your specific data, you may want to try the tRNAscan module instead! → Request: wrap jcvi_gff_stats reformat.py as a standalone tool + add as a preprocessing option to jcvi_gff_stats (*) · Issue #7317 · galaxyproject/tools-iuc · GitHub

There is also another tool package that uses the reformat.py script (all these utilities are nested!) that you may find interesting. See in the tool panel at EU → Fix tRNA model. It parses the output of tRNA prediction (tRNAscan) and tRNA and tmRNA prediction (Aragorn). All of these tools use ever so slightly different gff3 formats but hopefully explain more about what to look for if an error comes up again.

So, try reformatting with other text manipulation tools and consider comparing with Jupyter and the expanded package directly to learn what these tools are expecting.

I hope this helps you to understand what is going on! Please let us know if you have any questions! :slight_smile:

Thank you very much for your detailed help.

I will try all your recommendations.

Greetings

1 Like