Train Augustus doesn't work

I am going to use the Train Ausgutus tool, but it reported the following error: Error: training set file /mnt/scratch/job_working_directory/008/978/8978590/working/genome.gff3 has neither Genbank nor GFF nor FASTA format!
Could anyone help me? Thank you very much!

Chuan

Welcome, @chuan_zhai

It sounds like the input for the reference data has a format problem. You’ll need to investigate why, then fix it, then try a rerun until resolved.

This guide has links in the “references” section to examples of what these data should contain → Reference genomes at public Galaxy servers: GRCh38/hg38 example, even if you are not working with human these can help.

Then, if you cannot solve the problem and want help, you can post back some of your data or the shared history and we can offer advice. How to share the parts of your data that will allow others to give specific feedback is below. We like details! :slight_smile:

I also added some tags to your post that contain what worked to solve this kind of reference data problem for others.

Let’s start there!

Hi jennaj, thanks for your help. However, the problem still exists. I have converted GFF to GTF, but it still didn’t work for Augusts training. Here is my history:Galaxy | Australia. Please feel free to have a look and see where the problem comes from.

Cheers,

Chuan

1 Like

Hi @chuan_zhai

Thanks for sharing the history, super helpful!

It seems that some term groups are missing the final ;

screenshot

This is one format specification you can compare against → Genome Browser FAQ

And these tutorials have example data manipulations → Hands-on: Data Manipulation Olympics / Data Manipulation Olympics / Foundations of Data Science

Hope this helps!

Hi @chuan_zhai,

As @jennaj said, it sounds like the input reference data has a format problem. I could not figure out what caused this issue, but it can be fixed using gffread tool in Galaxy. gffread is a GFF/GTF converter. Use it on Final annotation from Maker. Specify the output format as GFF. Make sure you do not modify or filter the data. The output file from gffread should have GFF3 datatype. You’ll see some obvious changes, such as appearnae of couple comment lines and absence of empty comment lines. My test Train Augustus jobs with “fixed” GFF3 file are running for twenty minutes now, while with the original GFF3 file the jobs failed in seconds.

@jennaj, we see similar errors occasionally with Train Augustus, but it is a mystery to me why it happens on some assemblies. Maybe a note can be added about the gffread fix to the Genome annotation with Maker tutorial. I don’t know if the fix universal or not, and my test jobs are not finished yet, so, technically, I cannot call it “fix” at this point, but it does address “not GFF format” issue.

Kind regards,
Igor

2 Likes

This is a good idea! I’ve seen a few GTFs that are missing that final ; and it seems to just be the “transcript_id” “transcript” feature (3rd column) lines but I haven’t looked/tested in detail yet to figure out where that is coming from yet. Maybe a wrapper bug… or something that can be adjusted in the wrapper if that is from Maker directly. More soon and thanks for the workaround!!

Update: the Train Augustus jobs with “fixed by gffread” annotation file were successful.
Kind regards,
Igor

1 Like

Hi @igor

So confusing!

I found that when converting gff3 > gtf with gffread, those are also transcript features missing the final ;

Last week I created a ticket for that correction here Bug with gffread: converting gff3 to gtf results in transcript lines missing the final ; · Issue #6310 · galaxyproject/tools-iuc · GitHub

Hopefully the IUC can help us to figure out exactly where things are going wrong! Seems like the format problem is both introduced and fixed by gffread with different uses!