incomplete annotation with maker

joyhenart · July 4, 2025, 2:23pm

Hi
I’ve run Maker with a fasta genome (581 scaffolds, 5 chromosomes and 576 unplaced).
Maker succesfully annotated my genome with ncbi ESTs and swissprot proteins.
However I was not cappable to train Augustus and, among other things I’ve re-run maker with a modified fasta file where I shortened names (removed all after first withespace) and set lines to 100 bp (normalizd genome) and removed lowercases (awk text reformatting)
Now I only get a single chromosome annotated
Can you tell me what i’m doing wrong?
Thank you

jennaj · July 11, 2025, 11:46pm

Welcome @joyhenart

Very odd that only the first chromosome was annotated.

I wonder if something went wrong with modifying the fasta. Next time, you can use a tool like NormalizeFasta to clip off description content.

Or, maybe that is what you did, but now all of the remaining parts of the > title lines are now the same? Each sequence will need to have a unique identifier for tools to recognize them as distinct content.

There are ways to do this that we can help with, but we would need to see the original sequence identifiers first to come up with a strategy. You can extract these from the fasta file with the tool/options here, paste them back, and we can try to help.

Select tool
choice: Matching
regular expression: ^>

Please let us know if you solved this already or need more help. I think others had trouble helping. Next time, screenshots and/or sharing your history is the fastest way to share the context of the problem with others in a way that they can advise without guessing too much.

joyhenart · July 12, 2025, 11:09am

Dear Jennifer
Thank you so much for the answer
I’m sorry, I shared the history and as the days passed I kept deleting and purging every failed run.
I was trying to see the expression of some Tritrichomonas foetus genes by using a recently published genome (GCA_041260205.1_FLI_TFoe_1.1_genomic.fna) which is shown as annotated at ncbi. The files recovered at ncbi are gff3,rna,cds,protein,genome,seq-report but the gff3 annotation has just references to gaps.
I’ve been trying to annotate this genome so I started by normalization, then making this sequence my custom genome and using it for maker annotation just by following the galaxy tutorial.
I soon found a problem with Augustus. So I stepped back to normalization: used 100 letters per lane instead of 80, kept only the first word (which is different for the first 5 sequences, 5 chromosomes, than for the others which, scaffolds) by using the first whitespace, and I also made anothe fasta which only has the first 5 sequences (fasta to tabular, select the first five rows and tabular to fasta). Every time I made a change I’ve set the new normalized and modified genome as my custom build (can it be the problem?).
I’ve also made changes in my maker annotation by ignoring repeats and by using the curated pfam data (that I presume does not include protist repeats). Whit this I washave a decent first annotation.
My maker annotation of a single chromosome came after keeping only 5 sequences in the fasta reference file. It has 5 sequences with 106 single complete busco genes in 115071593 bp while the former has 581 sequences with 131 complete single buscos in 148700953.
I’ve mentioned I had a single chromosome annotated but in a previous step I had almost all the first chromosome annotated plus a segment of the second one.
At this moment I was using the 5 chromosomes set and I was not using the rnas and proteins files accompanying the tritrichomonas foetus genome at ncbi (datasets download genome accession GCA_041260205.1 --include gff3,rna,cds,protein,genome,seq-report) but multifastas with sequences downloaded from ncbi nucleotide database and proteins from the swissprot.
The ncbi cds and swissprot sequences were less than the ncbi record accompanying the genome so I’ve added sequences from Trichomonas vaginalis (a related organism) too. This week I’ve changed these files to get new sequences because I totally failed to run a proper maker. I’ve also run RepeatModeler to have a file with some repeats but most of the time I’ve got a “0 lines, 0 columns and 2 comments” result. I’ve changed the cds and proteins and sometimes I had error results because of Ns in the ncbi and/or swissprot sequences which I could not fix by removing sequences from the fasta files.
I’m sorry, You are totally right if you think this is a mess because this is a mess.
I maker keeping track of my mistakes and making me pay for them
Thank you very much

jennaj · July 14, 2025, 7:56pm

Funny! This is not a mess, you are doing exactly what anyone else would do to troubleshoot the problem.

For context to remind myself, this is the reference genome assembly and (most) of the original source files, and the tutorial you are following.

From there,

I agree that working with just the primary assembly (ignoring unplaced scaffolds) seems like a good place to start!
Creating custom genomes should be fine. Give each a unique name to avoid content conflicts/mixups. You can always tidy that up at the end. genome1, genome2, genomeN then fall back to just genome once done (viable workflow run, sharing)
Consider staring with just one chromosome, develop your workflow, then try on a few more, tune, then consider the whole thing. These jobs will potentially get very very large, and maybe exceed the computational resources! There are strategies around that but to start off with, getting compute resources and technical issues mixed up will be frustrating.

Backing up, do you know what this problem was? Was it just a format issue, and you have that resolved now?

Maybe. Try giving each a distinct name. You might find this useful anyway (example: if you need to process per-chromosome)!

I’m not sure I understand this choice (technically) – are you not masking repeats at all? Have you explore repeats in other ways? Oh, I see you did this later, nevermind!

This could be a computational issue, or, repeats are getting in your way. This is why I am suggesting to process one chromosome at a time and to use repeat masking. You might learn something interesting about the assembly itself this way, too.

Related cross species is a good idea!

I’m not sure about this but we could look at it and get clarify from VGP people if needed. Are you sure it is a format issue? Or, do you mean IUPAC characters in the protein sequences are leading to problems? Or, stop codons (*)?

Haha. You are doing great!

I don’t want to get in your way so please keep working the problem. But if you get a confusing error and are not sure if it is resources or data, please ask, ok? Right now it seems you need to resolve your strategy for repeats, consider working with smaller chunks of data at a time, use distinct custom genome labels (or, skip this entirely for now), maybe tune up your EST/protein choices, and triple check file formats. I would also suggest working from vgp.usegalaxy.org to help with how the jobs are routing to our clusters.

Hope this helps!

joyhenart · July 21, 2025, 11:21pm

Dear Jennifer
I’ve done a few runs of maker with single chromosomes (chr1 or chr2) and with the whole genome.
With a single chromosome fasta I had annotations when using RNA files (ESTs, I tested many different sets) after ignoring repeats (I know it is not advised) I had ~1300 lineswith and with a Repeat Modeler file (containing ~300 consensus sequences). I failed in having a single annotated gene (0 lines) with DFAM repeats.
Ignoring repeats I had ~1300 lines with chromosome 1 and ~800 with chromosome 2. By using repeat modeler consensus sequences the annotation decreases to ~80 lines and 46 lines (far less genes).
I give up trying protein multifasta files as I checked for non iupac and stop codons and I did not find any. Ns fixed gave me “0 lines” too. Subsamples of these protein files also gave me errors. I had errors with proteins aligned with miniprot2 too.
With chromosome 1 annotation without soft masking I was capable of going into ab-initio predictions, and into a second run of Maker. However I could not go through the second Augustus prediction. The error was that annotations contained less than 100 genes.
I retested EST files working with single chromosomes against the whole genome.Of course I was not capable of getting a run with annotations by using protein files, aligned sequences and soft masking with dfam repeats.
I am again with a single annotated chromosome (out of 5)
The error log states this after starting with chromosome 2:

--Next Contig--

#---------------------------------------------------------------------
Now starting the contig!!
SeqID: chr2
Length: 26153395
#---------------------------------------------------------------------

setting up GFF3 output and fasta chunks
preparing ab-inits
couldn't close /tmp/maker_6YkEod/chr2.abinit_nomask.0
No space left on device at /usr/local/bin/../lib/FastaFile.pm line 60.
--> rank=NA, hostname=galaxy-main-set03-1.novalocal
ERROR: Failed while preparing ab-inits
ERROR: Chunk failed at level:0, tier_type:2
FAILED CONTIG:chr2

ERROR: Chunk failed at level:4, tier_type:0
FAILED CONTIG:chr2

examining contents of the fasta file and run log

--Next Contig--

#---------------------------------------------------------------------
Now starting the contig!!
SeqID: chr3
Length: 5999298
#---------------------------------------------------------------------

setting up GFF3 output and fasta chunks
preparing ab-inits
couldn't close /tmp/maker_6YkEod/chr3.abinit_nomask.0
No space left on device at /usr/local/bin/../lib/FastaFile.pm line 60.
--> rank=NA, hostname=galaxy-main-set03-1.novalocal
ERROR: Failed while preparing ab-inits
ERROR: Chunk failed at level:0, tier_type:2
FAILED CONTIG:chr3

ERROR: Chunk failed at level:4, tier_type:0
FAILED CONTIG:chr3

examining contents of the fasta file and run log

--Next Contig--

Processing run.log file...
#---------------------------------------------------------------------
Now retrying the contig!!
SeqID: chr2
Length: 26153395
Tries: 2!!
#---------------------------------------------------------------------

setting up GFF3 output and fasta chunks
preparing ab-inits
preparing ab-inits
couldn't close /tmp/maker_6YkEod/chr2.abinit_nomask.1
No space left on device at /usr/local/bin/../lib/FastaFile.pm line 60.
--> rank=NA, hostname=galaxy-main-set03-1.novalocal
ERROR: Failed while preparing ab-inits
ERROR: Chunk failed at level:0, tier_type:2
FAILED CONTIG:chr2

ERROR: Chunk failed at level:4, tier_type:0
FAILED CONTIG:chr2

examining contents of the fasta file and run log

--Next Contig--

Processing run.log file...
#---------------------------------------------------------------------
Now retrying the contig!!
SeqID: chr3
Length: 5999298
Tries: 2!!
#---------------------------------------------------------------------

setting up GFF3 output and fasta chunks
preparing ab-inits
gathering ab-init output files
doing blastn of ESTs
couldn't close /tmp/maker_6YkEod/0/chr3.0
No space left on device at /usr/local/bin/../lib/FastaFile.pm line 60.
--> rank=NA, hostname=galaxy-main-set03-1.novalocal
ERROR: Failed while doing blastn of ESTs
ERROR: Chunk failed at level:0, tier_type:3
FAILED CONTIG:chr3

ERROR: Chunk failed at level:4, tier_type:0
FAILED CONTIG:chr3

examining contents of the fasta file and run log

--Next Contig--

Processing run.log file...
examining contents of the fasta file and run log

--Next Contig--

Processing run.log file...
MAKER WARNING: The file dataset_96b8abc0-9094-4325-8fbb-2f75a3a1aa5a.maker.output/dataset_96b8abc0-9094-4325-8fbb-2f75a3a1aa5a_datastore/50/43/chr3//theVoid.chr3/0/chr3.0.dataset_4cc3a66f-e8cb-420f-9a45-bc8af61675ef%2Edat.blastn
did not finish on the last run and must be erased

Maker is now finished!!!

Can you help me?
Thank you
Jorge

jennaj · July 22, 2025, 7:02pm

Hi @joyhenart

The job is overwhelming the cluster node. Working at VGP or EU might help, or you will need to adjust the analysis.

Do you want to share this job, and the inputs? I can help to determine what is going wrong.

It sounds like this is not going to be enough (not enough genes called), and potentially has a lot of noise (why the files are too large to work with), and getting the protein alignment to work will help.

In short, you will want to annotate with ESTs and proteins first, then flow down into downstream steps. Maker will be run several times, layering in more annotation based on the prior annotation at each round. The computation is too much all at once.