I am encountering an issue while running eggNOG-mapper on Galaxy.
Some protein-coding genes present in my input FASTA file (protein sequences) are missing from the final annotation output file. I have checked carefully and confirmed that these sequences are indeed present in the input FASTA, with proper headers and no obvious formatting problems.
However, these gene IDs do not appear at all in the eggNOG-mapper annotation output (TSV file). It is not a case of missing functional annotation — the genes are completely absent from the output file.
Could you please help me understand why this might happen?
Let’s break down how the tool is performing the analysis to narrow down what may be going wrong.
There are three components to this tool’s protocol. The first one is:
eggNOG Mapper search phase
Is this where your protein sequences are not gaining hits? Or did you pass through this step and the problem is happening with later steps and that is what you are trying to do?
Have you confirmed that those proteins are included in the annotation hosted? Or, did you expect your genes to be included and they aren’t capturing hits?
To start, I would suggest checking against the other online versions of the annotation. This would help to confirm if your queries are missing from the underlying databank in the set or not.
Keep in mind that you may need to adjust the parameters to capture a hit! The wiki above would be the best resource for investigating the best parameters for your samples.
From there, it is difficult to guess more! We can help to ensure that there isn’t a technical problem with the Galaxy implementation at the server you are working at if you would like to share back your history!
To clarify my issue: the problem is that some protein IDs present in my input FASTA are completely absent from the final annotation TSV output. I would like to determine whether they are already lost during the search phase, or only later during the annotation step.
At the moment, I have only checked the final annotation file, and I confirmed that:
the missing protein sequences are present in the input FASTA
their headers appear correctly formatted
no error message was reported during the Galaxy run
I have not yet verified separately whether these proteins obtained hits in the eggNOG-mapper search phase itself. I will check that point next.
My expectation was that all query proteins would at least appear in the output, even if some had no functional annotation assigned. What I observe instead is that some IDs are completely missing from the final TSV. I will also compare these sequences against the online eggNOG resource and review the search parameters, as you suggested.
If useful, I can also share the Galaxy history or provide a small subset of the affected sequences so we can check whether this could be related to the Galaxy implementation.
It sounds like using this Output Option might be worth exploring? Running these tools a few times in a series with different parameters is one of the recommended usage paths. I would suggest breaking it down if you want more details about what is happening at each phase. Be sure to notice the extra output toggles for the reporting options.
In short: the search phase generates statistics for all input sequences, then the mapping to orthologs is where sequences technically fall out (no hits). You could review those initial stats to dig in more?
Examples:
Run the eggNOG Mapper search phase tool, then run:
eggNOG Mapper annotation phase
Help → Outputs → Output sequences without annotation
Produce an additional FASTA file with the sequences of queries for which an existing annotation was not found using cache mode. This file can be used as input of another eggNOG-mapper run without using the cache, trying to annotate the sequences.
You can also run the primary tool twice – first generate the md5 hash, then to use it during the next round. The Help on this form explains more, and usage will be the same as described at their wiki.
eggNOG Mapper functional sequence annotation by orthology
Output Options
Add md5 hash of each query to annotations
(–md5)
Help → Outputs → sequences without annotation
This output is created if cached annotations are used as input. It is a FASTA file containing all sequences that are not found in the cached annotations. These sequences can then be used as input for another run of the EggNOG mapper computing seed orthologs with diamond, etc.
It sounds like your data is set up correctly (valid fasta data) and exploring the options and workflows the authors host would be a good place to start. The Galaxy tutorials only use the tool with one cycle but there is much more you can try.