Welcome @mars142
Thanks for sharing your history, super helpful!
The problem is with the Ensembl gene identifiers – these have an extra version .N
and the goseq tool doesn’t understand how to parse those correctly.
If you scroll down on the tool form, you’ll see the expected format, and those happen to use Ensembl too.
I’m guessing that your coworkers were either using a different annotation source, or they removed the versioning.
This recent topic here has more details (a different Bioconductor tool, but all are in R and work the same at a technical level). →
What to do
- Add a step into the workflow that strips of the
.N
content in the tabular files before sending the data into goseq. - Find an annotation file that already has the version stripped.
- More about genome data sources. → Reference genomes at public Galaxy servers: GRCh38/hg38 example
Hope this helps!