Using GOSEQ with a custom category input

What an interesting thread to follow up on. Thanks @jennaj for your guidance. My question is: Can I use Trinity gene ID in goseq or does it have to be Ensemble ID? I am not using reference-based methods, so I also need help with category file format.

That tool is using a mapping from gene IDs to GO terms. If your genome is not supported directly, but you can find and provide that information, that will work too. Might take some experimentation on your part. Scroll down into the tool form help for the instructions.

Gene categories file

This tool can get GO and KEGG categories for some genomes. The three GO categories are GO:MF (Molecular Function - molecular activities of gene products), GO:CC (Cellular Component - where gene products are active), GO:BP (Biological Process - pathways and larger processes made up of the activities of multiple gene products). If your genome is not available, you will also need a file describing the membership of genes in categories. The category file should have two columns with an optional header row. with Gene ID in the first column and category identifier in the second column. As the mapping between categories and genes is usually many-to-many, this table will usually have multiple rows with the same Gene ID and category identifier.


ENSG00000162526 GO:0000003
ENSG00000198648 GO:0000278
ENSG00000112312 GO:0000278
ENSG00000174442 GO:0000278
ENSG00000108953 GO:0000278

I don’t see this tutorial posted to this specific thread yet, so including as a reference. Data Manipulation Olympics

Thank you @jennaj, I provided similar format for the category file the only difference is the Gene Id example below,
TRINITY_DN20_c0_g1_i3 GO:0016787
TRINITY_DN20_c0_g2_i3 GO:0016787
TRINITY_DN20_c0_g1_i2 GO:0004185
But it did not work. However, I have the same gene Ids as the gene length and differential expression files.

Hi @Buhari_Lawan_Muhamma

Did the tool produce any error message? Click into the β€œi” icon, scroll down into the job details, and review all of the logs. You should find an R error message.

What you are checking is that the gene ID format is consistent between all of the inputs, and the other files are complete. The IDs need to be in the same order, and all must be in both files. The second with the lengths can have extra lines but everything in the first file has to be in the second.

If you can’t find the problem, please share that job details view – set your history to a shared state, then share the link from the β€œi” icon (or the entire history and note which dataset to look at). Or, your can expand all the datasets and logs on that view, and copy/paste the content back here (not as complete but maybe has some clues).

Hello @jennaj, Thank you very much. I detected the problem with the β€œi” icon. It seems I had duplicate Gene IDs in the row names. I removed all the duplicate IDs from all three files and tried again, and it worked.

However, how can I get the GO terms for multiple samples, each represented on the dot plot? I hope the question is clear. Thank you once again.

Hi @Buhari_Lawan_Muhamma

Glad you found the original problem and were able to address it.

The goseq tool produces a graph for the current sample, but there are several optional outputs available on the tool form. Those could be used with general graphing tools. Some choices are in the tool panel, and more are found under the Visualize masthead menu.

Ok, thank you so much @jennaj

1 Like