What an interesting thread to follow up on. Thanks @jennaj for your guidance. My question is: Can I use Trinity gene ID in goseq or does it have to be Ensemble ID? I am not using reference-based methods, so I also need help with category file format.
That tool is using a mapping from gene IDs to GO terms. If your genome is not supported directly, but you can find and provide that information, that will work too. Might take some experimentation on your part. Scroll down into the tool form help for the instructions.
Gene categories file
This tool can get GO and KEGG categories for some genomes. The three GO categories are GO:MF (Molecular Function - molecular activities of gene products), GO:CC (Cellular Component - where gene products are active), GO:BP (Biological Process - pathways and larger processes made up of the activities of multiple gene products). If your genome is not available, you will also need a file describing the membership of genes in categories. The category file should have two columns with an optional header row. with Gene ID in the first column and category identifier in the second column. As the mapping between categories and genes is usually many-to-many, this table will usually have multiple rows with the same Gene ID and category identifier.
Example:
ENSG00000162526 GO:0000003 ENSG00000198648 GO:0000278 ENSG00000112312 GO:0000278 ENSG00000174442 GO:0000278 ENSG00000108953 GO:0000278
I donβt see this tutorial posted to this specific thread yet, so including as a reference. Data Manipulation Olympics
Thank you @jennaj, I provided similar format for the category file the only difference is the Gene Id example below,
TRINITY_DN20_c0_g1_i3 GO:0016787
TRINITY_DN20_c0_g2_i3 GO:0016787
TRINITY_DN20_c0_g1_i2 GO:0004185
But it did not work. However, I have the same gene Ids as the gene length and differential expression files.
Did the tool produce any error message? Click into the βiβ icon, scroll down into the job details, and review all of the logs. You should find an R error message.
What you are checking is that the gene ID format is consistent between all of the inputs, and the other files are complete. The IDs need to be in the same order, and all must be in both files. The second with the lengths can have extra lines but everything in the first file has to be in the second.
If you canβt find the problem, please share that job details view β set your history to a shared state, then share the link from the βiβ icon (or the entire history and note which dataset to look at). Or, your can expand all the datasets and logs on that view, and copy/paste the content back here (not as complete but maybe has some clues).
Hello @jennaj, Thank you very much. I detected the problem with the βiβ icon. It seems I had duplicate Gene IDs in the row names. I removed all the duplicate IDs from all three files and tried again, and it worked.
However, how can I get the GO terms for multiple samples, each represented on the dot plot? I hope the question is clear. Thank you once again.
Glad you found the original problem and were able to address it.
The goseq tool produces a graph for the current sample, but there are several optional outputs available on the tool form. Those could be used with general graphing tools. Some choices are in the tool panel, and more are found under the Visualize masthead menu.