Difficulty in using OrthoFinder

I was trying to use Orthofinder to find orthologous protein/gene groups across four different plant species. I dont have the blast results, so I was using Orthofinder from scratch with the protein fasta files. I tried with the protein and the cds fasta files but everytime the program ended in error stating the list contained zero datasets. I am unable to understand where I am going wrong. Could anybody please help me find a solution to this issue? Also, if instead of using the “from fasta files” option, I want to use the “from blast results” option, then how should I proceed with that? How to obtain the blast results before running orthofinder? Any help, any tutorial, any suggestion is please welcome. I am stuck with this problem for more than a week.

Hi @Sutrisha_Kundu

I’m curious about which server you are you using the tool, and what your files look like. Would you like to share your history so we can help you to troubleshoot the usage? Please leave the errors or odd results that you have so far in history undeleted if possible. The link can be posted back in a reply and you can note which datasets represent the different ways you have tried this along with some explanation about what the goals are.

How to share your history is in the banner at this forum, also here. → How to get faster help with your question.

Let’s start there, thanks! :slight_smile:

This is my history. Initially, I was not sure whether I should use protein fasta files or genomic DNA fasta files to run the all against all blast program in OrthoFinder. So, I used both the files. Jobs run from 1552 to 1565 are done using the genomic DNA files of four plant species, while jobs from 1594 to 1607 are done using the protein fasta files of the same four plant species. In the second case, no gene trees were obtained (job no. 1596). I obtained results for both the cases. Now, after running this blast, I tried to input these results in OrthoFinder again using the “from blast results” option (job ids 3547 to 3561, 3562 to 3576, and 3577 to 3591) using different options. But all these jobs failed. My main aim is to obtain a Venn diagram for these four species based on the shared and unique orthogroups as seen in published manuscripts. I am not aware as to how can I obtain such diagrams and tables?

Hi @Sutrisha_Kundu

Thanks for sharing the history, this helps.

Some feedback I noticed from the first error I looked at

  1. Click into the i-icon for the errors, and review the messages. Most of these are about the content of the inputs being a mismatch for what the tool is expecting.

    Example: Data 1658.

    The message is explaining why the tool thinks the annotation is not in the expected format with details. Then, when reviewing your inputs, I can see that some of the annotation files are not in the GFF3 format, but GTF and a hybrid type of GFF.

    All must be GFF3. You can use the tool gffread sometimes to covert between these formats but if the data source has a GFF3, get it directly from them.

  2. Organize the data into collections. If you are supplying genome and annotation pairs, those need to be read in by the tool in the same order. Two collections with the same order would help.

More feedback soon. I am still running a separate test on one of these tools to confirm a different problem you had.

Update! The tests completed. I ran some smaller genome files through the two tools you were using to make sure the technical parts were Ok. Everything worked. I’ve shared the history here. I didn’t use all parameter combinations, so if you get an error with this data and specific combinations, you can share that back for more feedback.

Meanwhile, you should explore the user guide for the tool linked from the Help section. This should have example data you can compare to. :slight_smile:

Thank you for your help. I am trying to rectify the gff files before running OrthoFinder and Proteinortho again. While running Orthofinder using the protein files, I did not get any gene trees; remaining output files I received except the gene trees (job no. 1596). Is it because of issues in gff files or something else?

I saw you provided some results of Proteinortho as well which I couldn’t. Is it because of the same issue in gff files? For Proteinortho, I didn’t get a single output.

Apart from this, since you are helping so much, do you know how I can find out what are the single copy and multi copy genes from these four plant genomes? I wish to segregate the single copy, double copy, triple copy, and multi copy genes. Can you please help on this aspect as well? :pray:

Hi @Sutrisha_Kundu

Yes, the format and labeling of the input data is likely contributing to the problems.

That is what I meant with these comments:

What to do

Summary:

  1. You should create or retrieve the correct type of GFF3 annotation, then put those files into a collection folder.
  2. Simply the format of the fasta files, then put those into a collection folder.
  3. The order of files in those two different collection folders should be the same: data for genome A is listed first in both, genome B second, … etc.

Formats:

  • GFF3 from the data provider should be already in the correct format. If you only have GTF, you can try to convert it to GFF3 with gffread. See → Datatypes - Galaxy Community Hub

  • fasta for the exomes needs to be very simple: just have the identifier on the > title lines of the fasta, and remove everything else. This usually means removing everything after the first whitespace on the > lines, and the tool NormalizeFasta can be used. See → Datatypes - Galaxy Community Hub

  • Finally, double check that all of the identifiers in the fasta are now actually inside of the GFF3 files. If not, then you do not have the correct data for one of these and you still need to find or create the paired files.

    • The tool is trying to “match identifiers up” between the files. It uses the protein sequence to do alignments, then it uses the identifier from the fasta to look up annotation details in the GFF3 to do some clustering.
    • You will want to make this very clear and exact – not just to have the tool run without failing but to get accurate technical outcomes in the results. You can’t make any scientific interpretations about the annotation without this.

I noticed that your runs without the annotation seemed to work Ok, but that is because the tool was only using the protein fasta sequences against each other, not really using the fasta identifiers for anything but labels. But when bringing in more data files to consider, you will need to make sure the tool can first make associations between both files per exome, then layer in the clustering between the species’ exomes.

Exact suggested steps

  1. Copy all of the fasta and GFF/GTF/GFF3 files you have into a brand new history.
  2. Then start going through each species’ files – clean up the fasta file, then decide in the annotation is actually GFF3 or not, and fix that up.
  3. Once you have done this for all, create a list collection for the fasta’s, then one for the annotation.
  4. Finally, I would copy just those final two collections into another new history and try a rerun using this very clean data.

If you are not sure how to manipulate collections, this tutorial has examples.

Once you have the files cleaned up and if you need help with getting these into collections, you can share back that history and we can try to help more. :slight_smile:

I tried to rectify the protein fasta files and the gff files. I used the option NormalizeFasta to convert the Fasta files and then I ran OrthoFinder but still I did not obtain any gene trees. I downloaded the original gff3 files as well. However, still I don’t get any output using Proteinortho. I am sharing the 4 fasta and gff files here in this history. Please help me obtain the results. Also, please mention how can I obtain clustervenn diagram for the orthogroups and how can I obtain results for single copy and multicopy genes? What tools should I use? Please help.

Hi @Sutrisha_Kundu

Great, this is looking so much better!

Each of the GFF3 files need to have a line like this one at the very top. This is what tells tools that the file is actually that content, otherwise it will read it in as a generic GFF file. You can remove the other comment lines in each or just put this.

##gff-version 3

Please adjust that in your files, then try to run the OrthoFinder tool in this same history so we can see how it is working and what the new log messages state.

You are getting much closer to solving this!

I re-uploaded the gff3 annotation files with only header as you mentioned. Then I ran ProteinOrtho which did not give any result. I also ran OrthoFinder using 3 different parameters; which did not provide any gene trees again. Last time also, the other outputs came except the gene trees. this time also I did not get the gene trees and proteinortho provided no results at all.

1 Like

Hi @Sutrisha_Kundu

You can share back those examples if you want help troubleshooting. If this is a problem on the server, we can share that with the administrators to fix what might be going wrong. If you could label what each example represents, or share the “before” and “after” that shows the problem that will make it easier to see where things may need a correction. Thanks! :slight_smile:

This is the history where I uploaded 4 fasta files and 4 gff files after rectification. I did not get any result for gene trees after running Orthofinder using three different parameters. Also, I got no output from Proteinortho.