Cd-hit doubts if the representative sequences are in the same order as clusters

Hi!

I did cd-hit nucleotide. I selected the option “Sort cluster by size” and “sort FASTA by cluster size”. My doubt is does the representative sequence 1 correspond to cluster 0?

Hi @Ines_Leao

Yes, when computers list things out, the range always starts with a 0 for the first item. Any time you might see a first item count as a 1 instead, is was because the tool or program translated for you! :slight_smile:

I have a list of 6 things.
The position of the first item in my list is at 0.
The position of the last is at 5.

This is a nice explanation from engineers out in the wild. → https://cseducators.stackexchange.com/questions/5023/why-do-we-count-starting-from-zero

I also put a very simple example from the cd-hit tool test in this tiny shared history you can import and inspect to see how the two output files relate to each other.

  • https://usegalaxy.eu/u/jenj/h/ghelp-cd-hit

  • The first cluster is Cluster 0.

  • The first sequence included in it, which is our representative sequence for Cluster 0, is itself counted at 0 since it the first item in a list of sequences.

  • Other sequences that are clustered with that representative sequence will be listed after, starting with 1 (if there are any!), because they will always be the second or greater item in a cluster’s list of sequences.

Please give this a review and let us know if this is clear! You are welcome to screenshot or paste back the first few lines of your output and we can discuss the example lines closer.

Hi Jenna,

Thank you for your answer. However, the representative sequence 1, that should corresponde to cluster 0, does not have the same sequence size as showed in cluster 0.

Hi @Ines_Leao

We you like to share your history back, so I can take a closer look and try to help more specifically with your use case? You can post the link back here.

We can help to confirm there isn’t some problem with the tool, or that specific version of the tool, plus explain how to match up the files if that is needed.

Thanks! :slight_smile:

Hi Jenna,

Here follows the link: Galaxy

Thank you so much!

I thought the cluster gets an index assigned during clustering and does not have any relationship with the fasta. If you want to know the cluster that belongs to sequence >1 you need to search for this sequence in the cluster file. As far as I understand the cluster that belongs to sequence >1 is >Cluster 5867 see:

image

The astrix (*) shows the representative sequence for a cluster.

I think normally if you set the sorting by size you get the correct order:

But after a quick check this seems to be broken:

On galaxy I also see this tool, it might do what you want:

Or you need to script something yourself.

1 Like

Very helpful information @gbbio – thank you! :slight_smile: It seems like this is a known wrinkle with the underlying tool. Galaxy always hosts the original tool, so the correction would need to happen upstream first, then it would flow down to Galaxy and everywhere else.

Please give the Format cd-hit outputs tool a try @Ines_Leao and let us know how that works out for you! Once the extra annotation is added into the data, you could also explore sorting the data with one of the regular text sorting tools to achieve the output you want. Maybe put all of these into a mini-workflow for reuse?

Thanks!