Cd-hit doubts if the representative sequences are in the same order as clusters

Ines_Leao · October 30, 2025, 11:51am

Hi!

I did cd-hit nucleotide. I selected the option “Sort cluster by size” and “sort FASTA by cluster size”. My doubt is does the representative sequence 1 correspond to cluster 0?

jennaj · October 30, 2025, 10:22pm

Hi @Ines_Leao

Yes, when computers list things out, the range always starts with a 0 for the first item. Any time you might see a first item count as a 1 instead, is was because the tool or program translated for you!

I have a list of 6 things.
The position of the first item in my list is at 0.
The position of the last is at 5.

This is a nice explanation from engineers out in the wild. → https://cseducators.stackexchange.com/questions/5023/why-do-we-count-starting-from-zero

I also put a very simple example from the cd-hit tool test in this tiny shared history you can import and inspect to see how the two output files relate to each other.

https://usegalaxy.eu/u/jenj/h/ghelp-cd-hit

ghelp-cd-hit-example718×350 18.2 KB
The first cluster is Cluster 0.
The first sequence included in it, which is our representative sequence for Cluster 0, is itself counted at 0 since it the first item in a list of sequences.
Other sequences that are clustered with that representative sequence will be listed after, starting with 1 (if there are any!), because they will always be the second or greater item in a cluster’s list of sequences.

Please give this a review and let us know if this is clear! You are welcome to screenshot or paste back the first few lines of your output and we can discuss the example lines closer.

Ines_Leao · October 31, 2025, 10:10am

Hi Jenna,

Thank you for your answer. However, the representative sequence 1, that should corresponde to cluster 0, does not have the same sequence size as showed in cluster 0.

jennaj · November 3, 2025, 10:45pm

Hi @Ines_Leao

We you like to share your history back, so I can take a closer look and try to help more specifically with your use case? You can post the link back here.

How to get faster help with your question

We can help to confirm there isn’t some problem with the tool, or that specific version of the tool, plus explain how to match up the files if that is needed.

Thanks!

Ines_Leao · December 26, 2025, 12:00pm

Hi Jenna,

Here follows the link: Galaxy

Thank you so much!

gbbio · December 30, 2025, 10:10am

I thought the cluster gets an index assigned during clustering and does not have any relationship with the fasta. If you want to know the cluster that belongs to sequence >1 you need to search for this sequence in the cluster file. As far as I understand the cluster that belongs to sequence >1 is >Cluster 5867 see:

The astrix (*) shows the representative sequence for a cluster.

I think normally if you set the sorting by size you get the correct order:

But after a quick check this seems to be broken:

github.com/weizhongli/cdhit

Functionality missing for -sf option in cd-hit-est

opened 02:17PM - 02 Sep 21 UTC

tcr0fts

I ran the following command for cd-hit-est and got the expected two output files…, so it ran successfully ./cd-hit-est -i /home/tsc7044/cd-hit-v4.8.1-2019-0228/soil_ntc04_processed.fasta -o soil_ntc04_cdhit99 -c 0.99 -n 11 -g 1 -d 0 -T 8 -M 1600 -sc 1 -sf 1 Note -sc 1 and -sf 1, meaning both the cluster file and the representative sequences fasta file should be ordered in decreasing cluster size. This checks out for the clusters file (largest cluster to singletons) but the representative reads fasta file is not ordered and seems to be ordered by ascending read number instead (not correlated with cluster size). Re-running the same command with -sf 0 (turn off fasta sorting) gave the exact same output (for first several dozen lines at least). Is lack of -sf functionality a known issue or is something else wrong on my end? I want to be able to reference the clusters file to find the top 'n' most common reads and then pull those from the representative reads fasta file. I have a work around but it looks like -sf should be able to make this much easier if it worked for me. Thanks

On galaxy I also see this tool, it might do what you want:

Or you need to script something yourself.

jennaj · January 5, 2026, 9:46pm

Very helpful information @gbbio – thank you! It seems like this is a known wrinkle with the underlying tool. Galaxy always hosts the original tool, so the correction would need to happen upstream first, then it would flow down to Galaxy and everywhere else.

Please give the Format cd-hit outputs tool a try @Ines_Leao and let us know how that works out for you! Once the extra annotation is added into the data, you could also explore sorting the data with one of the regular text sorting tools to achieve the output you want. Maybe put all of these into a mini-workflow for reuse?

Thanks!