Home Software Services About Contact usearch manual
cluster quality and sequence identity
 
USEARCH uses the BLAST definition of sequence identity. Through version 5, USEARCH used the CD-HIT definition by default.

For a given alignment, BLAST identity <= CD-HIT identity. This is because BLAST counts gaps as differences, but CD-HIT sometimes does not. Insertions and deletions are generally less probable than substitutions. Therefore, gaps should count as least as much as substitutions as a measure of evolutionary distance, and the BLAST definition is more biologically realistic.

Increased number of clusters
One effect of this change has been to increase the number of clusters (smaller average size) in versions 6 and later compared to version 5 at a given identity threshold, especially at high identities. This due to the tendency for %id to be reduced by the new definition, so fewer sequences match a given centroid. In some applications, notably OTU picking for SSU rRNA genes by clustering at 97% id, the number of clusters is sometimes used as a measure of cluster quality, and the increased number of qualities might then be interpreted as a reduction in cluster quality. In fact, I believe that this the clusters produced by the new definition are better because the BLAST definition of identity is a better estimate of evolutionary distance.