Home Software Services About Contact usearch manual
-id option
Specifies the identity threshold. Expressed as a fraction between 0.0 and 1.0. See also accept options and alignment parameters.

To calculate an identity, an alignment is required. In usearch, the alignment is almost always a pair-wise alignment between a query sequence and a target sequence in a database. In the case of clustering commands, the target sequence is a cluster centroid.

The BLAST definition of identity is used, which is the number of identities divided by the number of alignment columns (see below for discussion). In the case of a global alignment, columns containing terminal gaps are discarded, but internal gaps do count as differences.

Pair-wise identity varies by method
The pair-wise identity between two sequences depends on the alignment and the definition of identity. Alignments vary due to the use of different parameters such as gap penalties and substitution scores. Definitions of identity vary depending on the treatment of gaps.

Some popular definitions of %id
Terminal gaps are ignored; identity is calculated from the remaining columns. An "identity" is a column with two identical letters; a "mismatch" is a column with two different letters. An "indel" is a consecutive series of gaps in one sequence. In other words, two or more consecutive gaps count as one indel. GAST is a SSU taxonomy assignment method that counts one indel rather than one gap column as one  difference. USEARCH supports several definitions (--iddef option), default is to use the CD-HIT definition. The MBL definition in USEARCH is the same as GAST's.

BLAST definition
Identities / Columns

GAST definition
(Columns - Mismatches - Indels) / Columns

CD-HIT definition
Identities / (Length of shorter sequence)

Problems with the CD-HIT definition
For historical reasons, versions 5 and earlier of usearch used the CD-HIT definition of identity. In versions 6 and later, the BLAST definition is used. I made the change because I felt the CD-HIT definition had several important weaknesses. The CD-HIT definition is not symmetrical between the longer and shorter sequence. Gaps in the longer sequence reduce %id but gaps in the shorter sequence do not. Gappier alignments therefore tend to have higher identities according to CD-HIT compared to other methods, and the CD-HIT %id correlates less well with evolutionary distance. A measure of %id that counts gaps as differences is more robust against the choice of alignment parameters (gap penalties and substitution matrices). For these reasons, I now prefer the BLAST definition for most purposes and may make this the default in USEARCH v6.

Example where CD-HIT id is 97% and USEARCH id is 86%

Example where CD-HIT id is 97% and USEARCH id is 95%