Home Software Services About Contact usearch manual
sequence identity threshold
 
The ‑id option specifies the minimum identity between a query sequence and a database sequence (target). In the case of clustering commands (cluster_fast and cluster_smallmem), the target sequence is a centroid (see UCLUST algorithm). Identity is in the range 0.0 to 1.0, which is sometimes called a fractional identity (as opposed to a percent identity in the range 0% to 100%). Identity is always calculated from an alignment, so in general depends on alignment parameters as well as the definition of identity. This means that different programs may report a different identity for the same pair of sequences, even if the same definition is used. The concept of sequence identity is therefore not biologically well-defined and should not be taken too seriously.

Definition of identity
Different definitions of sequence identity are used by different programs (see here for examples). They differ mainly in their treatment of gaps. In version 6, USEARCH uses the same definition as BLAST, which is:

  identity = (number of identities) / (number of columns)

An "identity" is an alignment column containing two identical letters. With global alignments, terminal gaps are discarded and do not count towards the number of columns; internal gaps are included. With local alignments, terminal gaps never occur because a higher score can always be achieved by deleting them.

The BLAST definition was chosen because it is the most widely used, is intuitively appealing, is robust against changes in alignment parameters, avoids anomalies that occur with other definitions and is the best estimate of evolutionary distance that is possible with an identity ratio.

Previous version of USEARCH used the CD-HIT definition.

For a given alignment, BLAST identity <= CD-HIT identity. This is because BLAST counts gaps as differences, but CD-HIT does not. Insertions and deletions are generally less probable than substitutions. Therefore, gaps should count as least as much as substitutions as a measure of evolutionary distance, and the BLAST definition is more biologically realistic.

See also
Identity and clustering.