<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
Pair-wise identity varies by method
The pair-wise identity between two sequences depends on the
alignment and the definition of identity. Alignments vary due to the use
of different parameters such as gap penalties and substitution scores.
Definitions of identity vary depending on the treatment of gaps. Link to
definitions used here.
tends to produce gappy alignments due
to the use of low gap penalties and mismatch scores, combined with a
definition of %id that does not account for gaps in the shorter
sequence. Compared to USEARCH, CD-HIT
reports systematically higher %ids, which means that CD-HIT clusters
are not directly comparable to USEARCH at a given identity threshold.
Below are alignments by CD-HIT, USEARCH and CLUSTALW of a pair of 16S
rRNA reads (in
FASTA format at bottom of this page). Terminal gaps are not shown. The pair has identity 97%
according to CD-HIT and 86% according to USEARCH. See here for instructions on how
to view CD-HIT alignments.
The following table summarizes the %ids
assigned to this pair of reads by different methods. The CD-HIT value
stands out as anomalously high: 97%, compared with the
next-largest value of 93.5% (CLUSTALW alignment with CD-HIT definition).
Results from the MUSCLE alignment are also reported (alignment not
How close are the reads taxonomically?
The RDP Naive Bayesian
Classifier assigns these reads to different families in the same order.
Since the RDP classifier uses an alignment-free method, we can assume
that it is independent of alignment biases. In this example, the
divergence reported by USEARCH is closer to the expected taxonomic