UCLUST algorithm

OTU clustering UCLUST sort orderConsensus sequences Natural centroids and recentering The UCLUST algorithm divides a set of sequences into clusters. The cluster_fast and cluster_smallmem commands are based on UCLUST. A cluster is defined by one sequence, known as the centroid or representative sequence. Every sequence in the cluster must have similarity above a given identity threshold with the centroid, as shown in the figure below. In previous versions centroids were called seed sequences; this term is no longer used to avoid confusion with alignment seeds (matching words) in algorithms such as BLAST and UBLAST. The identity threshold (T) can be viewed as the radius of a cluster. Clustering commands include cluster_fast and cluister_smallmem.
(1) All centroids have
similarity < T to each other, and With default parameters, the algorithm is heuristic and condition (1) is not guaranteed to hold, though in practice false negatives (two centroids with similarity >= T) are rare. Note that in general, many different clusterings will satisfy these criteria. For example, a sequence may match two different centroids with identity > T. Ideally, it will be assigned to the closest centroid, but there may be two or more at same distance, in which case the best cluster assignment is ambiguous and an arbitrary choice must be made. Identities are computed using a global alignment. Clustering based on local alignments could easily be implemented in the USEARCH software, but I believe local clustering is fundamentally flawed. If you have a application that really needs it, I'll add support for local clustering.
Bioinformatics 26(19), 2460-2461.doi: 10.1093/bioinformatics/btq461 |