UCLUST algorithm

The UCLUST algorithm divides a set of sequences into clusters. The cluster_fast and cluster_smallmem commands are based on UCLUST. A cluster is defined by one sequence, known as the centroid or representative sequence. Every sequence in the cluster must have similarity above a given identity threshold with the centroid, as shown in the figure below. In previous versions centroids were called seed sequences; this term is no longer used to avoid confusion with alignment seeds (matching words) in algorithms such as BLAST and UBLAST. The identity threshold (T) can be viewed as the radius of a cluster. Clustering commands include cluster_fast and cluister_smallmem.

Clustering and identity
The definition of identity changed in USEARCH version 6. This sometimes results in an increased number of clusters at a given identity threshold, but this should not be interpreted as reduced sensitivity or worse cluster quality.

Recommended identity ranges
UCLUST is effective at identities of ~50% and above for proteins and ~75% and above for nucleotides. At lower identities, this type of method is questionable because (i) alignment quality degrades and (ii) homology cannot be reliably determined from an alignment.

Clustering criteria
UCLUST is designed to find a set of clusters such that:

(1) All centroids have similarity < T to each other, and
(2) All member sequences have similarity >= T to a centroid.

With default parameters, the algorithm is heuristic and condition (1) is not guaranteed to hold, though in practice false negatives (two centroids with similarity >= T) are rare. Note that in general, many different clusterings will satisfy these criteria. For example, a sequence may match two different centroids with identity > T. Ideally, it will be assigned to the closest centroid, but there may be two or more at same distance, in which case the best cluster assignment is ambiguous and an arbitrary choice must be made.

Identities are computed using a global alignment. Clustering based on local alignments could easily be implemented in the USEARCH software, but I believe local clustering is fundamentally flawed. If you have a application that really needs it, I'll add support for local clustering.

Input sequence order
UCLUST is a greedy algorithm, and the order of the input sequences is important. In the cluister_smallmem command, sequences are processed in the order they appear in the input file. If the next sequence matches an existing centroid, it is assigned to that cluster, otherwise it becomes the centroid of a new cluster. This means that sequences should be ordered so that the most appropriate centroids tend to appear earlier in the file. The best order depends on the application; see UCLUST sort order for further discussion. The cluster_fast command always presorts by length. This cannot be changed, so if you need a different order you must use cluster_smallmem.

Searching the centroid database
A fundamental step in UCLUST is to compare an input sequence with the centroids identified so far. This is done using the USEARCH algorithm. Most USEARCH parameters, including indexing options, can be used with clustering commands in order to control sensitivity, memory use and speed.

Clustering fragments
If both full-length sequences and fragments are present in the input, then sorting sequences by decreasing length is effective (see sort order, sortbylength). Fragments usually make poor centroids because two full-length sequences may be similar to a fragment over one segment while being more diverged over the rest of their length. If the fragment is a centroid, this could result in highly divergent full-length sequences being assigned to the same cluster. The cluster_fast command always sorts by decreasing length.

OTU clustering
Clustering sequences into Operational Taxonomic Units (OTUs) is a specialized problem. A length sort is not appropriate as this tends to choose centroids that are biological outliers, splitting species and genera into different clusters and inflating alpha diversity. See OTU clustering for a recommended procedure for creating OTUs with USEARCH.

Reference
Edgar, R.C. (2010) Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19), 2460-2461.
doi: 10.1093/bioinformatics/btq461