USEARCH quick start for CD-HIT users

USEARCH manual > quick start > for CD-HIT users

Clustering commands
In CD-HIT, different programs are used depending on the sequence type (protein or nucleotide), with specialized variants for sequencing reads. In USEARCH, there is only one program. There are two clustering commands: cluster_fast and cluster_smallmem. Like most USEARCH commands, both proteins and nucleotides are supported; the sequence type is automatically detected. Accept options provide a rich set of criteria for sequence matching.

CD-HIT program	Seq. type	USEARCH equivalent
cd-hit	protein	Protein clustering. Use cluster_fast or cluster_smallmem.
cd-hit-est	nucl.	Nucleotide clustering. Use cluster_fast or cluster_smallmem.
cd-hit-2d	protein	Database search. The -db2 option of cd-hit-2d is the query sequence and -db1 is the database. The equivalent in USEARCH is usearch_global. The UC output file can be used to identify query sequences that did not match the database (reported as N records).
cd-hit-454	454 reads	In cd-hit-454, new clustering criteria include (1) the sequences must start at the same position, i.e. terminal gaps are not allowed at the left end of the alignment, and (2) gaps longer than one are not allowed. (1) can be implemented using the ‑leftjust or ‑idprefix accept options (‑idprefix is preferred for faster speed). (2) can be implemented by disallowing internal gap extensions (see alignment options). However, I believe (2) makes little difference in practice and is not worth the trouble.
cd-hit-est-2d	nucl.	See comments for cd-hit-2d.
cd-hit-otu	16S reads	Script for clustering 16S reads into OTUs. Otupipe is a USEARCH-based script for making OTUs, but I generally recommend using USEARCH-based scripts in the QIIME package for analyzing 16S reads.
cd-hit-dup	nucl.	Dereplication. Use derep_fulllength or derep_prefix.

Clustering threshold
The -c option of CD-HIT is roughly equivalent to the ‑id option of USEARCH. There are two important differences. USEARCH and CD-HIT use different definitions of identity. USEARCH counts gaps as differences, but CD-HIT does not. This means that CD-HIT assigns systematically higher identities to alignments containing gaps. In addition, CD-HIT has lower gap and mismatch penalties than other programs. This means that CD-HIT tends to produce "gappier" alignments with more match columns. This effect also produces systematically higher identities. The net result of these issues is that the CD-HIT clustering threshold is not directly comparable with USEARCH. For example, I would estimate that -c 0.95 is roughly comparable to ‑id 0.97 in USEARCH, but it should be noted that the differences cannot be compensated by a re-scaling of identity.

Alignment heuristics and banding
Both CD-HIT and USEARCH use fast heuristics to compute global alignments. However, there are important differences. Both programs use a technique called "banding" to limit the region of the dynamic programming matrix that is filled in, but the strategies are quite different, and the ‑band option of USEARCH is not equivalent to the -b option of CD-HIT. In CD-HIT, the band spans the entire alignment. USEARCH starts by finding HSPs using an X-drop algorithm similar to BLAST. Banding is used only for regions between HSPs, and the band width is set dynamically according to the length difference of the aligned regions. The ‑band option of USEARCH sets a minimum radius of the band (width = radius x 2 + 1), while the CD-HIT -b option sets the maximum width of the band. The net result is that CD-HIT alignments are much more prone to banding artifacts.