<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
CD-HIT version 4
The source code of version 4 of
CD-HIT is a major re-write compared to version 3. The code in v4 is more
readable, though still difficult to follow. The algorithm has been
improved in some respects, and in particular multi-threading is supported
via parallelization using OpenMP.
This multi-threading strategy is effective when the input data has high
redundancy. When redundancy is low (average cluster size close to 1),
wall-clock time is not reduced significantly and may even increase.
Version 3 of CD-HIT was often orders of magnitude slower than the UCLUST
algorithm in USEARCH. Version 4 is much improved though still usually
significantly slower than UCLUST, even when multi-threading is used.
USEARCH does not support multi-threading for clustering, at least
through version. A multi-threaded version of UCLUST is in development
and may be included in USEARCH v6.
CD-HIT versions and
At the time of writing (Feb 2012), there are two
download pages for CD-HIT which give contradictory information about
which version is considered stable and recommended for production use.
The CD-HIT home page at UCSD
has a Downloads tab which refers to this site (http://bioinformatics.org/cd-hit/)
for "tested and stable" releases, and another (http://code.google.com/p/cdhit/)
for "developers" and "new features". However, the Google
Code download page describes six of the download files, from cd-hit-v4.5.3-2011-02-09.tgz
as "Stable", while the latest version on the bionformatics.org
page is cd-hit-v4.5.4-2011-03-07.tgz.
It therefore not clear whether 4.5.4 or 4.5.7 is the recommended
stable release. In the tests reported here, I used these two downloads:
and 4.5.7 (cd-hit-v4.5.7-2011-12-16.tgz).