CD-HIT versions

  
<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
  
CD-HIT version 4
The source code of version 4 of CD-HIT is a major re-write compared to version 3. The code in v4 is more readable, though still difficult to follow. The algorithm has been improved in some respects, and in particular multi-threading is supported via parallelization using OpenMP. This multi-threading strategy is effective when the input data has high redundancy. When redundancy is low (average cluster size close to 1), wall-clock time is not reduced significantly and may even increase. Version 3 of CD-HIT was often orders of magnitude slower than the UCLUST algorithm in USEARCH. Version 4 is much improved though still usually significantly slower than UCLUST, even when multi-threading is used. USEARCH does not support multi-threading for clustering, at least through version. A multi-threaded version of UCLUST is in development and may be included in USEARCH v6.

CD-HIT versions and downloads
At the time of writing (Feb 2012), there are two download pages for CD-HIT which give contradictory information about which version is considered stable and recommended for production use. The CD-HIT home page at UCSD has a Downloads tab which refers to this site (http://bioinformatics.org/cd-hit/) for "tested and stable" releases, and another (http://code.google.com/p/cdhit/) for "developers" and "new features". However, the Google Code download page describes six of the download files, from cd-hit-v4.5.3-2011-02-09.tgz to cd-hit-v4.5.7-2011-12-16.tgz, as "Stable", while the latest version on the bionformatics.org page is cd-hit-v4.5.4-2011-03-07.tgz. It therefore not clear whether 4.5.4 or 4.5.7 is the recommended stable release. In the tests reported here, I used these two downloads: v3.1.2 (cd-hit-2009-0427.tar.gz) and 4.5.7 (cd-hit-v4.5.7-2011-12-16.tgz).