CD-HIT and USEARCH results on a 16S rRNA clustering test

 
<< CD-HIT analysis

<< Comparing USEARCH and CD-HIT
 
Results
The table below is an updated version of Table 2 from the USEARCH paper (Edgar 2010), which compared CD-HIT v3 with the UCLUST algorithm as implemented in USEARCH v1.1.570. This shows that version 4 of CD-HIT produces significantly different results compared to CD-HIT v3. Input is a set of 1.1M 16S rRNA reads from Costello et al., 2009. These tests were run on a quad-core CPU, which would be expected to enable performance improvements with up to four threads. CD-HIT v3 and USEARCH v5 do not support multi-threading for clustering, so those results are for a single thread.

These results are misleading
Please note that I now consider this methodology to be highly misleading, for reasons discussed here. I am providing these results only for comparison with the paper to indicate the improvements in speed in CD-HIT v4.

Greedy clustering is not recommended for 16S OTUs
I do not recommend using the UCLUST algorithm or CD-HIT for generating OTUs, especially if a decreasing length sort is used with USEARCH (CD-HIT-EST always does a length sort). There are several problems with this approach; e.g., the longest sequence in a cluster tends to be an outlier relative to an abundant biological sequence, so is not appropriate as a representative sequence and tends to greatly overestimate the number of OTUs. I recommend using otupipe for OTU clustering.
 
Min %id   USEARCH v5 CD-HIT v3 CD-HIT v4
defaults
CD-HIT v4
2 threads
CD-HIT v4
4 threads
70% Time 115 s
1m 55s
(CD-HIT cannot cluster < 80%)
Clusters 258
75% Time 116 s
1m 54s
Clusters 543
80% Time 102 s
1m 42s
37,801 s
10h 30m 1s
1,078 s
17m 58s
585 s
9m 45s
305 s
5m 5s
Clusters 1,143 1,987 679 679 679
90% Time 88 s
1m 28s
3,231 s
53m 51s
1,152 s
19m 11s
585 s
9m 45s
345 s
5m 45s
Clusters 5,398 6,366 4,325 4,325 4,325
95% Time 121 s
2m 1s
1,729 s
28m 49s
1,102 s
18m 22s
670 s
11m 10s
239 s
3m 59s
Clusters 16,289 16,304 13,257 13,257 13,257
97% Time 167 s
2m 47s
2,151 s
35m 51s
1,794 s
29m 54s
649 s
10m 49s
346 s
5m 46s
Clusters 29,586 28,446 24,899 24,899 24,899

Software and hardware versions
CD-HIT-EST v3.1.2 and v4.5.7.
USEARCH v5.2.13 (32-bit Linux i86).
CPU: Quad-core Xeon X5450 3.0GHz.

Command-lines
cd-hit-est -i costello.fasta -o costello -c 0.97 -M 0 [-t 4]
usearch -cluster costello.fasta -uc costello.uc -id 0.97

References
Edgar,R.C. (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19), 2460-2461.
 
doi: 10.1093/bioinformatics/btq461
Costello, E.K. et al. (2009), Bacterial community variation in human body habitats across space and time, Science 326, 1694-97.