Home Software Services About Contact usearch manual

Greengenes clustering benchmark


Methods
Input was the Greengenes 16S RNA sequence database, version 13.5. This is a 1.7Gb FASTA file, containing 1,262,986 (1.3M) sequences.

USEARCH clustering was tested using the derep_fulllength, derep_prefix, cluster_fast and cluster_smallmem commands in USEARCH v7.0.1090, 64-bit build for Linux.

CD-HIT clustering was tested using cd-hit-est v4.6. At 100% identity, the cd-hit run had not finished clustering after several days and was canceled; the full results are therefore not known.

See also: hardware configuration.

Command lines

usearch -derep_fulllength gg.fa -threads 6 -output full.fa

usearch -derep_prefix gg.fa -threads 6 -output prefix.fa

usearch -cluster_fast gg.fa -id 0.97 -threads 6 -centroids fast97.fa

usearch -cluster_smallmem gg.fa -id 0.97 -centroids small97.fa

cd-hit-est -i gg.fa -n 5 -M 0 -T 6 -c 0.8 -o gg80