Home Software Services About Contact usearch manual
FAQ: Which taxonomy database should I use?

See also
Taxonomy database downloads

Use a small database with authoritative classifications
I recommend using a authoritatively classified sequences, e.g. for 16S that could be the RDP training set or the SILVA LTP subset.

Taxonomies in large databases are unreliable predictions
The taxonomy annotations in the large 16S databases (SILVA, Greengenes, or the full RDP database) are mostly computational predictions from 16S sequences. For SILVA and Greengenes, the methods are not available to the best of my knowledge, and the algorithms are not fully explained. In the case of Greengenes, and perhaps also SILVA, the annotations are manually curated, but note that the curation is on the basis of the sequences and predicted phylogenetic tree, not on phenotype which is not available for the vast majority of sequences.

The predicted phylogenetic trees surely have errors, and there are many subtle issues in interpreting the annotations. For example, if there is only one type strain for a given genus, then sequence identity is the only way to determine if another sequence is in the same genus, and identity therefore defines the genus in such cases, rather than observed phenotypes. About half of genera have only a single type strain, so this situation is common. Also, does a blank name mean that the sequence belongs to a novel taxon, or does it mean that it was predicted to belong to a known taxon but with a confidence that is too low to be sure, say 0.75?

Thus, using annotations in large databases adds errors or uncertainties on top of the errors introduced by using a prediction algorithm such as sintax or the Naive Bayesian Classifier.

The full RDP database is definitely not a good choice because the taxonomies were predicted by the RDP Classifier, which has a high rate of over-classification errors on full-length sequences (see SINTAX paper). If you want predictions using the Bergey's nomenclature, then I would recommend using the RDP training set with SINTAX.

With these considerations in mind, I believe it is best to use a database of type strain sequences rather than a large database such as Greengenes or SILVA.