Home Software Services About Contact usearch manual
FAQ: Which taxonomy database should I use?

See also
 
Taxonomy database downloads
  Edgar 2018, "Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences" (link to paper).
    Shows that V4 accuracy is ~50%, species prediction is not possible, better to use small, authoritative reference..
  Edgar 2018, "Taxonomy annotations and guide tree errors in 16S rRNA databases" (link to paper).
    Shows that annotation error rate of SILVA and Greengenes is ~17%,

Use a small database with authoritative classifications
I recommend using a authoritatively classified sequences, e.g. for 16S that could be the RDP training set or the SILVA LTP subset.

Taxonomies in large databases are unreliable predictions
The taxonomy annotations in the large 16S databases (SILVA, Greengenes, or the full RDP database) are mostly computational predictions from 16S sequences. For SILVA and Greengenes, the methods are not available to the best of my knowledge, and the algorithms are not fully explained. In the case of Greengenes, and perhaps also SILVA, the annotations are manually curated, but note that the curation is on the basis of the sequences and predicted phylogenetic tree, not on phenotype which is not available for the vast majority of sequences.

The predicted phylogenetic trees surely have errors, and there are many subtle issues in interpreting the annotations. For example, if there is only one type strain for a given genus, then sequence identity is the only way to determine if another sequence is in the same genus, and identity therefore defines the genus in such cases, rather than observed phenotypes. About half of genera have only a single type strain, so this situation is common. Also, does a blank name mean that the sequence belongs to a novel taxon, or does it mean that it was predicted to belong to a known taxon but with a confidence that is too low to be sure, say 0.75?

Thus, using annotations in large databases adds errors or uncertainties on top of the errors introduced by using a prediction algorithm such as SINTAX or the Naive Bayesian Classifier.

With these considerations in mind, I believe it is best to use a database of type strain sequences rather than a large database such as Greengenes or SILVA.