FAQ: Which taxonomy database should I use?
Edgar 2018, "Accuracy of taxonomy prediction
for 16S rRNA and fungal ITS sequences" (link
Shows that V4 accuracy is ~50%, species
prediction is not possible, better to use small, authoritative reference..
Edgar 2018, "Taxonomy annotations and guide tree
errors in 16S rRNA databases" (link
Shows that annotation error rate of
SILVA and Greengenes is ~17%,
Use a small database with authoritative classifications
I recommend using a
authoritatively classified sequences, e.g. for 16S that could
be the RDP training set or the SILVA LTP subset.
Taxonomies in large databases are
The taxonomy annotations in the
large 16S databases (SILVA, Greengenes, or the full
RDP database) are mostly computational predictions from 16S sequences. For
SILVA and Greengenes, the methods are not available to the best of my
knowledge, and the algorithms are not fully explained. In the case of
Greengenes, and perhaps also SILVA, the annotations are manually curated,
but note that the curation is on the basis of the sequences and
predicted phylogenetic tree, not on phenotype which is not available
for the vast majority of sequences.
The predicted phylogenetic trees
surely have errors, and there are many subtle issues in interpreting the
annotations. For example, if there is only one type strain for a given
genus, then sequence identity is the only way to determine if another
sequence is in the same genus, and identity therefore defines the
genus in such cases, rather than observed phenotypes. About half of genera
have only a single type strain, so this situation is common. Also, does a
blank name mean that the sequence belongs to a novel taxon, or does it mean
that it was predicted to belong to a known taxon but with a confidence that
is too low to be sure, say 0.75?
Thus, using annotations in large databases adds errors
or uncertainties on top of the errors introduced by using a prediction
algorithm such as SINTAX or the Naive Bayesian Classifier.
With these considerations in mind, I believe it is best to use a database of
type strain sequences rather than a large database such as Greengenes or