FAQ: Which taxonomy database should I use?
Use a small database with authoritative classifications
I recommend using a
authoritatively classified sequences, e.g. for 16S that could
be the RDP training set or the SILVA LTP subset.
Taxonomies in large databases are
The taxonomy annotations in the
large 16S databases (SILVA, Greengenes, or the full
RDP database) are mostly computational predictions from 16S sequences. For
SILVA and Greengenes, the methods are not available to the best of my
knowledge, and the algorithms are not fully explained. In the case of
Greengenes, and perhaps also SILVA, the annotations are manually curated,
but note that the curation is on the basis of the sequences and
predicted phylogenetic tree, not on phenotype which is not available
for the vast majority of sequences.
The predicted phylogenetic trees
surely have errors, and there are many subtle issues in interpreting the
annotations. For example, if there is only one type strain for a given
genus, then sequence identity is the only way to determine if another
sequence is in the same genus, and identity therefore defines the
genus in such cases, rather than observed phenotypes. About half of genera
have only a single type strain, so this situation is common. Also, does a
blank name mean that the sequence belongs to a novel taxon, or does it mean
that it was predicted to belong to a known taxon but with a confidence that
is too low to be sure, say 0.75?
Thus, using annotations in large databases adds errors
or uncertainties on top of the errors introduced by using a prediction
algorithm such as sintax or the Naive Bayesian Classifier.
The full RDP database is definitely not a good choice because
the taxonomies were predicted by the RDP Classifier, which has a high rate
of over-classification errors on full-length sequences (see
SINTAX paper). If you want predictions using the
Bergey's nomenclature, then I would recommend using the RDP training set
With these considerations in mind, I believe it is best to use a database of
type strain sequences rather than a large database such as Greengenes or