FAQ: Why not use a large ref db for UTAX like
Greengenes or SILVA?
The RDP training set for 16S is small, only about 10k
sequences at the time of writing. So why not use a much larger database with
taxonomy annotations such as Greengenes (~1M sequences) or SILVA (~1.8M)?
The only way to get sequences with trusted taxonomy is
from named isolate strains. There are only a few thousand of these, hence
the small size of the RDP training set. Taxonomies for the vast majority
sequences in the large 16S databases were generated by prediction algorithms
starting from a much smaller set of trusted sequences. I don't know what the
trusted set was / is for any of the databases; as far as I know, this is not
published or documented.
In the 16S world these databases do not attempt to
define new clades based on computational analysis of sequence information,
rather they try to extend existing classifications to new sequences, e.g. to
predict that a given sequence belongs to a novel class in a known phylum.
With ITS, things are a bit different because
UNITE introduces new groups based on computationally-defined "species
hypotheses". There are many details here I do not understand despite working
in this area for some time, but I am skeptical that the computationally
predicted 16S taxonomies are reliable because of the
very high error rates of existing
algorithms on novel taxa. It is simply not possible to reliably predict,
for example, whether a given sequence belongs to a known class or a novel
Which algorithms were used to annotate these
databases? In the case of Greengenes, this was done by constructing a NAST
multiple alignment, predicting the phylogenetic tree using FastTree2 and
using a custom algorithm called tax2tree to propagate clade names to
internal nodes of the tree followed by manual curation based on review
criteria which are not clear to me. This process is described in
Here, one of the authors' stated objectives was to be consistent with
earlier releases of Greengenes, so the "trusted" set by their definition
includes earlier predictions by earlier methods. NAST
introduces alignment errors by design, which does not inspire a lot of
confidence, and phylogenetic tree algorithms don't get the branching order
correct, especially with very large alignments, especially when there are
low-quality sequences, chimeras etc.
I don't know how the SILVA taxonomies were predicted.
If you know, I'd like to learn -- please
Surely, taxonomies in the large databases are not
always correct, and we have no clue about the error rate, which could be
large. If we use, say, Greengenes as the reference then errors in Greengenes
will propagate to UTAX predictions because UTAX assumes that the database is
correct and will overestimate the P-value when the reference sequence is
The only argument I can think of to use taxonomies in the larger databases
is 1. taxonomy prediction is so much more reliable with full-length
sequences that we can neglect the errors that are introduced and 2.
predictions of taxonomies from shorter sequences is better when we have more
full-length sequences to compare with. This could conceivably be true,
though I doubt it. I don't know how to measure the error rates of the large
databases, so I don't know how to check assumptions 1 and 2. If we do
believe 1 and 2, then we should use the most accurate taxonomy prediction
algorithm to re-generate the predictions for the majority of sequences. The
best current algorithm is UTAX, so I think the burden of proof is to show
that the annotations in those large database are at least as accurate as
UTAX predictions from named isolates.
Bottom line: since I don't understand the error rates in the large databases
and have reason to believe they are large, I recommend using a reference set
I trust (the RDP training set) with an algorithm (UTAX) that I trust to give
a good error estimate.