USEARCH manual

FAQ: Why not use a large ref db for UTAX like Greengenes or SILVA?

The RDP training set for 16S is small, only about 10k sequences at the time of writing. So why not use a much larger database with taxonomy annotations such as Greengenes (~1M sequences) or SILVA (~1.8M)?

The only way to get sequences with trusted taxonomy is from named isolate strains. There are only a few thousand of these, hence the small size of the RDP training set. Taxonomies for the vast majority sequences in the large 16S databases were generated by prediction algorithms starting from a much smaller set of trusted sequences. I don't know what the trusted set was / is for any of the databases; as far as I know, this is not published or documented.

In the 16S world these databases do not attempt to define new clades based on computational analysis of sequence information, rather they try to extend existing classifications to new sequences, e.g. to predict that a given sequence belongs to a novel class in a known phylum. With ITS, things are a bit different because UNITE introduces new groups based on computationally-defined "species hypotheses". There are many details here I do not understand despite working in this area for some time, but I am skeptical that the computationally predicted 16S taxonomies are reliable because of the very high error rates of existing algorithms on novel taxa. It is simply not possible to reliably predict, for example, whether a given sequence belongs to a known class or a novel class.

Which algorithms were used to annotate these databases? In the case of Greengenes, this was done by constructing a NAST multiple alignment, predicting the phylogenetic tree using FastTree2 and using a custom algorithm called tax2tree to propagate clade names to internal nodes of the tree followed by manual curation based on review criteria which are not clear to me. This process is described in McDonald 2012. Here, one of the authors' stated objectives was to be consistent with earlier releases of Greengenes, so the "trusted" set by their definition includes earlier predictions by earlier methods. NAST introduces alignment errors by design, which does not inspire a lot of confidence, and phylogenetic tree algorithms don't get the branching order correct, especially with very large alignments, especially when there are low-quality sequences, chimeras etc.

I don't know how the SILVA taxonomies were predicted. If you know, I'd like to learn -- please email me.

Surely, taxonomies in the large databases are not always correct, and we have no clue about the error rate, which could be large. If we use, say, Greengenes as the reference then errors in Greengenes will propagate to UTAX predictions because UTAX assumes that the database is correct and will overestimate the P-value when the reference sequence is mis-annotated.

The only argument I can think of to use taxonomies in the larger databases is 1. taxonomy prediction is so much more reliable with full-length sequences that we can neglect the errors that are introduced and 2. predictions of taxonomies from shorter sequences is better when we have more full-length sequences to compare with. This could conceivably be true, though I doubt it. I don't know how to measure the error rates of the large databases, so I don't know how to check assumptions 1 and 2. If we do believe 1 and 2, then we should use the most accurate taxonomy prediction algorithm to re-generate the predictions for the majority of sequences. The best current algorithm is UTAX, so I think the burden of proof is to show that the annotations in those large database are at least as accurate as UTAX predictions from named isolates.

Bottom line: since I don't understand the error rates in the large databases and have reason to believe they are large, I recommend using a reference set I trust (the RDP training set) with an algorithm (UTAX) that I trust to give a good error estimate.