Home Software Services About Contact usearch manual
Low coverage of microbial taxonomy reference data

See also
 
UTAX algorithm
  utax command
  Taxonomy benchmark home
  Taxonomy overclassification results
  Why not use a large reference database like Greengenes or SILVA?

A fundamental challenge for microbial taxonomy prediction algorithms is the sparse coverage of sequence databases with authoritative classifications. For example, the total number of extant prokaryotic species has been estimated to be of the order of ten million (Curtis et al. 2002)  to a billion ((Dykhuizen, 2011), but at the time of writing the RDP 16S rRNA training database (RDP14) has10,678 unique sequences covering 2,799 different taxa of which 1,133 (40%) have only one representative sequence. (Larger reference database such as Greengenes have predicted taxonomies which cover a similar number of named taxa so these classifications cannot be considered to be authoritative).

Optimistically, a single training sequence might be sufficient for a classifier to recognize novel members of a given taxon, but a large majority of taxa are surely missing. Assuming an order of magnitude fewer genera than species, there are one million to one hundred million genera, while RDP14 contains only 2,126 genera. By these estimates, if a species is picked at random, the probability that its genus is present in RDP14 is then between 0.2% and 0.002%.

To the best of my knowledge, the only published method for quantifying taxonomy prediction performance is the RDP Classifier "leave-one-out" strategy in which one sequence (the query) is removed from the trusted set and the query is classified using the remaining sequences as a reference4. Accuracy at each rank is defined to be the fraction of queries for which the taxon at that rank is correctly identified. With RDP14 as the reference, the probability that the genus of a random query is present after removing the query is 91% (far greater than the most optimistic estimate of 0.2% for a random species) with a mean of 4.2 remaining training sequences for its genus, and 99.5% that the family is present with a mean of 27 training sequences. Leave-one-out thus models a highly unrealistic scenario where the reference database has several training examples for all ranks of most query sequences.