USEARCH manual

Defining "accuracy" of taxonomy classifiers

I measure sensitivity and error rate by dividing a gold standard database into several query and reference subsets by splitting nodes at different taxonomic levels, as described in Validating Taxonomy Classifiers.

Reference data is sparse
In practice, microbial reference databases with known taxonomy cover only a small fraction of extant species. Incomplete reference data is therefore a severe problem for taxonomy classifiers. To get a realistic validation of classifier performance it is therefore important to have a balanced mix of "possible" and "impossible" cases at each level.

Leave-one-out validation is unrealistic
The RDP "leave-one-out" validation approach is not realistic because a large majority of taxa are "possible" because there will usually be several examples of the query sequence genus in the training data, which will often not be the case with real data.

Possible and Impossible taxa
For a given query-reference pair, some taxa will be present in both the query and reference ("possible" taxa), some will be present only in the query set ("impossible" taxa), and some taxa will be present only in the reference set ("decoy" taxa -- if these appear in a prediction they are always false positives).

Defining sensitivity and error rates
I define sensitivity to be the fraction of "possible" taxa that are correctly predicted by the classifier at a given value of a confidence score, averaged over all query-reference pairs. I define the error rate to be the fraction of predictions that are incorrect at that score (false positives and false negatives, again averaged over all pairs. For classifiers that do not report a confidence score, all predictions are included. See taxonomy classification errors.

Down-weighting over-represented taxa
Averages are weighted so that there are an equal number of possible and impossible taxa on the query-reference pair, and each taxon name has the same weight (to correct for highly overrepresented taxa such as the genus Streptomyces which is found in 513 = 5% of the RDP training sequences).