USEARCH manual

Defining "accuracy" of taxonomy classifiers

I measure sensitivity and error rate by dividing a gold standard database into several query and reference subsets by splitting nodes at different taxonomic levels, as described in Validating Taxonomy Classifiers.

Reference data is sparse
In practice, only a small fraction of extant species have been studied and named by taxonomists. Incomplete reference data is therefore a severe problem for taxonomy classifiers. To get a realistic validation of classifier performance it is therefore important to have a balanced mix of "predictable" (i.e., named) and "unpredictable" (not named) cases at each rank.

Predictable and unpredictable taxa
In practice, for a given query dataset reference database, some taxa will be present in both the query and reference ("predictable" taxa), some will be present only in the query set ("unpredictable" taxa), and some taxa will be present only in the reference set ("decoy" taxa -- if these appear in a prediction they are always false positives).

Leave-one-out validation is unrealistic
The RDP leave-one-out validation approach is not realistic because a large majority of taxa are can be predicted because there will usually be several examples of the query genus in the training data even after the query sequence has been removed, which will often not be the case with real data.

Defining sensitivity and error rates
I define sensitivity to be the fraction of "predictable" taxa that are correctly predicted by the classifier at a given value of a confidence score, averaged over all query-reference pairs. I define the error rate to be the fraction of predictions that are incorrect at that score (false positives and false negatives, again averaged over all pairs. For classifiers that do not report a confidence score, all predictions are included. See taxonomy classification errors.

Down-weighting over-represented taxa
Averages are weighted so that there are an equal number of predictable and unpredictable taxa on the query-reference pair, and each taxon name has the same weight (to correct for highly overrepresented taxa such as the genus Streptomyces which is found in 513 = 5% of the RDP training sequences).