Defining "accuracy" of taxonomy classifiers
Taxonomy benchmark home
I measure sensitivity and error rate by
dividing a gold
standard database into several query and reference subsets by
splitting nodes at
different taxonomic levels, as described in
Validating Taxonomy Classifiers.
Reference data is sparse
In practice, only a small fraction of extant species
have been studied and named by taxonomists.
Incomplete reference data is therefore a severe problem for taxonomy
classifiers. To get a realistic validation of classifier performance it is
therefore important to have a balanced mix of "predictable" (i.e.,
named) and "unpredictable" (not named) cases at each rank.
Predictable and unpredictable taxa
In practice, for a given query dataset reference database, some taxa
will be present in both the query and reference ("predictable" taxa), some will be
present only in the query set ("unpredictable" taxa), and some taxa will be present
only in the reference set ("decoy" taxa -- if these appear in a prediction they
are always false positives).
Leave-one-out validation is unrealistic
The RDP leave-one-out validation approach is not realistic because a large
majority of taxa are can be predicted because there will
usually be several examples of the query genus in the training data even after
the query sequence has been removed,
which will often not be the case with real data.
Defining sensitivity and error rates
I define sensitivity to be the fraction of "predictable" taxa that are correctly
predicted by the classifier at a given value of a
confidence score, averaged over all query-reference pairs. I define the error
rate to be the fraction of predictions that are incorrect at that score (false
positives and false negatives, again averaged over all pairs. For classifiers
that do not report a confidence score, all predictions are included. See
taxonomy classification errors.
Down-weighting over-represented taxa
Averages are weighted so that there are an equal number of predictable and
unpredictable taxa on the query-reference pair, and each taxon name has the same
weight (to correct for highly overrepresented taxa such as the genus Streptomyces which is found in 513 = 5% of the RDP training sequences).