Taxonomy confidence measures
Taxonomy benchmark home
The UTAX algorithm generates
taxonomy predictions with confidence estimates specified as a value in the
range 0.0 to 1.0.
The definition and interpretation of a taxonomy
prediction confidence estimate is not as simple as it might appear. Ideally, the
error rate of predictions with confidence 0.9 should be approximately 10%, but
in practice the error rate depends on the query dataset and on unknown
characteristics of the reference dataset. It would be nice to calculate a p-value,
but this is tricky because we need two statistical models: one for the hypothesis
we are testing plus a null model in which the hypothesis is false and the
observation occurs by chance. I don't have a clue how to do this for taxonomy predictions.
Most taxonomy prediction algorithms don't provide a
confidence estimate, including
GAST, the default QIIME method (assign_taxonomy.py
‑m uclust) and the mothur
Classify_seqs command with method=knn. A notable exception is the
RDP Naive Bayesian
Classifier (RDP) which reports a confidence value obtained by
This was an important improvement over previous methods and is a good reason why
RDP is currently the most widely-used algorithm for 16S taxonomy prediction.
However, everyone agrees that the RDP bootstrap value should not be interpreted
as indicating the probabllity that the prediction is correct (which would be
100% minus the estimated error probability). The authors claim that for 16S
sequences shorter than 250nt, a bootstrap threshold of 50% gives accurate
results to genus level, claiming accuracies from 79% to 100% depending on the V
region (see discussion and table under "Confidence threshold" at
https://rdp.cme.msu.edu/classifier/class_help.jsp). If this result is
valid, the error rate at 50% bootstrap is presumably much less than 50%.
However, I believe their "leave-one-out" validation seriously under-estimates error rates on real data (for discussion see
validating taxonomy classifiers). On my tests, I find a 33% error rate for
genus predictions by RDP on the V3-V5 segment (~530nt) at 50% bootstrap cutoff.
At 100% bootstrap confidence, the error rate is 8%.
So, how to define a confidence value that is a
reasonable prediction of error rates on real data? The UTAX algorithm does this
by a training procedure which splits a reference
dataset at different taxonomic levels into query and reference subsets to create
a balanced combination of scenarios where taxa each level (genus, family...) are
present / not present. This is designed to measure the effects of novel taxa
which are missing from the database, which is a
serious problem with
microbial marker gene reference databases. The confidence value is defined
to be the probability that a prediction with a given score is correct,
summing over all predictions from all splits.
In practice, predictions with confidence values are
typically used to set a threshold. This adds another complication -- we're not
just concerned with the error rate exactly at the threshold, we would like to
know the total rate for all predictions at or above the chosen value. Even
in the ideal situation where 0.9 means 10% error, setting a threshold of 0.9
allows confidence values of 0.9, 0.85, 0.99 etc. If most of the predictions are
around 0.90 to 0.92 then the error rate will be worse than if most of them are
0.99 with only a few in the low 0.9s. So, even if confidence=0.9 does give an
error rate of ~10% on a given query set, then confidence≥0.9
should give an rate less than 10%, perhaps much less, and the threshold can be
interpreted as a pessimistic lower bound on the cumulative error rate of the
predictions that are kept.
To provide guidance on setting a threshold, UTAX reports the sensitivity and error rates obtained
during training at a
series of cutoffs 0.95, 0.9 etc. To get this report, use the -report option of
makeudb_utax (example below).