UTAX is an algorithm for taxonomy assignment which is
implemented in the utax command. The
cluster_otus_utax command generates
OTUs based on taxa predicted by UTAX.
advantages of UTAX over previous classifiers such as the RDP Naive Bayesian
Classifier (RDP) are very high speed, informative confidence values and flexible
options for training on user-supplied data.
The algorithm is currently not published. See
Validating Taxonomy Classifiers for the
method I used to validate its accuracy compared with other algorithms.
At a high level, UTAX is a k-mer based method which looks
for words in common between the query sequence and reference sequences with
known taxonomy. A score calculated from word counts is used to estimate a
confidence value for each taxonomic level. Confidence values are
trained to give a
realistic estimate of error rates, in contrast to the bootstrap values
reported by RDP which are poor predictors of
error rates in practice.