Naive Bayesian Classifier algorithm

The RDP Naive Bayesian Classifier (NBC) algorithm is described in Wang et al. 2007. To the best of my knowledge, it was the first published method for automated rRNA taxonomy prediction. Many other algorithms have been published since then, including my own SINTAX, but none has achieved a clear improvement in accuracy over NBC.

Having worked intensively on this problem over several years, I believe that NBC and SINTAX work equally well for all practical purposes. I doubt it is possible for any algorithm to achieve meaningfully better accuracy because taxonomy correlates only approximately with the sequence of any single gene.

The NBC algorithm is unnecessarily complicated. The Bayesian formulation gives a misleading impression that there is informative genus-specific conservation of some 8-mers, which I do not believe is the case. One way to show this is to note that roughly half of the genera in the RDP training sets are singletons, and obviously you can't learn which 8-mers are conserved if you only have one training sequence for the genus.

If the posterior probabilities were predictive, then they could be used directly, but in practice it is necessary to abandon the Bayesian approach and use bootstrapping instead to obtain a measure of confidence. I view the Bayesian formulation as an indirect way of obtaining the top 8-mer hit. This is demonstrated in practice by SINTAX, which achieves equally good results by simple 8-mer counting. While it does not achieve a systematic improvement in accuracy, I prefer SINTAX because it is faster, uses less memory, and conceptually simpler.

Different trade-offs between sensitivity and error rates are achieved by tuning the bootstrap cutoffs for both NBC and SINTAX. Since false negatives and false positives can both be important, there is no single "best" bootstrap threshold; users should understand this issue when choosing a threshold. The default or recommended value by the authors is not necessarily optimal for any given study; this depends which biological questions are being addressed.

I believe that the leave-one-out validation strategy used by Wang et al. (and also subsequent papers from the RDP group) reports misleadingly high accuracy. Cross-validation by identity is more realistic.

The NBC algorithm has been successfully re-implemented in mothur (Classify.seqs command with method=wang) and the rRDP package in bioconductor as well as USEARCH, showing that all of the required details are clearly specified in the paper. However, I do not understand how the equation for genus-specific conditional probability was obtained; if anyone can explain this to me I would be grateful (an email to the authors on this point was not answered).

Reference
Wang,Q. et al. (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. AEM 73, 5261-7.