The UNBIAS algorithm
attempts to adjust an OTU table to correct for the
two sources of abundance bias I believe to be
most important in practice: 16S copy number and primer mismatches. This requires
predicting the copy number and mismatch number for each OTU sequence, then
adjusting the read counts accordingly.
Abundance bias distorts
The diversity in a single sample is commonly
measured using alpha diversity metrics such as
the Shannon index and the Chao estimator, while the variation between pairs of
samples is measured using a beta diversity metric
such as the Jaccard distance or Bray-Curtis dissimilarity. Many such metrics,
including Shannon, Chao, Jaccard, and Bray-Curtis, are calculated from estimated
species frequencies. The correlation between read abundance and species
abundance is very low, so species frequencies
cannot be reliably estimated from marker gene reads, and traditional
diversity estimates based on species frequenices are therefore invalid or
difficult to interpret.
Predicting copy number and primer
Prediction of copy number and primer mismatches is done
by the SINAPS algorithm. SINAPS is based on
essentially the same algorithm as SINTAX. The top
hit in a reference database is identified using k-mer similarity. Confidence is
estimated by bootstrapping. In each bootstrap iteration, a subset of k-mers is
selected and used to find the top hit and the trait of interest (here, copy
number or primer mismatches) is taken from reference sequence annotation. The
trait with highest bootstrap frequency is reported as the prediction, and the
frequency with which it occurred is reported the bootstrap confidence. UNBIAS
reqyuires a prediction for every OTU, so the bootstrap confidence is ignored. In
this case, SINAPS is effectively equivalent to finding the top database hit
using the USEARCH algorithm.
If the predicted 16S copy number is C,
the read count is multiplied by 4/C because the
mean 16S copy count is approximately four.
Primer mismatch correction
If the predicted number of
primer mismatches is m, the read count is multiplied by 10m;
i.e., an order of magnitude loss in efficiency is assumed for each mismatch.
Using 10 as a base is a rather arbitrary choice that probably does not work very
well in practice because the true efficiency loss depends on several factors
which are unknown or hard to predict. For example, the loss will depend on the
mismatch position in the oligonucleotide (mismatches close to the 3' end give
higher losses) and will tend to be greater if more rounds of PCR are used.
Accuracy in practice
UNBIAS achieves a substantial
improvement on mock community tests when known values for copy numbers and
primer mismatches are used. This confirms that these biases are significant in
practice. However, UNBIAS is less successful when reference sequences have 97%
identity or less to the OTU sequences, as will often be the case. Thus, UNBIAS
is not a full solution to the problem of abundance bias.