Interpreting counts and frequencies in OTU tables
Estimating microbial diversity
The diversity in a single
sample (alpha diversity) is commonly measured
using metrics such as the Shannon index and the Chao1 estimator, while the
variation between pairs of samples (beta diversity)
is measured using metrics such as the Jaccard distance or Bray-Curtis
dissimilarity. Many such metrics, including Shannon, Chao1, Jaccard and
Bray-Curtis, are calculated from OTU frequencies. Other metrics, e.g.
unweighted UniFrac (called unifrac_binary in usearch) use presence / absence only,
effectively considering a count to be one if it is any non-zero value.
OTU frequency does not correlate with species frequency
In fact, OTU frequencies have low
correlation with species frequencies. This means, for example, that the
most abundant OTU usually does not contain the most abundant species.
Cross-talk degrades presence / absence
metrics use OTU presence / absence rather than frequencies. In usearch, such
metrics are called "binary" because the count is considered to be zero or
one. With amplicon reads, presence / absence cannot be reliably measured if
samples are multiplexed because cross-talk
causes reads to be incorrectly assigned to a sample where the OTU is in fact
absent. This problem is particularly severe if samples from different
environments (e.g., human gut and mouse gut) are multiplexed into a single
Singleton counts are especially suspect
If you follow my
recommended procedures, then you will pool reads
for all samples and discard singleton unique
sequences for making 97% OTUs and discard unique sequences with
abundance <8 for making ZOTUs (denoising). Even so, many OTU table entries are
often singletons (i.e.,
have value 1) for smaller OTUs because the total count is distributed over
several samples. Small counts are more likely to be spurious, especially
singletons, either because the OTU itself is spurious (e.g., an undetected chimera),
or because of cross-talk.
Traditional diversity metrics are invalid or hard to interpret
Because of the issues described above, many diversity metrics are invalid,
meaningless or hard to interpret when calculated from OTUs. Some alpha
diversity metrics, including Chao1 and Robbins, explicitly use singleton
counts or singleton frequencies in their formulas. If singleton unique reads or
singleton OTUs are discarded, then these calculations are obviously invalid.
Either way, singleton counts are suspect as described above, so the
calculations are misleading or meaningless in practice. All beta diversity
metrics use OTU frequencies or presence / absence, neither of which can be
reliably determined from amplicon reads.