Interpreting counts and frequencies in OTU tables
Home Software Services About Contact     
 
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

24-Nov-2016
UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.

 

USEARCH v11

Interpreting counts and frequencies in OTU tables

See also
 
Defining and interpreting OTUs
  Amplification bias
  Cross-talk
  UNBIAS algorithm
  UNCROSS2 algorithm
  Interpreting diversity metrics
  Recommended alpha and beta metrics
  Comparing alpha diversity between groups
  Statistical significance of diversity differences


Estimating microbial diversity
The diversity in a single sample (alpha diversity) is commonly measured using metrics such as the Shannon index and the Chao1 estimator, while the variation between pairs of samples (beta diversity) is measured using metrics such as the Jaccard distance or Bray-Curtis dissimilarity. Many such metrics, including Shannon, Chao1, Jaccard and Bray-Curtis, are calculated from OTU frequencies. Other metrics, e.g. unweighted UniFrac (called unifrac_binary in usearch) use presence / absence only, effectively considering a count to be one if it is any non-zero value.

OTU frequency does not correlate with species frequency
In fact, OTU frequencies have low correlation with species frequencies. This means, for example, that the most abundant OTU usually does not contain the most abundant species.

Cross-talk degrades presence / absence
Some diversity metrics use OTU presence / absence rather than frequencies. In usearch, such metrics are called "binary" because the count is considered to be zero or one. With amplicon reads, presence / absence cannot be reliably measured if samples are multiplexed because cross-talk often causes reads to be incorrectly assigned to a sample where the OTU is in fact absent. This problem is particularly severe if samples from different environments (e.g., human gut and mouse gut) are multiplexed into a single sequencing run.

Singleton counts are especially suspect
If you follow my recommended procedures, then you will pool reads for all samples and discard singleton unique sequences for making 97% OTUs and discard unique sequences with abundance <8 for making ZOTUs (denoising). Even so, many OTU table entries are often singletons (i.e., have value 1) for smaller OTUs because the total count is distributed over several samples. Small counts are more likely to be spurious, especially singletons, either because the OTU itself is spurious (e.g., an undetected chimera), or because of cross-talk.

Traditional diversity metrics are invalid or hard to interpret
Because of the issues described above, many diversity metrics are invalid, meaningless or hard to interpret when calculated from OTUs. Some alpha diversity metrics, including Chao1 and Robbins, explicitly use singleton counts or singleton frequencies in their formulas. If singleton unique reads or singleton OTUs are discarded, then these calculations are obviously invalid. Either way, singleton counts are suspect as described above, so the calculations are misleading or meaningless in practice. All beta diversity metrics use OTU frequencies or presence / absence, neither of which can be reliably determined from amplicon reads.