Rarefaction is a technique from numerical ecology that is
often applied to OTU analysis. However, with NGS reads,
low-abundance OTUs are
often spurious, and rarefaction analysis is therefore of dubious value.
The goal of rarefaction is determine whether sufficient
observations have been made to get a reasonable estimate of a quantity (call it
R) that has been measured by sampling.
The most commonly considered quantity is
richness (the number of different species in an environment or ecosystem),
though similar analysis can be applied to any alpha
diversity metric (see alpha_div_rare
The basic idea of rarefaction is to plot the value of a
measured quantity (call it R) against the number of observations used in the calculation.
Values of R for smaller numbers of observations are obtained by taking random
subsets. If we get a similar value of R with fewer observations, then it is
reasonable to infer that R has converged on a good estimate of the correct
value. Conversely, if R is systematically increasing or decreasing as more
samples are added, then we can infer that we cannot make a good
estimate of R for the full population.
These two cases are shown in the figure. In this example, the upper curve
(red) is still increasing, so has not converged. The lower curve (blue) has
reached a horizontal asymptote, so we can infer that the value of R is a good
estimate of the value that would be obtained if every individual was observed at least once.
This type of plot is called a "rarefaction curve". Note
that the conclusions we can draw from a rarefaction curve are suggestive but not
definitive -- there could be rare species that have not yet been observed even
if the curve appears to converge.
If R does not converge, there are two possibilities: we
need more samples to get a good estimate, e.g. because we have not yet observed
all the taxa present, or spurious OTUs due to sequencing error increases
indefinitely with the number of reads, in which case the measured R might increase indefinitely.
This effect is commonly seen with the number of OTUs. Suppose there is a fixed
probability that a read has >3% bad bases and will thus induce a spurious OTU.
As the number of reads increases, the number of OTUs will increase due to these
bad reads, regardless of whether all the species in the sample have been
detected. The number of OTUs will therefore never converge. This is usually the
case in practice, because it is impossible to completely eliminate spurious
There is a standard formula for calculating the rarefaction curve
for richness given the observed abundances, but this formula is not quite
singleton reads are discarded, as recommended in
the UPARSE pipeline. See abundance
rarefaction for further discussion. I doubt it matters in practice, because
sources of error are probably more important, so rarefaction analysis has
dubious value for marker gene OTUs.