See also

Tolstoy's paradox

I believe that for a given protocol from sample collection, libary prep., sequencing and data analysis,**the number of spurious OTUs is approximately
**
**independent of**** community structure**.

For very noisy methods such as QIIME closed-reference, this can be supported (though not shown definitively) by looking for dominant OTUs (manuscript in preparation).

The following arguments show that the claim is plausible.

**Real samples are like mock plus more**

Imagine sorting the species in a sample by decreasing abundance. Call the top 20 the*mock-like subset* and the remainder the *low-abundance tail*. A **given number of reads of the mock-like subset** will contain a **similar set of errors** due to PCR and sequencing, regardless of whether a low-abundance tail was also sequenced. If the top 20 represent only a small fraction of the sample, then repeat the thought experiment with the top 20 in the low-abundance tail, and so on.

**If errors are random, the probability of a spurious OTU per read is constant**

Errors are approximately random, which implies that each time a new read is added, the probability it will induce a new spurious OTU is approximately constant, because the probability of reproducing a previous spurious OTU is small due to the large number of ways in which errors can be different. Therefore, approximately the same number of spurious OTUs will be generated by generating a given number of reads, regardless of whether those reads were derived from few or many species.

**Spurious OTUs are not an artifact of mock community tests**

If the number of spurious OTUs does not strongly depend on the structure of the community, then we will get**similar numbers of spurious OTUs from mock and real samples**.

Tolstoy's paradox

For very noisy methods such as QIIME closed-reference, this can be supported (though not shown definitively) by looking for dominant OTUs (manuscript in preparation).

The following arguments show that the claim is plausible.

Imagine sorting the species in a sample by decreasing abundance. Call the top 20 the

Errors are approximately random, which implies that each time a new read is added, the probability it will induce a new spurious OTU is approximately constant, because the probability of reproducing a previous spurious OTU is small due to the large number of ways in which errors can be different. Therefore, approximately the same number of spurious OTUs will be generated by generating a given number of reads, regardless of whether those reads were derived from few or many species.

If the number of spurious OTUs does not strongly depend on the structure of the community, then we will get

R.C. Edgar (2017), Accuracy of microbial community diversity estimated by closed- and open-reference OTUs, PeerJ 5:e3889